date:20150504


[ 
https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526296#comment-14526296
 ] 

Chen He commented on YARN-1612:
---

Thank you for reviewing, Karthik, I will update the patch tomorrow.

 FairScheduler: Enable delay scheduling by default
 -

 Key: YARN-1612
 URL: https://issues.apache.org/jira/browse/YARN-1612
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Sandy Ryza
Assignee: Chen He
 Attachments: YARN-1612-v2.patch, YARN-1612.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience


[ 
https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526301#comment-14526301
 ] 

Naganarasimha G R commented on YARN-3523:
-

Hi [~vinodkv], Point noted :), as per my knowledge i think mistakenly set in 2 
jiras and it was not intentional (if so would have set to earlier version than 
2.8.0). My 2 cents here, if there is an option in the jira to disable/enable 
editing for particular group then can we apply that such that Fix version can 
be added/modified by committers only ?

 Cleanup ResourceManagerAdministrationProtocol interface audience
 

 Key: YARN-3523
 URL: https://issues.apache.org/jira/browse/YARN-3523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R
  Labels: newbie
 Attachments: YARN-3523.20150422-1.patch


 I noticed ResourceManagerAdministrationProtocol has @Private audience for the 
 class and @Public audience for methods. It doesn't make sense to me. We 
 should make class audience and methods audience consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file


 [ 
https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel updated YARN-3018:

Attachment: YARN-3018-1.patch

 Unify the default value for yarn.scheduler.capacity.node-locality-delay in 
 code and default xml file
 

 Key: YARN-3018
 URL: https://issues.apache.org/jira/browse/YARN-3018
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: nijel
Assignee: nijel
Priority: Trivial
 Attachments: YARN-3018-1.patch


 For the configuration item yarn.scheduler.capacity.node-locality-delay the 
 default value given in code is -1
 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1;
 In the default capacity-scheduler.xml file in the resource manager config 
 directory it is 40.
 Can it be unified to avoid confusion when the user creates the file without 
 this configuration. IF he expects the values in the file to be default 
 values, then it will be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file


[ 
https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526307#comment-14526307
 ] 

nijel commented on YARN-3018:
-

Thanks [~leftnoteasy]
Uploaded the patch

 Unify the default value for yarn.scheduler.capacity.node-locality-delay in 
 code and default xml file
 

 Key: YARN-3018
 URL: https://issues.apache.org/jira/browse/YARN-3018
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: nijel
Assignee: nijel
Priority: Trivial
 Attachments: YARN-3018-1.patch


 For the configuration item yarn.scheduler.capacity.node-locality-delay the 
 default value given in code is -1
 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1;
 In the default capacity-scheduler.xml file in the resource manager config 
 directory it is 40.
 Can it be unified to avoid confusion when the user creates the file without 
 this configuration. IF he expects the values in the file to be default 
 values, then it will be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml


[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526308#comment-14526308
 ] 

Hadoop QA commented on YARN-3069:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 46s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 43s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 52s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 4  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 23s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:red}-1{color} | yarn tests |   1m 55s | Tests failed in 
hadoop-yarn-common. |
| | |  38m 49s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.conf.TestYarnConfigurationFields |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730122/YARN-3069.005.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / a319771 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7682/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7682/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7682/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7682/console |


This message was automatically generated.

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   yarn.nodemanager.windows-secure-container-executor.group
   yarn.resourcemanager.configuration.file-system-based-store
   yarn.resourcemanager.delegation-token-renewer.thread-count
   yarn.resourcemanager.delegation.key.update-interval
   yarn.resourcemanager.delegation.token.max-lifetime
   yarn.resourcemanager.delegation.token.renew-interval

[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file


[ 
https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526309#comment-14526309
 ] 

Hadoop QA commented on YARN-3018:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   0m  0s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 14s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 2  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| | |   0m 19s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730124/YARN-3018-1.patch |
| Optional Tests |  |
| git revision | trunk / 3ba1836 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7683/artifact/patchprocess/whitespace.txt
 |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7683/console |


This message was automatically generated.

 Unify the default value for yarn.scheduler.capacity.node-locality-delay in 
 code and default xml file
 

 Key: YARN-3018
 URL: https://issues.apache.org/jira/browse/YARN-3018
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: nijel
Assignee: nijel
Priority: Trivial
 Attachments: YARN-3018-1.patch


 For the configuration item yarn.scheduler.capacity.node-locality-delay the 
 default value given in code is -1
 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1;
 In the default capacity-scheduler.xml file in the resource manager config 
 directory it is 40.
 Can it be unified to avoid confusion when the user creates the file without 
 this configuration. IF he expects the values in the file to be default 
 values, then it will be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1971) WindowsLocalWrapperScriptBuilder does not check for errors in generated script

2015-05-04 Thread Remus Rusanu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526344#comment-14526344
 ] 

Remus Rusanu commented on YARN-1971:


The problem is that there is no error check in the generated script. For 
comparison the ContainerLaunch.WindowsShellScriptBuilder will check each line 
in the generated script by adding this line automatically in the script, after 
each command:
{code}
@if %errorlevel% neq 0 exit /b %errorlevel%
{code}

I'm not advocating checking for various error conditions before launching the 
script, I'm saying the generated script itself should have error checking and 
handling.

 WindowsLocalWrapperScriptBuilder does not check for errors in generated script
 --

 Key: YARN-1971
 URL: https://issues.apache.org/jira/browse/YARN-1971
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Priority: Minor

 Similar to YARN-1865. The 
 DefaultContainerExecutor.WindowsLocalWrapperScriptBuilder builds a shell 
 script that contains commands that potentially may fail:
 {code}
 pout.println(@echo  + containerIdStr ++ normalizedPidFile +.tmp);
 pout.println(@move /Y  + normalizedPidFile + .tmp  + normalizedPidFile); 
 {code}
 These can fail due to access permissions, disc out of space, bad hardware, 
 cosmic rays etc etc. There should be proper error checking to ease 
 troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2775) There is no close method in NMWebServices#getLogs()

2015-05-04 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526366#comment-14526366
 ] 

Tsuyoshi Ozawa commented on YARN-2775:
--

[~skrho], thank you for taking this issue. I agree with that we need to close 
files after creating FileInputStream.  How about using try-with-resources 
statement since now we only supports JDK 7 or later? 
http://docs.oracle.com/javase/7/docs/technotes/guides/language/try-with-resources.html

{code}
try (final FileInputStream fis = ContainerLogsUtils.openLogFileForRead(
  containerIdStr, logFile, nmContext)) {
} catch () {
}
{code}

 There is no close method in NMWebServices#getLogs()
 ---

 Key: YARN-2775
 URL: https://issues.apache.org/jira/browse/YARN-2775
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: skrho
Priority: Minor
 Attachments: YARN-2775_001.patch


 If getLogs method is called,  fileInputStream object is accumulated in 
 memory..
 Because fileinputStream object is not closed..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3557) Support Intel Trusted Execution Technology(TXT) in YARN scheduler


[ 
https://issues.apache.org/jira/browse/YARN-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526399#comment-14526399
 ] 

Sunil G commented on YARN-3557:
---

bq.Currently for centralized node label configuration, it only supports admin 
configure node label through CLI.

Apart from CLI and REST, do u mean like exposing these configuration for a 
specific user (i assume this user will have some security approval in the 
cluster) so that this user can make the config via REST or api's.

 Support Intel Trusted Execution Technology(TXT) in YARN scheduler
 -

 Key: YARN-3557
 URL: https://issues.apache.org/jira/browse/YARN-3557
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Dian Fu
 Attachments: Support TXT in YARN high level design doc.pdf


 Intel TXT defines platform-level enhancements that provide the building 
 blocks for creating trusted platforms. A TXT aware YARN scheduler can 
 schedule security sensitive jobs on TXT enabled nodes only. YARN-2492 
 provides the capacity to restrict YARN applications to run only on cluster 
 nodes that have a specified node label. This is a good mechanism that be 
 utilized for TXT aware YARN scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3557) Support Intel Trusted Execution Technology(TXT) in YARN scheduler

2015-05-04 Thread Dian Fu (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526416#comment-14526416
]

Dian Fu commented on YARN-3557:
---

Hi [~sunilg],
Thanks for your comments.
{quote}Apart from CLI and REST, do u mean like exposing these configuration for
a specific user (i assume this user will have some security approval in the
cluster) so that this user can make the config via REST or api's.{quote}
Exposing these configuration for a specific user can be one option. But this
will require users to start a job which updates the labels periodically and is
complicated for users. If we can provide the similar method to YARN-2495 at RM
side, user will just need to provide a script(which takes node hostname/ip as
input and output the node labels).

Support Intel Trusted Execution Technology(TXT) in YARN scheduler
-

Key: YARN-3557
URL: https://issues.apache.org/jira/browse/YARN-3557
Project: Hadoop YARN
Issue Type: New Feature
Reporter: Dian Fu
Attachments: Support TXT in YARN high level design doc.pdf

Intel TXT defines platform-level enhancements that provide the building
blocks for creating trusted platforms. A TXT aware YARN scheduler can
schedule security sensitive jobs on TXT enabled nodes only. YARN-2492
provides the capacity to restrict YARN applications to run only on cluster
nodes that have a specified node label. This is a good mechanism that be
utilized for TXT aware YARN scheduler.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.


[ 
https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526423#comment-14526423
 ] 

Sunil G commented on YARN-2305:
---

Yes. This is can be closed. I have checked, and it was not occurring. Still i 
will perform few more tests, and if persists, I will reopen.

Thank you [~leftnoteasy]

 When a container is in reserved state then total cluster memory is displayed 
 wrongly.
 -

 Key: YARN-2305
 URL: https://issues.apache.org/jira/browse/YARN-2305
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: J.Andreina
Assignee: Sunil G
 Attachments: Capture.jpg


 ENV Details:
 =  
  3 queues  :  a(50%),b(25%),c(25%) --- All max utilization is set to 
 100
  2 Node cluster with total memory as 16GB
 TestSteps:
 =
   Execute following 3 jobs with different memory configurations for 
 Map , reducer and AM task
   ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a 
 -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 
 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 
 /dir8 /preempt_85 (application_1405414066690_0023)
  ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b 
 -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 
 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 
 /dir2 /preempt_86 (application_1405414066690_0025)
  
  ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c 
 -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 
 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 
 /dir2 /preempt_62
 Issue
 =
   when 2GB memory is in reserved state  totoal memory is shown as 
 15GB and used as 15GB  ( while total memory is 16GB)
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file


 [ 
https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel updated YARN-3018:

Attachment: YARN-3018-2.patch

Updated the patch to remove the white spaces

 Unify the default value for yarn.scheduler.capacity.node-locality-delay in 
 code and default xml file
 

 Key: YARN-3018
 URL: https://issues.apache.org/jira/browse/YARN-3018
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: nijel
Assignee: nijel
Priority: Trivial
 Attachments: YARN-3018-1.patch, YARN-3018-2.patch


 For the configuration item yarn.scheduler.capacity.node-locality-delay the 
 default value given in code is -1
 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1;
 In the default capacity-scheduler.xml file in the resource manager config 
 directory it is 40.
 Can it be unified to avoid confusion when the user creates the file without 
 this configuration. IF he expects the values in the file to be default 
 values, then it will be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file


[ 
https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526435#comment-14526435
 ] 

Hadoop QA commented on YARN-3018:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730139/YARN-3018-2.patch |
| Optional Tests |  |
| git revision | trunk / bb9ddef |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7684/console |


This message was automatically generated.

 Unify the default value for yarn.scheduler.capacity.node-locality-delay in 
 code and default xml file
 

 Key: YARN-3018
 URL: https://issues.apache.org/jira/browse/YARN-3018
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: nijel
Assignee: nijel
Priority: Trivial
 Attachments: YARN-3018-1.patch, YARN-3018-2.patch


 For the configuration item yarn.scheduler.capacity.node-locality-delay the 
 default value given in code is -1
 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1;
 In the default capacity-scheduler.xml file in the resource manager config 
 directory it is 40.
 Can it be unified to avoid confusion when the user creates the file without 
 this configuration. IF he expects the values in the file to be default 
 values, then it will be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2293) Scoring for NMs to identify a better candidate to launch AMs


[ 
https://issues.apache.org/jira/browse/YARN-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526441#comment-14526441
 ] 

Sunil G commented on YARN-2293:
---

Hi [~zjshen]
This work is moved to YARN-2005, I will share a basic prototype soon in that.
This can be made as duplicated to YARN-2005.

 Scoring for NMs to identify a better candidate to launch AMs
 

 Key: YARN-2293
 URL: https://issues.apache.org/jira/browse/YARN-2293
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Sunil G
Assignee: Sunil G

 Container exit status from NM is giving indications of reasons for its 
 failure. Some times, it may be because of container launching problems in NM. 
 In a heterogeneous cluster, some machines with weak hardware may cause more 
 failures. It will be better not to launch AMs there more often. Also I would 
 like to clear that container failures because of buggy job should not result 
 in decreasing score. 
 As mentioned earlier, based on exit status if a scoring mechanism is added 
 for NMs in RM, then NMs with better scores can be given for launching AMs. 
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2293) Scoring for NMs to identify a better candidate to launch AMs


 [ 
https://issues.apache.org/jira/browse/YARN-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G resolved YARN-2293.
---
Resolution: Duplicate

 Scoring for NMs to identify a better candidate to launch AMs
 

 Key: YARN-2293
 URL: https://issues.apache.org/jira/browse/YARN-2293
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Sunil G
Assignee: Sunil G

 Container exit status from NM is giving indications of reasons for its 
 failure. Some times, it may be because of container launching problems in NM. 
 In a heterogeneous cluster, some machines with weak hardware may cause more 
 failures. It will be better not to launch AMs there more often. Also I would 
 like to clear that container failures because of buggy job should not result 
 in decreasing score. 
 As mentioned earlier, based on exit status if a scoring mechanism is added 
 for NMs in RM, then NMs with better scores can be given for launching AMs. 
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2256) Too many nodemanager and resourcemanager audit logs are generated

2015-05-04 Thread Varun Saxena (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526445#comment-14526445
]

Varun Saxena commented on YARN-2256:

[~zjshen], just to brief you on the issue; in our setup we were getting too
many audit logs related to container events. We also found some other
unnecessary logs(not required for debugging) appearing frequently. Had raised
another JIRA for this. Anyways, so we internally we took up the task of
cleaning up these logs. This also made a slight improvement in the throughput
of running process(2.4.0)

To resolve the problem, one option was to remove these logs completely. But, we
decided to support different log levels for audit logs so that if some customer
requires these logs, we can enable them by merely changing log4j properties.
The scope of these 2 JIRAs' is indeed inter related. But I segregated them,
because I wasnt sure if community would accept support for different log
levels. We can decide if we need either one of these.

Too many nodemanager and resourcemanager audit logs are generated
-

Key: YARN-2256
URL: https://issues.apache.org/jira/browse/YARN-2256
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Attachments: YARN-2256.patch

Following audit logs are generated too many times(due to the possibility of a
large number of containers) :
1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a
container
2. In RM - Audit logs corresponding to AM allocating a container and AM
releasing a container
We can have different log levels even for NM and RM audit logs and move these
successful container related logs to DEBUG.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2267) Auxiliary Service support in RM


[ 
https://issues.apache.org/jira/browse/YARN-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526446#comment-14526446
 ] 

Sunil G commented on YARN-2267:
---

It would be a good feature if we can plugin few resource monitoring services to 
RM such as mentioned in *Scenario 1* above.

Could you please share the design thoughts for same, and main question will be 
like how this can be done in controlled way. By this what i meant is, an 
introduction of plugin should not conflict the existing behavior of scheduler's 
etc. 

 Auxiliary Service support in RM
 ---

 Key: YARN-2267
 URL: https://issues.apache.org/jira/browse/YARN-2267
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Naganarasimha G R
Assignee: Rohith

 Currently RM does not have a provision to run any Auxiliary services. For 
 health/monitoring in RM, its better to make a plugin mechanism in RM itself, 
 similar to NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3148) allow CORS related headers to passthrough in WebAppProxyServlet

2015-05-04 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526447#comment-14526447
 ] 

Varun Saxena commented on YARN-3148:


Thanks [~gtCarrera] for looking at this. Will update the patch ASAP.

 allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
 Attachments: YARN-3148.001.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

2015-05-04 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526452#comment-14526452
 ] 

Varun Saxena commented on YARN-2902:


[~leftnoteasy], sorry for the delay. Was on long leave and have come back 
today. We are pretty clear on how to handle it for private resources(as per the 
comment you highlighted) but hadn't updated a patch as need to simulate and 
investigate further for public resources. I will check it and update ASAP.


 Killing a container that is localizing can orphan resources in the 
 DOWNLOADING state
 

 Key: YARN-2902
 URL: https://issues.apache.org/jira/browse/YARN-2902
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-2902.002.patch, YARN-2902.patch


 If a container is in the process of localizing when it is stopped/killed then 
 resources are left in the DOWNLOADING state.  If no other container comes 
 along and requests these resources they linger around with no reference 
 counts but aren't cleaned up during normal cache cleanup scans since it will 
 never delete resources in the DOWNLOADING state even if their reference count 
 is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file


 [ 
https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel updated YARN-3018:

Attachment: YARN-3018-3.patch

Re trigger the CIS. Patch was wrongly generated
sorry for the noise

 Unify the default value for yarn.scheduler.capacity.node-locality-delay in 
 code and default xml file
 

 Key: YARN-3018
 URL: https://issues.apache.org/jira/browse/YARN-3018
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: nijel
Assignee: nijel
Priority: Trivial
 Attachments: YARN-3018-1.patch, YARN-3018-2.patch, YARN-3018-3.patch


 For the configuration item yarn.scheduler.capacity.node-locality-delay the 
 default value given in code is -1
 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1;
 In the default capacity-scheduler.xml file in the resource manager config 
 directory it is 40.
 Can it be unified to avoid confusion when the user creates the file without 
 this configuration. IF he expects the values in the file to be default 
 values, then it will be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file


[ 
https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526466#comment-14526466
 ] 

Hadoop QA commented on YARN-3018:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   0m  0s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 15s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   0m 18s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730144/YARN-3018-3.patch |
| Optional Tests |  |
| git revision | trunk / bb9ddef |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7685/console |


This message was automatically generated.

 Unify the default value for yarn.scheduler.capacity.node-locality-delay in 
 code and default xml file
 

 Key: YARN-3018
 URL: https://issues.apache.org/jira/browse/YARN-3018
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: nijel
Assignee: nijel
Priority: Trivial
 Attachments: YARN-3018-1.patch, YARN-3018-2.patch, YARN-3018-3.patch


 For the configuration item yarn.scheduler.capacity.node-locality-delay the 
 default value given in code is -1
 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1;
 In the default capacity-scheduler.xml file in the resource manager config 
 directory it is 40.
 Can it be unified to avoid confusion when the user creates the file without 
 this configuration. IF he expects the values in the file to be default 
 values, then it will be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-1662) Capacity Scheduler reservation issue cause Job Hang


 [ 
https://issues.apache.org/jira/browse/YARN-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G resolved YARN-1662.
---
Resolution: Invalid

 Capacity Scheduler reservation issue cause Job Hang
 ---

 Key: YARN-1662
 URL: https://issues.apache.org/jira/browse/YARN-1662
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.2.0
 Environment: Suse 11 SP1 + Linux
Reporter: Sunil G

 There are 2 node managers in my cluster.
 NM1 with 8GB
 NM2 with 8GB
 I am submitting a Job with below details:
 AM with 2GB
 Map needs 5GB
 Reducer needs 3GB
 slowstart is enabled with 0.5
 10maps and 50reducers are assigned.
 5maps are completed. Now few reducers got scheduled.
 Now NM1 has 2GB AM and 3Gb Reducer_1[Used 5GB]
 NM2 has 3Gb Reducer_2  [Used 3GB]
 A Map has now reserved(5GB) in NM1 which has only 3Gb free.
 It hangs forever.
 Potential issue is, reservation is now blocked in NM1 for a Map which needs 
 5GB.
 But the Reducer_1 hangs by waiting for few map ouputs.
 Reducer side preemption also not happened as few headroom is still available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1662) Capacity Scheduler reservation issue cause Job Hang


[ 
https://issues.apache.org/jira/browse/YARN-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526473#comment-14526473
 ] 

Sunil G commented on YARN-1662:
---

Yes [~jianhe]
we can close this issue. After YARN-1769, we have a better reservation too.

I checked this and its not happening now.

 Capacity Scheduler reservation issue cause Job Hang
 ---

 Key: YARN-1662
 URL: https://issues.apache.org/jira/browse/YARN-1662
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.2.0
 Environment: Suse 11 SP1 + Linux
Reporter: Sunil G

 There are 2 node managers in my cluster.
 NM1 with 8GB
 NM2 with 8GB
 I am submitting a Job with below details:
 AM with 2GB
 Map needs 5GB
 Reducer needs 3GB
 slowstart is enabled with 0.5
 10maps and 50reducers are assigned.
 5maps are completed. Now few reducers got scheduled.
 Now NM1 has 2GB AM and 3Gb Reducer_1[Used 5GB]
 NM2 has 3Gb Reducer_2  [Used 3GB]
 A Map has now reserved(5GB) in NM1 which has only 3Gb free.
 It hangs forever.
 Potential issue is, reservation is now blocked in NM1 for a Map which needs 
 5GB.
 But the Reducer_1 hangs by waiting for few map ouputs.
 Reducer side preemption also not happened as few headroom is still available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies

2015-05-04 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526577#comment-14526577
 ] 

Eric Payne commented on YARN-3097:
--

{quote}
-1  The patch doesn't appear to include any new or modified tests. Please 
justify why no new tests are needed for this patch. Also please list what 
manual steps were performed to verify this patch.
{quote}

Since the only change in this patch is to change an info log message to a debug 
log message, no tests were included.

 Logging of resource recovery on NM restart has redundancies
 ---

 Key: YARN-3097
 URL: https://issues.apache.org/jira/browse/YARN-3097
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Minor
  Labels: newbie
 Attachments: YARN-3097.001.patch


 ResourceLocalizationService logs that it is recovering a resource with the 
 remote and local paths, but then very shortly afterwards the 
 LocalizedResource emits an INIT-LOCALIZED transition that also logs the same 
 remote and local paths.  The recovery message should be a debug message, 
 since it's not conveying any useful information that isn't already covered by 
 the resource state transition log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3562) unit tests failures and issues found from findbug from earlier ATS checkins


[ 
https://issues.apache.org/jira/browse/YARN-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526578#comment-14526578
 ] 

Naganarasimha G R commented on YARN-3562:
-

Seems to be some issue with Jenkins, compilation is passing and the test logs 
are showing as compilation issues !

 unit tests failures and issues found from findbug from earlier ATS checkins
 ---

 Key: YARN-3562
 URL: https://issues.apache.org/jira/browse/YARN-3562
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
Priority: Minor
 Attachments: YARN-3562-YARN-2928.001.patch


 *Issues reported from MAPREDUCE-6337* :
 A bunch of MR unit tests are failing on our branch whenever the mini YARN 
 cluster needs to bring up multiple node managers.
 For example, see 
 https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/
 It is because the NMCollectorService is using a fixed port for the RPC (8048).
 *Issues reported from YARN-3044* :
 Test case failures and tools(FB  CS) issues found :
 # find bugs issue : Comparison of String objects using == or != in 
 ResourceTrackerService.updateAppCollectorsMap
 # find bugs issue : Boxing/unboxing to parse a primitive 
 RMTimelineCollectorManager.postPut. Called method Long.longValue()
 Should call Long.parseLong(String) instead.
 # find bugs issue : DM_DEFAULT_ENCODING Called method new 
 java.io.FileWriter(String, boolean) At 
 FileSystemTimelineWriterImpl.java:\[line 86\]
 # hadoop.yarn.server.resourcemanager.TestAppManager, 
 hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions, 
 hadoop.yarn.server.resourcemanager.TestClientRMService  
 hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus,
  refer https://builds.apache.org/job/PreCommit-YARN-Build/7534/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers

2015-05-04 Thread Jun Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526588#comment-14526588
 ] 

Jun Gong commented on YARN-3474:


[~vinodkv] Thank you for the explanation. Closing it now.

 Add a way to let NM wait RM to come back, not kill running containers
 -

 Key: YARN-3474
 URL: https://issues.apache.org/jira/browse/YARN-3474
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3474.01.patch


 When RM HA is enabled and active RM shuts down, standby RM will become 
 active, recover apps and attempts. Apps will not be affected. 
 If there are some cases or bugs that cause both RM could not start 
 normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM 
 could not connect with ZK well). NM will kill containers running on it when  
 it could not heartbeat with RM for some time(max retry time is 15 mins by 
 default). Then all apps will be killed. 
 In production cluster, we might come across above cases and fixing these bugs 
 might need time more than 15 mins. In order to let apps not be affected and 
 killed by NM, YARN admin could set a flag(the flag is a znode 
 '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to 
 come back and not kill running containers. After fixing bugs and RM start 
 normally, clear the flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers

2015-05-04 Thread Jun Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong resolved YARN-3474.

Resolution: Invalid

 Add a way to let NM wait RM to come back, not kill running containers
 -

 Key: YARN-3474
 URL: https://issues.apache.org/jira/browse/YARN-3474
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3474.01.patch


 When RM HA is enabled and active RM shuts down, standby RM will become 
 active, recover apps and attempts. Apps will not be affected. 
 If there are some cases or bugs that cause both RM could not start 
 normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM 
 could not connect with ZK well). NM will kill containers running on it when  
 it could not heartbeat with RM for some time(max retry time is 15 mins by 
 default). Then all apps will be killed. 
 In production cluster, we might come across above cases and fixing these bugs 
 might need time more than 15 mins. In order to let apps not be affected and 
 killed by NM, YARN admin could set a flag(the flag is a znode 
 '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to 
 come back and not kill running containers. After fixing bugs and RM start 
 normally, clear the flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup

[
https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526623#comment-14526623
]

Naganarasimha G R commented on YARN-2729:
-

Thanks for the review comments [~vinodkv],

bq.SCRIPT_NODE_LABELS_PROVIDER and CONFIG_NODE_LABELS_PROVIDER are not needed,
delete them, you have separate constants for their prefixes
Actually these are not preffixes, as per [~Wangda]'s
[comment|https://issues.apache.org/jira/browse/YARN-2729?focusedCommentId=14393545page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14393545]
we had decided have whitelisting for provider : {{The option will be:
yarn.node-labels.nm.provider = config/script/other-class-name.}} . These are
modifications for it.

bq. DISABLE_NODE_LABELS_PROVIDER_FETCH_TIMER doesn't need to be in
YarnConfiguration
Well As per one of the wangda's comment he had mentioned that possible values
or default values of configurations had to be kept in YARNConfiguration, hence
had placed it here, if required as per your comment can move it to
AbstractNodeLabelsProvider

bq. LOG is not used anywhere
Are the logs expected when the labels are set in {{setNodeLabels}} ? i can add
here but anyway there were logs in NodeStatusUpdaterImpl on successfull and
unsuccessfull attempts.

bq. BTW, assuming YARN-3565 goes in first, you will have to make some changes
here.
bq. I think the format expected from the command should be more structured.
Specifically as we expect more per-label attributes in line with YARN-3565.
Well was thinking about this while working on YARN-3565, but dint modify the
NodeLabelsProvider as currently Labels(currently partitions) which needs to be
sent from NM have to be one of RM's CLUSTER NodeLabel set. So exclusiveness
need not be sent from NM to RM as currently we support specifying the
exclusiveness only during adding clusterNode labels. So IMHO if there is plan
to make this interface public stable then would be better do these changes
now itself if not it would better done after requirement for constraint labels,
so that more clarity on structure would be there?
[~wangda] and you can share your opinion on this, based on it will do the
modifications.

bq. Not caused by your patch but worth fixing here. NodeStatusUpdaterImpl
shouldn't worry about invalid label-set, previous-valid-labels and label
validation. You should move all that functionality into NodeLabelsProvider.
Well as per the class reponsibility i understand that NodeStatusUpdaterImpl is
not supposed to have it but as it might be expected to be public we had to
ensure that
* For every heartbeat labels are sent across only if modified
* doing basic validations before sending the modified labels

These needs to be done irrespective of the label provider (system or user's)
hence kept it in NodeStatusUpdaterImpl , but if req to be moved out then we
need to bring in some intermediate manager(/helper/delegator) class between
NodeStatusUpdaterImpl and NodeLabelsProvider.
Those changes were also from my previous patch, so no hard feelings in taking
care of it if req :).

bq. Can you add the documentation for setting this up too too?
Well was planning to raise jira for updating documentation on top of NodeLabels
but documentation for it is not yet completed. If required can just add some
pdf here

Support script based NodeLabelsProvider Interface in Distributed Node Label
Configuration Setup
---

Key: YARN-2729
URL: https://issues.apache.org/jira/browse/YARN-2729
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch,
YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch,
YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch,
YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch,
YARN-2729.20150402-1.patch, YARN-2729.20150404-1.patch

Support script based NodeLabelsProvider Interface in Distributed Node Label
Configuration Setup .

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3565) NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object instead of String


[ 
https://issues.apache.org/jira/browse/YARN-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526629#comment-14526629
 ] 

Naganarasimha G R commented on YARN-3565:
-

Thanks for the review comments [~vinodkv]
Agree with most of your suggestions but had few queries overall,
* can there be changes again when labels as constraints are introduced ? As i 
am not sure exclusivity will have any significance with constraints, if we plan 
to make use of NodeLabel class for constraints too.
* CLI will also require changes for adding, removing cluster node labels and 
mapping of nodes to labels ?
* If required to modify RMNodeLabelsManager.replaceLabelsOnNode() then i think 
we need to make yarn-3521 dependent on this jira, right ? 


 NodeHeartbeatRequest/RegisterNodeManagerRequest should use NodeLabel object 
 instead of String
 -

 Key: YARN-3565
 URL: https://issues.apache.org/jira/browse/YARN-3565
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R
Priority: Blocker
 Attachments: YARN-3565-20150502-1.patch


 Now NM HB/Register uses SetString, it will be hard to add new fields if we 
 want to support specifying NodeLabel type such as exclusivity/constraints, 
 etc. We need to make sure rolling upgrade works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience


 [ 
https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3523:

Attachment: YARN-3523.20150504-1.patch

Have checked the 2.7.0 api docs and neither this class nor its package has been 
captured. Hence have modified visibility of the methods as private in this 
updated patch

 Cleanup ResourceManagerAdministrationProtocol interface audience
 

 Key: YARN-3523
 URL: https://issues.apache.org/jira/browse/YARN-3523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R
  Labels: newbie
 Attachments: YARN-3523.20150422-1.patch, YARN-3523.20150504-1.patch


 I noticed ResourceManagerAdministrationProtocol has @Private audience for the 
 class and @Public audience for methods. It doesn't make sense to me. We 
 should make class audience and methods audience consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain

2015-05-04 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2622:
-
Target Version/s:   (was: 2.6.0)

 RM should put the application related timeline data into a secured domain
 -

 Key: YARN-2622
 URL: https://issues.apache.org/jira/browse/YARN-2622
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the 
 application related timeline data is put into the default domain. It is not 
 secured. We should let RM to choose a secured domain to put the system 
 metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources


[ 
https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526700#comment-14526700
 ] 

Hadoop QA commented on YARN-2618:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12723515/YARN-2618-7.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / bb9ddef |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7687/console |


This message was automatically generated.

 Avoid over-allocation of disk resources
 ---

 Key: YARN-2618
 URL: https://issues.apache.org/jira/browse/YARN-2618
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, 
 YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch


 Subtask of YARN-2139. 
 This should include
 - Add API support for introducing disk I/O as the 3rd type resource.
 - NM should report this information to the RM
 - RM should consider this to avoid over-allocation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience


[ 
https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526710#comment-14526710
 ] 

Hadoop QA commented on YARN-3523:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 55s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 42s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 52s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m  2s | The applied patch generated  1 
new checkstyle issues (total was 17, now 18). |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 6  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 36s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 24s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   0m 27s | Tests passed in 
hadoop-yarn-api. |
| | |  38m  5s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730182/YARN-3523.20150504-1.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / bb9ddef |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7686/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7686/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7686/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7686/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7686/console |


This message was automatically generated.

 Cleanup ResourceManagerAdministrationProtocol interface audience
 

 Key: YARN-3523
 URL: https://issues.apache.org/jira/browse/YARN-3523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R
  Labels: newbie
 Attachments: YARN-3523.20150422-1.patch, YARN-3523.20150504-1.patch


 I noticed ResourceManagerAdministrationProtocol has @Private audience for the 
 class and @Public audience for methods. It doesn't make sense to me. We 
 should make class audience and methods audience consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels


[ 
https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526747#comment-14526747
 ] 

Sunil G commented on YARN-3521:
---

1.
bq.Should be exclusivity.
Yes. I updated the same

2.
bq.Did we ever call these APIs stable? 
No. I have changed to a NodeLabelsInfo object and added new getter which can 
supply list/set of string names.

3.
Why are we not dropping the name-only records?
I have removed *NodeLabelsName*. And instead use *NodeLabelsInfo*, also added a 
new getter which can give back String of label names. NodeToLabelsName is 
renamed as NodeToLabelsInfo and internally it also uses NodeLabelInfo.

 Support return structured NodeLabel objects in REST API when call 
 getClusterNodeLabels
 --

 Key: YARN-3521
 URL: https://issues.apache.org/jira/browse/YARN-3521
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Sunil G
 Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 
 0003-YARN-3521.patch


 In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should 
 make the same change in REST API side to make them consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels


 [ 
https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3521:
--
Attachment: 0004-YARN-3521.patch

[~vinodkv] and [~leftnoteasy]
Pls share your thoughts on this updated patch. 

IMO I also feel that NodeLabelManager apis can use Object rather than Strings. 
Admin interface can take this conversion logic.

 Support return structured NodeLabel objects in REST API when call 
 getClusterNodeLabels
 --

 Key: YARN-3521
 URL: https://issues.apache.org/jira/browse/YARN-3521
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Sunil G
 Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 
 0003-YARN-3521.patch, 0004-YARN-3521.patch


 In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should 
 make the same change in REST API side to make them consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies

2015-05-04 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526755#comment-14526755
 ] 

Jason Lowe commented on YARN-3097:
--

+1, committing this.

 Logging of resource recovery on NM restart has redundancies
 ---

 Key: YARN-3097
 URL: https://issues.apache.org/jira/browse/YARN-3097
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Minor
  Labels: newbie
 Attachments: YARN-3097.001.patch


 ResourceLocalizationService logs that it is recovering a resource with the 
 remote and local paths, but then very shortly afterwards the 
 LocalizedResource emits an INIT-LOCALIZED transition that also logs the same 
 remote and local paths.  The recovery message should be a debug message, 
 since it's not conveying any useful information that isn't already covered by 
 the resource state transition log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3388) Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit

2015-05-04 Thread Nathan Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526763#comment-14526763
 ] 

Nathan Roberts commented on YARN-3388:
--

Yes. I have a patch which I think is close. I need to merge to latest trunk. 
then I'll post for review.

 Allocation in LeafQueue could get stuck because DRF calculator isn't well 
 supported when computing user-limit
 -

 Key: YARN-3388
 URL: https://issues.apache.org/jira/browse/YARN-3388
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: YARN-3388-v0.patch, YARN-3388-v1.patch


 When there are multiple active users in a queue, it should be possible for 
 those users to make use of capacity up-to max_capacity (or close). The 
 resources should be fairly distributed among the active users in the queue. 
 This works pretty well when there is a single resource being scheduled.   
 However, when there are multiple resources the situation gets more complex 
 and the current algorithm tends to get stuck at Capacity. 
 Example illustrated in subsequent comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies

2015-05-04 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526764#comment-14526764
 ] 

Hudson commented on YARN-3097:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7723 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7723/])
YARN-3097. Logging of resource recovery on NM restart has redundancies. 
Contributed by Eric Payne (jlowe: rev 8f65c793f2930bfd16885a2ab188a9970b754974)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* hadoop-yarn-project/CHANGES.txt


 Logging of resource recovery on NM restart has redundancies
 ---

 Key: YARN-3097
 URL: https://issues.apache.org/jira/browse/YARN-3097
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Eric Payne
Priority: Minor
  Labels: newbie
 Fix For: 2.8.0

 Attachments: YARN-3097.001.patch


 ResourceLocalizationService logs that it is recovering a resource with the 
 remote and local paths, but then very shortly afterwards the 
 LocalizedResource emits an INIT-LOCALIZED transition that also logs the same 
 remote and local paths.  The recovery message should be a debug message, 
 since it's not conveying any useful information that isn't already covered by 
 the resource state transition log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high

2015-05-04 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526775#comment-14526775
]

Jason Lowe commented on YARN-3554:
--

YARN-3518 is a separate concern with different ramifications. We should
discuss it there and not mix these two.

bq. set this to a bigger value maybe based on network partition considerations
not only for nm restart.
What value do you propose? As pointed out earlier, anything over 10 minutes is
pointless since the container allocation expires in that time. Is it common
for network partitions to take longer than 3 minutes but less than 10 minutes?
If so we should tune the value for that. If not then making the value larger
just slows recovery time.

bq. 3 mins seems dangerous, If rm fails over and the recover takes serval mins,
nm maybe kill all containers, in production env, it's not expected.

This JIRA is configuring the amount of time NM clients (i.e.: primarily
ApplicationMasters and the RM when launching ApplicationMasters) will try to
connect to a particular NM before failing. I'm missing how RM failover leads
to a mass killing of containers due to this proposed change. This is not a
property used by the NM, so the NM is not going to start killing all containers
differently based on an updated value for it. The only case where the RM will
use this property is when connecting to NMs to launch AM containers, and it
will only do so for NMs that have recently heartbeated. Could you explain how
this leads to all containers getting killed on a particular node?

Default value for maximum nodemanager connect wait time is too high
---

Key: YARN-3554
URL: https://issues.apache.org/jira/browse/YARN-3554
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Naganarasimha G R
Labels: newbie
Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch

The default value for yarn.client.nodemanager-connect.max-wait-ms is 90
msec or 15 minutes, which is way too high. The default container expiry time
from the RM and the default task timeout in MapReduce are both only 10
minutes.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources

2015-05-04 Thread Wei Yan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526791#comment-14526791
 ] 

Wei Yan commented on YARN-2618:
---

Thanks, [~djp], I'll rebase the patch.

 Avoid over-allocation of disk resources
 ---

 Key: YARN-2618
 URL: https://issues.apache.org/jira/browse/YARN-2618
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, 
 YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch


 Subtask of YARN-2139. 
 This should include
 - Add API support for introducing disk I/O as the 3rd type resource.
 - NM should report this information to the RM
 - RM should consider this to avoid over-allocation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high


[ 
https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526934#comment-14526934
 ] 

Naganarasimha G R commented on YARN-3554:
-

Hi [~jlowe],
 earlier my query of ideal time and [~sandflee]'s comment is related to 
yarn.resourcemanager.connect.max-wait.ms and as [~gtCarrera] mentioned its 
just for discussion purpose.

 Default value for maximum nodemanager connect wait time is too high
 ---

 Key: YARN-3554
 URL: https://issues.apache.org/jira/browse/YARN-3554
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Naganarasimha G R
  Labels: newbie
 Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch


 The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 
 msec or 15 minutes, which is way too high.  The default container expiry time 
 from the RM and the default task timeout in MapReduce are both only 10 
 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.

2015-05-04 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526957#comment-14526957
 ] 

Wilfred Spiegelenburg commented on YARN-3491:
-

Can we clean up the getInitializedLogDirs() and getInitializedLogDirs() now 
that we're changing them?
Neither of the methods need to return anything since we do not use the return 
value. Also a rename of the methods would make it clearer:
getInitializedLogDirs()  --  initializeLogDirs()
getInitializedLocalDirs()  --  initializeLocalDirs()


 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch, YARN-3491.001.patch, 
 YARN-3491.002.patch, YARN-3491.003.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which takes about 10+ ms.
 The total delay will be approximately number of local dirs * 10+ ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1564) add some basic workflow YARN services

2015-05-04 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526963#comment-14526963
 ] 

Zhijie Shen commented on YARN-1564:
---

YARN-2928 is going to support flow as the first class citizen.It will be great 
if we can coordinate on this between app management and monitoring.

 add some basic workflow YARN services
 -

 Key: YARN-1564
 URL: https://issues.apache.org/jira/browse/YARN-1564
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api
Affects Versions: 2.4.0
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor
 Attachments: YARN-1564-001.patch

   Original Estimate: 24h
  Time Spent: 48h
  Remaining Estimate: 0h

 I've been using some alternative composite services to help build workflows 
 of process execution in a YARN AM.
 They and their tests could be moved in YARN for the use by others -this would 
 make it easier to build aggregate services in an AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high

2015-05-04 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526983#comment-14526983
 ] 

Jason Lowe commented on YARN-3554:
--

Ah, thanks [~Naganarasimha], sorry I missed that.  We can continue discussing 
the proper RM connect wait time over at YARN-3518, as obviously I cannot keep 
them straight here. ;-)

Are there still objections to lowering it from 15 mins to 3 mins?  I'm +1 for 
the second patch, but I'll wait a few days before committing to give time for 
alternate proposals.

 Default value for maximum nodemanager connect wait time is too high
 ---

 Key: YARN-3554
 URL: https://issues.apache.org/jira/browse/YARN-3554
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Naganarasimha G R
  Labels: newbie
 Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch


 The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 
 msec or 15 minutes, which is way too high.  The default container expiry time 
 from the RM and the default task timeout in MapReduce are both only 10 
 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3422) relatedentities always return empty list when primary filter is set

2015-05-04 Thread Zhijie Shen (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527012#comment-14527012
]

Zhijie Shen commented on YARN-3422:
---

[~billie.rina...@gmail.com], thanks for explaining the rationale. Hence the
attache patch should not be the right fix.

bq. In retrospect, the directional nature of the related entity relationship
seems to make things more confusing. Perhaps it would be better if relatedness
were bidirectional.

I think directional may be okay, but the confusing part is that we're storing A
- B, but we query B - A, while we always say related entities. In fact, we
need to differentiate both. When storing A, B resides in A entity as
isRelatedTo entity, and when querying B, A is shown as the relatesTo entity. Of
cause, we can querying A, and B should be shown as the isRelatedTo entity,
which is not supported here. This problem will be resolved in ATS v2.

Moreover, it's also the limitation about the way we store the primary filter.
The index table is a copy of the whole entity (only the information comes with
the current put) and attach the primary filter as the prefix of the key. It
makes it expensive to define one primary key for an entity, and probably
results in different snapshot of the entity with different primary keys. In
this example, B doesn't have primary filter C, but even later we add C for B,
we will still be not able to get related entity A when querying B via primary
filter C. That's one reason why I suggest using reverse index in YARN-3448.

However, for current LeveldbTimelineStore, I'm not sure if we have a quick way
to resolve the problem. Thoughts?

relatedentities always return empty list when primary filter is set
---

Key: YARN-3422
URL: https://issues.apache.org/jira/browse/YARN-3422
Project: Hadoop YARN
Issue Type: Bug
Components: timelineserver
Reporter: Chang Li
Assignee: Chang Li
Attachments: YARN-3422.1.patch

When you curl for ats entities with a primary filter, the relatedentities
fields always return empty list

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources


[ 
https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527020#comment-14527020
 ] 

Vinod Kumar Vavilapalli commented on YARN-2618:
---

Haven't looked at this so far, Tx for rekicking it Junping! Taking a quick look 
now..

 Avoid over-allocation of disk resources
 ---

 Key: YARN-2618
 URL: https://issues.apache.org/jira/browse/YARN-2618
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, 
 YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch


 Subtask of YARN-2139. 
 This should include
 - Add API support for introducing disk I/O as the 3rd type resource.
 - NM should report this information to the RM
 - RM should consider this to avoid over-allocation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated

2015-05-04 Thread Mit Desai (JIRA)

Mit Desai created YARN-3573:
---

 Summary: MiniMRYarnCluster constructor that starts the timeline 
server using a boolean should be marked depricated
 Key: YARN-3573
 URL: https://issues.apache.org/jira/browse/YARN-3573
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai


{code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code}
starts the timeline server using *boolean enableAHS*. It is better to have the 
timelineserver started based on the config value.
We should mark this constructor as deprecated to avoid its future use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources


[ 
https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527035#comment-14527035
 ] 

Vinod Kumar Vavilapalli commented on YARN-2618:
---

Okay, quickly scanned. Seems like you are having other related discussions at 
the umbrella ticket and other JIRAs. So please go ahead.

Is this only for trunk or branch-2 also?

 Avoid over-allocation of disk resources
 ---

 Key: YARN-2618
 URL: https://issues.apache.org/jira/browse/YARN-2618
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, 
 YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch


 Subtask of YARN-2139. 
 This should include
 - Add API support for introducing disk I/O as the 3rd type resource.
 - NM should report this information to the RM
 - RM should consider this to avoid over-allocation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated

2015-05-04 Thread Mit Desai (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai reassigned YARN-3573:
---

Assignee: Mit Desai

 MiniMRYarnCluster constructor that starts the timeline server using a boolean 
 should be marked depricated
 -

 Key: YARN-3573
 URL: https://issues.apache.org/jira/browse/YARN-3573
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai

 {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code}
 starts the timeline server using *boolean enableAHS*. It is better to have 
 the timelineserver started based on the config value.
 We should mark this constructor as deprecated to avoid its future use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1612) FairScheduler: Enable delay scheduling by default


 [ 
https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1612:
--
Attachment: YARN-1612-003.patch

patch updated. 

 FairScheduler: Enable delay scheduling by default
 -

 Key: YARN-1612
 URL: https://issues.apache.org/jira/browse/YARN-1612
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Sandy Ryza
Assignee: Chen He
 Attachments: YARN-1612-003.patch, YARN-1612-v2.patch, YARN-1612.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1612) FairScheduler: Enable delay scheduling by default


 [ 
https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1612:
--
Attachment: (was: YARN-1612-003.patch)

 FairScheduler: Enable delay scheduling by default
 -

 Key: YARN-1612
 URL: https://issues.apache.org/jira/browse/YARN-1612
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Sandy Ryza
Assignee: Chen He
 Attachments: YARN-1612-003.patch, YARN-1612-v2.patch, YARN-1612.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1612) FairScheduler: Enable delay scheduling by default


 [ 
https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1612:
--
Attachment: YARN-1612-003.patch

 FairScheduler: Enable delay scheduling by default
 -

 Key: YARN-1612
 URL: https://issues.apache.org/jira/browse/YARN-1612
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Sandy Ryza
Assignee: Chen He
 Attachments: YARN-1612-003.patch, YARN-1612-v2.patch, YARN-1612.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated

2015-05-04 Thread Mit Desai (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-3573:

Assignee: (was: Mit Desai)

 MiniMRYarnCluster constructor that starts the timeline server using a boolean 
 should be marked depricated
 -

 Key: YARN-3573
 URL: https://issues.apache.org/jira/browse/YARN-3573
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai

 {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code}
 starts the timeline server using *boolean enableAHS*. It is better to have 
 the timelineserver started based on the config value.
 We should mark this constructor as deprecated to avoid its future use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

[
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527069#comment-14527069
]

Jian He commented on YARN-3480:
---

[~hex108], generally, it's better to avoid a global config for an outlier app.
1. How often do you see an app failed with a large number of attempts? If it's
limited to a few apps. I wouldn't worry so much.
bq. make RM recover process much slower.
2. How slower it is in reality in your case? we've done some benchmark,
recovering 10k apps(with 1 attempt) on ZK is pretty fast, within 20 seconds or
so.
3. Limiting the attempts to be recorded means we are losing history. it's a
trade off.

My main point is that if you can provide some real numbers showing how slow the
recovery process in real scenario, we can figure out where the bottleneck is
and how to improve it.

Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

Key: YARN-3480
URL: https://issues.apache.org/jira/browse/YARN-3480
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
Attachments: YARN-3480.01.patch, YARN-3480.02.patch,
YARN-3480.03.patch

When RM HA is enabled and running containers are kept across attempts, apps
are more likely to finish successfully with more retries(attempts), so it
will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However
it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make
RM recover process much slower. It might be better to set max attempts to be
stored in RMStateStore.
BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to
a small value, retried attempts might be very large. So we need to delete
some attempts stored in RMStateStore and RMStateStore.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high


[ 
https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527072#comment-14527072
 ] 

Vinod Kumar Vavilapalli commented on YARN-3554:
---

HADOOP-11398 and YARN-3238 relevant in that they caused AM-NM communication 
take a long time to timeout.

 Default value for maximum nodemanager connect wait time is too high
 ---

 Key: YARN-3554
 URL: https://issues.apache.org/jira/browse/YARN-3554
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Naganarasimha G R
  Labels: newbie
 Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch


 The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 
 msec or 15 minutes, which is way too high.  The default container expiry time 
 from the RM and the default task timeout in MapReduce are both only 10 
 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2267) Auxiliary Service support in RM

2015-05-04 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527084#comment-14527084
 ] 

Zhijie Shen commented on YARN-2267:
---

Sunil, my 2 cents: if you can have some detailed proposal doc to share with 
community, and use it for further discussion, it will be more easier to catch 
the community's eyes and better to understand you proposal. You may want to 
focus on stating your problem, why it's general, what are the possible 
solutions, what are their pros and cos and so on. For example, Vinod may want 
to understand why we need to make monitoring as a aux service instead of 
builtin func of RM.

 Auxiliary Service support in RM
 ---

 Key: YARN-2267
 URL: https://issues.apache.org/jira/browse/YARN-2267
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Naganarasimha G R
Assignee: Rohith

 Currently RM does not have a provision to run any Auxiliary services. For 
 health/monitoring in RM, its better to make a plugin mechanism in RM itself, 
 similar to NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1621) Add CLI to list rows of task attempt ID, container ID, host of container, state of container

2015-05-04 Thread JIRA

[
https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bartosz Ługowski updated YARN-1621:
---
Attachment: (was: YARN-1621.6.patch)

Add CLI to list rows of task attempt ID, container ID, host of container,
state of container
--

Key: YARN-1621
URL: https://issues.apache.org/jira/browse/YARN-1621
Project: Hadoop YARN
Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Tassapol Athiapinya
Assignee: Bartosz Ługowski
Attachments: YARN-1621.1.patch, YARN-1621.2.patch, YARN-1621.3.patch,
YARN-1621.4.patch, YARN-1621.5.patch, YARN-1621.6.patch

As more applications are moved to YARN, we need generic CLI to list rows of
task attempt ID, container ID, host of container, state of container. Today
if YARN application running in a container does hang, there is no way to find
out more info because a user does not know where each attempt is running in.
For each running application, it is useful to differentiate between
running/succeeded/failed/killed containers.

{code:title=proposed yarn cli}
$ yarn application -list-containers -applicationId appId [-containerState
state of container]
where containerState is optional filter to list container in given state only.
container state can be running/succeeded/killed/failed/all.
A user can specify more than one container state at once e.g. KILLED,FAILED.
task attempt ID container ID host of container state of container
{code}
CLI should work with running application/completed application. If a
container runs many task attempts, all attempts should be shown. That will
likely be the case of Tez container-reuse application.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1621) Add CLI to list rows of task attempt ID, container ID, host of container, state of container

2015-05-04 Thread JIRA

[
https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bartosz Ługowski updated YARN-1621:
---
Attachment: YARN-1621.6.patch

Add CLI to list rows of task attempt ID, container ID, host of container,
state of container
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high


[ 
https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527087#comment-14527087
 ] 

Vinod Kumar Vavilapalli commented on YARN-3554:
---

bq. Are there still objections to lowering it from 15 mins to 3 mins? I'm +1 
for the second patch, but I'll wait a few days before committing to give time 
for alternate proposals.
For our users, we explicitly set yarn.client.nodemanager-connect.max-wait-ms to 
60,000 (one minute). As HADOOP-11398 is still not in, this ends up becoming 6 
minutes timeout (assuming each of the underlying rpc retries takes 1 sec * 50 
times to finish (50 secs), plus 10 seconds retry interval, causing 1min per 
retry and 6 retries overall).

 Default value for maximum nodemanager connect wait time is too high
 ---

 Key: YARN-3554
 URL: https://issues.apache.org/jira/browse/YARN-3554
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Naganarasimha G R
  Labels: newbie
 Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch


 The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 
 msec or 15 minutes, which is way too high.  The default container expiry time 
 from the RM and the default task timeout in MapReduce are both only 10 
 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time


 [ 
https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3518:
--
Assignee: sandflee

 default rm/am expire interval should not less than default resourcemanager 
 connect wait time
 

 Key: YARN-3518
 URL: https://issues.apache.org/jira/browse/YARN-3518
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Reporter: sandflee
Assignee: sandflee
  Labels: configuration, newbie
 Attachments: YARN-3518.001.patch


 take am for example, if am can't connect to RM, after am expire (600s), RM 
 relaunch am, and there will be two am at the same time util resourcemanager 
 connect max wait time(900s) passed.
 DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS =  15 * 60 * 1000;
 DEFAULT_RM_AM_EXPIRY_INTERVAL_MS = 60;
 DEFAULT_RM_NM_EXPIRY_INTERVAL_MS = 60;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time


[ 
https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527100#comment-14527100
 ] 

Vinod Kumar Vavilapalli commented on YARN-3518:
---

We need to be careful here. Clients from gateway machines should be treated 
separately from AMs - a distinction we don't have today. It actually makes 
sense for clients to retry for a longer time than is usual for AMs.

 default rm/am expire interval should not less than default resourcemanager 
 connect wait time
 

 Key: YARN-3518
 URL: https://issues.apache.org/jira/browse/YARN-3518
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Reporter: sandflee
Assignee: sandflee
  Labels: configuration, newbie
 Attachments: YARN-3518.001.patch


 take am for example, if am can't connect to RM, after am expire (600s), RM 
 relaunch am, and there will be two am at the same time util resourcemanager 
 connect max wait time(900s) passed.
 DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS =  15 * 60 * 1000;
 DEFAULT_RM_AM_EXPIRY_INTERVAL_MS = 60;
 DEFAULT_RM_NM_EXPIRY_INTERVAL_MS = 60;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3562) unit tests failures and issues found from findbug from earlier ATS checkins


[ 
https://issues.apache.org/jira/browse/YARN-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527614#comment-14527614
 ] 

Naganarasimha G R commented on YARN-3562:
-

Thanks [~sjlee0], yes lately seeing some strange jenkins output and thanks 
for testing locally, but there might be some other unrelated test case failure  
as we are modifying the miniyarncluster, so not sure how to proceed in that 
case ? 
also how do you guys kickoff jenkins ? delete and reupload the patch ?

 unit tests failures and issues found from findbug from earlier ATS checkins
 ---

 Key: YARN-3562
 URL: https://issues.apache.org/jira/browse/YARN-3562
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
Priority: Minor
 Attachments: YARN-3562-YARN-2928.001.patch


 *Issues reported from MAPREDUCE-6337* :
 A bunch of MR unit tests are failing on our branch whenever the mini YARN 
 cluster needs to bring up multiple node managers.
 For example, see 
 https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/
 It is because the NMCollectorService is using a fixed port for the RPC (8048).
 *Issues reported from YARN-3044* :
 Test case failures and tools(FB  CS) issues found :
 # find bugs issue : Comparison of String objects using == or != in 
 ResourceTrackerService.updateAppCollectorsMap
 # find bugs issue : Boxing/unboxing to parse a primitive 
 RMTimelineCollectorManager.postPut. Called method Long.longValue()
 Should call Long.parseLong(String) instead.
 # find bugs issue : DM_DEFAULT_ENCODING Called method new 
 java.io.FileWriter(String, boolean) At 
 FileSystemTimelineWriterImpl.java:\[line 86\]
 # hadoop.yarn.server.resourcemanager.TestAppManager, 
 hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions, 
 hadoop.yarn.server.resourcemanager.TestClientRMService  
 hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus,
  refer https://builds.apache.org/job/PreCommit-YARN-Build/7534/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file


[ 
https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527612#comment-14527612
 ] 

Jian He commented on YARN-3018:
---

hi [~nijel], 
below code in CapacitySchedulerConfiguration actually uses 0 instead. How about 
change it to be 0 ? and simplify below code to {{return 
getInt(NODE_LOCALITY_DELAY, DEFAULT_NODE_LOCALITY_DELAY);}}
{code}
  public int getNodeLocalityDelay() {
int delay = getInt(NODE_LOCALITY_DELAY, DEFAULT_NODE_LOCALITY_DELAY);
return (delay == DEFAULT_NODE_LOCALITY_DELAY) ? 0 : delay;
  }
{code}

 Unify the default value for yarn.scheduler.capacity.node-locality-delay in 
 code and default xml file
 

 Key: YARN-3018
 URL: https://issues.apache.org/jira/browse/YARN-3018
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: nijel
Assignee: nijel
Priority: Trivial
 Attachments: YARN-3018-1.patch, YARN-3018-2.patch, YARN-3018-3.patch


 For the configuration item yarn.scheduler.capacity.node-locality-delay the 
 default value given in code is -1
 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1;
 In the default capacity-scheduler.xml file in the resource manager config 
 directory it is 40.
 Can it be unified to avoid confusion when the user creates the file without 
 this configuration. IF he expects the values in the file to be default 
 values, then it will be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore

2015-05-04 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527758#comment-14527758
 ] 

Hudson commented on YARN-2725:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #7729 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7729/])


 Adding test cases of retrying requests about ZKRMStateStore
 ---

 Key: YARN-2725
 URL: https://issues.apache.org/jira/browse/YARN-2725
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsuyoshi Ozawa
Assignee: Tsuyoshi Ozawa
 Fix For: 2.8.0

 Attachments: YARN-2725.1.patch, YARN-2725.1.patch


 YARN-2721 found a race condition for ZK-specific retry semantics. We should 
 add tests about the case of retry requests to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3375) NodeHealthScriptRunner.shouldRun() check is performing 3 times for starting NodeHealthScriptRunner

2015-05-04 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527756#comment-14527756
 ] 

Hudson commented on YARN-3375:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #7729 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7729/])


 NodeHealthScriptRunner.shouldRun() check is performing 3 times for starting 
 NodeHealthScriptRunner
 --

 Key: YARN-3375
 URL: https://issues.apache.org/jira/browse/YARN-3375
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Devaraj K
Assignee: Devaraj K
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3375.patch


 1. NodeHealthScriptRunner.shouldRun() check is happening 3 times for starting 
 the NodeHealthScriptRunner.
 {code:title=NodeManager.java|borderStyle=solid}
 if(!NodeHealthScriptRunner.shouldRun(nodeHealthScript)) {
   LOG.info(Abey khali);
   return null;
 }
 {code}
 {code:title=NodeHealthCheckerService.java|borderStyle=solid}
 if (NodeHealthScriptRunner.shouldRun(
 conf.get(YarnConfiguration.NM_HEALTH_CHECK_SCRIPT_PATH))) {
   addService(nodeHealthScriptRunner);
 }
 {code}
 {code:title=NodeHealthScriptRunner.java|borderStyle=solid}
 if (!shouldRun(nodeHealthScript)) {
   LOG.info(Not starting node health monitor);
   return;
 }
 {code}
 2. If we don't configure node health script or configured health script 
 doesn't execute permission, NM logs with the below message.
 {code:xml}
 2015-03-19 19:55:45,713 INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager: Abey khali
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.


[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527802#comment-14527802
 ] 

zhihai xu commented on YARN-3491:
-

thanks [~wilfreds] for the review. I uploaded a new patch YARN-3491.004.patch, 
which addressed all your comments.

 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch, YARN-3491.001.patch, 
 YARN-3491.002.patch, YARN-3491.003.patch, YARN-3491.004.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which takes about 10+ ms.
 The total delay will be approximately number of local dirs * 10+ ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated


 [ 
https://issues.apache.org/jira/browse/YARN-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3573:
---
Attachment: YARN-3573.patch

 MiniMRYarnCluster constructor that starts the timeline server using a boolean 
 should be marked depricated
 -

 Key: YARN-3573
 URL: https://issues.apache.org/jira/browse/YARN-3573
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Brahma Reddy Battula
 Attachments: YARN-3573.patch


 {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code}
 starts the timeline server using *boolean enableAHS*. It is better to have 
 the timelineserver started based on the config value.
 We should mark this constructor as deprecated to avoid its future use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures

2015-05-04 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527639#comment-14527639
 ] 

Wangda Tan commented on YARN-3514:
--

[~cnauroth], I think this causes other problems in latest YARN as well, for 
example:

If a user with name with mixed cases for example De, if we have a rule /L 
in kerberos side to make all names to lower case, when NM doing log 
aggregation, it will fail because user name doesn't match (in 
UserGroupInformation is de, but in OS).

{code}
java.io.IOException: Owner 'De' for path 
/hadoop/yarn2/log/application_1428608050835_0013/container_1428608050835_0013_01_06/stder
r did not match expected owner 'de'
at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:285)
at 
org.apache.hadoop.io.SecureIOUtils.forceSecureOpenForRead(SecureIOUtils.java:219)
at 
org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:204)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogValue.secureOpenFile(AggregatedLogFormat.java:275)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogValue.write(AggregatedLogFormat.java:227)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.append(AggregatedLogFormat.java:448)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl$ContainerLogAggregator.doContainer
LogAggregation(AppLogAggregatorImpl.java:534)
at 
...
{code}

One possible solution is ignoring cases while compare user name, but that will 
be problematic when user De/de existed at the same time. Any thoughts? 
[~cnauroth].

 Active directory usernames like domain\login cause YARN failures
 

 Key: YARN-3514
 URL: https://issues.apache.org/jira/browse/YARN-3514
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
 Environment: CentOS6
Reporter: john lilley
Assignee: Chris Nauroth
Priority: Minor
 Attachments: YARN-3514.001.patch, YARN-3514.002.patch


 We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
 Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
 are able to authenticate, browse HDFS, etc.  However, YARN fails during 
 localization because it seems to get confused by the presence of a \ 
 character in the local user name.
 Our AD authentication on the nodes goes through sssd and set configured to 
 map AD users onto the form domain\username.  For example, our test user has a 
 Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
 domain\hadoopuser.  We have no problem validating that user with PAM, 
 logging in as that user, su-ing to that user, etc.
 However, when we attempt to run a YARN application master, the localization 
 step fails when setting up the local cache directory for the AM.  The error 
 that comes out of the RM logs:
 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
 ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
 diagnostics='Application application_1429295486450_0001 failed 1 times due to 
 AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
 -1000 due to: Application application_1429295486450_0001 initialization 
 failed (exitCode=255) with output: main : command provided 0
 main : user is DOMAIN\hadoopuser
 main : requested yarn user is domain\hadoopuser
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
 directory: 
 /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
 at 
 org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
 .Failing this attempt.. Failing the application.'
 However, when we look on the node launching the AM, we see this:
 [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
 [root@rpb-cdh-kerb-2 usercache]# ls -l
 drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
 There appears to be different treatment of the \ character in different 
 places.  Something creates the directory as domain\hadoopuser

[jira] [Commented] (YARN-3343) TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk


[ 
https://issues.apache.org/jira/browse/YARN-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527640#comment-14527640
 ] 

Jian He commented on YARN-3343:
---

[~rohithsharma], is this still reproducible ? seems not on my side. 

 TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk
 ---

 Key: YARN-3343
 URL: https://issues.apache.org/jira/browse/YARN-3343
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Xuan Gong
Assignee: Rohith
Priority: Minor
 Attachments: 0001-YARN-3343.patch


 Error Message
 test timed out after 3 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 3 milliseconds
   at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
   at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
   at 
 java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
   at java.net.InetAddress.getAllByName0(InetAddress.java:1246)
   at java.net.InetAddress.getAllByName(InetAddress.java:1162)
   at java.net.InetAddress.getAllByName(InetAddress.java:1098)
   at java.net.InetAddress.getByName(InetAddress.java:1048)
   at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:563)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.isValidNode(NodesListManager.java:147)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(ResourceTrackerService.java:367)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:178)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:136)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:206)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate(TestCapacitySchedulerNodeLabelUpdate.java:157)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels

2015-05-04 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527697#comment-14527697
 ] 

Wangda Tan commented on YARN-3521:
--

[~sunilg], Make sense to me, 

bq. IMO I also feel that NodeLabelManager apis can use Object rather than 
Strings. Admin interface can take this conversion logic.
Sorry I didn't get this, currently addToCluserNodeLabels is already takes 
object instead of String and you're using it in your patch.

 Support return structured NodeLabel objects in REST API when call 
 getClusterNodeLabels
 --

 Key: YARN-3521
 URL: https://issues.apache.org/jira/browse/YARN-3521
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Sunil G
 Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 
 0003-YARN-3521.patch, 0004-YARN-3521.patch


 In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should 
 make the same change in REST API side to make them consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3574) RM hangs on stopping MetricsSinkAdapter when transitioning to standby


 [ 
https://issues.apache.org/jira/browse/YARN-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3574:
--
Description: 
We've seen a situation that one RM hangs on stopping the MetricsSinkAdapter

{code}
main-EventThread daemon prio=10 tid=0x7f9b24031000 nid=0x2d18 in 
Object.wait() [0x7f9afe7eb000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0xc058dcf8 (a 
org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1)
at java.lang.Thread.join(Thread.java:1281)
- locked 0xc058dcf8 (a 
org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1)
at java.lang.Thread.join(Thread.java:1355)
at 
org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.stop(MetricsSinkAdapter.java:202)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stopSinks(MetricsSystemImpl.java:472)
- locked 0xc04cc1a0 (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stop(MetricsSystemImpl.java:213)
- locked 0xc04cc1a0 (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.shutdown(MetricsSystemImpl.java:592)
- locked 0xc04cc1a0 (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdownInstance(DefaultMetricsSystem.java:72)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdown(DefaultMetricsSystem.java:68)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:605)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked 0xc0503568 (a java.lang.Object)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1024)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1076)
- locked 0xc03fe3b8 (a 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322)
- locked 0xc0502b10 (a 
org.apache.hadoop.yarn.server.resourcemanager.AdminService)
at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:135)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428)
- locked 0xc0718940 (a 
org.apache.hadoop.ha.ActiveStandbyElector)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
{code}
{code}
timeline daemon prio=10 tid=0x7f9b34d55000 nid=0x1d93 runnable 
[0x7f9b0cbbf000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
- locked 0xc0f522c8 (a java.io.BufferedInputStream)
at 
org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
at 
org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at 
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
at 
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
at 
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at 
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at 
org.apache.hadoop.metrics2.sink.timeline.AbstractTimelineMetricsSink.emitMetrics(AbstractTimelineMetricsSink.java:66)
at 
org.apache.hadoop.metrics2.sink.timeline.HadoopTimelineMetricsSink.putMetrics(HadoopTimelineMetricsSink.java:203)
at 
org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.consume(MetricsSinkAdapter.java:175)
at

[jira] [Commented] (YARN-3574) RM hangs on stopping MetricsSinkAdapter when transitioning to standby


[ 
https://issues.apache.org/jira/browse/YARN-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527698#comment-14527698
 ] 

Jian He commented on YARN-3574:
---

[~brahmareddy],  I'm also not able to repro.. I wondered if any other folks 
have seen this issue before.
we found this while doing ambari integration testing. I added one more stack 
trace for the blocking thread in the description. 



 RM hangs on stopping MetricsSinkAdapter when transitioning to standby
 -

 Key: YARN-3574
 URL: https://issues.apache.org/jira/browse/YARN-3574
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Brahma Reddy Battula

 We've seen a situation that one RM hangs on stopping the MetricsSinkAdapter
 {code}
 main-EventThread daemon prio=10 tid=0x7f9b24031000 nid=0x2d18 in 
 Object.wait() [0x7f9afe7eb000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0xc058dcf8 (a 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1)
 at java.lang.Thread.join(Thread.java:1281)
 - locked 0xc058dcf8 (a 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1)
 at java.lang.Thread.join(Thread.java:1355)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.stop(MetricsSinkAdapter.java:202)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stopSinks(MetricsSystemImpl.java:472)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stop(MetricsSystemImpl.java:213)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.shutdown(MetricsSystemImpl.java:592)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdownInstance(DefaultMetricsSystem.java:72)
 at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdown(DefaultMetricsSystem.java:68)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:605)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 - locked 0xc0503568 (a java.lang.Object)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1024)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1076)
 - locked 0xc03fe3b8 (a 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322)
 - locked 0xc0502b10 (a 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:135)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428)
 - locked 0xc0718940 (a 
 org.apache.hadoop.ha.ActiveStandbyElector)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 {code}
 timeline daemon prio=10 tid=0x7f9b34d55000 nid=0x1d93 runnable 
 [0x7f9b0cbbf000]
java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:152)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
 - locked 0xc0f522c8 (a java.io.BufferedInputStream)
 at 
 org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
 at 
 org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
 at 
 org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
 at 
 org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
 at 
 org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
 at 
 org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
 at

[jira] [Commented] (YARN-1612) FairScheduler: Enable delay scheduling by default


[ 
https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527820#comment-14527820
 ] 

Hadoop QA commented on YARN-1612:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 33s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 31s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 35s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 50s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 14s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |  52m 36s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 52s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730302/YARN-1612-004.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 551615f |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7695/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7695/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7695/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7695/console |


This message was automatically generated.

 FairScheduler: Enable delay scheduling by default
 -

 Key: YARN-1612
 URL: https://issues.apache.org/jira/browse/YARN-1612
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Sandy Ryza
Assignee: Chen He
 Attachments: YARN-1612-003.patch, YARN-1612-004.patch, 
 YARN-1612-v2.patch, YARN-1612.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend


[ 
https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527835#comment-14527835
 ] 

Hadoop QA commented on YARN-3134:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 51s | Pre-patch YARN-2928 compilation 
is healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 37s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 34s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  26m  0s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730332/YARN-3134-YARN-2928.003.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / 557a395 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7698/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7698/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7698/console |


This message was automatically generated.

 [Storage implementation] Exploiting the option of using Phoenix to access 
 HBase backend
 ---

 Key: YARN-3134
 URL: https://issues.apache.org/jira/browse/YARN-3134
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Li Lu
 Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, 
 YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, 
 YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, 
 YARN-3134-YARN-2928.001.patch, YARN-3134-YARN-2928.002.patch, 
 YARN-3134-YARN-2928.003.patch, YARN-3134DataSchema.pdf


 Quote the introduction on Phoenix web page:
 {code}
 Apache Phoenix is a relational database layer over HBase delivered as a 
 client-embedded JDBC driver targeting low latency queries over HBase data. 
 Apache Phoenix takes your SQL query, compiles it into a series of HBase 
 scans, and orchestrates the running of those scans to produce regular JDBC 
 result sets. The table metadata is stored in an HBase table and versioned, 
 such that snapshot queries over prior versions will automatically use the 
 correct schema. Direct use of the HBase API, along with coprocessors and 
 custom filters, results in performance on the order of milliseconds for small 
 queries, or seconds for tens of millions of rows.
 {code}
 It may simply our implementation read/write data from/to HBase, and can 
 easily build index and compose complex query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-04 Thread Jun Gong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527847#comment-14527847
]

Jun Gong commented on YARN-3480:

[~jianhe], sorry for not specifying our scenario: RM HA is enabled, use ZK to
store apps' info, most apps running in the cluster are long running(service)
apps, yarn.resourcemanager.am.max-attempts is set to 1 because we have not
patched YARN-611 and we want apps to retry more times. There are 10K apps with
1~1 attempts stored in ZK. It will take about 6 mins to recover those apps
when RM HA.

{quote}
1. How often do you see an app failed with a large number of attempts? If it's
limited to a few apps. I wouldn't worry so much.
2. How slower it is in reality in your case? we've done some benchmark,
recovering 10k apps(with 1 attempt) on ZK is pretty fast, within 20 seconds or
so.
{quote}
Please see above. I think it will be OK for map-reduce jobs. But it might not
be OK for service apps which have been running several months.

{quote}
3. Limiting the attempts to be recorded means we are losing history. it's a
trade off.
{quote}
Yes, I agree.

Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2921) MockRM#waitForState methods can be too slow and flaky

2015-05-04 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527855#comment-14527855
 ] 

Tsuyoshi Ozawa commented on YARN-2921:
--

[~leftnoteasy] thank you for pinging me. Yes, it looks related. Let me 
survey

 MockRM#waitForState methods can be too slow and flaky
 -

 Key: YARN-2921
 URL: https://issues.apache.org/jira/browse/YARN-2921
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Tsuyoshi Ozawa
 Attachments: YARN-2921.001.patch, YARN-2921.002.patch, 
 YARN-2921.003.patch, YARN-2921.004.patch


 MockRM#waitForState methods currently sleep for too long (2 seconds and 1 
 second). This leads to slow tests and sometimes failures if the 
 App/AppAttempt moves to another state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3574) RM hangs on stopping MetricsSinkAdapter when transitioning to standby


[ 
https://issues.apache.org/jira/browse/YARN-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527616#comment-14527616
 ] 

Brahma Reddy Battula commented on YARN-3574:


[~jianhe] I would like to work on this.. I am not able to reproduce this .. can 
you please give scenario ..?

 RM hangs on stopping MetricsSinkAdapter when transitioning to standby
 -

 Key: YARN-3574
 URL: https://issues.apache.org/jira/browse/YARN-3574
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He

 We've seen a situation that one RM hangs on stopping the MetricsSinkAdapter
 {code}
 main-EventThread daemon prio=10 tid=0x7f9b24031000 nid=0x2d18 in 
 Object.wait() [0x7f9afe7eb000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0xc058dcf8 (a 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1)
 at java.lang.Thread.join(Thread.java:1281)
 - locked 0xc058dcf8 (a 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1)
 at java.lang.Thread.join(Thread.java:1355)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.stop(MetricsSinkAdapter.java:202)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stopSinks(MetricsSystemImpl.java:472)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stop(MetricsSystemImpl.java:213)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.shutdown(MetricsSystemImpl.java:592)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdownInstance(DefaultMetricsSystem.java:72)
 at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdown(DefaultMetricsSystem.java:68)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:605)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 - locked 0xc0503568 (a java.lang.Object)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1024)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1076)
 - locked 0xc03fe3b8 (a 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322)
 - locked 0xc0502b10 (a 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:135)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428)
 - locked 0xc0718940 (a 
 org.apache.hadoop.ha.ActiveStandbyElector)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
  looks like the {{sinkThread.interrupt();}} in MetricsSinkAdapter#stop 
 doesn't really interrupt the thread, which cause it to hang at join.
 This appears only once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3574) RM hangs on stopping MetricsSinkAdapter when transitioning to standby


 [ 
https://issues.apache.org/jira/browse/YARN-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula reassigned YARN-3574:
--

Assignee: Brahma Reddy Battula

 RM hangs on stopping MetricsSinkAdapter when transitioning to standby
 -

 Key: YARN-3574
 URL: https://issues.apache.org/jira/browse/YARN-3574
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Brahma Reddy Battula

 We've seen a situation that one RM hangs on stopping the MetricsSinkAdapter
 {code}
 main-EventThread daemon prio=10 tid=0x7f9b24031000 nid=0x2d18 in 
 Object.wait() [0x7f9afe7eb000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0xc058dcf8 (a 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1)
 at java.lang.Thread.join(Thread.java:1281)
 - locked 0xc058dcf8 (a 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1)
 at java.lang.Thread.join(Thread.java:1355)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.stop(MetricsSinkAdapter.java:202)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stopSinks(MetricsSystemImpl.java:472)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stop(MetricsSystemImpl.java:213)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.shutdown(MetricsSystemImpl.java:592)
 - locked 0xc04cc1a0 (a 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
 at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdownInstance(DefaultMetricsSystem.java:72)
 at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdown(DefaultMetricsSystem.java:68)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:605)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 - locked 0xc0503568 (a java.lang.Object)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1024)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1076)
 - locked 0xc03fe3b8 (a 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322)
 - locked 0xc0502b10 (a 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:135)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428)
 - locked 0xc0718940 (a 
 org.apache.hadoop.ha.ActiveStandbyElector)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
  looks like the {{sinkThread.interrupt();}} in MetricsSinkAdapter#stop 
 doesn't really interrupt the thread, which cause it to hang at join.
 This appears only once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated


[ 
https://issues.apache.org/jira/browse/YARN-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527619#comment-14527619
 ] 

Brahma Reddy Battula commented on YARN-3573:


[~mitdesai] Thanks for reporting, Attached the patch kindly review..

 MiniMRYarnCluster constructor that starts the timeline server using a boolean 
 should be marked depricated
 -

 Key: YARN-3573
 URL: https://issues.apache.org/jira/browse/YARN-3573
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Brahma Reddy Battula
 Attachments: YARN-3573.patch


 {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code}
 starts the timeline server using *boolean enableAHS*. It is better to have 
 the timelineserver started based on the config value.
 We should mark this constructor as deprecated to avoid its future use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend

2015-05-04 Thread Li Lu (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Li Lu updated YARN-3134:

Attachment: YARN-3134-YARN-2928.003.patch

Updated my patch according to the latest comments. I've rebased the patch to
the latest YARN-2928 branch, with YARN-3551 in. In this version we're no longer
swallowing exceptions. I have not made the change on the Phoenix connection
string since, according to our previous discussion, we're planning to address
this after we've decided which implementation to pursue in the future.

A special note to [~zjshen]: I'm not sure my current way to access the
singleData section of a TimelineMetric is correct (since the field no longer
exists). It would be great if you can take a look at it. Thanks!

[Storage implementation] Exploiting the option of using Phoenix to access
HBase backend
---

Key: YARN-3134
URL: https://issues.apache.org/jira/browse/YARN-3134
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Zhijie Shen
Assignee: Li Lu
Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf,
YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch,
YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch,
YARN-3134-YARN-2928.001.patch, YARN-3134-YARN-2928.002.patch,
YARN-3134-YARN-2928.003.patch, YARN-3134DataSchema.pdf

Quote the introduction on Phoenix web page:
{code}
Apache Phoenix is a relational database layer over HBase delivered as a
client-embedded JDBC driver targeting low latency queries over HBase data.
Apache Phoenix takes your SQL query, compiles it into a series of HBase
scans, and orchestrates the running of those scans to produce regular JDBC
result sets. The table metadata is stored in an HBase table and versioned,
such that snapshot queries over prior versions will automatically use the
correct schema. Direct use of the HBase API, along with coprocessors and
custom filters, results in performance on the order of milliseconds for small
queries, or seconds for tens of millions of rows.
{code}
It may simply our implementation read/write data from/to HBase, and can
easily build index and compose complex query.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience


[ 
https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527649#comment-14527649
 ] 

Naganarasimha G R commented on YARN-3523:
-

in that case better to remove @Stable and not add  @Unstable .. thoughts ?

 Cleanup ResourceManagerAdministrationProtocol interface audience
 

 Key: YARN-3523
 URL: https://issues.apache.org/jira/browse/YARN-3523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R
  Labels: newbie
 Attachments: YARN-3523.20150422-1.patch, YARN-3523.20150504-1.patch


 I noticed ResourceManagerAdministrationProtocol has @Private audience for the 
 class and @Public audience for methods. It doesn't make sense to me. We 
 should make class audience and methods audience consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3557) Support Intel Trusted Execution Technology(TXT) in YARN scheduler

2015-05-04 Thread Dian Fu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527699#comment-14527699
 ] 

Dian Fu commented on YARN-3557:
---

Hi [~leftnoteasy],
Thanks a lot for your comments. What about the support of both distributed 
configuration and centralized configuration? Any thoughts about the solution I 
mentioned in the above comment?

 Support Intel Trusted Execution Technology(TXT) in YARN scheduler
 -

 Key: YARN-3557
 URL: https://issues.apache.org/jira/browse/YARN-3557
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Dian Fu
 Attachments: Support TXT in YARN high level design doc.pdf


 Intel TXT defines platform-level enhancements that provide the building 
 blocks for creating trusted platforms. A TXT aware YARN scheduler can 
 schedule security sensitive jobs on TXT enabled nodes only. YARN-2492 
 provides the capacity to restrict YARN applications to run only on cluster 
 nodes that have a specified node label. This is a good mechanism that be 
 utilized for TXT aware YARN scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures


[ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527702#comment-14527702
 ] 

Vinod Kumar Vavilapalli commented on YARN-3514:
---

I also doubt if this (the fix by the patch) is the only place where 
domain\login type of user-names will fail in YARN.

 Active directory usernames like domain\login cause YARN failures
 

 Key: YARN-3514
 URL: https://issues.apache.org/jira/browse/YARN-3514
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
 Environment: CentOS6
Reporter: john lilley
Assignee: Chris Nauroth
Priority: Minor
 Attachments: YARN-3514.001.patch, YARN-3514.002.patch


 We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
 Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
 are able to authenticate, browse HDFS, etc.  However, YARN fails during 
 localization because it seems to get confused by the presence of a \ 
 character in the local user name.
 Our AD authentication on the nodes goes through sssd and set configured to 
 map AD users onto the form domain\username.  For example, our test user has a 
 Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
 domain\hadoopuser.  We have no problem validating that user with PAM, 
 logging in as that user, su-ing to that user, etc.
 However, when we attempt to run a YARN application master, the localization 
 step fails when setting up the local cache directory for the AM.  The error 
 that comes out of the RM logs:
 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
 ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
 diagnostics='Application application_1429295486450_0001 failed 1 times due to 
 AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
 -1000 due to: Application application_1429295486450_0001 initialization 
 failed (exitCode=255) with output: main : command provided 0
 main : user is DOMAIN\hadoopuser
 main : requested yarn user is domain\hadoopuser
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
 directory: 
 /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
 at 
 org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
 .Failing this attempt.. Failing the application.'
 However, when we look on the node launching the AM, we see this:
 [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
 [root@rpb-cdh-kerb-2 usercache]# ls -l
 drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
 There appears to be different treatment of the \ character in different 
 places.  Something creates the directory as domain\hadoopuser but something 
 else later attempts to use it as domain%5Chadoopuser.  I’m not sure where 
 or why the URL escapement converts the \ to %5C or why this is not consistent.
 I should also mention, for the sake of completeness, our auth_to_local rule 
 is set up to map u...@domain.com to domain\user:
 RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience


[ 
https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527705#comment-14527705
 ] 

Vinod Kumar Vavilapalli commented on YARN-3523:
---

Makes sense.

 Cleanup ResourceManagerAdministrationProtocol interface audience
 

 Key: YARN-3523
 URL: https://issues.apache.org/jira/browse/YARN-3523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R
  Labels: newbie
 Attachments: YARN-3523.20150422-1.patch, YARN-3523.20150504-1.patch


 I noticed ResourceManagerAdministrationProtocol has @Private audience for the 
 class and @Public audience for methods. It doesn't make sense to me. We 
 should make class audience and methods audience consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3562) unit tests failures and issues found from findbug from earlier ATS checkins


[ 
https://issues.apache.org/jira/browse/YARN-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527752#comment-14527752
 ] 

Hadoop QA commented on YARN-3562:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 53s | Pre-patch YARN-2928 compilation 
is healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 5 new or modified test files. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 41s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 40s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 25s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:red}-1{color} | yarn tests |  53m 18s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   2m 34s | Tests passed in 
hadoop-yarn-server-tests. |
| {color:green}+1{color} | yarn tests |   0m 21s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  94m  1s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.resourcemanager.TestClientRMService |
|   | hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730037/YARN-3562-YARN-2928.001.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / 557a395 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7694/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-tests test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7694/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt
 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7694/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7694/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7694/console |


This message was automatically generated.

 unit tests failures and issues found from findbug from earlier ATS checkins
 ---

 Key: YARN-3562
 URL: https://issues.apache.org/jira/browse/YARN-3562
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
Priority: Minor
 Attachments: YARN-3562-YARN-2928.001.patch


 *Issues reported from MAPREDUCE-6337* :
 A bunch of MR unit tests are failing on our branch whenever the mini YARN 
 cluster needs to bring up multiple node managers.
 For example, see 
 https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/
 It is because the NMCollectorService is using a fixed port for the RPC (8048).
 *Issues reported from YARN-3044* :
 Test case failures and tools(FB  CS) issues found :
 # find bugs issue : Comparison of String objects using == or != in 
 ResourceTrackerService.updateAppCollectorsMap
 # find bugs issue : Boxing/unboxing to parse a primitive 
 RMTimelineCollectorManager.postPut. Called method Long.longValue()
 Should call Long.parseLong(String) instead.
 # find bugs issue : DM_DEFAULT_ENCODING Called method new 
 java.io.FileWriter(String, boolean) At 
 FileSystemTimelineWriterImpl.java:\[line 86\]
 # hadoop.yarn.server.resourcemanager.TestAppManager, 
 hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions, 
 hadoop.yarn.server.resourcemanager.TestClientRMService  
 hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus,
  refer https://builds.apache.org/job/PreCommit-YARN-Build/7534/testReport/



--
This message was sent by Atlassian JIRA

[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.


 [ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3491:

Attachment: YARN-3491.004.patch

 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch, YARN-3491.001.patch, 
 YARN-3491.002.patch, YARN-3491.003.patch, YARN-3491.004.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which takes about 10+ ms.
 The total delay will be approximately number of local dirs * 10+ ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend

2015-05-04 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527632#comment-14527632
 ] 

Li Lu commented on YARN-3134:
-

And, one more thing: I'm closing all PreparedStatements implicitly in the 
try-with-resource statements. This statement will not swallow any exceptions 
(since there's no catch after it) but will guarantee the resource is released 
after the block's execution, even if there're exceptions. 

 [Storage implementation] Exploiting the option of using Phoenix to access 
 HBase backend
 ---

 Key: YARN-3134
 URL: https://issues.apache.org/jira/browse/YARN-3134
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Li Lu
 Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, 
 YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, 
 YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, 
 YARN-3134-YARN-2928.001.patch, YARN-3134-YARN-2928.002.patch, 
 YARN-3134-YARN-2928.003.patch, YARN-3134DataSchema.pdf


 Quote the introduction on Phoenix web page:
 {code}
 Apache Phoenix is a relational database layer over HBase delivered as a 
 client-embedded JDBC driver targeting low latency queries over HBase data. 
 Apache Phoenix takes your SQL query, compiles it into a series of HBase 
 scans, and orchestrates the running of those scans to produce regular JDBC 
 result sets. The table metadata is stored in an HBase table and versioned, 
 such that snapshot queries over prior versions will automatically use the 
 correct schema. Direct use of the HBase API, along with coprocessors and 
 custom filters, results in performance on the order of milliseconds for small 
 queries, or seconds for tens of millions of rows.
 {code}
 It may simply our implementation read/write data from/to HBase, and can 
 easily build index and compose complex query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3518) default rm/am expire interval should not less than default resourcemanager connect wait time

2015-05-04 Thread sandflee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527706#comment-14527706
 ] 

sandflee commented on YARN-3518:


agree, we should set nm, am, client  separately

 default rm/am expire interval should not less than default resourcemanager 
 connect wait time
 

 Key: YARN-3518
 URL: https://issues.apache.org/jira/browse/YARN-3518
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Reporter: sandflee
Assignee: sandflee
  Labels: configuration, newbie
 Attachments: YARN-3518.001.patch


 take am for example, if am can't connect to RM, after am expire (600s), RM 
 relaunch am, and there will be two am at the same time util resourcemanager 
 connect max wait time(900s) passed.
 DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS =  15 * 60 * 1000;
 DEFAULT_RM_AM_EXPIRY_INTERVAL_MS = 60;
 DEFAULT_RM_NM_EXPIRY_INTERVAL_MS = 60;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml


[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527732#comment-14527732
 ] 

Hadoop QA commented on YARN-3069:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 46s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 4 new or modified test files. |
| {color:green}+1{color} | javac |   7m 38s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 42s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   4m 45s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  3s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 39s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   7m 10s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | common tests |  23m 32s | Tests passed in 
hadoop-common. |
| {color:green}+1{color} | mapreduce tests |   9m 42s | Tests passed in 
hadoop-mapreduce-client-app. |
| {color:green}+1{color} | yarn tests |   1m 59s | Tests passed in 
hadoop-yarn-common. |
| {color:red}-1{color} | hdfs tests | 164m 48s | Tests failed in hadoop-hdfs. |
| | | 246m 47s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestFileCreation |
|   | hadoop.hdfs.TestHDFSFileSystemContract |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730267/YARN-3069.006.patch |
| Optional Tests | javac unit findbugs checkstyle javadoc |
| git revision | trunk / bf70c5a |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7691/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-mapreduce-client-app test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7691/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7691/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7691/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7691/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7691/console |


This message was automatically generated.

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms

[jira] [Commented] (YARN-3547) FairScheduler: Apps that have no resource demand should not participate scheduling


[ 
https://issues.apache.org/jira/browse/YARN-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527827#comment-14527827
 ] 

Hadoop QA commented on YARN-3547:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 39s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 49s | The applied patch generated  1 
new checkstyle issues (total was 9, now 10). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 15s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:red}-1{color} | yarn tests |  52m 58s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  89m 29s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730098/YARN-3547.003.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 551615f |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7696/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7696/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7696/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7696/console |


This message was automatically generated.

 FairScheduler: Apps that have no resource demand should not participate 
 scheduling
 --

 Key: YARN-3547
 URL: https://issues.apache.org/jira/browse/YARN-3547
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Xianyin Xin
Assignee: Xianyin Xin
 Attachments: YARN-3547.001.patch, YARN-3547.002.patch, 
 YARN-3547.003.patch


 At present, all of the 'running' apps participate the scheduling process, 
 however, most of them may have no resource demand on a production cluster, as 
 the app's status is running other than waiting for resource at the most of 
 the app's lifetime. It's not a wise way we sort all the 'running' apps and 
 try to fulfill them, especially on a large-scale cluster which has heavy 
 scheduling load. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.


[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527898#comment-14527898
 ] 

Hadoop QA commented on YARN-3491:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 43s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 36s | The applied patch generated  3 
new checkstyle issues (total was 177, now 178). |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  2s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   5m 57s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  42m  5s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730351/YARN-3491.004.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 338e88a |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7700/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7700/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7700/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7700/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7700/console |


This message was automatically generated.

 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch, YARN-3491.001.patch, 
 YARN-3491.002.patch, YARN-3491.003.patch, YARN-3491.004.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which takes about 10+ ms.
 The total delay will be approximately number of local dirs * 10+ ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API when call getClusterNodeLabels


[ 
https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527904#comment-14527904
 ] 

Sunil G commented on YARN-3521:
---

[~leftnoteasy] Yes, Its not a valid point. replaceLabelsOnNode and 
removeFromClusterNodeLabels doesn't need node label object, name is enough. Pls 
discard my earlier comment.

 Support return structured NodeLabel objects in REST API when call 
 getClusterNodeLabels
 --

 Key: YARN-3521
 URL: https://issues.apache.org/jira/browse/YARN-3521
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Sunil G
 Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 
 0003-YARN-3521.patch, 0004-YARN-3521.patch


 In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should 
 make the same change in REST API side to make them consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3573) MiniMRYarnCluster constructor that starts the timeline server using a boolean should be marked depricated


[ 
https://issues.apache.org/jira/browse/YARN-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527906#comment-14527906
 ] 

Hadoop QA commented on YARN-3573:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   5m  9s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 28s | There were no new javac warning 
messages. |
| {color:green}+1{color} | release audit |   0m 19s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 32s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 31s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 40s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | mapreduce tests | 106m 29s | Tests passed in 
hadoop-mapreduce-client-jobclient. |
| | | 122m 45s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12730327/YARN-3573.patch |
| Optional Tests | javac unit findbugs checkstyle |
| git revision | trunk / 551615f |
| hadoop-mapreduce-client-jobclient test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7699/artifact/patchprocess/testrun_hadoop-mapreduce-client-jobclient.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7699/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7699/console |


This message was automatically generated.

 MiniMRYarnCluster constructor that starts the timeline server using a boolean 
 should be marked depricated
 -

 Key: YARN-3573
 URL: https://issues.apache.org/jira/browse/YARN-3573
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Brahma Reddy Battula
 Attachments: YARN-3573.patch


 {code}MiniMRYarnCluster(String testName, int noOfNMs, boolean enableAHS){code}
 starts the timeline server using *boolean enableAHS*. It is better to have 
 the timelineserver started based on the config value.
 We should mark this constructor as deprecated to avoid its future use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures

2015-05-04 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527907#comment-14527907
 ] 

Chris Nauroth commented on YARN-3514:
-

Looking at the original description, I see upper-case DOMAIN is getting 
translated to lower-case domain in this environment.  It's likely that this 
environment would get an ownership mismatch error even after getting past the 
current bug.

{code}
drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
{code}

Nice catch, Wangda.

Is it necessary to translate to lower-case, or can the domain portion of the 
name be left in upper-case to match the OS level?

bq. One possible solution is ignoring cases while compare user name, but that 
will be problematic when user De/de existed at the same time.

I've seen a few mentions online that Active Directory is not case-sensitive but 
is case-preserving.  That means it will preserve the case you used in 
usernames, but the case doesn't matter for comparisons.  I've also seen 
references that DNS has similar behavior with regards to case.

I can't find a definitive statement though that this is guaranteed behavior.  
I'd feel safer making this kind of change if we had a definitive reference.

 Active directory usernames like domain\login cause YARN failures
 

 Key: YARN-3514
 URL: https://issues.apache.org/jira/browse/YARN-3514
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
 Environment: CentOS6
Reporter: john lilley
Assignee: Chris Nauroth
Priority: Minor
 Attachments: YARN-3514.001.patch, YARN-3514.002.patch


 We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
 Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
 are able to authenticate, browse HDFS, etc.  However, YARN fails during 
 localization because it seems to get confused by the presence of a \ 
 character in the local user name.
 Our AD authentication on the nodes goes through sssd and set configured to 
 map AD users onto the form domain\username.  For example, our test user has a 
 Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
 domain\hadoopuser.  We have no problem validating that user with PAM, 
 logging in as that user, su-ing to that user, etc.
 However, when we attempt to run a YARN application master, the localization 
 step fails when setting up the local cache directory for the AM.  The error 
 that comes out of the RM logs:
 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
 ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
 diagnostics='Application application_1429295486450_0001 failed 1 times due to 
 AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
 -1000 due to: Application application_1429295486450_0001 initialization 
 failed (exitCode=255) with output: main : command provided 0
 main : user is DOMAIN\hadoopuser
 main : requested yarn user is domain\hadoopuser
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
 directory: 
 /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
 at 
 org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
 .Failing this attempt.. Failing the application.'
 However, when we look on the node launching the AM, we see this:
 [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
 [root@rpb-cdh-kerb-2 usercache]# ls -l
 drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
 There appears to be different treatment of the \ character in different 
 places.  Something creates the directory as domain\hadoopuser but something 
 else later attempts to use it as domain%5Chadoopuser.  I’m not sure where 
 or why the URL escapement converts the \ to %5C or why this is not consistent.
 I should also mention, for the sake of completeness, our auth_to_local rule 
 is set up to map u...@domain.com to domain\user:
 RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3560) Not able to navigate to the cluster from tracking url (proxy) generated after submission of job

2015-05-04 Thread Mohammad Shahid Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527917#comment-14527917
 ] 

Mohammad Shahid Khan commented on YARN-3560:


The issue is happening due to the wrong hyperlink url formation.
The system is always setting forming the url with the default port name even 
when the yarn.resourcemanager.webapp.address is being configured with different 
port numnber.

 Not able to navigate to the cluster from tracking url (proxy) generated after 
 submission of job
 ---

 Key: YARN-3560
 URL: https://issues.apache.org/jira/browse/YARN-3560
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anushri
Priority: Minor

 a standalone web proxy server is enabled in the cluster
 when a job is submitted the url generated contains proxy
 track this url
 in the web page , if we try to navigate to the cluster links [about. 
 applications, or scheduler] it gets redirected to some default port instead 
 of actual RM web port configured
 as such it throws webpage not available



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3343) TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk

2015-05-04 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527929#comment-14527929
 ] 

Rohith commented on YARN-3343:
--

[~jianhe] I was able to reproduce it. When I debug this issue, found that the 
30 sec timeout was so aggressive to complete the test. On an average , the test 
case was taken around 35-45 sec. Changed the timeout to 60 sec

 TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate sometime fails in trunk
 ---

 Key: YARN-3343
 URL: https://issues.apache.org/jira/browse/YARN-3343
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Xuan Gong
Assignee: Rohith
Priority: Minor
 Attachments: 0001-YARN-3343.patch


 Error Message
 test timed out after 3 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 3 milliseconds
   at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
   at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
   at 
 java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
   at java.net.InetAddress.getAllByName0(InetAddress.java:1246)
   at java.net.InetAddress.getAllByName(InetAddress.java:1162)
   at java.net.InetAddress.getAllByName(InetAddress.java:1098)
   at java.net.InetAddress.getByName(InetAddress.java:1048)
   at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:563)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.isValidNode(NodesListManager.java:147)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(ResourceTrackerService.java:367)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:178)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockNM.nodeHeartbeat(MockNM.java:136)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:206)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate.testNodeUpdate(TestCapacitySchedulerNodeLabelUpdate.java:157)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.


 [ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3491:

Attachment: (was: YARN-3491.004.patch)

 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch, YARN-3491.001.patch, 
 YARN-3491.002.patch, YARN-3491.003.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which takes about 10+ ms.
 The total delay will be approximately number of local dirs * 10+ ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.