[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-20 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: (was: CPU-isolation-for-latency-sensitive-services-v1.pdf)

> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors, this is inspired by the isolation 
> technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
>  Later I will upload a detailed design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-20 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf

> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors, this is inspired by the isolation 
> technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
>  Later I will upload a detailed design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8331) Race condition in NM container launched after done

2018-05-20 Thread Yang Wang (JIRA)
Yang Wang created YARN-8331:
---

 Summary: Race condition in NM container launched after done
 Key: YARN-8331
 URL: https://issues.apache.org/jira/browse/YARN-8331
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


When a container is launching, in ContainerLaunch#launchContainer, state is 
SCHEDULED,
kill event was sent to this container, state : SCHEDULED->KILLING->DONE
 Then ContainerLaunch send CONTAINER_LAUNCHED event and start the container 
processes. These absent container processes will not be cleaned up anymore.

 
{code:java}
2018-05-21 13:11:56,114 INFO  [Thread-11] nodemanager.NMAuditLogger 
(NMAuditLogger.java:logSuccess(94)) - USER=nobody   OPERATION=Start Container 
Request   TARGET=ContainerManageImpl  RESULT=SUCCESS  
APPID=application_0_CONTAINERID=container_0__01_00
2018-05-21 13:11:56,114 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application 
application_0_ transitioned from NEW to INITING
2018-05-21 13:11:56,114 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:transition(446)) - Adding 
container_0__01_00 to application application_0_
2018-05-21 13:11:56,118 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application 
application_0_ transitioned from INITING to RUNNING
2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
container_0__01_00 transitioned from NEW to SCHEDULED
2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
containermanager.AuxServices (AuxServices.java:handle(220)) - Got event 
CONTAINER_INIT for appId application_0_
2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(504)) - 
Starting container [container_0__01_00]
2018-05-21 13:11:56,226 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
container_0__01_00 transitioned from SCHEDULED to KILLING
2018-05-21 13:11:56,227 INFO  [NM ContainerManager dispatcher] 
containermanager.TestContainerManager 
(BaseContainerManagerTest.java:delete(287)) - Psuedo delete: user - nobody, 
type - FILE
2018-05-21 13:11:56,227 INFO  [NM ContainerManager dispatcher] 
nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(94)) - USER=nobody 
 OPERATION=Container Finished - Killed   TARGET=ContainerImplRESULT=SUCCESS 
 APPID=application_0_CONTAINERID=container_0__01_00
2018-05-21 13:11:56,238 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
container_0__01_00 transitioned from KILLING to DONE
2018-05-21 13:11:56,238 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:transition(489)) - Removing 
container_0__01_00 from application application_0_
2018-05-21 13:11:56,239 INFO  [NM ContainerManager dispatcher] 
monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:onStopMonitoringContainer(932)) - Stopping 
resource-monitoring for container_0__01_00
2018-05-21 13:11:56,239 INFO  [NM ContainerManager dispatcher] 
containermanager.AuxServices (AuxServices.java:handle(220)) - Got event 
CONTAINER_STOP for appId application_0_
2018-05-21 13:11:56,274 WARN  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2106)) - Can't handle this 
event at current state: Current: [DONE], eventType: [CONTAINER_LAUNCHED], 
container: [container_0__01_00]
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
CONTAINER_LAUNCHED at DONE
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2104)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:104)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1525)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1518)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 

[jira] [Commented] (YARN-8319) More YARN pages need to honor yarn.resourcemanager.display.per-user-apps

2018-05-20 Thread Sunil Govindan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482201#comment-16482201
 ] 

Sunil Govindan commented on YARN-8319:
--

Updating v2 patch after fixing jenkins.

[~vinodkv] [~rohithsharma] [~leftnoteasy] Kindly help to review.

> More YARN pages need to honor yarn.resourcemanager.display.per-user-apps
> 
>
> Key: YARN-8319
> URL: https://issues.apache.org/jira/browse/YARN-8319
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8319.001.patch, YARN-8319.002.patch
>
>
> When this config is on
>  - Per queue page on UI2 should filter app list by user
>  -- TODO: Verify the same with UI1 Per-queue page
>  - ATSv2 with UI2 should filter list of all users' flows and flow activities
>  - Per Node pages
>  -- Listing of apps and containers on a per-node basis should filter apps and 
> containers by user.
> To this end, because this is no longer just for resourcemanager, we should 
> also deprecate {{yarn.resourcemanager.display.per-user-apps}} in favor of 
> {{yarn.webapp.filter-app-list-by-user}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8319) More YARN pages need to honor yarn.resourcemanager.display.per-user-apps

2018-05-20 Thread Sunil Govindan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-8319:
-
Attachment: YARN-8319.002.patch

> More YARN pages need to honor yarn.resourcemanager.display.per-user-apps
> 
>
> Key: YARN-8319
> URL: https://issues.apache.org/jira/browse/YARN-8319
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8319.001.patch, YARN-8319.002.patch
>
>
> When this config is on
>  - Per queue page on UI2 should filter app list by user
>  -- TODO: Verify the same with UI1 Per-queue page
>  - ATSv2 with UI2 should filter list of all users' flows and flow activities
>  - Per Node pages
>  -- Listing of apps and containers on a per-node basis should filter apps and 
> containers by user.
> To this end, because this is no longer just for resourcemanager, we should 
> also deprecate {{yarn.resourcemanager.display.per-user-apps}} in favor of 
> {{yarn.webapp.filter-app-list-by-user}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8330) An extra container got launched by RM for yarn-service

2018-05-20 Thread Suma Shivaprasad (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suma Shivaprasad reassigned YARN-8330:
--

Assignee: Suma Shivaprasad

> An extra container got launched by RM for yarn-service
> --
>
> Key: YARN-8330
> URL: https://issues.apache.org/jira/browse/YARN-8330
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Yesha Vora
>Assignee: Suma Shivaprasad
>Priority: Critical
>
> Steps:
> launch Hbase tarball app
> list containers for hbase tarball app
> {code}
> /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list 
> appattempt_1525463491331_0006_01
> WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of 
> YARN_LOG_DIR.
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History 
> server at xxx/xxx:10200
> 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> Total number of containers :5
> Container-IdStart Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018  
>  N/A RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa
> 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03
> Fri May 04 22:34:26 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01
> Fri May 04 22:34:15 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05
> Fri May 04 22:34:56 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa
> 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04
> Fri May 04 22:34:56 + 2018   N/A
> nullxxx:25454  http://xxx:8042
> http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code}
> Total expected containers = 4 ( 3 components container + 1 am). Instead, RM 
> is listing 5 containers. 
> container_e06_1525463491331_0006_01_04 is in null state.
> Yarn service utilized container 02, 03, 05 for component. There is no log 
> available in NM & AM related to container 04. Only one line in RM log is 
> printed
> {code}
> 2018-05-04 22:34:56,618 INFO  rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(489)) - 
> container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to 
> RESERVED{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8248) Job hangs when a job requests a resource that its queue does not have

2018-05-20 Thread Haibo Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482168#comment-16482168
 ] 

Haibo Chen commented on YARN-8248:
--

[~snemeth] Can you please address the checkstyle issues?

> Job hangs when a job requests a resource that its queue does not have
> -
>
> Key: YARN-8248
> URL: https://issues.apache.org/jira/browse/YARN-8248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8248-001.patch, YARN-8248-002.patch, 
> YARN-8248-003.patch, YARN-8248-004.patch, YARN-8248-005.patch, 
> YARN-8248-006.patch, YARN-8248-007.patch, YARN-8248-008.patch, 
> YARN-8248-009.patch, YARN-8248-010.patch, YARN-8248-011.patch, 
> YARN-8248-012.patch
>
>
> Job hangs when mapreduce.job.queuename is specified and the queue has 0 of 
> any resource (vcores / memory / other)
> In this scenario, the job should be immediately rejected upon submission 
> since the specified queue cannot serve the resource needs of the submitted 
> job.
>  
> Command to run:
> {code:java}
> bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dmapreduce.job.queuename=sample_queue 1 1000;{code}
> fair-scheduler.xml queue config (excerpt):
>  
> {code:java}
>  
> 1 mb,0vcores
> 9 mb,0vcores
> 50
> -1.0f
> 2.0
> fair
>   
> {code}
> Diagnostic message from the web UI: 
> {code:java}
> Wed May 02 06:35:57 -0700 2018] Application is added to the scheduler and is 
> not yet activated. (Resource request:  exceeds current 
> queue or its parents maximum resource allowed).{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2018-05-20 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482157#comment-16482157
 ] 

Weiwei Yang commented on YARN-6578:
---

Agreee with [~fly_in_gis], the motivation of this JIRA is to 1) give AM more 
informative message for update container request; 2) potentially optimize 
container preemption based on allocated/utilized resource. There is no place to 
consume other resource utilization at this point.

> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-6578.001.patch, YARN-6578.002.patch, 
> YARN-6578.003.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making. So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.
> Also container resource utilization need to report to RM to make better 
> scheduling.
> So put resource utilization in ContainerStatus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2018-05-20 Thread Yang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482142#comment-16482142
 ] 

Yang Wang commented on YARN-6578:
-

[~Naganarasimha] thanks for your comment.

Currently we just return pmem/vmem/vcores in ContainerStatus#getUtilization.

Just as you mentioned, do we need to make ResourceUtilization extensible like 
Resource?

Get the utilization of extensible resource (gpu/fpga) is not easy as 
pmem/vmem/vcores.

In most use case, scheduling opportunistic containers or increase/decrease 
container resource, utilization of pmem/vmem/vcores is enough.

> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-6578.001.patch, YARN-6578.002.patch, 
> YARN-6578.003.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making. So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.
> Also container resource utilization need to report to RM to make better 
> scheduling.
> So put resource utilization in ContainerStatus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8079) Support static and archive unmodified local resources in service AM

2018-05-20 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482076#comment-16482076
 ] 

Eric Yang edited comment on YARN-8079 at 5/21/18 12:20 AM:
---

[~leftnoteasy] When the files are placed in resources directory, patch 10 
implementation prevents mistake to overwrite system level generated files, such 
as .token file, and launch_container.sh.  However, this design can created 
inconvenience for some users because existing Hadoop workload may already be 
using the top level localized directory instead of resource directory.  We may 
not need to worry about launch_container.sh getting overwritten because 
container-executor generates the file after static files are localized.  Apps 
will try to avoid .token files because they can not contact HDFS from 
containers, if they overwrites the token files.  

With resources directory, it maybe easier for end user to specify a single 
relative directory to bind-mount instead of specifying individual files to 
bind-mount in yarnfile.  By removing the resources directory, user will need to 
think a bit more on how to manage the bind-mount directories to reduce wordy 
syntax.

With both approaches considered, it all comes down to usability of which 
approach is easiest to use, while not creating too much clutters.  In summary, 
it might be safe to remove the requirement of "resources" directory from my 
point of view.


was (Author: eyang):
[~leftnoteasy] When the files are placed in resources directory, patch 10 
implementation prevents mistake to overwrite system level generated files, such 
as .token file, and launch_container.sh.  However, this design can created 
inconvenience for some users because existing Hadoop workload may already be 
using the top level localized directory instead of resource directory.  We may 
not need to worry about launch_container.sh getting overwritten because 
container-executor generates the file after static files are localized.  Apps 
will try to avoid .token files because they can not contact HDFS from 
containers, if they overwrites the token files.  In summary, it is likely safe 
to remove the requirement of "resources" directory from my point of view.

> Support static and archive unmodified local resources in service AM
> ---
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch, YARN-8079.007.patch, YARN-8079.008.patch, 
> YARN-8079.009.patch, YARN-8079.010.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8079) Support static and archive unmodified local resources in service AM

2018-05-20 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482076#comment-16482076
 ] 

Eric Yang commented on YARN-8079:
-

[~leftnoteasy] When the files are placed in resources directory, patch 10 
implementation prevents mistake to overwrite system level generated files, such 
as .token file, and launch_container.sh.  However, this design can created 
inconvenience for some users because existing Hadoop workload may already be 
using the top level localized directory instead of resource directory.  We may 
not need to worry about launch_container.sh getting overwritten because 
container-executor generates the file after static files are localized.  Apps 
will try to avoid .token files because they can not contact HDFS from 
containers, if they overwrites the token files.  In summary, it is likely safe 
to remove the requirement of "resources" directory from my point of view.

> Support static and archive unmodified local resources in service AM
> ---
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch, YARN-8079.007.patch, YARN-8079.008.patch, 
> YARN-8079.009.patch, YARN-8079.010.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8079) Support static and archive unmodified local resources in service AM

2018-05-20 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482065#comment-16482065
 ] 

Wangda Tan commented on YARN-8079:
--

Thanks [~suma.shivaprasad]. 

While using this feature, I found the behavior to download to resources/ folder 
is really confusing. Since inside the spec: 
{code:java}
 {
"dest_file": "presetup-tf.sh",
"type": "STATIC",
"src_file": "hdfs:///tf-job-conf/scripts/configs/presetup-tf.sh"
},{code}

It doesn't mention the "resources" at all in the spec!

To make it simple, I would prefer to remove the default resources, and place 
all downloaded files directly under container's local folder. Thoughts? 

+ [~gsaha]/[~billie.rinaldi]/[~eyang] for suggestions.

> Support static and archive unmodified local resources in service AM
> ---
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch, YARN-8079.007.patch, YARN-8079.008.patch, 
> YARN-8079.009.patch, YARN-8079.010.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7494) Add muti node lookup support for better placement

2018-05-20 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481855#comment-16481855
 ] 

genericqa commented on YARN-7494:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
40s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 9 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 32m 
 7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  2s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 47s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 33 new + 1063 unchanged - 1 fixed = 1096 total (was 1064) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 27s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m  6s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
23s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}133m  3s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | YARN-7494 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12924083/YARN-7494.008.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 04bcb74a36a9 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 
14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / a23ff8d |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_162 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/20802/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
| unit | 

[jira] [Commented] (YARN-8213) Add Capacity Scheduler metrics

2018-05-20 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481846#comment-16481846
 ] 

Weiwei Yang commented on YARN-8213:
---

I think [~leftnoteasy] posted his comment in a different JIRA, let me copy and 
paste here

{noformat}
in general the patch looks good. I'm only not sure about following logic inside 
reinitialize(): 
CapacitySchedulerMetrics.destroy()
Reinitialize could be frequently invoked, probably we should not destroy it on 
every reinitialize
{noformat}

I think we need this because we need to reset CS metrics counters during CS 
refresh, otherwise the metrics might lose accurate when configure has changed. 
What do you think?

> Add Capacity Scheduler metrics
> --
>
> Key: YARN-8213
> URL: https://issues.apache.org/jira/browse/YARN-8213
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: YARN-8213.001.patch, YARN-8213.002.patch, 
> YARN-8213.003.patch, YARN-8213.004.patch
>
>
> Currently when tune CS performance, it is not that straightforward because 
> lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly 
> only tracks queue level resource counters. Propose to add CS metrics to 
> collect and display more fine-grained perf metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8015) Support inter-app placement constraints in AppPlacementAllocator

2018-05-20 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481844#comment-16481844
 ] 

Weiwei Yang commented on YARN-8015:
---

Hi [~leftnoteasy]

Is upon review comment for YARN-8213 :) ?

What do you think about the patch for this jira? I was trying to get this done 
for 3.1.1 timeline as placement constraint functionality needs this.

Please suggest, thanks.

> Support inter-app placement constraints in AppPlacementAllocator
> 
>
> Key: YARN-8015
> URL: https://issues.apache.org/jira/browse/YARN-8015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: YARN-8015.001.patch, YARN-8015.002.patch
>
>
> AppPlacementAllocator currently only supports intra-app anti-affinity 
> placement constraints, once YARN-8002 and YARN-8013 are resolved, it needs to 
> support inter-app constraints too. Also, this may require some refactoring on 
> the existing code logic. Use this JIRA to track.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-20 Thread Weiwei Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8320:
--
Description: 
Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
“cpu.shares” to isolate cpu resource. However,
 * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
no support for differentiated latency
 * Request latency of services running on container may be frequent shake when 
all containers share cpus, and latency-sensitive services can not afford in our 
production environment.

So we need more finer cpu isolation.

My co-workers and I propose a solution using cgroup cpuset to binds containers 
to different processors, this is inspired by the isolation technique in [Borg 
system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
 Later I will upload a detailed design doc.

  was:
Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
“cpu.shares”  to isolate cpu resource. However,
* Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; no 
support for differentiated latency
* Request latency of services running on container may be frequent shake when 
all containers share cpus, and latency-sensitive services can not afford in our 
production environment.

So we need more finer cpu isolation.

My co-workers and I propose a solution using cgroup cpuset to binds containers 
to different processors according to a [Google’s 
PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
Later I will upload a detailed design doc.

 



> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors, this is inspired by the isolation 
> technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
>  Later I will upload a detailed design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org