[jira] [Commented] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674740#comment-16674740
 ] 

Eric Yang commented on YARN-8927:
-

[~tangzhankun] Pseudo code is:

{code}
if trust local image (or trust local image in a list) {
  docker image
  if exists {
docker run
  } else {
helper();
  }
} else {
  helper();
}

function helper() {
  allow=false;
  if (image does not container "/" and docker.trusted.registry has "library") {
allowed = true;
  } else {
allowed = image in docker.trusted.registry or 
docker.privileged-containers-registry;
  }
  if (allow) {
docker pull
docker run
  }
}
{code}

When local image is disable or not listed, then registry image take precedence. 
 This solves the 78% of majority who trust latest greatest image from remote 
repositories.  If trust local image option is enabled, local image take 
precedence over remote repositories.  There is no state to remember in Java 
because docker image command retains the memory if the image is available 
locally.  C-e test is simple and fast by comparing config value with docker 
image command without having to touch remote repository for the checks.

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8303) YarnClient should contact TimelineReader for application/attempt/container report

2018-11-04 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674714#comment-16674714
 ] 

Rohith Sharma K S commented on YARN-8303:
-

[~abmodi] could you also update the -clusterid options from logsCLI as we 
discussed in last weeklyl call? there are some findbugs which also need to be 
fixed

> YarnClient should contact TimelineReader for application/attempt/container 
> report
> -
>
> Key: YARN-8303
> URL: https://issues.apache.org/jira/browse/YARN-8303
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Critical
> Attachments: YARN-8303.001.patch, YARN-8303.002.patch, 
> YARN-8303.poc.patch
>
>
> YarnClient get app/attempt/container information from RM. If RM doesn't have 
> then queried to ahsClient. When ATSv2 is only enabled, yarnClient will result 
> empty. 
> YarnClient is used by many users which result in empty information for 
> app/attempt/container report. 
> Proposal is to have adapter from yarn client so that app/attempt/container 
> reports can be generated from AHSv2Client which does REST API to 
> TimelineReader and get the entity and convert it into app/attempt/container 
> report.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8233) NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal whose allocatedOrReservedContainer is null

2018-11-04 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674709#comment-16674709
 ] 

Hadoop QA commented on YARN-8233:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 30s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
34s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 40s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}104m 39s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}163m 20s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-8233 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946853/YARN-8233.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux d71d91c6d4e6 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 4e3df75 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/22413/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22413/testReport/ |
| Max. process+thread count | 949 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
| Console output | 

[jira] [Updated] (YARN-8233) NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal whose allocatedOrReservedContainer is null

2018-11-04 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8233:
--
Target Version/s: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2, 3.3.0

> NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal 
> whose allocatedOrReservedContainer is null
> -
>
> Key: YARN-8233
> URL: https://issues.apache.org/jira/browse/YARN-8233
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8233.001.patch, YARN-8233.002.patch, 
> YARN-8233.003.patch
>
>
> Recently we saw a NPE problem in CapacityScheduler#tryCommit when try to find 
> the attemptId by calling {{c.getAllocatedOrReservedContainer().get...}} from 
> an allocate/reserve proposal. But got null allocatedOrReservedContainer and 
> thrown NPE.
> Reference code:
> {code:java}
> // find the application to accept and apply the ResourceCommitRequest
> if (request.anythingAllocatedOrReserved()) {
>   ContainerAllocationProposal c =
>   request.getFirstAllocatedOrReservedContainer();
>   attemptId =
>   c.getAllocatedOrReservedContainer().getSchedulerApplicationAttempt()
>   .getApplicationAttemptId();   //NPE happens here
> } else { ...
> {code}
> The proposal was constructed in 
> {{CapacityScheduler#createResourceCommitRequest}} and 
> allocatedOrReservedContainer is possibly null in async-scheduling process 
> when node was lost or application was finished (details in 
> {{CapacityScheduler#getSchedulerContainer}}).
> Reference code:
> {code:java}
>   // Allocated something
>   List allocations =
>   csAssignment.getAssignmentInformation().getAllocationDetails();
>   if (!allocations.isEmpty()) {
> RMContainer rmContainer = allocations.get(0).rmContainer;
> allocated = new ContainerAllocationProposal<>(
> getSchedulerContainer(rmContainer, true),   //possibly null
> getSchedulerContainersToRelease(csAssignment),
> 
> getSchedulerContainer(csAssignment.getFulfilledReservedContainer(),
> false), csAssignment.getType(),
> csAssignment.getRequestLocalityType(),
> csAssignment.getSchedulingMode() != null ?
> csAssignment.getSchedulingMode() :
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY,
> csAssignment.getResource());
>   }
> {code}
> I think we should add null check for allocateOrReserveContainer before create 
> allocate/reserve proposals. Besides the allocation process has increase 
> unconfirmed resource of app when creating an allocate assignment, so if this 
> check is null, we should decrease the unconfirmed resource of live app.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics

2018-11-04 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674687#comment-16674687
 ] 

Weiwei Yang commented on YARN-2599:
---

I agree with [~Tao Yang]'s point, separate metrics allow users to monitor the 
standby RM's healthy, which is also critical to the cluster. I was not in the 
history discussion in this ticket, if this was canceled due to we were not sure 
what to expose on standby RM, we can at least start with some basic ones, e.g 
the heap like [~Tao Yang] suggested.

[~Naganarasimha], [~rohithsharma], any comments on this?

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8233) NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal whose allocatedOrReservedContainer is null

2018-11-04 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674682#comment-16674682
 ] 

Weiwei Yang commented on YARN-8233:
---

LGTM, pending on jenkins, thanks [~Tao Yang].

> NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal 
> whose allocatedOrReservedContainer is null
> -
>
> Key: YARN-8233
> URL: https://issues.apache.org/jira/browse/YARN-8233
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8233.001.patch, YARN-8233.002.patch, 
> YARN-8233.003.patch
>
>
> Recently we saw a NPE problem in CapacityScheduler#tryCommit when try to find 
> the attemptId by calling {{c.getAllocatedOrReservedContainer().get...}} from 
> an allocate/reserve proposal. But got null allocatedOrReservedContainer and 
> thrown NPE.
> Reference code:
> {code:java}
> // find the application to accept and apply the ResourceCommitRequest
> if (request.anythingAllocatedOrReserved()) {
>   ContainerAllocationProposal c =
>   request.getFirstAllocatedOrReservedContainer();
>   attemptId =
>   c.getAllocatedOrReservedContainer().getSchedulerApplicationAttempt()
>   .getApplicationAttemptId();   //NPE happens here
> } else { ...
> {code}
> The proposal was constructed in 
> {{CapacityScheduler#createResourceCommitRequest}} and 
> allocatedOrReservedContainer is possibly null in async-scheduling process 
> when node was lost or application was finished (details in 
> {{CapacityScheduler#getSchedulerContainer}}).
> Reference code:
> {code:java}
>   // Allocated something
>   List allocations =
>   csAssignment.getAssignmentInformation().getAllocationDetails();
>   if (!allocations.isEmpty()) {
> RMContainer rmContainer = allocations.get(0).rmContainer;
> allocated = new ContainerAllocationProposal<>(
> getSchedulerContainer(rmContainer, true),   //possibly null
> getSchedulerContainersToRelease(csAssignment),
> 
> getSchedulerContainer(csAssignment.getFulfilledReservedContainer(),
> false), csAssignment.getType(),
> csAssignment.getRequestLocalityType(),
> csAssignment.getSchedulingMode() != null ?
> csAssignment.getSchedulingMode() :
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY,
> csAssignment.getResource());
>   }
> {code}
> I think we should add null check for allocateOrReserveContainer before create 
> allocate/reserve proposals. Besides the allocation process has increase 
> unconfirmed resource of app when creating an allocate assignment, so if this 
> check is null, we should decrease the unconfirmed resource of live app.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8833) compute shares may lock the scheduling process

2018-11-04 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674677#comment-16674677
 ] 

Weiwei Yang commented on YARN-8833:
---

Assigned this ticket to [~yoelee], thanks for the creating the issue, look 
forward to the patch!

> compute shares may  lock the scheduling process
> ---
>
> Key: YARN-8833
> URL: https://issues.apache.org/jira/browse/YARN-8833
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: liyakun
>Assignee: liyakun
>Priority: Major
>
> When use w2rRatio compute fair share, there may be a chance triggering the 
> problem of Int overflow, and entering an infinite loop.
> Since the compute share thread holds the writeLock, it may blocking 
> scheduling thread.
> This issue occurs in a production environment with 8500 nodes. And we have 
> already fixed it.
>  
> added 2018-10-29: elaborate the problem 
> /**
>  * Compute the resources that would be used given a weight-to-resource ratio
>  * w2rRatio, for use in the computeFairShares algorithm as described in #
>  */
>  private static int resourceUsedWithWeightToResourceRatio(double w2rRatio,
>  Collection schedulables, String type) {
>  int resourcesTaken = 0;
>  for (Schedulable sched : schedulables) \{ int share = computeShare(sched, 
> w2rRatio, type); resourcesTaken += share; }
> return resourcesTaken;
>  }
> The variable resourcesTaken is an integer type. And it also is accumulated 
> value of result of
> computeShare(Schedulable sched, double w2rRatio,String type) which is a value 
> between the min share and max share of a queue.
> For example, when there are 3 queues, each has min share = max share = 
> Integer.MAX_VALUE, the resourcesTaken will be out of Integer bound, and it 
> will be a negative number.
> when resourceUsedWithWeightToResourceRatio(double w2rRatio, Collection extends Schedulable> schedulables, String type) return a negative number, the 
> loop in 
> computeSharesInternal() may never out which got the scheduler lock.
>  
> //org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
>  < totalResource){
> rMax *= 2.0;
> }
> This may blocking scheduling thread.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8833) compute shares may lock the scheduling process

2018-11-04 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YARN-8833:
-

Assignee: liyakun

> compute shares may  lock the scheduling process
> ---
>
> Key: YARN-8833
> URL: https://issues.apache.org/jira/browse/YARN-8833
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: liyakun
>Assignee: liyakun
>Priority: Major
>
> When use w2rRatio compute fair share, there may be a chance triggering the 
> problem of Int overflow, and entering an infinite loop.
> Since the compute share thread holds the writeLock, it may blocking 
> scheduling thread.
> This issue occurs in a production environment with 8500 nodes. And we have 
> already fixed it.
>  
> added 2018-10-29: elaborate the problem 
> /**
>  * Compute the resources that would be used given a weight-to-resource ratio
>  * w2rRatio, for use in the computeFairShares algorithm as described in #
>  */
>  private static int resourceUsedWithWeightToResourceRatio(double w2rRatio,
>  Collection schedulables, String type) {
>  int resourcesTaken = 0;
>  for (Schedulable sched : schedulables) \{ int share = computeShare(sched, 
> w2rRatio, type); resourcesTaken += share; }
> return resourcesTaken;
>  }
> The variable resourcesTaken is an integer type. And it also is accumulated 
> value of result of
> computeShare(Schedulable sched, double w2rRatio,String type) which is a value 
> between the min share and max share of a queue.
> For example, when there are 3 queues, each has min share = max share = 
> Integer.MAX_VALUE, the resourcesTaken will be out of Integer bound, and it 
> will be a negative number.
> when resourceUsedWithWeightToResourceRatio(double w2rRatio, Collection extends Schedulable> schedulables, String type) return a negative number, the 
> loop in 
> computeSharesInternal() may never out which got the scheduler lock.
>  
> //org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
>  < totalResource){
> rMax *= 2.0;
> }
> This may blocking scheduling thread.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-04 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang reassigned YARN-8714:
--

Assignee: Zhankun Tang

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8858) CapacityScheduler should respect maximum node resource when per-queue maximum-allocation is being used.

2018-11-04 Thread Akira Ajisaka (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674663#comment-16674663
 ] 

Akira Ajisaka commented on YARN-8858:
-

Hi [~cheersyang], the added regression test in the patch for branch-2.8 is 
failing. Would you fix it?

> CapacityScheduler should respect maximum node resource when per-queue 
> maximum-allocation is being used.
> ---
>
> Key: YARN-8858
> URL: https://issues.apache.org/jira/browse/YARN-8858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2, 3.3.0
>
> Attachments: YARN-8858-branch-2.8.001.patch, YARN-8858.001.patch, 
> YARN-8858.002.patch
>
>
> This issue happens after YARN-8720.
> Before that, AMS uses scheduler.getMaximumAllocation to do the normalization. 
> After that, AMS uses LeafQueue.getMaximumAllocation. The scheduler one uses 
> nodeTracker.getMaximumAllocation, but the LeafQueue.getMaximum doesn't. 
> We should use the scheduler.getMaximumAllocation to cap the per-queue's 
> maximum-allocation every time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674660#comment-16674660
 ] 

Zhankun Tang commented on YARN-8927:


[~eyang] , for the splitting configuration by run and pull. That's just a 
suggestion. I agree that we should think more and establish a good design in 
point 1 to avoid a revisit.

How the idea of the split into pull and run comes up might be helpful for your 
reference. When I think how "YARN-3854" would decide what repo it can pull from 
and run. Something is unclear for the local image settings 
"docker.trusted.local.image". *Now I seem to prefer the 
"docker.trusted.local.image" is a white-list.*

Consider below scenario with boolean flag:

 
{code:java}
"docker.trusted.local.image" = false
"docker.trusted.registries" = "cmp1, library"
{code}
 

When a user request "cmp1/img1" or "centos:latest", YARN-3854 may download it 
first if no local image because we trust "cmp1" and Docker hub. And then when 
c-e wants to run the container, it should first check if this "cmp1/img1" is 
really local.

If it is local before the YARN-3854, deny it because 
"docker.trusted.local.image" is false. Else, allow it to run based on 
privilege/mount white-list check result.

This seems to require YARN to maintains a list of local images in advance in 
java layer because c-e is not long-running.

Although passing the list to c-e and let c-e do the check is possible, this 
seems unsmooth or complex. And we need to handle NM restart and load back the 
original local images names.

So it seems the "_docker.trusted.local.image_" should be a white-list to avoid 
above complexity. And the name can be like:

 
{code:java}
"docker.trusted.local.images" = "cmp1/img1,centos"
"docker.trusted.registries" = "cmp1,library"
{code}
 

 

But the above configuration still seems not that straightforward to me. So the 
below configurations comes up in my mind:

 
{code:java}
"docker.pull.trusted.registries" = "cmp1,library"
"docker.run.trusted.registries" = "cmp1,library"
{code}
 

Please correct me if I missed something important. I've no strong opinion on 
either configuration. Any thoughts? [~eyang] , [~ebadger]

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8376) Separate white list for docker.trusted.registries and docker.privileged-container.registries

2018-11-04 Thread Eric Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang reassigned YARN-8376:
---

Assignee: Zhankun Tang

> Separate white list for docker.trusted.registries and 
> docker.privileged-container.registries
> 
>
> Key: YARN-8376
> URL: https://issues.apache.org/jira/browse/YARN-8376
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: docker
>
> In the ideal world, it would be possible to have separate white lists for 
> docker registry depending on the security requirement for each type of docker 
> images:
> 1. Registries from which we can run non-privileged containers without mounts
> 2. Registries from which we can run non-privileged containers with mounts
> 3. Registries from which we can run privileged or non-privileged containers 
> with mounts
> In the current implementation, there are only type 1 and type 2 or 3.  It 
> would be nice to definite a separate white list to differentiate between 2 
> and 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8472) YARN Container Phase 2

2018-11-04 Thread Eric Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang reassigned YARN-8472:
---

Assignee: Zhankun Tang

> YARN Container Phase 2
> --
>
> Key: YARN-8472
> URL: https://issues.apache.org/jira/browse/YARN-8472
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Yang
>Assignee: Zhankun Tang
>Priority: Major
>
> In YARN-3611, we have implemented basic Docker container support for YARN.  
> This story is the next phase to improve container usability.
> Several area for improvements are:
>  # Software defined network support
>  # Interactive shell to container
>  # User management sss/nscd integration
>  # Runc/containerd support
>  # Metrics/Logs integration with Timeline service v2 
>  # Docker container profiles
>  # Docker cgroup management



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8472) YARN Container Phase 2

2018-11-04 Thread Eric Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang reassigned YARN-8472:
---

Assignee: Eric Yang  (was: Zhankun Tang)

> YARN Container Phase 2
> --
>
> Key: YARN-8472
> URL: https://issues.apache.org/jira/browse/YARN-8472
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>
> In YARN-3611, we have implemented basic Docker container support for YARN.  
> This story is the next phase to improve container usability.
> Several area for improvements are:
>  # Software defined network support
>  # Interactive shell to container
>  # User management sss/nscd integration
>  # Runc/containerd support
>  # Metrics/Logs integration with Timeline service v2 
>  # Docker container profiles
>  # Docker cgroup management



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8233) NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal whose allocatedOrReservedContainer is null

2018-11-04 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674625#comment-16674625
 ] 

Tao Yang commented on YARN-8233:


Attached v3 patch to fix checkstyle warnings through removing useless test 
codes. UT failures seem not related to this patch, I can't reproduce these 
failures on my local environment.

> NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal 
> whose allocatedOrReservedContainer is null
> -
>
> Key: YARN-8233
> URL: https://issues.apache.org/jira/browse/YARN-8233
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8233.001.patch, YARN-8233.002.patch, 
> YARN-8233.003.patch
>
>
> Recently we saw a NPE problem in CapacityScheduler#tryCommit when try to find 
> the attemptId by calling {{c.getAllocatedOrReservedContainer().get...}} from 
> an allocate/reserve proposal. But got null allocatedOrReservedContainer and 
> thrown NPE.
> Reference code:
> {code:java}
> // find the application to accept and apply the ResourceCommitRequest
> if (request.anythingAllocatedOrReserved()) {
>   ContainerAllocationProposal c =
>   request.getFirstAllocatedOrReservedContainer();
>   attemptId =
>   c.getAllocatedOrReservedContainer().getSchedulerApplicationAttempt()
>   .getApplicationAttemptId();   //NPE happens here
> } else { ...
> {code}
> The proposal was constructed in 
> {{CapacityScheduler#createResourceCommitRequest}} and 
> allocatedOrReservedContainer is possibly null in async-scheduling process 
> when node was lost or application was finished (details in 
> {{CapacityScheduler#getSchedulerContainer}}).
> Reference code:
> {code:java}
>   // Allocated something
>   List allocations =
>   csAssignment.getAssignmentInformation().getAllocationDetails();
>   if (!allocations.isEmpty()) {
> RMContainer rmContainer = allocations.get(0).rmContainer;
> allocated = new ContainerAllocationProposal<>(
> getSchedulerContainer(rmContainer, true),   //possibly null
> getSchedulerContainersToRelease(csAssignment),
> 
> getSchedulerContainer(csAssignment.getFulfilledReservedContainer(),
> false), csAssignment.getType(),
> csAssignment.getRequestLocalityType(),
> csAssignment.getSchedulingMode() != null ?
> csAssignment.getSchedulingMode() :
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY,
> csAssignment.getResource());
>   }
> {code}
> I think we should add null check for allocateOrReserveContainer before create 
> allocate/reserve proposals. Besides the allocation process has increase 
> unconfirmed resource of app when creating an allocate assignment, so if this 
> check is null, we should decrease the unconfirmed resource of live app.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8233) NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal whose allocatedOrReservedContainer is null

2018-11-04 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8233:
---
Attachment: YARN-8233.003.patch

> NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal 
> whose allocatedOrReservedContainer is null
> -
>
> Key: YARN-8233
> URL: https://issues.apache.org/jira/browse/YARN-8233
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8233.001.patch, YARN-8233.002.patch, 
> YARN-8233.003.patch
>
>
> Recently we saw a NPE problem in CapacityScheduler#tryCommit when try to find 
> the attemptId by calling {{c.getAllocatedOrReservedContainer().get...}} from 
> an allocate/reserve proposal. But got null allocatedOrReservedContainer and 
> thrown NPE.
> Reference code:
> {code:java}
> // find the application to accept and apply the ResourceCommitRequest
> if (request.anythingAllocatedOrReserved()) {
>   ContainerAllocationProposal c =
>   request.getFirstAllocatedOrReservedContainer();
>   attemptId =
>   c.getAllocatedOrReservedContainer().getSchedulerApplicationAttempt()
>   .getApplicationAttemptId();   //NPE happens here
> } else { ...
> {code}
> The proposal was constructed in 
> {{CapacityScheduler#createResourceCommitRequest}} and 
> allocatedOrReservedContainer is possibly null in async-scheduling process 
> when node was lost or application was finished (details in 
> {{CapacityScheduler#getSchedulerContainer}}).
> Reference code:
> {code:java}
>   // Allocated something
>   List allocations =
>   csAssignment.getAssignmentInformation().getAllocationDetails();
>   if (!allocations.isEmpty()) {
> RMContainer rmContainer = allocations.get(0).rmContainer;
> allocated = new ContainerAllocationProposal<>(
> getSchedulerContainer(rmContainer, true),   //possibly null
> getSchedulerContainersToRelease(csAssignment),
> 
> getSchedulerContainer(csAssignment.getFulfilledReservedContainer(),
> false), csAssignment.getType(),
> csAssignment.getRequestLocalityType(),
> csAssignment.getSchedulingMode() != null ?
> csAssignment.getSchedulingMode() :
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY,
> csAssignment.getResource());
>   }
> {code}
> I think we should add null check for allocateOrReserveContainer before create 
> allocate/reserve proposals. Besides the allocation process has increase 
> unconfirmed resource of app when creating an allocate assignment, so if this 
> check is null, we should decrease the unconfirmed resource of live app.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics

2018-11-04 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674624#comment-16674624
 ] 

Tao Yang commented on YARN-2599:


Hi, [~Naganarasimha] & [~kasha] & [~rohithsharma],
We have met problems that standby RM can't be transited to active because of 
memory leak. If we can monitor heap metrics from standby RM metrics, perhaps we 
can avoid such problems. Can we revaluate whether standby RM should expose jmx? 
 Thanks.
cc: [~cheersyang], [~leftnoteasy], [~sunilg]

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674622#comment-16674622
 ] 

Zhankun Tang commented on YARN-8927:


[~eyang] , sorry for the improper statement of the "The problem .." in point 1. 
Yeah. we could add a check function like your pseudo-code to allow top-level 
image.

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

2018-11-04 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674614#comment-16674614
 ] 

Wilfred Spiegelenburg commented on YARN-7560:
-

[~zhengchenyu] do you mind if I take over to get this finalised and checked in?

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> --
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 3.0.0
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
> Attachments: YARN-7560.000.patch, YARN-7560.001.patch
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable 
> [0x7f98eed9a000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x7f8c4a8177a0> (a java.util.HashMap)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x7f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c4a76ac48> (a java.lang.Object)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c49254268> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x7f8c467495e0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> the sum of all minRes is over int.max, so 
> resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But 
> resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
> that resourceUsedWithWeightToResourceRatio will return a overflow value, just 
> a negative. So the loop will never break.
> {code}
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
> < totalResource) {
>   rMax *= 2.0;
> }
> {code}



--
This message was sent by 

[jira] [Commented] (YARN-8865) RMStateStore contains large number of expired RMDelegationToken

2018-11-04 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674592#comment-16674592
 ] 

Wilfred Spiegelenburg commented on YARN-8865:
-

[~daryn] or [~jlowe] Could you please check the latest patch? I removed the 
reset of the maximum date as per the last comment from Daryn.

> RMStateStore contains large number of expired RMDelegationToken
> ---
>
> Key: YARN-8865
> URL: https://issues.apache.org/jira/browse/YARN-8865
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-8865.001.patch, YARN-8865.002.patch, 
> YARN-8865.003.patch, YARN-8865.004.patch, YARN-8865.005.patch, 
> YARN-8865.006.patch
>
>
> When the RM state store is restored expired delegation tokens are restored 
> and added to the system. These expired tokens do not get cleaned up or 
> removed. The exact reason why the tokens are still in the store is not clear. 
> We have seen as many as 250,000 tokens in the store some of which were 2 
> years old.
> This has two side effects:
> * for the zookeeper store this leads to a jute buffer exhaustion issue and 
> prevents the RM from becoming active.
> * restore takes longer than needed and heap usage is higher than it should be
> We should not restore already expired tokens since they cannot be renewed or 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8738) FairScheduler configures maxResources or minResources as negative, the value parse to a positive number.

2018-11-04 Thread Sen Zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674580#comment-16674580
 ] 

Sen Zhao commented on YARN-8738:


[~snemeth]. Of course, you can do this.

> FairScheduler configures maxResources or minResources as negative, the value 
> parse to a positive number.
> 
>
> Key: YARN-8738
> URL: https://issues.apache.org/jira/browse/YARN-8738
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Sen Zhao
>Assignee: Szilard Nemeth
>Priority: Major
>
> If maxResources or minResources is configured as a negative number, the value 
> will be positive after parsing.
> If this is a problem, I will fix it. If not, the 
> FairSchedulerConfiguration#parseNewStyleResource parse negative number should 
> be same with parseOldStyleResource .
> cc:[~templedf], [~leftnoteasy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7631) ResourceRequest with different Capacity (Resource) overrides each other in RM and thus lost

2018-11-04 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-7631:


Assignee: Szilard Nemeth

> ResourceRequest with different Capacity (Resource) overrides each other in RM 
> and thus lost
> ---
>
> Key: YARN-7631
> URL: https://issues.apache.org/jira/browse/YARN-7631
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Botong Huang
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: resourcebug.patch
>
>
> Today in AMRMClientImpl, the ResourceRequests (RR) are kept as: RequestId -> 
> Priority -> ResourceName -> ExecutionType -> Resource (Capacity) -> 
> ResourceRequestInfo (the actual RR). This means that only RRs with the same 
> (requestId, priority, resourcename, executionType, resource) will be grouped 
> and aggregated together. 
> While in RM side, the mapping is SchedulerRequestKey (RequestId, priority) -> 
> LocalityAppPlacementAllocator (ResourceName -> RR). 
> The issue is that in RM side Resource is not in the key to the RR at all. 
> (Note that executionType is also not in the RM side, but it is fine because 
> RM handles it separately as container update requests.) This means that under 
> the same value of (requestId, priority, resourcename), RRs with different 
> Resource values will be grouped together and override each other in RM. As a 
> result, some of the container requests are lost and will never be allocated. 
> Furthermore, since the two RRs are kept under different keys in AMRMClient 
> side, allocation of RR1 will only trigger cancel for RR1, the pending RR2 
> will not get resend as well. 
> I’ve attached an unit test (resourcebug.patch) which is failing in trunk to 
> illustrate this issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7583) Reduce overhead of container reacquisition

2018-11-04 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-7583:


Assignee: Szilard Nemeth

> Reduce overhead of container reacquisition
> --
>
> Key: YARN-7583
> URL: https://issues.apache.org/jira/browse/YARN-7583
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Jason Lowe
>Assignee: Szilard Nemeth
>Priority: Major
>
> When reacquiring containers after a nodemanager restart the Linux container 
> executor invokes the container executor to essentially kill -0 the process to 
> check if it is alive.  It would be a lot cheaper on Linux to stat the 
> /proc/ directory which the nodemanager can do directly rather than pay 
> for the fork-and-exec through the container executor and potential signal 
> permission issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7744) Fix Get status rest api response when application is destroyed

2018-11-04 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-7744:


Assignee: Szilard Nemeth

> Fix Get status rest api response when application is destroyed
> --
>
> Key: YARN-7744
> URL: https://issues.apache.org/jira/browse/YARN-7744
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Yesha Vora
>Assignee: Szilard Nemeth
>Priority: Critical
>
> Steps:
> 1) Create a yarn service 
> 2) Destroy a yarn service
> Run get status for application using REST API:
> {code}
> response json = {u'diagnostics': u'Failed to retrieve service: File does not 
> exist: 
> hdfs://mycluster/user/yarn/.yarn/services/httpd-service/httpd-service.json'}
> status code = 500{code}
> The REST API should respond with proper json including diagnostics and HTTP 
> status code 404



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7768) yarn application -status appName does not return valid json

2018-11-04 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-7768:


Assignee: Szilard Nemeth

> yarn application -status appName does not return valid json
> ---
>
> Key: YARN-7768
> URL: https://issues.apache.org/jira/browse/YARN-7768
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Yesha Vora
>Assignee: Szilard Nemeth
>Priority: Major
>
> yarn application -status  does not return valid json
> 1) It has classname added to json content such as class Service, class 
> KerberosPrincipal , class Component etc
> 2) The json object should be comma separated.
> {code}
> [hrt_qa@2 hadoopqe]$ yarn application -status httpd-hrt-qa
> WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of 
> YARN_LOG_DIR.
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 18/01/18 00:33:07 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 18/01/18 00:33:08 WARN shortcircuit.DomainSocketFactory: The short-circuit 
> local reads feature cannot be used because libhadoop cannot be loaded.
> 18/01/18 00:33:08 INFO utils.ServiceApiUtil: Loading service definition from 
> hdfs://mycluster/user/hrt_qa/.yarn/services/httpd-hrt-qa/httpd-hrt-qa.json
> 18/01/18 00:33:09 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> class Service {
> name: httpd-hrt-qa
> id: application_1516234304810_0001
> artifact: null
> resource: null
> launchTime: null
> numberOfRunningContainers: null
> lifetime: 3600
> placementPolicy: null
> components: [class Component {
> name: httpd
> state: STABLE
> dependencies: []
> readinessCheck: null
> artifact: class Artifact {
> id: centos/httpd-24-centos7:latest
> type: DOCKER
> uri: null
> }
> launchCommand: /usr/bin/run-httpd
> resource: class Resource {
> profile: null
> cpus: 1
> memory: 1024
> additional: null
> }
> numberOfContainers: 2
> containers: [class Container {
> id: container_e05_1516234304810_0001_01_02
> launchTime: Thu Jan 18 00:19:22 UTC 2018
> ip: 172.17.0.2
> hostname: httpd-0.httpd-hrt-qa.hrt_qa.test.com
> bareHost: 5.hwx.site
> state: READY
> componentInstanceName: httpd-0
> resource: null
> artifact: null
> privilegedContainer: null
> }, class Container {
> id: container_e05_1516234304810_0001_01_03
> launchTime: Thu Jan 18 00:19:23 UTC 2018
> ip: 172.17.0.3
> hostname: httpd-1.httpd-hrt-qa.hrt_qa.test.com
> bareHost: 5.hwx.site
> state: READY
> componentInstanceName: httpd-1
> resource: null
> artifact: null
> privilegedContainer: null
> }]
> runPrivilegedContainer: false
> placementPolicy: null
> configuration: class Configuration {
> properties: {}
> env: {}
> files: [class ConfigFile {
> type: TEMPLATE
> destFile: /var/www/html/index.html
> srcFile: null
> properties: 
> {content=TitleHello from 
> ${COMPONENT_INSTANCE_NAME}!}
> }]
> }
> quicklinks: []
> }, class Component {
> name: httpd-proxy
> state: FLEXING
> dependencies: []
> readinessCheck: null
> artifact: class Artifact {
> id: centos/httpd-24-centos7:latest
> type: DOCKER
> uri: null
> }
> launchCommand: /usr/bin/run-httpd
> resource: class Resource {
> profile: null
> cpus: 1
> memory: 1024
> additional: null
> }
> numberOfContainers: 1
> containers: []
> runPrivilegedContainer: false
> placementPolicy: null
> configuration: class Configuration {
> properties: {}
> env: {}
> files: [class ConfigFile {
> type: TEMPLATE
> destFile: /etc/httpd/conf.d/httpd-proxy.conf
> srcFile: httpd-proxy.conf
> properties: {}
> }]
> }
> 

[jira] [Assigned] (YARN-8738) FairScheduler configures maxResources or minResources as negative, the value parse to a positive number.

2018-11-04 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-8738:


Assignee: Szilard Nemeth

> FairScheduler configures maxResources or minResources as negative, the value 
> parse to a positive number.
> 
>
> Key: YARN-8738
> URL: https://issues.apache.org/jira/browse/YARN-8738
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Sen Zhao
>Assignee: Szilard Nemeth
>Priority: Major
>
> If maxResources or minResources is configured as a negative number, the value 
> will be positive after parsing.
> If this is a problem, I will fix it. If not, the 
> FairSchedulerConfiguration#parseNewStyleResource parse negative number should 
> be same with parseOldStyleResource .
> cc:[~templedf], [~leftnoteasy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674136#comment-16674136
 ] 

Eric Yang edited comment on YARN-8927 at 11/4/18 5:09 PM:
--

[~tangzhankun] Kind of, and we add more fine grained controls.  The summarized 
guideline are:

# docker.trusted.registries that can trust images on dockerhub as well as 
private trusted registries.  If "library" keyword is used, top level images are 
trusted.
# docker.privileged-containers.registries that can run trusted images as 
privileged user.  If "library" keyword is used, top level images can run with 
privileged.
# docker.trusted.local.image[s] and docker.privileged.local.image[s] are either 
Boolean flag to trust all local images or work as a list that white list 
certain local images.  (To be discussed in YARN-8955).



was (Author: eyang):
[~tangzhankun] Kind of, and we add more fine grained controls.  The summarized 
guideline are:

# docker.trusted.registries that can trust images on dockerhub as well as 
private trusted registries.  If "library" keyword is used, top level images are 
trusted.
# docker.privileged.registries that can run trusted images as privileged user.  
If "library" keyword is used, top level images can run with privileged.
# docker.trusted.local.image[s] and docker.privileged.local.image[s] are either 
Boolean flag to trust all local images or work as a list that white list 
certain local images.  (To be discussed in YARN-8955).


> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1667#comment-1667
 ] 

Eric Yang commented on YARN-8927:
-

{quote}But for point 1, "docker.trusted.registries" will be all about non-local 
repo. The problem is that doesn't implement how to configure the trust of 
top-level images like "centos[:tag]". Let's say, configured "library" keyword, 
top-level pattern image name is trusted.{quote}

In pseudo code, can we write:
{code}
if (image does not container "/" and docker.trusted.registry has "library") {
  allowed = true;
} else {
  check image repository in docker.trusted.registry;
}
{code}

We should never split the configuration by pull and run.  They are executed in 
the same flow, making distinction between them can prevent program from working 
and confuse system admin.  

{quote}For point 2, if we have a "docker.privileged.registries", does it mean 
the existing "docker.privileged-containers.enabled" will be useless? And for 
the mount stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated 
them?{quote}

Sorry, it is a typo.  I meant to say docker.privileged-containers.registries.

I am trying to allow implementation to happen in the order of 1, 2 and 3 
without having to revisit logic for 1, when 2 is implementing.

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674357#comment-16674357
 ] 

Zhankun Tang edited comment on YARN-8927 at 11/4/18 10:26 AM:
--

[~eyang], Thanks for the explanation. I'd like to help on these tasks. Feel 
free to assign to me if you want. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I thought the above configurations are effective for both "docker 
pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

But for point 1, "_docker.trusted.registries_" will be all about non-local 
repo. The problem is that doesn't implement how to configure the trust of 
top-level images like "centos[:tag]". Let's say, configured "library" keyword, 
top-level pattern image name is trusted.

But this trust is more related to pull I think. It can be pulled. When run, the 
local image check will be involved based on "_docker.trusted.local.image_". If 
so, I feel this configuration is only useful for pull? Maybe 
"docker.pull.trusted.registries" is proper?

Think more on this, can the configuration of white-listed registries be split 
into two categories which is pull and run?

"docker.pull.trusted.registries" configures where YARN can pull from.

"docker.run.trusted.registries" configure what can be run (image localized by 
YARN or admin).

"docker.run.privileged.registries" is a subset of above 
"docker.run.trusted.registries"

 

For point 2, if we have a "_docker.privileged.registries_", does it mean the 
existing "_docker.privileged-containers.enabled_" will be useless? And for the 
mount stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?


was (Author: tangzhankun):
[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I thought the above configurations are effective for both "docker 
pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

But for point 1, "_docker.trusted.registries_" will be all about non-local 
repo. The problem is that doesn't implement how to configure the trust of 
top-level images like "centos[:tag]". Let's say, configured "library" keyword, 
top-level pattern image name is trusted.

But this trust is more related to pull I think. It can be pulled. When run, the 
local image check will be involved based on "_docker.trusted.local.image_". If 
so, I feel this configuration is only useful for pull? Maybe 
"docker.pull.trusted.registries" is proper?

Think more on this, can the configuration of white-listed registries be split 
into two categories which is pull and run?

"docker.pull.trusted.registries" configures where YARN can pull from.

"docker.run.trusted.registries" configure what can be run (image localized by 
YARN or admin).

"docker.run.privileged.registries" is a subset of above 
"docker.run.trusted.registries"

 

For point 2, if we have a "_docker.privileged.registries_", does it mean the 
existing "_docker.privileged-containers.enabled_" will be useless? And for the 
mount stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA

[jira] [Comment Edited] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674357#comment-16674357
 ] 

Zhankun Tang edited comment on YARN-8927 at 11/4/18 10:15 AM:
--

[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I thought the above configurations are effective for both "docker 
pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

But for point 1, "_docker.trusted.registries_" will be all about non-local 
repo. The problem is that doesn't implement how to configure the trust of 
top-level images like "centos[:tag]". Let's say, configured "library" keyword, 
top-level pattern image name is trusted.

But this trust is more related to pull I think. It can be pulled. When run, the 
local image check will be involved based on "_docker.trusted.local.image_". If 
so, I feel this configuration is only useful for pull? Maybe 
"docker.pull.trusted.registries" is proper?

Think more on this, can the configuration of white-listed registries be split 
into two categories which is pull and run?

"docker.pull.trusted.registries" configures where YARN can pull from.

"docker.run.trusted.registries" configure what can be run (image localized by 
YARN or admin).

"docker.run.privileged.registries" is a subset of above 
"docker.run.trusted.registries"

 

For point 2, if we have a "_docker.privileged.registries_", does it mean the 
existing "_docker.privileged-containers.enabled_" will be useless? And for the 
mount stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?


was (Author: tangzhankun):
[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I thought the above configurations are effective for both "docker 
pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

But for point 1, "_docker.trusted.registries_" will be all about non-local 
repo. The problem is that doesn't implement how to configure the trust of 
top-level images like "centos[:tag]". Let's say, configured "library" keyword, 
top-level pattern image name is trusted.

But this trust is more related to pull I think. It can be pulled. When run, the 
local image check will be involved based on "_docker.trusted.local.image_". If 
so, I feel this configuration is only useful for pull? Maybe 
"docker.pull.trusted.registries" is proper?

Think more on this, can the configuration of white-listed registries be split 
into two categories which is pull and run?

"docker.pull.trusted.registries" configures where YARN can pull from.

"docker.run.trusted.registries" configure what can be run after image 
localization.

"docker.run.privileged.registries" is a subset of above 
"docker.run.trusted.registries"

 

For point 2, if we have a "_docker.privileged.registries_", does it mean the 
existing "_docker.privileged-containers.enabled_" will be useless? And for the 
mount stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To 

[jira] [Comment Edited] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674357#comment-16674357
 ] 

Zhankun Tang edited comment on YARN-8927 at 11/4/18 10:12 AM:
--

[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I thought the above configurations are effective for both "docker 
pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

But for point 1, "_docker.trusted.registries_" will be all about non-local 
repo. The problem is that doesn't implement how to configure the trust of 
top-level images like "centos[:tag]". Let's say, configured "library" keyword, 
top-level pattern image name is trusted.

But this trust is more related to pull I think. It can be pulled. When run, the 
local image check will be involved based on "_docker.trusted.local.image_". If 
so, I feel this configuration is only useful for pull? Maybe 
"docker.pull.trusted.registries" is proper?

Think more on this, can the configuration of white-listed registries be split 
into two categories which is pull and run?

"docker.pull.trusted.registries" configures where YARN can pull from.

"docker.run.trusted.registries" configure what can be run after image 
localization.

"docker.run.privileged.registries" is a subset of above 
"docker.run.trusted.registries"

 

For point 2, if we have a "_docker.privileged.registries_", does it mean the 
existing "_docker.privileged-containers.enabled_" will be useless? And for the 
mount stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?


was (Author: tangzhankun):
[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I thought the above configurations are effective for both "docker 
pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

But for point 1, "_docker.trusted.registries_" will be all about non-local 
repo. The problem is that doesn't implement how to configure the trust of 
top-level images like "centos[:tag]". Let's say, configured "library" keyword, 
top-level pattern image name is trusted.

But this trust is more related to pull I think. It can be pulled. When run, the 
local image check will be involved based on "_docker.trusted.local.image_". If 
so, I feel this configuration is only useful for pull? Maybe 
"docker.pull.trusted.registries" is proper?

Think more on this, can the configuration of white-listed registries be split 
into two categories which is pull and run?

"docker.pull.trusted.registries" configures where YARN can pull from.

"docker.run.trusted.registries" configure what can be run after image 
localization.

"docker.run.privileged.registries" is a subset of above 
"docker.run.trusted.registries"

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674357#comment-16674357
 ] 

Zhankun Tang edited comment on YARN-8927 at 11/4/18 10:10 AM:
--

[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I thought the above configurations are effective for both "docker 
pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

But for point 1, "_docker.trusted.registries_" will be all about non-local 
repo. The problem is that doesn't implement how to configure the trust of 
top-level images like "centos[:tag]". Let's say, configured "library" keyword, 
top-level pattern image name is trusted.

But this trust is more related to pull I think. It can be pulled. When run, the 
local image check will be involved based on "_docker.trusted.local.image_". If 
so, I feel this configuration is only useful for pull? Maybe 
"docker.pull.trusted.registries" is proper?

Think more on this, can the configuration of white-listed registries be split 
into two categories which is pull and run?

"docker.pull.trusted.registries" configures where YARN can pull from.

"docker.run.trusted.registries" configure what can be run after image 
localization.

"docker.run.privileged.registries" is a subset of above 
"docker.run.trusted.registries"

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?


was (Author: tangzhankun):
[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I thought the above configurations are effective for both "docker 
pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

But for point 1, "_docker.trusted.registries_" will be all about non-local 
repo. The problem is that doesn't implement how to configure the trust of 
top-level images like "centos[:tag]". Let's say, configured "library" keyword, 
top-level pattern image name is trusted.

But this trust is more related to pull I think. It can be pulled. When run, the 
local image check will be involved based on "_docker.trusted.local.image_". If 
so, I feel this configuration is only useful for pull? Maybe 
"docker.pull.trusted.registries" is proper?

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674357#comment-16674357
 ] 

Zhankun Tang edited comment on YARN-8927 at 11/4/18 9:58 AM:
-

[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I thought the above configurations are effective for both "docker 
pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

But for point 1, "_docker.trusted.registries_" will be all about non-local 
repo. The problem is that doesn't implement how to configure the trust of 
top-level images like "centos[:tag]". Let's say, configured "library" keyword, 
top-level pattern image name is trusted.

But this trust is more related to pull I think. It can be pulled. When run, the 
local image check will be involved based on "_docker.trusted.local.image_". If 
so, I feel this configuration is only useful for pull? Maybe 
"docker.pull.trusted.registries" is proper?

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?


was (Author: tangzhankun):
[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I understand the above configurations are effective for both 
"docker pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

For point 1, "_docker.trusted.registries_" will be all about non-local repo. 
The current logic underneath "docker.trusted.registries" already support 
private trusted registries and docker hub. But it doesn't implement how to 
configure the trust of top-level images like "centos[:tag]". We only need to 
add a check related to "library" keyword in c-e. Configured "library" keyword, 
top-level pattern image name is trusted. It can be pulled. But when run, the 
local image check will be done based on "_docker.trusted.local.image_". Right?

If so, I feel this configuration is only useful for pull ? Maybe 
"docker.pull.trusted.registries" is also proper?

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674357#comment-16674357
 ] 

Zhankun Tang edited comment on YARN-8927 at 11/4/18 9:54 AM:
-

[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I understand the above configurations are effective for both 
"docker pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

For point 1, "_docker.trusted.registries_" will be all about non-local repo. 
The current logic underneath "docker.trusted.registries" already support 
private trusted registries and docker hub. But it doesn't implement how to 
configure the trust of top-level images like "centos[:tag]". We only need to 
add a check related to "library" keyword in c-e. Configured "library" keyword, 
top-level pattern image name is trusted. It can be pulled. But when run, the 
local image check will be done based on "_docker.trusted.local.image_". Right?

If so, I feel this configuration is only useful for pull ? Maybe 
"docker.pull.trusted.registries" is also proper?

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?


was (Author: tangzhankun):
[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I understand the above configurations are effective for both 
"docker pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

For point 1, "_docker.trusted.registries_" will be all about non-local repo. 
The current logic underneath "docker.trusted.registries" already support 
private trusted registries and docker hub. But it doesn't implement how to 
configure the trust of top-level images like "centos[:tag]". We only need to 
add a check related to "library" keyword in c-e. Configured "library" keyword, 
top-level pattern image name is trusted. It can be pulled. But when run, the 
local image check will be done based on "_docker.trusted.local.image_". Right?

If so, I feel this configuration is only useful for pull ? Maybe 
"docker.pull.registries" is also proper?

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674357#comment-16674357
 ] 

Zhankun Tang edited comment on YARN-8927 at 11/4/18 9:54 AM:
-

[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I understand the above configurations are effective for both 
"docker pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

For point 1, "_docker.trusted.registries_" will be all about non-local repo. 
The current logic underneath "docker.trusted.registries" already support 
private trusted registries and docker hub. But it doesn't implement how to 
configure the trust of top-level images like "centos[:tag]". We only need to 
add a check related to "library" keyword in c-e. Configured "library" keyword, 
top-level pattern image name is trusted. It can be pulled. But when run, the 
local image check will be done based on "_docker.trusted.local.image_". Right?

If so, I feel this configuration is only useful for pull ? Maybe 
"docker.pull.registries" is also proper?

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?


was (Author: tangzhankun):
[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I understand the above configurations are effective for both 
"docker pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

For point 1, "_docker.trusted.registries_" will be all about non-local repo. 
The current logic underneath "docker.trusted.registries" already support 
private trusted registries and docker hub. But it doesn't implement how to 
configure the trust of top-level images like "centos[:tag]". We only need to 
add a check related to "library" keyword in c-e. Configured "library" keyword, 
top-level pattern image name is trusted. It can be pulled. But when run, the 
local image check will be done based on "_docker.trusted.local.image_". Right?

If so, I feel this configuration is only useful for pull ? Maybe the name is 
not proper?

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674357#comment-16674357
 ] 

Zhankun Tang edited comment on YARN-8927 at 11/4/18 9:51 AM:
-

[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I understand the above configurations are effective for both 
"docker pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

For point 1, "_docker.trusted.registries_" will be all about non-local repo. 
The current logic underneath "docker.trusted.registries" already support 
private trusted registries and docker hub. But it doesn't implement how to 
configure the trust of top-level images like "centos[:tag]". We only need to 
add a check related to "library" keyword in c-e. Configured "library" keyword, 
top-level pattern image name is trusted. It can be pulled. But when run, the 
local image check will be done based on "_docker.trusted.local.image_". Right?

If so, I feel this configuration is only useful for pull ? Maybe the name is 
not proper?

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?


was (Author: tangzhankun):
[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I understand the above configurations are effective for both 
"docker pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied by if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

For point 1, "_docker.trusted.registries_" will be all about non-local repo. 
The current logic underneath "docker.trusted.registries" already support 
private trusted registries and docker hub. But it doesn't implement how to 
configure the trust of top-level images like "centos[:tag]". We only need to 
add a check related to "library" keyword in c-e. Configured "library" keyword, 
top-level pattern image name is trusted. It can be pulled. But when run, the 
local image check will be done based on "_docker.trusted.local.image_". Right?

If so, I feel this configuration is only useful for pull ? Maybe the name is 
not proper?

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-11-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674357#comment-16674357
 ] 

Zhankun Tang commented on YARN-8927:


[~eyang], Thanks for the explanation. To make us in the same context. I list my 
understanding and questions as below. Please correct me if anything wrong.

First of all, I understand the above configurations are effective for both 
"docker pull" and "docker run". YARN-3854's request of "docker pull" to 
container-executor will be denied by if not fit in white-list. The request of a 
running container will be denied by c-e if not fit in white-list.

For point 1, "_docker.trusted.registries_" will be all about non-local repo. 
The current logic underneath "docker.trusted.registries" already support 
private trusted registries and docker hub. But it doesn't implement how to 
configure the trust of top-level images like "centos[:tag]". We only need to 
add a check related to "library" keyword in c-e. Configured "library" keyword, 
top-level pattern image name is trusted. It can be pulled. But when run, the 
local image check will be done based on "_docker.trusted.local.image_". Right?

If so, I feel this configuration is only useful for pull ? Maybe the name is 
not proper?

 

For point 2, if we have a "_docker.privileged.registries_", does it mean 
"_docker.privileged-containers.enabled_" will be useless? And for the mount 
stuff, how will we handle the relationship with existing 
"docker.allowed.ro-mounts" and "docker.allowed.rw-mounts"? Also deprecated them?

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org