[ 
https://issues.apache.org/jira/browse/YARN-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9073:
-------------------------------
    Description: 
The current GPU/FPGA behavior may has an issue when c-g.cfg doesn't align with 
yarn-site.xml. Take GPU for instance:

One host has 1,2,3,4,5,6. And "GPU.allowed = 1,2,3" configured in c-e.cfg. But 
yarn-site.xml configured "auto" which means allow 1,2,3,4,5,6.

And one application request 4 GPU, the scheduler allocated 1,2,4,5. So 
--excluded-gpus is "3". And c-e will check that 3 is in allowed list(1,2,3) and 
then only deny 3 in cgroups.

In this case, c-e's allowed-list (1,2,3) doesn't work because the application 
can access 4 and 5.

  was:
The current GPU/FPGA behavior may has an issue when c-g.cfg doesn't align with 
yarn-site.xml. Take GPU for instance:

One host has 1,2,3,4,5. And "GPU.allowed = 1,2,3" configured in c-e.cfg. But 
yarn-site.xml configured "auto" which means allow 1,2,3,4,5.

And one application request 4 GPU, the scheduler allocated 1,2,4,5. So 
--excluded-gpus is "3". And c-e will check that 3 is in allowed list(1,2,3) and 
then only deny 3 in cgroups.

In this case, c-e's allowed-list (1,2,3) doesn't work because the application 
can access 4 and 5.


> GPU/FPGA whitelist configuration in container-executor.cfg won't work when 
> yarn-site.xml's allowed devices doesn't align with it
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9073
>                 URL: https://issues.apache.org/jira/browse/YARN-9073
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Zhankun Tang
>            Assignee: Zhankun Tang
>            Priority: Major
>
> The current GPU/FPGA behavior may has an issue when c-g.cfg doesn't align 
> with yarn-site.xml. Take GPU for instance:
> One host has 1,2,3,4,5,6. And "GPU.allowed = 1,2,3" configured in c-e.cfg. 
> But yarn-site.xml configured "auto" which means allow 1,2,3,4,5,6.
> And one application request 4 GPU, the scheduler allocated 1,2,4,5. So 
> --excluded-gpus is "3". And c-e will check that 3 is in allowed list(1,2,3) 
> and then only deny 3 in cgroups.
> In this case, c-e's allowed-list (1,2,3) doesn't work because the application 
> can access 4 and 5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to