[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16475453#comment-16475453
 ] 

Szilard Nemeth commented on YARN-4599:
--------------------------------------

Thanks [~miklos.szeg...@cloudera.com] for the patch.

Couple of comments, some of them are minor, some of them could be important:

 

1. I think you have a typo for the property: 
yarn.nodemanager.elastic-memory-control.enabled in yarn-default.xml 
"if all the containers exceeded the a limit." --> should be changed to: 
"...exceeded the limit." (without an "a")

2. I think time unit should be provided for the property 
yarn.nodemanager.elastic-memory-control.timeout in yarn-default.xml.
>From the description, I guess the value 5 means seconds from the code, but 
>only from the description it's ambiguous


3. {{CGroupElasticMemoryController}}:

A.) In constructor: I would create a separate method for create and return the 
OOM handler instance as the constructor is very crowded.
As I see it, the {{oomHandlerClass}} and the instantiation logic both could be 
extracted to the method and then you don't need the {{oomHandlerTemp}} variable 
at all.

B.) In constructor: 
I think it's cleaner to use YarnConfiguration constants in string messages (log 
messages, exception message).
E.g. {{LOG.warn("yarn.nodemanager.elastic-memory-control.enabled}}... --> You 
could use {{YarnConfiguration.NM_ELASTIC_MEMORY_CONTROL_ENABLED}} instead.
If you search for "yarn.nodemanager." in this class it will give you all the 
results I'm talking about.

C.) [Question] In constructor: If both {{controlPhysicalMemory}} and 
{{controlVirtualMemory}} is on, you only warn log a line.
How the NM will work in these conditions, in other words, is it enough to just 
log a warning and continue NM's execution?

D.) In {{getOOMListenerExecutablePath()}}: You print OOM listener's path on 
debug level with: 
{code:java}
conf.get(
 YarnConfiguration.NM_ELASTIC_MEMORY_CONTROL_OOM_LISTENER_PATH,
 defaultPath){code}

I think you should extract this to a final local variable as you are returning 
the same value.

E.) In run(): On the first line of this method, you have a typo in 
{{oomNotResoved}} --> should be {{oomNotResolved}}.

F.) In run(): You have a TODO in this method. Do you have any plans to resolve 
it or it will be resolved later?

G.) In run(): I'm curious whether it can happen that in the while loop's 
statement, {{events.read}} would read more data than 8 bytes or is it perfectly 
safe to rely on that on every read, at most 8 bytes will be read?

H.) In run(): Since can harldy fit on one screen, I would extract the 
substantive part of the run() method to a new method, i.e. from the first 
{{executor.submit()}} call (errorListener) to the last statement until the 
catch.

I.) In run(): {{throw new YarnRuntimeException("OOM was not resolved in 
time.");}}} --> I would include how much time was spent in the log message for 
better troubleshooting.

J.) Javadoc of {{watchAndLogOOMState}} return is wrong, it returns a boolean 
and not an empty string.

K.) In {{setCGroupParameters()}}: Maybe in the {{ResourceHandlerException}}'s 
message would be more understandable with whole words, i.e. instead of p/v 
abbreviations, I would use physical/virtual.

L.) In {{watchAndLogOOMState()}}: Please extract {{"under_oom 1"}} to a static 
final String. This string is also used in {{DefaultOOMHandler}} so you should 
use the same constant.


4. {{DefaultOOMHandler}}

A.) There are several TODOs in this class, what is the plan with those?
A.) Constructor could be package-private
B.) (minor) Javadoc of run() has a typo, should be: "...exceeded its limits" 
instead of "it's limits". There are several other typos like this in this class.
B.) In run(): Call to

 
{code:java}
ArrayList<Container> containers = new ArrayList<>();
 containers.addAll(context.getContainers().values());
{code}
could be this one liner: 
{code:java}
ArrayList<Container> containers = new 
ArrayList<>(context.getContainers().values());{code}
C.) In {{killContainerIfOOM()}}: 
This line specifically:
{code:java}
long request = container.getResource().getMemorySize() * 1024 * 1024;{code}

According to javadoc of 
{{org.apache.hadoop.yarn.api.records.Resource#getMemorySize}}, this will return 
MB of memory size as a default and it depends on the unit of the resource.
You multiply it by 1024 two times, so I guess you are relying on that memory 
size would be bytes instead. 
*This one seems a possible unit conversion issue for me.*

5.) {{TestCGroupElasticMemoryController}}:

A.) {{testConstructorHandler()}}: It's not clear for me why do you expect 
{{YarnException}} with this test.
Moreover, {{YarnException}} could be thrown if the {{CGroupMemoryController}}'s 
constructor could not create an instance of the OOM handler which could hide a 
code error. I would throw a special exception from {{DummyRunnable}}'s 
constructor and assert on that exception from the testcase in order to be sure 
that the OOM handler's constructor is invoked without an issue.

> Set OOM control for memory cgroups
> ----------------------------------
>
>                 Key: YARN-4599
>                 URL: https://issues.apache.org/jira/browse/YARN-4599
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.9.0
>            Reporter: Karthik Kambatla
>            Assignee: Miklos Szegedi
>            Priority: Major
>              Labels: oct16-medium
>         Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to