[ https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16475453#comment-16475453 ]
Szilard Nemeth commented on YARN-4599: -------------------------------------- Thanks [~miklos.szeg...@cloudera.com] for the patch. Couple of comments, some of them are minor, some of them could be important: 1. I think you have a typo for the property: yarn.nodemanager.elastic-memory-control.enabled in yarn-default.xml "if all the containers exceeded the a limit." --> should be changed to: "...exceeded the limit." (without an "a") 2. I think time unit should be provided for the property yarn.nodemanager.elastic-memory-control.timeout in yarn-default.xml. >From the description, I guess the value 5 means seconds from the code, but >only from the description it's ambiguous 3. {{CGroupElasticMemoryController}}: A.) In constructor: I would create a separate method for create and return the OOM handler instance as the constructor is very crowded. As I see it, the {{oomHandlerClass}} and the instantiation logic both could be extracted to the method and then you don't need the {{oomHandlerTemp}} variable at all. B.) In constructor: I think it's cleaner to use YarnConfiguration constants in string messages (log messages, exception message). E.g. {{LOG.warn("yarn.nodemanager.elastic-memory-control.enabled}}... --> You could use {{YarnConfiguration.NM_ELASTIC_MEMORY_CONTROL_ENABLED}} instead. If you search for "yarn.nodemanager." in this class it will give you all the results I'm talking about. C.) [Question] In constructor: If both {{controlPhysicalMemory}} and {{controlVirtualMemory}} is on, you only warn log a line. How the NM will work in these conditions, in other words, is it enough to just log a warning and continue NM's execution? D.) In {{getOOMListenerExecutablePath()}}: You print OOM listener's path on debug level with: {code:java} conf.get( YarnConfiguration.NM_ELASTIC_MEMORY_CONTROL_OOM_LISTENER_PATH, defaultPath){code} I think you should extract this to a final local variable as you are returning the same value. E.) In run(): On the first line of this method, you have a typo in {{oomNotResoved}} --> should be {{oomNotResolved}}. F.) In run(): You have a TODO in this method. Do you have any plans to resolve it or it will be resolved later? G.) In run(): I'm curious whether it can happen that in the while loop's statement, {{events.read}} would read more data than 8 bytes or is it perfectly safe to rely on that on every read, at most 8 bytes will be read? H.) In run(): Since can harldy fit on one screen, I would extract the substantive part of the run() method to a new method, i.e. from the first {{executor.submit()}} call (errorListener) to the last statement until the catch. I.) In run(): {{throw new YarnRuntimeException("OOM was not resolved in time.");}}} --> I would include how much time was spent in the log message for better troubleshooting. J.) Javadoc of {{watchAndLogOOMState}} return is wrong, it returns a boolean and not an empty string. K.) In {{setCGroupParameters()}}: Maybe in the {{ResourceHandlerException}}'s message would be more understandable with whole words, i.e. instead of p/v abbreviations, I would use physical/virtual. L.) In {{watchAndLogOOMState()}}: Please extract {{"under_oom 1"}} to a static final String. This string is also used in {{DefaultOOMHandler}} so you should use the same constant. 4. {{DefaultOOMHandler}} A.) There are several TODOs in this class, what is the plan with those? A.) Constructor could be package-private B.) (minor) Javadoc of run() has a typo, should be: "...exceeded its limits" instead of "it's limits". There are several other typos like this in this class. B.) In run(): Call to {code:java} ArrayList<Container> containers = new ArrayList<>(); containers.addAll(context.getContainers().values()); {code} could be this one liner: {code:java} ArrayList<Container> containers = new ArrayList<>(context.getContainers().values());{code} C.) In {{killContainerIfOOM()}}: This line specifically: {code:java} long request = container.getResource().getMemorySize() * 1024 * 1024;{code} According to javadoc of {{org.apache.hadoop.yarn.api.records.Resource#getMemorySize}}, this will return MB of memory size as a default and it depends on the unit of the resource. You multiply it by 1024 two times, so I guess you are relying on that memory size would be bytes instead. *This one seems a possible unit conversion issue for me.* 5.) {{TestCGroupElasticMemoryController}}: A.) {{testConstructorHandler()}}: It's not clear for me why do you expect {{YarnException}} with this test. Moreover, {{YarnException}} could be thrown if the {{CGroupMemoryController}}'s constructor could not create an instance of the OOM handler which could hide a code error. I would throw a special exception from {{DummyRunnable}}'s constructor and assert on that exception from the testcase in order to be sure that the OOM handler's constructor is invoked without an issue. > Set OOM control for memory cgroups > ---------------------------------- > > Key: YARN-4599 > URL: https://issues.apache.org/jira/browse/YARN-4599 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.9.0 > Reporter: Karthik Kambatla > Assignee: Miklos Szegedi > Priority: Major > Labels: oct16-medium > Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, > YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, > YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, > YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch > > > YARN-1856 adds memory cgroups enforcing support. We should also explicitly > set OOM control so that containers are not killed as soon as they go over > their usage. Today, one could set the swappiness to control this, but > clusters with swap turned off exist. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org