Yarn containers creating child process
Hi, What are the features available to limit the Yarn containers from creating the child process? Thanks
Re: HDFS Shell tool
Hi Vity! Please let me reiterate that I think its great work and I'm glad you thought of sharing it with the community. Thanks a lot. I can think of a few reasons for using WebHDFS, although, if these are not important to you, it may not be worth the effort: 1. You can point to an HttpFS gateway in case you do not have network access to the datanodes. 2. WebHDFS is a lot more likely to be compatible with different versions of Hadoop (https://github.com/avast/hdfs-shell/blob/master/build.gradle#L80) Although, the community is trying really hard to maintain compatibility going forward for FileSystem too. 3. You may be able to eliminate linking a lot of jars that hadoop-client would pull in. Having said that there may well be reasons why you don't want to use WebHDFS. Thanks again! Ravi On Fri, Feb 10, 2017 at 12:38 AM, Vitásek, Ladislavwrote: > Hello Ravi, > I am glad you like it. > Why should I use WebHDFS? Our cluster sysops, include me, prefer command > line. :-) > > -Vity > > 2017-02-09 22:21 GMT+01:00 Ravi Prakash : > >> Great job Vity! >> >> Thanks a lot for sharing. Have you thought about using WebHDFS? >> >> Thanks >> Ravi >> >> On Thu, Feb 9, 2017 at 7:12 AM, Vitásek, Ladislav >> wrote: >> >>> Hello Hadoop fans, >>> I would like to inform you about our tool we want to share. >>> >>> We created a new utility - HDFS Shell to work with HDFS more faster. >>> >>> https://github.com/avast/hdfs-shell >>> >>> *Feature highlights* >>> - HDFS DFS command initiates JVM for each command call, HDFS Shell does >>> it only once - which means great speed enhancement when you need to work >>> with HDFS more often >>> - Commands can be used in a short way - eg. *hdfs dfs -ls /*, *ls /* - >>> both will work >>> - *HDFS path completion using TAB key* >>> - you can easily add any other HDFS manipulation function >>> - there is a command history persisting in history log >>> (~/.hdfs-shell/hdfs-shell.log) >>> - support for relative directory + commands *cd* and *pwd* >>> - it can be also launched as a daemon (using UNIX domain sockets) >>> - 100% Java, it's open source >>> >>> You suggestions are welcome. >>> >>> -L. Vitasek aka Vity >>> >>> >> >
YARN and memory management with CapacityScheduler and with cgroups
Hello everyone, I am trying to figure out how to share among users a Hadoop cluster which is primarily going to be used with Spark and YARN. I have seen that YARN 2.7 provides a scheduler called CapacityScheduler which would help with multi-tenancy. What is not fully clear to me is how resource management is handled by the NodeManager. I have read a lot of documents, and also the book Hadoop: The Definitive Guide (4th), and still what is not clear to me is how I can achieve a sort of "soft" (or even hard, whenever possible) isolation between containers. To quote page the book (p. 106): > In normal operation, the Capacity Scheduler does not preempt containers by forcibly killing them, so if a queue is under capacity due to lack of demand, and then demand increases, the queue will only return to capacity as resources are released from other queues as containers complete. It is possible to mitigate this by configuring queues with a maximum capacity so that they don't eat into other queues' capacities too much. This is at the cost of queue elasticity, of course, so a reasonable trade-off should be found by trial and error. If I have an arbitrary number of queues, what happens if I set the only following values (and not the maximum-capacity property to set elasticity)? yarn.scheduler.capacity.abc.maximum-allocation-mb = 2048 yarn.nodemanager.resource.memory-mb = 1048 Considering that Spark may ask for an arbitrary amount of RAM per executor (e.g., 768mb), and that each task may take additional memory at runtime (besides overhead, maybe memory spikes?), can it happen that one container takes much more memory than specified in the settings above? Is this container going to prevent resources from being allocated to other containers in other queues (or same queue as well)? As far as I know, YARN will eventually kill the container if it's using more RAM for too long - may I set a timeout? Like "after 15 seconds of over-use, kill it". And finally, if I set a hard limit like the one above, is YARN still going to provide elasticity? I was even considering using cgroups to enforce such hard limits, and I have found out that they won't be included until version 2.9 ( https://issues.apache.org/jira/browse/YARN-1856), although from my understanding that jira issue is primarily focused on cgroups monitoring (and not really enforcing), but I may be wrong about this. (As far as I know, cgroups are only enforcing vcores limits in v. 2.7.1, which is something we would like to have for memory as well, so that users don't use more than allowed) Could you please help me understand how it works? Thanks in advance. Kind regards, Marco
Re: HDFS Shell tool
Hello Ravi, I am glad you like it. Why should I use WebHDFS? Our cluster sysops, include me, prefer command line. :-) -Vity 2017-02-09 22:21 GMT+01:00 Ravi Prakash: > Great job Vity! > > Thanks a lot for sharing. Have you thought about using WebHDFS? > > Thanks > Ravi > > On Thu, Feb 9, 2017 at 7:12 AM, Vitásek, Ladislav > wrote: > >> Hello Hadoop fans, >> I would like to inform you about our tool we want to share. >> >> We created a new utility - HDFS Shell to work with HDFS more faster. >> >> https://github.com/avast/hdfs-shell >> >> *Feature highlights* >> - HDFS DFS command initiates JVM for each command call, HDFS Shell does >> it only once - which means great speed enhancement when you need to work >> with HDFS more often >> - Commands can be used in a short way - eg. *hdfs dfs -ls /*, *ls /* - >> both will work >> - *HDFS path completion using TAB key* >> - you can easily add any other HDFS manipulation function >> - there is a command history persisting in history log >> (~/.hdfs-shell/hdfs-shell.log) >> - support for relative directory + commands *cd* and *pwd* >> - it can be also launched as a daemon (using UNIX domain sockets) >> - 100% Java, it's open source >> >> You suggestions are welcome. >> >> -L. Vitasek aka Vity >> >> >