Yarn containers creating child process

2017-02-10 Thread Sandesh Hegde
Hi,

What are the features available to limit the Yarn containers from creating
the child process?

Thanks


Re: HDFS Shell tool

2017-02-10 Thread Ravi Prakash
Hi Vity!

Please let me reiterate that I think its great work and I'm glad you
thought of sharing it with the community. Thanks a lot.

I can think of a few reasons for using WebHDFS, although, if these are not
important to you, it may not be worth the effort:
1. You can point to an HttpFS gateway in case you do not have network
access to the datanodes.
2. WebHDFS is a lot more likely to be compatible with different versions of
Hadoop (https://github.com/avast/hdfs-shell/blob/master/build.gradle#L80)
Although, the community is trying really hard to maintain compatibility
going forward for FileSystem too.
3. You may be able to eliminate linking a lot of jars that hadoop-client
would pull in.

Having said that there may well be reasons why you don't want to use
WebHDFS.

Thanks again!
Ravi


On Fri, Feb 10, 2017 at 12:38 AM, Vitásek, Ladislav 
wrote:

> Hello Ravi,
> I am glad you like it.
> Why should I use WebHDFS? Our cluster sysops, include me, prefer command
> line. :-)
>
> -Vity
>
> 2017-02-09 22:21 GMT+01:00 Ravi Prakash :
>
>> Great job Vity!
>>
>> Thanks a lot for sharing. Have you thought about using WebHDFS?
>>
>> Thanks
>> Ravi
>>
>> On Thu, Feb 9, 2017 at 7:12 AM, Vitásek, Ladislav 
>> wrote:
>>
>>> Hello Hadoop fans,
>>> I would like to inform you about our tool we want to share.
>>>
>>> We created a new utility - HDFS Shell to work with HDFS more faster.
>>>
>>> https://github.com/avast/hdfs-shell
>>>
>>> *Feature highlights*
>>> - HDFS DFS command initiates JVM for each command call, HDFS Shell does
>>> it only once - which means great speed enhancement when you need to work
>>> with HDFS more often
>>> - Commands can be used in a short way - eg. *hdfs dfs -ls /*, *ls /* -
>>> both will work
>>> - *HDFS path completion using TAB key*
>>> - you can easily add any other HDFS manipulation function
>>> - there is a command history persisting in history log
>>> (~/.hdfs-shell/hdfs-shell.log)
>>> - support for relative directory + commands *cd* and *pwd*
>>> - it can be also launched as a daemon (using UNIX domain sockets)
>>> - 100% Java, it's open source
>>>
>>> You suggestions are welcome.
>>>
>>> -L. Vitasek aka Vity
>>>
>>>
>>
>


YARN and memory management with CapacityScheduler and with cgroups

2017-02-10 Thread Marco B.
Hello everyone,

I am trying to figure out how to share among users a Hadoop cluster which
is primarily going to be used with Spark and YARN. I have seen that YARN
2.7 provides a scheduler called CapacityScheduler which would help with
multi-tenancy. What is not fully clear to me is how resource management is
handled by the NodeManager. I have read a lot of documents, and also the
book Hadoop: The Definitive Guide (4th), and still what is not clear to me
is how I can achieve a sort of "soft" (or even hard, whenever possible)
isolation between containers.

To quote page the book (p. 106):

> In normal operation, the Capacity Scheduler does not preempt containers
by forcibly killing them, so if a queue is under capacity due to lack of
demand, and then demand increases, the queue will only return to capacity
as resources are released from other queues as containers complete. It is
possible to mitigate this by configuring queues with a maximum capacity so
that they don't eat into other queues' capacities too much. This is at the
cost of queue elasticity, of course, so a reasonable trade-off should be
found by trial and error.

If I have an arbitrary number of queues, what happens if I set the only
following values (and not the maximum-capacity property to set elasticity)?
yarn.scheduler.capacity.abc.maximum-allocation-mb = 2048
yarn.nodemanager.resource.memory-mb = 1048

Considering that Spark may ask for an arbitrary amount of RAM per executor
(e.g., 768mb), and that each task may take additional memory at runtime
(besides overhead, maybe memory spikes?), can it happen that one container
takes much more memory than specified in the settings above? Is this
container going to prevent resources from being allocated to other
containers in other queues (or same queue as well)? As far as I know, YARN
will eventually kill the container if it's using more RAM for too long -
may I set a timeout? Like "after 15 seconds of over-use, kill it". And
finally, if I set a hard limit like the one above, is YARN still going to
provide elasticity?

I was even considering using cgroups to enforce such hard limits, and I
have found out that they won't be included until version 2.9 (
https://issues.apache.org/jira/browse/YARN-1856), although from my
understanding that jira issue is primarily focused on cgroups monitoring
(and not really enforcing), but I may be wrong about this. (As far as I
know, cgroups are only enforcing vcores limits in v. 2.7.1, which is
something we would like to have for memory as well, so that users don't use
more than allowed)

Could you please help me understand how it works?

Thanks in advance.

Kind regards,
Marco


Re: HDFS Shell tool

2017-02-10 Thread Vitásek , Ladislav
Hello Ravi,
I am glad you like it.
Why should I use WebHDFS? Our cluster sysops, include me, prefer command
line. :-)

-Vity

2017-02-09 22:21 GMT+01:00 Ravi Prakash :

> Great job Vity!
>
> Thanks a lot for sharing. Have you thought about using WebHDFS?
>
> Thanks
> Ravi
>
> On Thu, Feb 9, 2017 at 7:12 AM, Vitásek, Ladislav 
> wrote:
>
>> Hello Hadoop fans,
>> I would like to inform you about our tool we want to share.
>>
>> We created a new utility - HDFS Shell to work with HDFS more faster.
>>
>> https://github.com/avast/hdfs-shell
>>
>> *Feature highlights*
>> - HDFS DFS command initiates JVM for each command call, HDFS Shell does
>> it only once - which means great speed enhancement when you need to work
>> with HDFS more often
>> - Commands can be used in a short way - eg. *hdfs dfs -ls /*, *ls /* -
>> both will work
>> - *HDFS path completion using TAB key*
>> - you can easily add any other HDFS manipulation function
>> - there is a command history persisting in history log
>> (~/.hdfs-shell/hdfs-shell.log)
>> - support for relative directory + commands *cd* and *pwd*
>> - it can be also launched as a daemon (using UNIX domain sockets)
>> - 100% Java, it's open source
>>
>> You suggestions are welcome.
>>
>> -L. Vitasek aka Vity
>>
>>
>