[ 
https://issues.apache.org/jira/browse/YARN-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246626#comment-16246626
 ] 

Clay B. commented on YARN-7468:
-------------------------------

For the driving use-case, I run secure clusters (secured on the inside to keep 
data from leaking back out); think of them as a drop box where users can build 
models with restricted data. (Or my favorite analogy is a 
[glovebox|https://en.wikipedia.org/wiki/File:Vacuum_Dry_Box.jpg] -- things can 
go in but once in, they may be tainted and can't come out except by very 
special decontamination.)

As such, I need to ensure that network-wise the cluster is reachable from/to 
the local HDFS'es, HBase, databases, etc. Yet, only users permissioned for data 
ingest jobs should reach out and pull data. We can vet for example Oozie jobs 
to ensure they do only as we expect but how do we keep a user from reaching out 
to the same HBase or HDFS (when they otherwise have access) and storing data 
(or how do we allow a user to push reports to a simple service)?

Ideally, I'd have all the external endpoints secured to disallow this cluster 
from talking back except for very fine-grained allowances -- it's a big world 
and I can't. So, I'd like a way to setup firewall rule equivalents with some 
help from YARN on the secure cluster. The process I have in mind looks like the 
following workflow:

1A. We would setup iptables rules statically beforehand to ensure traffic for 
the various YARN agreed upon cgroup contexts, bridge devices or network 
namespaces could only flow where we want; we'd do this via out-of-band 
configuration management -- no need for YARN to do this setup.

1B. A user interactively logging onto a machine would be placed into a default 
cgroup/network namespace so they are strictly limited. They would only be 
permitted to talk to the local: YARN RM, HDFS namenodes, datanodes and Oozie 
for job submission. (This would prevent outbound scp and allow them to only 
submit a job or view logs); this would be configured via our out-of-band 
configuration management as well.

2. Then, when a user submit's a job, YARN would setup the OS control (cgroup, 
network namespace or the bridge interface) for those processes to match the 
user's name, a queue or some other deterministic handle. (We would use that 
handle for our configuration-managed matching iptables rules which would be 
pre-configured.)

2A. An ingest user for a particular database would be permissioned to reach out 
to a remote database to do ingest to the local HDFS to write data and to the 
necessary YARN ports. (All external YARN jobs should have strict review but 
even if we did not strictly review, connections could only flow to this one 
remote location -- that one database and what that one role account could read 
-- likely data only from one database.)

2B. A role account or human account for running ETL and adhoc intra-cluster 
jobs would not be allowed to talk off the cluster. (Jobs could be arbitrary and 
unreviewed -- but host-based network control - software firewall - would limit 
that one user; yea!)

2C. An egress user responsible for writing scrubbed data back out (e.g. 
reports) could reach out to a specific remote service endpoint to publish data, 
the local HDFS and YARN. (All jobs should again get strict review but the 
network controls would ensure data leakage from this account was limited to 
that one service and what that one role account could read on HDFS.)

3. Other uses could also use this technique:

3A. YARN already uses cgroups for traffic shaping using {{tc}} to shape a 
container's traffic; see JIRAs around YARN-2140.

3B. In general, we could audit what traffic comes from which users and affect 
only bad flows or bill back for network usage. Today, I worry if we have a 
pathological application reach out to a service and knock it down, I only know 
the machines and have to correlate {{netstat}} to see what user that is (or 
hope I have a strong correlation)[2]. If I have OS network control, I can ask 
the host-based firewall to log which users/devices (namespace bridges, etc.) 
are talking to that service's IP to know who's running the pathological job and 
throttle it opposed to kill it.

This is not a request for full-scale software-defined-networking integration 
into YARN. For example, I suspect many YARN operators would not have the 
organizational support or man-power to integrate something like the [Cloud 
Native Computing Foundation's Container Network 
Interface|https://github.com/containernetworking/cni/blob/master/SPEC.md] via 
[Project Calico|https://www.projectcalico.org/]. The hope is this does bring 
the "policy-driven network security" aspect of these Projects in reach of those 
who operate their YARN clusters and the underlying OS.

[1]: http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/
[2]: In all fairness, I could use 
[{{tcpspy}}|https://directory.fsf.org/wiki/Tcpspy] and have it record the PID 
of processes today too

> Provide means for container network policy control
> --------------------------------------------------
>
>                 Key: YARN-7468
>                 URL: https://issues.apache.org/jira/browse/YARN-7468
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>            Reporter: Clay B.
>            Priority: Minor
>
> To prevent data exfiltration from a YARN cluster, it would be very helpful to 
> have "firewall" rules able to map to a user/queue's containers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to