[
https://issues.apache.org/jira/browse/YARN-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246626#comment-16246626
]
Clay B. commented on YARN-7468:
-------------------------------
For the driving use-case, I run secure clusters (secured on the inside to keep
data from leaking back out); think of them as a drop box where users can build
models with restricted data. (Or my favorite analogy is a
[glovebox|https://en.wikipedia.org/wiki/File:Vacuum_Dry_Box.jpg] -- things can
go in but once in, they may be tainted and can't come out except by very
special decontamination.)
As such, I need to ensure that network-wise the cluster is reachable from/to
the local HDFS'es, HBase, databases, etc. Yet, only users permissioned for data
ingest jobs should reach out and pull data. We can vet for example Oozie jobs
to ensure they do only as we expect but how do we keep a user from reaching out
to the same HBase or HDFS (when they otherwise have access) and storing data
(or how do we allow a user to push reports to a simple service)?
Ideally, I'd have all the external endpoints secured to disallow this cluster
from talking back except for very fine-grained allowances -- it's a big world
and I can't. So, I'd like a way to setup firewall rule equivalents with some
help from YARN on the secure cluster. The process I have in mind looks like the
following workflow:
1A. We would setup iptables rules statically beforehand to ensure traffic for
the various YARN agreed upon cgroup contexts, bridge devices or network
namespaces could only flow where we want; we'd do this via out-of-band
configuration management -- no need for YARN to do this setup.
1B. A user interactively logging onto a machine would be placed into a default
cgroup/network namespace so they are strictly limited. They would only be
permitted to talk to the local: YARN RM, HDFS namenodes, datanodes and Oozie
for job submission. (This would prevent outbound scp and allow them to only
submit a job or view logs); this would be configured via our out-of-band
configuration management as well.
2. Then, when a user submit's a job, YARN would setup the OS control (cgroup,
network namespace or the bridge interface) for those processes to match the
user's name, a queue or some other deterministic handle. (We would use that
handle for our configuration-managed matching iptables rules which would be
pre-configured.)
2A. An ingest user for a particular database would be permissioned to reach out
to a remote database to do ingest to the local HDFS to write data and to the
necessary YARN ports. (All external YARN jobs should have strict review but
even if we did not strictly review, connections could only flow to this one
remote location -- that one database and what that one role account could read
-- likely data only from one database.)
2B. A role account or human account for running ETL and adhoc intra-cluster
jobs would not be allowed to talk off the cluster. (Jobs could be arbitrary and
unreviewed -- but host-based network control - software firewall - would limit
that one user; yea!)
2C. An egress user responsible for writing scrubbed data back out (e.g.
reports) could reach out to a specific remote service endpoint to publish data,
the local HDFS and YARN. (All jobs should again get strict review but the
network controls would ensure data leakage from this account was limited to
that one service and what that one role account could read on HDFS.)
3. Other uses could also use this technique:
3A. YARN already uses cgroups for traffic shaping using {{tc}} to shape a
container's traffic; see JIRAs around YARN-2140.
3B. In general, we could audit what traffic comes from which users and affect
only bad flows or bill back for network usage. Today, I worry if we have a
pathological application reach out to a service and knock it down, I only know
the machines and have to correlate {{netstat}} to see what user that is (or
hope I have a strong correlation)[2]. If I have OS network control, I can ask
the host-based firewall to log which users/devices (namespace bridges, etc.)
are talking to that service's IP to know who's running the pathological job and
throttle it opposed to kill it.
This is not a request for full-scale software-defined-networking integration
into YARN. For example, I suspect many YARN operators would not have the
organizational support or man-power to integrate something like the [Cloud
Native Computing Foundation's Container Network
Interface|https://github.com/containernetworking/cni/blob/master/SPEC.md] via
[Project Calico|https://www.projectcalico.org/]. The hope is this does bring
the "policy-driven network security" aspect of these Projects in reach of those
who operate their YARN clusters and the underlying OS.
[1]: http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/
[2]: In all fairness, I could use
[{{tcpspy}}|https://directory.fsf.org/wiki/Tcpspy] and have it record the PID
of processes today too
> Provide means for container network policy control
> --------------------------------------------------
>
> Key: YARN-7468
> URL: https://issues.apache.org/jira/browse/YARN-7468
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Reporter: Clay B.
> Priority: Minor
>
> To prevent data exfiltration from a YARN cluster, it would be very helpful to
> have "firewall" rules able to map to a user/queue's containers.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]