Thanks Jarek, this is all very good info to know. Unfortunately I'm doing
my evaluation on AWS MWAA so changing the executor isn't an option (at
least not yet). It'll probably be a hard sell to convince management to run
a self-managed airflow cluster to get the heightened isolation of the
Kubernetes Executor, but maybe it's worth floating the idea :)  Thanks for
the info!

On Sat, Jan 15, 2022 at 11:18 AM Jarek Potiuk <[email protected]> wrote:

> > I mean "one team writing DAGs for multiple clients, and those tasks
> can't collide". We don't require actual security from malicious users, we
> just need some safety rails to prevent accidents.
>
> I think "/tmp" is only one of the problems. I am not sure if you are
> aware but DAG writers have a lot of power in the current airflow. They
> could even accidentally - for example - delete the whole metadata db
> or all dag history by issuing an ORM command to delete those. There
> are no protections (and that's by design until multi-tenancy is
> implemented. So worrying about /tmp accidental clashes by
> inexperienced users is the least of your worries I believe. Airflow
> (currently) assumes a lot of trust in the DAG writers that they are
> not doing anything "crazy" (again this is by design assumption is that
> DAG writers know what they are doing and their code is reviewed by
> their peers before executed).
>
> However when you do want to only focus on file access, then /tmp is
> also not your only problem. Depending which executors you use there
> are also other possibilities of "clashing"
>
> 1) Local Executor- the tasks are run as processes on the same machine
> as scheduler and ANY file (not only /tmp) can be shared/overwritten.
> If your teams choose some "/file/file-storage" they could also
> overwrite those files (there is no way to provide different access
> level to tasks belonging to different tasks
>
> 2) Celery Executor - those are usually separated from scheduler but
> still one "Worker" can handle multiple tasks from (potentially)
> different teams and same problems can occur. You can potentially
> separate different teams by using different queues (and each team
> having separate set of workers) but this is not at all "safe" as any
> DAG writer can override the queue to another value - effectively any
> team member can run the dags as another team member. No protection
> against that (except code review) is built-in currently.
>
> 3) Kubernetes Executor - here the situation is a bit better. Each task
> is always run in a separate new POD and the only shared volumes are
> those which you explicitly add in POD template (but still a user could
> run conceptually `DELETE * from DA` and delete all dags from all
> teams. No protection against such cases in this case (same in
> Local/Celery) is possible currently.
>
> So In short - there are no "good" protections. If you want to protect
> against "accidental" /tmp file override between teams - use K8S
> executor.
>
> What you could also provide is to set TMP_DIR to a different path for
> each team or make your teams only use DockerOperator or K8S operator
> to introduce file-level separation (but this would require some
> conventions adopted by the teams and trust that they are not breaking
> them - there is nothing in Airflow to enforce those. You could
> potentially "check" some of those via cluster policies:
>
> https://airflow.apache.org/docs/apache-airflow/stable/concepts/cluster-policies.html
> - but those checks will only be able to "check" if your conventions
> are followed, but you would not be able to detect if a member of one
> team pretends to be a member of another team (unless you also add some
> separation of folders and permissions for dag submissions and link the
> location of DAGs to DAG location). This is not foul-proof (because any
> DAG writer could override the location dynamically when DAG is parsed.
>
> J.
>
> On Fri, Jan 14, 2022 at 9:40 PM Chris Redekop <[email protected]> wrote:
> >
> > I mean "one team writing DAGs for multiple clients, and those tasks
> can't collide". We don't require actual security from malicious users, we
> just need some safety rails to prevent accidents.
> >
> > On Fri, Jan 14, 2022 at 1:31 PM Jed Cunningham <[email protected]>
> wrote:
> >>
> >> Hey Chris,
> >>
> >> I think the answer depends on what you mean by "multi-tenancy". I think
> you mean one team writing DAGs for multiple clients and those tasks can't
> collide. If so, the easiest way to have isolated workers is with
> KubernetesExecutor. No shared tmp!
> >>
> >> If instead you mean multiple teams sharing an instance (what I consider
> multi-tenancy), it's a totally different situation, and in most cases
> having separate instances is the right call if you require "security".
> >>
> >> Remember, DAGs are arbitrary python and you can do all sorts of
> interesting things in them. Do you need isolation for accidental
> collisions, or do you need to protect tenant-a from
> possibly-bad-actor-tenant-b?
> >>
> >> More reading on Airflow multi-tenancy:
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-1%3A+Improve+Airflow+Security
> >>
> https://lists.apache.org/[email protected]:lte=1y:multi-tenancy
> >>
> >> Jed
>

Reply via email to