Re: How to sandbox the tasks from each other?

Jarek Potiuk Sat, 15 Jan 2022 10:18:08 -0800

> I mean "one team writing DAGs for multiple clients, and those tasks can't 
> collide". We don't require actual security from malicious users, we just need 
> some safety rails to prevent accidents.

I think "/tmp" is only one of the problems. I am not sure if you are
aware but DAG writers have a lot of power in the current airflow. They
could even accidentally - for example - delete the whole metadata db
or all dag history by issuing an ORM command to delete those. There
are no protections (and that's by design until multi-tenancy is
implemented. So worrying about /tmp accidental clashes by
inexperienced users is the least of your worries I believe. Airflow
(currently) assumes a lot of trust in the DAG writers that they are
not doing anything "crazy" (again this is by design assumption is that
DAG writers know what they are doing and their code is reviewed by
their peers before executed).

However when you do want to only focus on file access, then /tmp is
also not your only problem. Depending which executors you use there
are also other possibilities of "clashing"

1) Local Executor- the tasks are run as processes on the same machine
as scheduler and ANY file (not only /tmp) can be shared/overwritten.
If your teams choose some "/file/file-storage" they could also
overwrite those files (there is no way to provide different access
level to tasks belonging to different tasks

2) Celery Executor - those are usually separated from scheduler but
still one "Worker" can handle multiple tasks from (potentially)
different teams and same problems can occur. You can potentially
separate different teams by using different queues (and each team
having separate set of workers) but this is not at all "safe" as any
DAG writer can override the queue to another value - effectively any
team member can run the dags as another team member. No protection
against that (except code review) is built-in currently.

3) Kubernetes Executor - here the situation is a bit better. Each task
is always run in a separate new POD and the only shared volumes are
those which you explicitly add in POD template (but still a user could
run conceptually `DELETE * from DA` and delete all dags from all
teams. No protection against such cases in this case (same in
Local/Celery) is possible currently.

So In short - there are no "good" protections. If you want to protect
against "accidental" /tmp file override between teams - use K8S
executor.

What you could also provide is to set TMP_DIR to a different path for
each team or make your teams only use DockerOperator or K8S operator
to introduce file-level separation (but this would require some
conventions adopted by the teams and trust that they are not breaking
them - there is nothing in Airflow to enforce those. You could
potentially "check" some of those via cluster policies:
https://airflow.apache.org/docs/apache-airflow/stable/concepts/cluster-policies.html
- but those checks will only be able to "check" if your conventions
are followed, but you would not be able to detect if a member of one
team pretends to be a member of another team (unless you also add some
separation of folders and permissions for dag submissions and link the
location of DAGs to DAG location). This is not foul-proof (because any
DAG writer could override the location dynamically when DAG is parsed.

J.

On Fri, Jan 14, 2022 at 9:40 PM Chris Redekop <[email protected]> wrote:
>
> I mean "one team writing DAGs for multiple clients, and those tasks can't 
> collide". We don't require actual security from malicious users, we just need 
> some safety rails to prevent accidents.
>
> On Fri, Jan 14, 2022 at 1:31 PM Jed Cunningham <[email protected]> 
> wrote:
>>
>> Hey Chris,
>>
>> I think the answer depends on what you mean by "multi-tenancy". I think you 
>> mean one team writing DAGs for multiple clients and those tasks can't 
>> collide. If so, the easiest way to have isolated workers is with 
>> KubernetesExecutor. No shared tmp!
>>
>> If instead you mean multiple teams sharing an instance (what I consider 
>> multi-tenancy), it's a totally different situation, and in most cases having 
>> separate instances is the right call if you require "security".
>>
>> Remember, DAGs are arbitrary python and you can do all sorts of interesting 
>> things in them. Do you need isolation for accidental collisions, or do you 
>> need to protect tenant-a from possibly-bad-actor-tenant-b?
>>
>> More reading on Airflow multi-tenancy:
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-1%3A+Improve+Airflow+Security
>> https://lists.apache.org/[email protected]:lte=1y:multi-tenancy
>>
>> Jed

Re: How to sandbox the tasks from each other?

Reply via email to