Hi Charles,
Your engineers have identified a common need, but one which is very difficult
to satisfy.
TL;DR: DoY gets as close to the requirements as possible within the constraints
of YARN and Drill. But, future projects could do more.
Your engineers want resource segregation among tenants: multi-tenancy. This is
very difficult to achieve at the application level. Consider Drill. It would
need some way to identify users to know which tenant they belong to. Then,
Drill would need a way to enqueue users whose queries would exceed the memory
or CPU limit for that tenant. Plus, Drill would have to be able to limit memory
and CPU for each query. Much work has been done to limit memory, but CPU is
very difficult. Mature products such as Teradata can do this, but Teradata has
40 years of effort behind it.
Since it is hard to build multi-tenancy in at the app level (not impossible,
just very, very hard), the thought is to apply it at the cluster level. This is
done in YARN via limiting the resources available to processes (typically
map/reduce) and to limit the number of running processes. Works for M/R because
each map task uses disk to shuffle results to a reduce task, so map and reduce
tasks can run asynchronously.
For tools such as Drill, which do in-memory processing (really,
across-the-network exchanges), both the sender and receiver have to run
concurrently. This is much harder to schedule than async m/r tasks: it means
that the entire Drill cluster (of whatever size) be up and running to run a
query.
The start-up time for Drill is far, far longer than a query. So, it is not
feasible to use YARN to launch a Drill cluster for each query the way you would
do with Spark. Instead, under YARN, Drill is a long running service that
handles many queries.
Obviously, this is not ideal: I'm sure your engineers want to use a tenant's
resources for Drill when running queries, else for Spark, Hive, or maybe
TensorFlow. If Drill has to be long-running, I'm sure they's like to slosh
resources between tenants as is done in YARN. As noted above, this is a hard
problem that DoY did not attempt to solve.
One might suggest that Drill grab resources from YARN when Tenant A wants to
run a query, and release them when that tenant is done, grabbing new resources
when Tenant B wants to run. Impala tried this with Llama and found it did not
work. (This is why DoY is quite a bit simpler; no reason to rerun a failed
experiment.)
Some folks are looking to Kubernetes (K8s) as a solution. But, that just
replaces YARN with K8s: Drill is still a long-running process.
To solve the problem you identify, you'll need either:
* A bunch of work in Drill to build multi-tenancy into Drill, or
* A cloud-like solution in which each tenant spins up a Drill cluster within
its budget, spinning it down, or resizing it, to stay with an overall budget.
The second option can be achieved under YARN with DoY, assuming that DoY added
support for graceful shutdown (or the cluster is reduced in size only when no
queries are active.) Longer-term, a more modern solution would be
Drill-on-Kubernetes (DoK?) which Abhishek started on.
Engineering is the art of compromise. The question for your engineers is how to
achieve the best result given the limitations of the software available today.
At the same time, helping the Drill community improve the solutions over time.
Thanks,
- Paul
On Sunday, December 30, 2018, 9:38:04 PM PST, Charles Givre
<[email protected]> wrote:
Hi Paul,
Here’s what our engineers said:
>From Paul’s response, I understand that there is a slight confusion around how
>multi-tenancy has been enabled in our data lake.
Some more details on this –
Drill already has the concept of multitenancy where we can have multiple drill
clusters running on the same data lake enabled through different ports and
zookeeper. But, all of this is launched through the same hard coded yarn queue
that we provide as a config parameter.
In our data lake, each tenant has a certain amount of compute capacity allotted
to them which they can use for their project work. This is provisioned through
individual YARN queues for each tenant (resource caging). This restricts the
tenants from using cluster resources beyond a certain limit and not impacting
other tenants at the same time.
Access to these YARN queues is provisioned through ACL memberships.
——
Does this make sense? Is this possible to get Drill to work in this manner, or
should we look into opening up JIRAs and working on new capabilities?
> On Dec 17, 2018, at 21:59, Paul Rogers <[email protected]> wrote:
>
> Hi Kwizera,
> I hope my answer to Charles gave you the information you need. If not, please
> check out the DoY documentation or ask follow-up questions.
> Key thing to remember: Drill is a long-running YARN service; queries DO NOT
> go through YARN queues, they go through Drill directly.
>
> Thanks,
> - Paul
>
>
>
> On Monday, December 17, 2018, 11:01:04 AM PST, Kwizera hugues Teddy
><[email protected]> wrote:
>
> Hello,
> Same questions ,
> I would like to know how drill deal with this yarn fonctionality?
> Cheers.
>
> On Mon, Dec 17, 2018, 17:53 Charles Givre <[email protected] wrote:
>
>> Hello all,
>> We are trying to set up a Drill cluster on our corporate data lake. Our
>> cluster requires dynamic YARN queue allocation for multi-tenant
>> environment. Is this something that Drill supports or is there a
>> workaround?
>> Thanks!
>> —C