Re: Drill on YARN Questions

Paul Rogers Tue, 01 Jan 2019 18:18:24 -0800

Hi Charles,

Your engineers have identified a common need, but one which is very difficult 
to satisfy.

TL;DR: DoY gets as close to the requirements as possible within the constraints 
of YARN and Drill. But, future projects could do more.

Your engineers want resource segregation among tenants: multi-tenancy. This is 
very difficult to achieve at the application level. Consider Drill. It would 
need some way to identify users to know which tenant they belong to. Then, 
Drill would need a way to enqueue users whose queries would exceed the memory 
or CPU limit for that tenant. Plus, Drill would have to be able to limit memory 
and CPU for each query. Much work has been done to limit memory, but CPU is 
very difficult. Mature products such as Teradata can do this, but Teradata has 
40 years of effort behind it.

Since it is hard to build multi-tenancy in at the app level (not impossible, 
just very, very hard), the thought is to apply it at the cluster level. This is 
done in YARN via limiting the resources available to processes (typically 
map/reduce) and to limit the number of running processes. Works for M/R because 
each map task uses disk to shuffle results to a reduce task, so map and reduce 
tasks can run asynchronously.

For tools such as Drill, which do in-memory processing (really, 
across-the-network exchanges), both the sender and receiver have to run 
concurrently. This is much harder to schedule than async m/r tasks: it means 
that the entire Drill cluster (of whatever size) be up and running to run a 
query.

The start-up time for Drill is far, far longer than a query. So, it is not 
feasible to use YARN to launch a Drill cluster for each query the way you would 
do with Spark. Instead, under YARN, Drill is a long running service that 
handles many queries.

Obviously, this is not ideal: I'm sure your engineers want to use a tenant's 
resources for Drill when running queries, else for Spark, Hive, or maybe 
TensorFlow. If Drill has to be long-running, I'm sure they's like to slosh 
resources between tenants as is done in YARN. As noted above, this is a hard 
problem that DoY did not attempt to solve.

One might suggest that Drill grab resources from YARN when Tenant A wants to 
run a query, and release them when that tenant is done, grabbing new resources 
when Tenant B wants to run. Impala tried this with Llama and found it did not 
work. (This is why DoY is quite a bit simpler; no reason to rerun a failed 
experiment.)

Some folks are looking to Kubernetes (K8s) as a solution. But, that just 
replaces YARN with K8s: Drill is still a long-running process.

To solve the problem you identify, you'll need either:

* A bunch of work in Drill to build multi-tenancy into Drill, or
* A cloud-like solution in which each tenant spins up a Drill cluster within 
its budget, spinning it down, or resizing it, to stay with an overall budget.

The second option can be achieved under YARN with DoY, assuming that DoY added 
support for graceful shutdown (or the cluster is reduced in size only when no 
queries are active.) Longer-term, a more modern solution would be 
Drill-on-Kubernetes (DoK?) which Abhishek started on.

Engineering is the art of compromise. The question for your engineers is how to 
achieve the best result given the limitations of the software available today. 
At the same time, helping the Drill community improve the solutions over time.

Thanks,
- Paul

    On Sunday, December 30, 2018, 9:38:04 PM PST, Charles Givre 
<[email protected]> wrote:  

 Hi Paul, 
Here’s what our engineers said:

>From Paul’s response, I understand that there is a slight confusion around how 
>multi-tenancy has been enabled in our data lake.

Some more details on this – 

Drill already has the concept of multitenancy where we can have multiple drill 
clusters running on the same data lake enabled through different ports and 
zookeeper. But, all of this is launched through the same hard coded yarn queue 
that we provide as a config parameter.

In our data lake, each tenant has a certain amount of compute capacity allotted 
to them which they can use for their project work. This is provisioned through 
individual YARN queues for each tenant (resource caging). This restricts the 
tenants from using cluster resources beyond a certain limit and not impacting 
other tenants at the same time. 

Access to these YARN queues is provisioned through ACL memberships. 

——

Does this make sense?  Is this possible to get Drill to work in this manner, or 
should we look into opening up JIRAs and working on new capabilities?

> On Dec 17, 2018, at 21:59, Paul Rogers <[email protected]> wrote:
> 
> Hi Kwizera,
> I hope my answer to Charles gave you the information you need. If not, please 
> check out the DoY documentation or ask follow-up questions.
> Key thing to remember: Drill is a long-running YARN service; queries DO NOT 
> go through YARN queues, they go through Drill directly.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, December 17, 2018, 11:01:04 AM PST, Kwizera hugues Teddy 
><[email protected]> wrote:  
> 
> Hello,
> Same questions ,
> I would like to know how drill deal with this yarn fonctionality?
> Cheers.
> 
> On Mon, Dec 17, 2018, 17:53 Charles Givre <[email protected] wrote:
> 
>> Hello all,
>> We are trying to set up a Drill cluster on our corporate data lake.  Our
>> cluster requires dynamic YARN queue allocation for multi-tenant
>> environment.  Is this something that Drill supports or is there a
>> workaround?
>> Thanks!
>> —C

Re: Drill on YARN Questions

Reply via email to