Re: Benchmark numbers using Drill

Saurabh Mahapatra Thu, 19 Oct 2017 10:58:31 -0700

I do not think you will get such information about benchmarks from
customers on production workloads. But from the customers I have worked
with who have taken Drill to production, here is some information that may
be of use to you:

1. The trend universally has been to use beefier machines for in-memory
query engines. We see 256GB RAM and 32 cores as the most frequent
configuration. On the network side, it is 2x10GbE.

2. The most commonly sized dedicated cluster for starting out with Drill in
production has been around 16-20 nodes with the above configuration. I have
several customers who have deployed this on 200+ nodes as well but in those
scenarios, it is a service among many.

3. The concurrency we see in the above settings is a function of the size
of the dataset and the complexity of the customer query. In general,
Little's law holds. The smaller the chunk of work is to be processed, the
faster will be the throughput. Our understanding of this changes further
with the new releases of Drill where spill to disk features will make it
more of a pessimistic execution engine. Also, the use of queues can also
change this understanding.

4. From my company side, we do have TPCH and TPCDS benchmarks that I do
share with customers. But such benchmarks are flawed because they come from
the world of traditional warehousing where the competition was among
general purpose query engines. For example, our tests show that at higher
and higher data scale, Drill beats Impala on these benchmarks. The same is
touted by the Hive LLAP folks as well. But they do not necessarily imply
that it is the best tool choice for the production environment. It is a
reason why I am resistant getting into the war of the query engines in
which every query engine beats the other under a given set of primed
conditions.

5. It is an absolute most that you understand the query patterns that the
system will have to withstand with the data characteristics specific to
your use case. I would only trust that. Big data systems are going to be
application specific and will require tuning. Which also means that you
have to revisit the kinds of analytics you would like your end users to
have. Which again raises the question-what kinds of analytics truly
generate value for the BI user?

Best,
Saurabh

On Wed, Oct 18, 2017 at 10:26 PM, PROJJWAL SAHA <[email protected]> wrote:

> Hi,
>
> Is there any public performance benchmark that users have achieved using
> Drill in production scenarios ? It would be useful if someone can pass me
> any links for customer user stories.
>
> Regards
>

Re: Benchmark numbers using Drill

Reply via email to