I do not think you will get such information about benchmarks from customers on production workloads. But from the customers I have worked with who have taken Drill to production, here is some information that may be of use to you:
1. The trend universally has been to use beefier machines for in-memory query engines. We see 256GB RAM and 32 cores as the most frequent configuration. On the network side, it is 2x10GbE. 2. The most commonly sized dedicated cluster for starting out with Drill in production has been around 16-20 nodes with the above configuration. I have several customers who have deployed this on 200+ nodes as well but in those scenarios, it is a service among many. 3. The concurrency we see in the above settings is a function of the size of the dataset and the complexity of the customer query. In general, Little's law holds. The smaller the chunk of work is to be processed, the faster will be the throughput. Our understanding of this changes further with the new releases of Drill where spill to disk features will make it more of a pessimistic execution engine. Also, the use of queues can also change this understanding. 4. From my company side, we do have TPCH and TPCDS benchmarks that I do share with customers. But such benchmarks are flawed because they come from the world of traditional warehousing where the competition was among general purpose query engines. For example, our tests show that at higher and higher data scale, Drill beats Impala on these benchmarks. The same is touted by the Hive LLAP folks as well. But they do not necessarily imply that it is the best tool choice for the production environment. It is a reason why I am resistant getting into the war of the query engines in which every query engine beats the other under a given set of primed conditions. 5. It is an absolute most that you understand the query patterns that the system will have to withstand with the data characteristics specific to your use case. I would only trust that. Big data systems are going to be application specific and will require tuning. Which also means that you have to revisit the kinds of analytics you would like your end users to have. Which again raises the question-what kinds of analytics truly generate value for the BI user? Best, Saurabh On Wed, Oct 18, 2017 at 10:26 PM, PROJJWAL SAHA <[email protected]> wrote: > Hi, > > Is there any public performance benchmark that users have achieved using > Drill in production scenarios ? It would be useful if someone can pass me > any links for customer user stories. > > Regards >
