Re: Apache Drill Sizing guide

2020-04-13 Thread Paul Rogers
Hi Navin,

Ted is absolutely right. To add a bit of context, here are some of the factors 
we've considered in the past.

Queries: A simple filter scan takes the minimum resources: scan the tables, 
throw away most of the data, and deliver the rows that are needed. Such a use 
case is strongly driven by scan time. As Ted suggests, partitioning drives down 
scan cost. If every query hits the full TB of data, you will need many machines 
& disks to get adequate performance. Depending on your hardware, if you get 100 
MB/s read performance per disk, it will take 10,000 seconds (three hours) to 
read your TB of data on one disk. If you have 100 disks, the time drops to 100 
seconds. You didn't mention your storage technology: these numbers are likely 
entirely different for something like S3.


So, you don't want to read the full TB. By using good partitioning (typically 
by date), you might reduce the scan by a factor of 1000. Huge win. And this is 
true whether you use Drill, Spark, Presto or Python to read your data.

The next question is the selectivity of your queries. In the simple filter 
case, are you returning a few rows or a GB of rows? The more rows, the more 
Drill must grind through the data once it is read. This internal grinding 
requires CPU and benefits from parallelism. The amount you need depends on the 
number of rows processed per query.

There is little memory needed for a pure filter query. Drill reads the data, 
tosses most rows, a returns the remainder to the client. Interesting queries, 
however, do more than filtering: they might group, join, sort and so on. Each 
of these operations carries its own cost. Joins are network heavy (to shuffle 
data). Sorts want enough memory to buffer the entire result set to avoid slow 
disk-based sorts.

The query profile will provide lots of good information about the row count, 
memory usage and operators in each of your queries so you can determine the 
resources needed for each. When Ted asks you to analyze each query, the best 
way to do that is to look at the query profile and see which resources were 
needed by that query.


Then, there are concurrent users. What do you mean by concurrent? 40 people who 
might use Drill during the day so that only a few are active at the same time? 
Or, 40 users each watching dashboard that each run 10 queries, updated each 
second, which will place a huge load on the system? Most humans are 
intermittent users. Dashboards, when overdone, can kill any system.

Also, as Ted has said many times, if you run 40 queries a minute, and each 
takes 1 second, then concurrency turns into sequential processing. On the other 
hand, if one query uses all cluster resources for an hour, and you run 10 of 
them per hour, then the workload will fail.

Once you determine the actual "concurrent concurrency" level (number of queries 
that run at the same time), work out the mix. Sum the resources for those 
concurrent queries. That tells you the cluster capacity you need (plus some 
safety margin because load is random.) Drill does have features to smooth out 
the load peaks by queuing queries. Not state-of-the-art, but can prevent the 
inevitable overloads that occur at random peak loads when there is not 
sufficient reserve capacity.

You didn't mention your deployment model. In classic Hadoop days, with an 
on-prem cluster, you had to work all this out ahead of time so you could plan 
your equipment purchases 3 to 6 months in advance. In the cloud, however, 
especially with K8s, you just resize the cluster based on demand. Drill is not 
quite there yet with our K8s integration, but the team is making good progress 
and we should have a solution soon; contributions/feedback would be very 
helpful.


In short, there are many factors, some rather complex. (We all know it should 
be simple, but having done this with many DBs, it just turns out that it never 
is.)

We'd be happy to offer pointers if you can offer a few more specifics. Also, 
perhaps we can distill this discussion into a few pages in the Drill docs.


Thanks,
- Paul

 

On Monday, April 13, 2020, 7:59:08 AM PDT, Ted Dunning 
 wrote:  
 
 Navin,

Your specification of 40 concurrent users and data size are only a bit less
than half the story. Without the rest of the story, nobody will be able to
give you even general guidance beyond a useless estimate that it will take
between roughly 1 and 40 drillbits with with a gob of memory.

To do better than such non-specific "guidance", you need to add some
additional answers. For example,

What is the query mix?
How long do these queries run without any question of concurrency?
Could that query speed be enhanced with better partitioning?
How are you storing your data?
What promises are you making to these concurrent users?



On Mon, Apr 13, 2020 at 7:21 AM Navin Bhawsar 
wrote:

> Hi Team ,
>
> We are planning to use drill to query hdfs cluster with about a terabyte
> data in parquet file format .There will be approx. 

Re: Apache Drill Sizing guide

2020-04-13 Thread Ted Dunning
Navin,

Your specification of 40 concurrent users and data size are only a bit less
than half the story. Without the rest of the story, nobody will be able to
give you even general guidance beyond a useless estimate that it will take
between roughly 1 and 40 drillbits with with a gob of memory.

To do better than such non-specific "guidance", you need to add some
additional answers. For example,

What is the query mix?
How long do these queries run without any question of concurrency?
Could that query speed be enhanced with better partitioning?
How are you storing your data?
What promises are you making to these concurrent users?



On Mon, Apr 13, 2020 at 7:21 AM Navin Bhawsar 
wrote:

> Hi Team ,
>
> We are planning to use drill to query hdfs cluster with about a terabyte
> data in parquet file format .There will be approx. 40 concurrent users
> using this environment .
>
> Can you please advise on below two points considering above workload for
> optimum performance:
> 1. Number of Drill bits
> 2. Memory Config per drill bit
>
> Also is there any sizing guide for Apache Drill to refer.
>
>
> Thanks,
> Navin
>


Apache Drill Sizing guide

2020-04-13 Thread Navin Bhawsar
Hi Team ,

We are planning to use drill to query hdfs cluster with about a terabyte
data in parquet file format .There will be approx. 40 concurrent users
using this environment .

Can you please advise on below two points considering above workload for
optimum performance:
1. Number of Drill bits
2. Memory Config per drill bit

Also is there any sizing guide for Apache Drill to refer.


Thanks,
Navin