Hi Navin,
Ted is absolutely right. To add a bit of context, here are some of the factors
we've considered in the past.
Queries: A simple filter scan takes the minimum resources: scan the tables,
throw away most of the data, and deliver the rows that are needed. Such a use
case is strongly driven by scan time. As Ted suggests, partitioning drives down
scan cost. If every query hits the full TB of data, you will need many machines
& disks to get adequate performance. Depending on your hardware, if you get 100
MB/s read performance per disk, it will take 10,000 seconds (three hours) to
read your TB of data on one disk. If you have 100 disks, the time drops to 100
seconds. You didn't mention your storage technology: these numbers are likely
entirely different for something like S3.
So, you don't want to read the full TB. By using good partitioning (typically
by date), you might reduce the scan by a factor of 1000. Huge win. And this is
true whether you use Drill, Spark, Presto or Python to read your data.
The next question is the selectivity of your queries. In the simple filter
case, are you returning a few rows or a GB of rows? The more rows, the more
Drill must grind through the data once it is read. This internal grinding
requires CPU and benefits from parallelism. The amount you need depends on the
number of rows processed per query.
There is little memory needed for a pure filter query. Drill reads the data,
tosses most rows, a returns the remainder to the client. Interesting queries,
however, do more than filtering: they might group, join, sort and so on. Each
of these operations carries its own cost. Joins are network heavy (to shuffle
data). Sorts want enough memory to buffer the entire result set to avoid slow
disk-based sorts.
The query profile will provide lots of good information about the row count,
memory usage and operators in each of your queries so you can determine the
resources needed for each. When Ted asks you to analyze each query, the best
way to do that is to look at the query profile and see which resources were
needed by that query.
Then, there are concurrent users. What do you mean by concurrent? 40 people who
might use Drill during the day so that only a few are active at the same time?
Or, 40 users each watching dashboard that each run 10 queries, updated each
second, which will place a huge load on the system? Most humans are
intermittent users. Dashboards, when overdone, can kill any system.
Also, as Ted has said many times, if you run 40 queries a minute, and each
takes 1 second, then concurrency turns into sequential processing. On the other
hand, if one query uses all cluster resources for an hour, and you run 10 of
them per hour, then the workload will fail.
Once you determine the actual "concurrent concurrency" level (number of queries
that run at the same time), work out the mix. Sum the resources for those
concurrent queries. That tells you the cluster capacity you need (plus some
safety margin because load is random.) Drill does have features to smooth out
the load peaks by queuing queries. Not state-of-the-art, but can prevent the
inevitable overloads that occur at random peak loads when there is not
sufficient reserve capacity.
You didn't mention your deployment model. In classic Hadoop days, with an
on-prem cluster, you had to work all this out ahead of time so you could plan
your equipment purchases 3 to 6 months in advance. In the cloud, however,
especially with K8s, you just resize the cluster based on demand. Drill is not
quite there yet with our K8s integration, but the team is making good progress
and we should have a solution soon; contributions/feedback would be very
helpful.
In short, there are many factors, some rather complex. (We all know it should
be simple, but having done this with many DBs, it just turns out that it never
is.)
We'd be happy to offer pointers if you can offer a few more specifics. Also,
perhaps we can distill this discussion into a few pages in the Drill docs.
Thanks,
- Paul
On Monday, April 13, 2020, 7:59:08 AM PDT, Ted Dunning
wrote:
Navin,
Your specification of 40 concurrent users and data size are only a bit less
than half the story. Without the rest of the story, nobody will be able to
give you even general guidance beyond a useless estimate that it will take
between roughly 1 and 40 drillbits with with a gob of memory.
To do better than such non-specific "guidance", you need to add some
additional answers. For example,
What is the query mix?
How long do these queries run without any question of concurrency?
Could that query speed be enhanced with better partitioning?
How are you storing your data?
What promises are you making to these concurrent users?
On Mon, Apr 13, 2020 at 7:21 AM Navin Bhawsar
wrote:
> Hi Team ,
>
> We are planning to use drill to query hdfs cluster with about a terabyte
> data in parquet file format .There will be approx.