Re: Apache Drill Support concurrent parallel Request

Paul Rogers Wed, 08 Apr 2020 16:29:16 -0700

Hi Ted,

I echo the question about workload: I started with the simplest possible 
explanation, hoping that would spur a bit more of a use case description.

Good point on planner cost. Drill uses Apache Calcite for planning. Calcite is 
a monster: an interpreter, a rules engine, a generic SQL parser and analyzer. 
Calcite is great for what it was designed for: very complex queries against 
huge data sets, such as those which Hive queries. (Calcite is used by Hive for 
its planning, where it replaced a home-grown planner.)

A simpler planner might be faster, but would have its limits. For example, from 
my time on Impala, I learned that Impala's query is basically a sprint from a 
SQL parse tree to a Thrift query plan with very little optimization other than 
Parquet partition pruning. As a result, Impala planning is quick, but extremely 
limited: the further your data strays from TPC-H, the worse the plans that 
Impala produces. (I spent over a month trying to fix a really bad plan caused 
by naive assumptions in the Impala planner.)

Last I heard (a year ago), Impala planned to abandon their home-grown planner 
and move to Hive's Calcite-based planner (as part of merging Impala into Hive.) 
Don't know how that is going or if it is still the plan. We can guess that 
Impala will suffer the same planning overhead as Drill once that work is done.

By contrast, Presto completely rewrote their ad-hoc planner to create a dynamic 
planner that can use costs, somewhat like the Calcite planner does, but 
specific to Presto (that is, not based on Calcite.) I've not heard how well 
Presto handles complex queries, or the quality of its plans. Anyone have 
experience with this aspect of Presto?

The challenge is, each new planner (Cockroach DB, Hive when moving to Calcite, 
Impala when moving to Hive/Calcite, Presto with their new planner) takes 
multiple person-years of effort. Unfortunately, MapR did not have that kind of 
time to invest during MapR DB. The frantic, chaotic hacks that were done could 
not overcome Calcite's fundamental design limitation of being very 
heavy-weight. Or course, the major benefit of MapR-DB are secondary indexes, 
which actually require a cost-based, rule-driven planner such as Calcite 
because of the large number of potential plans to evaluate. There is no free 
lunch. (Aman, who did much of the index-plan work, contributed it to Drill, so 
it is available for anyone else with a similar data source.) 

All that said, I agree with your point that Drill would clearly benefit from a 
faster, simpler planner for the kinds of queries most people seem to do: a 
simple query against one or two data sources, with no indexes, on a single 
embedded Drillbit. If anyone knows of such a thing as an open source project, 
it would be great to hear about it. We could use the "mini-planner" for simple 
queries, but switch to Calcite for the heavy-weight queries where the extra 
planning cost would be worthwhile. (This idea was tossed around during the 
MapR-DB project, but as noted, there simply wasn't time to build a 
mini-planner.)

All this said, planning never has been (to the best of my recollection) the 
bottleneck in a multi-user Drill environment. Yes, it slows each individual 
query. But is is the run time costs (CPU and memory contention) which tend to 
become an issue as the number of concurrent queries increases. The team has 
added some good basic throttling and queuing to handle intense usage spikes. 
More can be done (see TeraData for what 40 years of tinkering can get you.)

In fact, in this day of K8s, an emerging new design is to run multiple 
clusters, each reading data from S3, etc. (So-called "separation of compute and 
storage.") In this model, each cluster is (dynamically) sized for a certain 
workload; much simpler than the old-school model of a single Drill clusters 
handling, say, both TB-sized ETL jobs and sub-second MapR-DB queries. It seems 
that Snowflake uses this model. I'm looking forward to trying out Abhishek's 
work with K8s to see what we can do.

Thanks,
- Paul

    On Wednesday, April 8, 2020, 12:24:12 PM PDT, Ted Dunning 
<ted.dunn...@gmail.com> wrote:  

 Another thing that user's will see when they start trying to use Drill for
concurrent queries is that Drill assumes that it is OK to spend quite a bit
of time optimizing a query before running it. Taking 500 ms to optimize the
query can be a really bad trade-off if your query only takes 100ms to run.

It is possible to tune this very differently, but that exercise is
definitely not a task for a user (or even a less-than-advanced developer).
In the MapR connection between the OJAI API to MapR DB, for instance, the
clear assumption is that queries will be relatively simple and all that
really needs to be done is look for good join ordering and make sure that
secondary indexes are used reasonably well. This meant that retuning for
fast optimization was very worthwhile.

A similar thing was done by Alibaba in their time series query engine.
There, the primary data source is a variant of Open TSDB and query costs
are dominated by the primary facts (the time series itself). Tuning the
optimizer to not think too much is a good thing.

So, could you say more about your workload so that the Drill community can
say more about what Drill will (or won't) do for you?

On Wed, Apr 8, 2020 at 12:02 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Ramasamy,
>
> Let's define some terms. By "parallel requests" do you mean multiple
> people submitting queries at the same time? If so, then Drill handles this
> just fine: Drill is designed to run multiple queries from multiple users
> concurrently.
>
> There is a caveat. Many people run Drill in embedded mode when they get
> started. Embedded mode is a single user, single-machine setup that is great
> for testing Drill, exploring small data sets and so on. However, to support
> multiple concurrent queries, the proper way to run Drill is as a service,
> preferably across multiple machines. Further, if you are running a cluster
> of two or more machines, you need some kind of distributed file system: S3,
> Hadoop, etc.
>
>
> Once you start running concurrent queries, memory becomes an important
> consideration, especially if your JSON files are large and you are doing
> memory-intensive operations such as sorting and joins. The Drill
> documentation explains the correct configuration steps.
>
> Thanks,
> - Paul
>
>
>
>    On Wednesday, April 8, 2020, 11:00:14 AM PDT, Ramasamy Javakar <
> ramas...@ezeeinfosolutions.com> wrote:
>
>  Hi, I did an analytics web application on drill, data set in json file.
> We
> are facing issues while getting multiple parallel requests. Does Apache
> Drill support concurrent requests?. Please let me know
>
>
> Thanks & Regards
> Ramasamy
>
> Product Manager
> EzeeInfo Cloud Solutions
> +91 95000 07269
>

Re: Apache Drill Support concurrent parallel Request

Reply via email to