Hi Ted,
I echo the question about workload: I started with the simplest possible
explanation, hoping that would spur a bit more of a use case description.
Good point on planner cost. Drill uses Apache Calcite for planning. Calcite is
a monster: an interpreter, a rules engine, a generic SQL parser and analyzer.
Calcite is great for what it was designed for: very complex queries against
huge data sets, such as those which Hive queries. (Calcite is used by Hive for
its planning, where it replaced a home-grown planner.)
A simpler planner might be faster, but would have its limits. For example, from
my time on Impala, I learned that Impala's query is basically a sprint from a
SQL parse tree to a Thrift query plan with very little optimization other than
Parquet partition pruning. As a result, Impala planning is quick, but extremely
limited: the further your data strays from TPC-H, the worse the plans that
Impala produces. (I spent over a month trying to fix a really bad plan caused
by naive assumptions in the Impala planner.)
Last I heard (a year ago), Impala planned to abandon their home-grown planner
and move to Hive's Calcite-based planner (as part of merging Impala into Hive.)
Don't know how that is going or if it is still the plan. We can guess that
Impala will suffer the same planning overhead as Drill once that work is done.
By contrast, Presto completely rewrote their ad-hoc planner to create a dynamic
planner that can use costs, somewhat like the Calcite planner does, but
specific to Presto (that is, not based on Calcite.) I've not heard how well
Presto handles complex queries, or the quality of its plans. Anyone have
experience with this aspect of Presto?
The challenge is, each new planner (Cockroach DB, Hive when moving to Calcite,
Impala when moving to Hive/Calcite, Presto with their new planner) takes
multiple person-years of effort. Unfortunately, MapR did not have that kind of
time to invest during MapR DB. The frantic, chaotic hacks that were done could
not overcome Calcite's fundamental design limitation of being very
heavy-weight. Or course, the major benefit of MapR-DB are secondary indexes,
which actually require a cost-based, rule-driven planner such as Calcite
because of the large number of potential plans to evaluate. There is no free
lunch. (Aman, who did much of the index-plan work, contributed it to Drill, so
it is available for anyone else with a similar data source.)
All that said, I agree with your point that Drill would clearly benefit from a
faster, simpler planner for the kinds of queries most people seem to do: a
simple query against one or two data sources, with no indexes, on a single
embedded Drillbit. If anyone knows of such a thing as an open source project,
it would be great to hear about it. We could use the "mini-planner" for simple
queries, but switch to Calcite for the heavy-weight queries where the extra
planning cost would be worthwhile. (This idea was tossed around during the
MapR-DB project, but as noted, there simply wasn't time to build a
mini-planner.)
All this said, planning never has been (to the best of my recollection) the
bottleneck in a multi-user Drill environment. Yes, it slows each individual
query. But is is the run time costs (CPU and memory contention) which tend to
become an issue as the number of concurrent queries increases. The team has
added some good basic throttling and queuing to handle intense usage spikes.
More can be done (see TeraData for what 40 years of tinkering can get you.)
In fact, in this day of K8s, an emerging new design is to run multiple
clusters, each reading data from S3, etc. (So-called "separation of compute and
storage.") In this model, each cluster is (dynamically) sized for a certain
workload; much simpler than the old-school model of a single Drill clusters
handling, say, both TB-sized ETL jobs and sub-second MapR-DB queries. It seems
that Snowflake uses this model. I'm looking forward to trying out Abhishek's
work with K8s to see what we can do.
Thanks,
- Paul
On Wednesday, April 8, 2020, 12:24:12 PM PDT, Ted Dunning
<[email protected]> wrote:
Another thing that user's will see when they start trying to use Drill for
concurrent queries is that Drill assumes that it is OK to spend quite a bit
of time optimizing a query before running it. Taking 500 ms to optimize the
query can be a really bad trade-off if your query only takes 100ms to run.
It is possible to tune this very differently, but that exercise is
definitely not a task for a user (or even a less-than-advanced developer).
In the MapR connection between the OJAI API to MapR DB, for instance, the
clear assumption is that queries will be relatively simple and all that
really needs to be done is look for good join ordering and make sure that
secondary indexes are used reasonably well. This meant that retuning for
fast optimization was very worthwhile.
A similar thing was done by Alibaba in their time series query engine.
There, the primary data source is a variant of Open TSDB and query costs
are dominated by the primary facts (the time series itself). Tuning the
optimizer to not think too much is a good thing.
So, could you say more about your workload so that the Drill community can
say more about what Drill will (or won't) do for you?
On Wed, Apr 8, 2020 at 12:02 PM Paul Rogers <[email protected]>
wrote:
> Hi Ramasamy,
>
> Let's define some terms. By "parallel requests" do you mean multiple
> people submitting queries at the same time? If so, then Drill handles this
> just fine: Drill is designed to run multiple queries from multiple users
> concurrently.
>
> There is a caveat. Many people run Drill in embedded mode when they get
> started. Embedded mode is a single user, single-machine setup that is great
> for testing Drill, exploring small data sets and so on. However, to support
> multiple concurrent queries, the proper way to run Drill is as a service,
> preferably across multiple machines. Further, if you are running a cluster
> of two or more machines, you need some kind of distributed file system: S3,
> Hadoop, etc.
>
>
> Once you start running concurrent queries, memory becomes an important
> consideration, especially if your JSON files are large and you are doing
> memory-intensive operations such as sorting and joins. The Drill
> documentation explains the correct configuration steps.
>
> Thanks,
> - Paul
>
>
>
> On Wednesday, April 8, 2020, 11:00:14 AM PDT, Ramasamy Javakar <
> [email protected]> wrote:
>
> Hi, I did an analytics web application on drill, data set in json file.
> We
> are facing issues while getting multiple parallel requests. Does Apache
> Drill support concurrent requests?. Please let me know
>
>
> Thanks & Regards
> Ramasamy
>
> Product Manager
> EzeeInfo Cloud Solutions
> +91 95000 07269
>