Hi Lokendra,
Your usecase is a typical old school sharded DB app. The design itself is fine.
However, as Tim noted, Drill is not designed for this case. Still, perhaps
Drill could be extended.
As Tim suggested, Drill assumes any Drillbit can operate in any role. So, in
your setup, you would run Drillbits on all your shard storage-nodes. Drill
would schedule reads (more on this shortly) on those nodes. Then, Drill would
do shuffles to other nodes to perform query operations.
In this model, one of your nodes would act as Foreman for a user. ZooKeeper
(ZK) tracks all nodes, each user randomly chooses a Drillbit to act as Foreman,
which means Forman load is shared across all your Drillbits.
Suppose you wanted to change this. You'd have to extend the way that Drillbits
register themselves in ZK. A Drillbit, when it starts, would be assigned one or
more roles which it would advertise in ZK. The distribution mechanisms in the
Planner would have to be aware of scan-only nodes, compute-only nodes, and
Foreman-only nodes.
Unless you plan to put heavy load on your scan nodes, it is not clear what
benefit you'd gain from forcing Drill into a particular distribution model.
Perhaps you can start by running Drill on just your storage nodes, then noting
performance.
One final point. Drill today knows to use HDFS to work out data locality for
scans. You'd need to modify this to plug in your own data distribution
mechanism so that Drill knows which shards to scan on which nodes. I don't
believe Drill has a plugin-API for this, but I could be wrong. If not, this
would be a great opportunity to define such an API.
Such an API might be helpful for other storage plugins such as Kafka so that
scans are done on nodes with data.
Thanks,
- Paul
On Tuesday, November 13, 2018, 5:32:32 PM PST, Lokendra Singh Panwar
wrote:
Hi Tim,
Thanks for the reply.
My usecase is following:
- My main DB table is huge so it is sharded amongs multiple
storage-nodes.
- Each stroage-node is storing the assigned shard in a local relational
db engine.
I was planning to use Drill as a distributed query engine that can
scatter-gather data from these storage-nodes.
So, my overall plan for such architecture, as per my limited understanding
of Drill so far, is:
- Have a DrilBit instance run on each storage-node, and this fleet will
act as a leaf-worker fleet.
- (I will write a Storage Plugin to transform data from my local
relational DB engine to Drill record fromat)
- Maintain another fleet that will serve as Foreman and Intermeidate
query workers, still part of the same Drill cluster.
- The reason I intended to have the leaf-query fleet (storage-nodes)
segregated from Foreman/Intermediate workers (working on major fragments
is):
- storage-nodes (acting as leaf-workers) are premium commodity in
my cluster, involved in data ingestion as well as query traffic
servers as
leaf-worker.
- So, I do not intend to overload them further with intermediate
query fragment processing and aggregation that Foreman and Intermeidate
pool of workers are involved in.
Does the above make sense?
Thanks,
Lokendra
On Tue, Nov 13, 2018 at 4:17 PM Timothy Farkas wrote:
> Hi Lokendra,
>
> All Drillbits can function as a foreman if a query is sent to them, and all
> drillbits are considered worker nodes. This ingrained deeply into the
> design of Drill and it was done with the intention of making Drill
> symmetric. Symmetric here means that each Drillbit is identical to all the
> others. Making this change would be a significant design change.
>
> Why are you interested in running Drill in this way? Do you have a specific
> use case in mind?
>
> Thanks,
> Tim
>
> On Tue, Nov 13, 2018 at 3:37 PM Lokendra Singh Panwar <
> lokendra...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Is it possible to configure Drill such that the Foreman and leaf-worker
> > fleets are separate fleets of nodes?
> > Or if this needs changing the source of Drill, any pointers are
> appreciated
> > too.
> >
> > Thanks,
> > Lokendra
> >
>