Greetings,
I have two servers. The first server (A) contains the zookeeper, a mongodb
database and a drillbit. The second server (B) contains a hadoop
distribution with several hive tables, a postgresql database and the other
drillbit. Both drillbits can see eachother on the drill homepages, since
they are both connected to the zookeeper on server A. When a query like the
one on the bottom is run (the view dfs.lineorder consists of the hive and
postgresql data on server B), drillbit B becomes the foreman and executes
everything. It does not delegate the parts dealing with mongodb tables to
the drillbit on server A, so 90% of the time required to run the query
(approx. 30 minutes) is spent on sending the mongodb collection from server
A to server B. Is there a way to force drillbit B to delegate the mongodb
part of the query to drillbit B, or could this be a misconfiguration
problem?
Also, can drillbits be configured to access different databases (e.g. when
each drillbit has access to a distinct subnet, so that not all drillbits
can access the same databases)?
Thank you.
Sample query:
SELECT SUM(revenue) AS revenue
FROM (
SELECT SUM(lo_extendedprice*lo_discount) AS revenue
FROM dfs.tmp.lineorder, dfs.tmp.`date`
WHERE lo_orderdate = d_datekey
AND d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25
UNION ALL
SELECT SUM(lo_extendedprice * lo_discount) AS revenue
FROM mongo.test.ssb_europe ssb
WHERE ssb.orderdate.d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25
);
Configuration (drill-override.conf):
On drillbit A (Windows Server 2008):
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "serverA:2181",
impersonation: {
enabled: true,
max_chained_user_hops: 3
}
}
On drillbit B (Cloudera CDH 5.8.0):
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "serverA:2181",
impersonation: {
enabled: true,
max_chained_user_hops: 3
}
}
Best regards,
Ivan