RE: Strange join table criteria in apache drill

Kunal Khatua Tue, 23 Apr 2019 21:53:26 -0700

(Adding this to the user mailing list for future reference and see if anyone 
else has alternative solutions)


Hi Charles

What you pointed out is the same issue as the mail I shared regarding the GROUP 
BY clause. There seems to be a limitation in the Calcite planner w.r.t. the 
JDBC storage plugin. I'm actually wondering how come you do not have the need 
for table aliases when defining the predicate. Perhaps that *might* resolve the 
issue of the filter being pushed down to the source.

It seems that the GROUP BY (or in your case, the WHERE) clause is being pushed 
down to the source only for the first source.

This isn't a bug in the sense that you are getting wrong results, but is 
definitely a performance bug. As suggested in that thread, one option would be 
to identify which source will be smaller to stream in the absence of the 
filter, and make that the second source in the union query.

An alternate would be to make this a two-step (three actually) query, where you 
create a temporary table with one of the tables (say, Header) and then run the 
actual query modified to do a union of this temporary table and the other JDBC 
source - (model2train). Of course, the third step would be to have this temp 
table dropped for the next run. You'll need a bit of experimentation, but my 
hunch is that you want to use the faster of the 2 subqueries to create the 
temporary table. 

Hope this helps. 

~ Kunal

On 4/23/2019 7:13:57 PM, Charles C X CHEN <[email protected]> wrote:
Hi Kunal,
 
For your example, it is using group by. We are using union all to join the two 
table. However, we find the data cannot be filtered in the data source side. 
Please look into detail in the mail content below. Thx
 
Below is our PoC steps.
 
1)       Catalog the bigquery to apache drill
2)       Catch the oracle db to the apache drill
3)       Run a join table statement and monitor the log
 
We issue the sql select * from oracle_sit.etx01da1.ASP_PMT_HEADER where feature 
='GBP' union all select * from 
bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature 
='GBP'. Both the two database has a selection criteria (feature = ‘GBP’). 
However, just the first data source can put the selection criteria “FEATURE" = 
'GBP' into the sql statement and drill will retrieve all the table of the table 
from the second data source. This will impact the performance much as it will 
make more data to be transferred from remote data source to our drill node. 
Please help to advise the cause. thx
 
SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" = 'GBP'
 
SELECT * FROM `hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train`
 
2019-04-23 11:00:40,284 [23411dbf-7fab-c825-e2c2-6bffc88c00ef:foreman] DEBUG 
o.a.d.e.p.s.h.DefaultSqlHandler - HEP_BOTTOM_UP:Physical Partition Prune 
Planning (1ms):
ScreenPrel: rowcount = 30.0, cumulative cost = {308.0 rows, 938.0 cpu, 0.0 io, 
368640.0 network, 0.0 memory}, id = 862
  ProjectPrel(FEATURE=[$0], TITLE=[$1], LABEL=[$2]): rowcount = 30.0, 
cumulative cost = {305.0 rows, 935.0 cpu, 0.0 io, 368640.0 network, 0.0 
memory}, id = 860
    UnionExchangePrel: rowcount = 30.0, cumulative cost = {275.0 rows, 845.0 
cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 858
      UnionAllPrel(all=[true]): rowcount = 30.0, cumulative cost = {245.0 rows, 
605.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 856
        JdbcPrel(sql=[SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" 
= 'GBP' ]): rowcount = 15.0, cumulative cost = {15.0 rows, 15.0 cpu, 0.0 io, 
0.0 network, 0.0 memory}, id = 825
        FilterPrel(condition=[=($0, 'GBP')]): rowcount = 15.0, cumulative cost 
= {200.0 rows, 560.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 854
          JdbcPrel(sql=[SELECT * FROM 
`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` ]): rowcount = 100.0, 
cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
= 828
 
2
 9 rows selected (88.054 seconds)
0: jdbc:drill:zk=local> select * from oracle_sit.etx01da1.ASP_PMT_HEADER where 
feature ='GBP' union all select * from 
bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature 
='GBP';
+----------+----------+--------+
| FEATURE  |  TITLE   | LABEL  |
+----------+----------+--------+
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
+----------+----------+--------+
9 rows selected (10.043 seconds)
 
 
Regards,


Charles C X Chen
 
Senior Consultant Specialist, Development
Payments and Global Liquidity and Cash Management (GLCM) | HSBC Operations, 
Services and Technology (HOST)
L22, Office Tower 2, Taikoo Hui, 381 Tianhe Road, Tianhe District, Guangzhou, 
China
___________________________________________________________________________________


Phone.   8620 38589062 Tie-line 718630-89062
Email.   [email protected] [mailto:[email protected]] 
___________________________________________________________________________________
 
From: [email protected] [mailto:[email protected]]
Sent: Wednesday, April 24, 2019 1:41 AM
To: Charles C X CHEN <[email protected]>
Subject: RE: Strange join table criteria in apache drill
 
Hi Charles
That is a current Calcite limitation that would need enhancement.
Here is a similar issue:
https://lists.apache.org/thread.html/f21839ec2c95f7e93980a9d2fafa89adb13998b2303517138f06b4f8@%3Cuser.drill.apache.org%3E
 
[https://lists.apache.org/thread.html/f21839ec2c95f7e93980a9d2fafa89adb13998b2303517138f06b4f8@%3Cuser.drill.apache.org%3E]
In the future, please send an unencrypted mail to the mailing list because 
questions like this help other users discover similar issues and solutions. It 
is very unlikely that someone would open these mails from unknown senders even 
if the domains look genuine.
~ Kunal
 
From: [email protected] [mailto:[email protected]]
Sent: Tue Apr 23, 2019 3:29 AM
To: [email protected] [mailto:[email protected]], 
[email protected] [mailto:[email protected]]
Cc: [email protected] [mailto:[email protected]], 
[email protected] [mailto:[email protected]], 
[email protected] [mailto:[email protected]]
Subject: Strange join table criteria in apache drill
Hi support / kunal,
 
We are exploring if can use apache drill as the data virtualization tool. 
However, we find there is some issue in handling the join table sql. From the 
processing log, just the first table can put the filtering criteria into the 
sql statement to the database. Kindly please help to check if anything wrong in 
our configuration and how to improve it. Thx
 
Below is our PoC steps.
 
1)       Catalog the bigquery to apache drill
2)       Catch the oracle db to the apache drill
3)       Run a join table statement and monitor the log
 
We issue the sql select * from oracle_sit.etx01da1.ASP_PMT_HEADER where feature 
='GBP' union all select * from 
bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature 
='GBP'. Both the two database has a selection criteria (feature = ‘GBP’). 
However, just the first data source can put the selection criteria “FEATURE" = 
'GBP' into the sql statement and drill will retrieve all the table of the table 
from the second data source. This will impact the performance much as it will 
make more data to be transferred from remote data source to our drill node. 
Please help to advise the cause. thx
 
SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" = 'GBP'
 
SELECT * FROM `hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train`
 
2019-04-23 11:00:40,284 [23411dbf-7fab-c825-e2c2-6bffc88c00ef:foreman] DEBUG 
o.a.d.e.p.s.h.DefaultSqlHandler - HEP_BOTTOM_UP:Physical Partition Prune 
Planning (1ms):
ScreenPrel: rowcount = 30.0, cumulative cost = {308.0 rows, 938.0 cpu, 0.0 io, 
368640.0 network, 0.0 memory}, id = 862
  ProjectPrel(FEATURE=[$0], TITLE=[$1], LABEL=[$2]): rowcount = 30.0, 
cumulative cost = {305.0 rows, 935.0 cpu, 0.0 io, 368640.0 network, 0.0 
memory}, id = 860
    UnionExchangePrel: rowcount = 30.0, cumulative cost = {275.0 rows, 845.0 
cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 858
      UnionAllPrel(all=[true]): rowcount = 30.0, cumulative cost = {245.0 rows, 
605.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 856
        JdbcPrel(sql=[SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" 
= 'GBP' ]): rowcount = 15.0, cumulative cost = {15.0 rows, 15.0 cpu, 0.0 io, 
0.0 network, 0.0 memory}, id = 825
        FilterPrel(condition=[=($0, 'GBP')]): rowcount = 15.0, cumulative cost 
= {200.0 rows, 560.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 854
          JdbcPrel(sql=[SELECT * FROM 
`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` ]): rowcount = 100.0, 
cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
= 828
 
2
 9 rows selected (88.054 seconds)
0: jdbc:drill:zk=local> select * from oracle_sit.etx01da1.ASP_PMT_HEADER where 
feature ='GBP' union all select * from 
bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature 
='GBP';
+----------+----------+--------+
| FEATURE  |  TITLE   | LABEL  |
+----------+----------+--------+
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
+----------+----------+--------+
9 rows selected (10.043 seconds)
 
Regards,
Charles C X Chen
 
Senior Consultant Specialist, Development
Payments and Global Liquidity and Cash Management (GLCM) | HSBC Operations, 
Services and Technology (HOST)
L22, Office Tower 2, Taikoo Hui, 381 Tianhe Road, Tianhe District, Guangzhou, 
China
___________________________________________________________________________________
Phone.   8620 38589062 Tie-line 718630-89062
Email.   [email protected] [mailto:[email protected]] 
___________________________________________________________________________________
 
 
 
*******************************************************************
This e-mail is confidential. It may also be legally privileged.
If you are not the addressee you may not copy, forward, disclose
or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the
sender immediately by return e-mail.

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.
*******************************************************************
"SAVE PAPER - THINK BEFORE YOU PRINT!"
******************************************************************
This E-mail is confidential. It may also be legally privileged. If
you are not the addressee you may not copy, forward, disclose or
use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the
sender immediately by return E-mail.

Internet communications cannot be guaranteed to be timely, secure,
error or virus-free. The sender does not accept liability for any
errors or omissions.
******************************************************************
SAVE PAPER - THINK BEFORE YOU PRINT!
*******************************************************************
This e-mail is confidential. It may also be legally privileged.
If you are not the addressee you may not copy, forward, disclose
or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the
sender immediately by return e-mail.

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.
*******************************************************************
"SAVE PAPER - THINK BEFORE YOU PRINT!"

RE: Strange join table criteria in apache drill

Reply via email to