(Adding this to the user mailing list for future reference and see if anyone else has alternative solutions)
Hi Charles What you pointed out is the same issue as the mail I shared regarding the GROUP BY clause. There seems to be a limitation in the Calcite planner w.r.t. the JDBC storage plugin. I'm actually wondering how come you do not have the need for table aliases when defining the predicate. Perhaps that *might* resolve the issue of the filter being pushed down to the source. It seems that the GROUP BY (or in your case, the WHERE) clause is being pushed down to the source only for the first source. This isn't a bug in the sense that you are getting wrong results, but is definitely a performance bug. As suggested in that thread, one option would be to identify which source will be smaller to stream in the absence of the filter, and make that the second source in the union query. An alternate would be to make this a two-step (three actually) query, where you create a temporary table with one of the tables (say, Header) and then run the actual query modified to do a union of this temporary table and the other JDBC source - (model2train). Of course, the third step would be to have this temp table dropped for the next run. You'll need a bit of experimentation, but my hunch is that you want to use the faster of the 2 subqueries to create the temporary table. Hope this helps. ~ Kunal On 4/23/2019 7:13:57 PM, Charles C X CHEN <[email protected]> wrote: Hi Kunal, For your example, it is using group by. We are using union all to join the two table. However, we find the data cannot be filtered in the data source side. Please look into detail in the mail content below. Thx Below is our PoC steps. 1) Catalog the bigquery to apache drill 2) Catch the oracle db to the apache drill 3) Run a join table statement and monitor the log We issue the sql select * from oracle_sit.etx01da1.ASP_PMT_HEADER where feature ='GBP' union all select * from bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature ='GBP'. Both the two database has a selection criteria (feature = ‘GBP’). However, just the first data source can put the selection criteria “FEATURE" = 'GBP' into the sql statement and drill will retrieve all the table of the table from the second data source. This will impact the performance much as it will make more data to be transferred from remote data source to our drill node. Please help to advise the cause. thx SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" = 'GBP' SELECT * FROM `hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` 2019-04-23 11:00:40,284 [23411dbf-7fab-c825-e2c2-6bffc88c00ef:foreman] DEBUG o.a.d.e.p.s.h.DefaultSqlHandler - HEP_BOTTOM_UP:Physical Partition Prune Planning (1ms): ScreenPrel: rowcount = 30.0, cumulative cost = {308.0 rows, 938.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 862 ProjectPrel(FEATURE=[$0], TITLE=[$1], LABEL=[$2]): rowcount = 30.0, cumulative cost = {305.0 rows, 935.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 860 UnionExchangePrel: rowcount = 30.0, cumulative cost = {275.0 rows, 845.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 858 UnionAllPrel(all=[true]): rowcount = 30.0, cumulative cost = {245.0 rows, 605.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 856 JdbcPrel(sql=[SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" = 'GBP' ]): rowcount = 15.0, cumulative cost = {15.0 rows, 15.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 825 FilterPrel(condition=[=($0, 'GBP')]): rowcount = 15.0, cumulative cost = {200.0 rows, 560.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 854 JdbcPrel(sql=[SELECT * FROM `hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` ]): rowcount = 100.0, cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 828 2 9 rows selected (88.054 seconds) 0: jdbc:drill:zk=local> select * from oracle_sit.etx01da1.ASP_PMT_HEADER where feature ='GBP' union all select * from bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature ='GBP'; +----------+----------+--------+ | FEATURE | TITLE | LABEL | +----------+----------+--------+ | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | +----------+----------+--------+ 9 rows selected (10.043 seconds) Regards, Charles C X Chen Senior Consultant Specialist, Development Payments and Global Liquidity and Cash Management (GLCM) | HSBC Operations, Services and Technology (HOST) L22, Office Tower 2, Taikoo Hui, 381 Tianhe Road, Tianhe District, Guangzhou, China ___________________________________________________________________________________ Phone. 8620 38589062 Tie-line 718630-89062 Email. [email protected] [mailto:[email protected]] ___________________________________________________________________________________ From: [email protected] [mailto:[email protected]] Sent: Wednesday, April 24, 2019 1:41 AM To: Charles C X CHEN <[email protected]> Subject: RE: Strange join table criteria in apache drill Hi Charles That is a current Calcite limitation that would need enhancement. Here is a similar issue: https://lists.apache.org/thread.html/f21839ec2c95f7e93980a9d2fafa89adb13998b2303517138f06b4f8@%3Cuser.drill.apache.org%3E [https://lists.apache.org/thread.html/f21839ec2c95f7e93980a9d2fafa89adb13998b2303517138f06b4f8@%3Cuser.drill.apache.org%3E] In the future, please send an unencrypted mail to the mailing list because questions like this help other users discover similar issues and solutions. It is very unlikely that someone would open these mails from unknown senders even if the domains look genuine. ~ Kunal From: [email protected] [mailto:[email protected]] Sent: Tue Apr 23, 2019 3:29 AM To: [email protected] [mailto:[email protected]], [email protected] [mailto:[email protected]] Cc: [email protected] [mailto:[email protected]], [email protected] [mailto:[email protected]], [email protected] [mailto:[email protected]] Subject: Strange join table criteria in apache drill Hi support / kunal, We are exploring if can use apache drill as the data virtualization tool. However, we find there is some issue in handling the join table sql. From the processing log, just the first table can put the filtering criteria into the sql statement to the database. Kindly please help to check if anything wrong in our configuration and how to improve it. Thx Below is our PoC steps. 1) Catalog the bigquery to apache drill 2) Catch the oracle db to the apache drill 3) Run a join table statement and monitor the log We issue the sql select * from oracle_sit.etx01da1.ASP_PMT_HEADER where feature ='GBP' union all select * from bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature ='GBP'. Both the two database has a selection criteria (feature = ‘GBP’). However, just the first data source can put the selection criteria “FEATURE" = 'GBP' into the sql statement and drill will retrieve all the table of the table from the second data source. This will impact the performance much as it will make more data to be transferred from remote data source to our drill node. Please help to advise the cause. thx SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" = 'GBP' SELECT * FROM `hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` 2019-04-23 11:00:40,284 [23411dbf-7fab-c825-e2c2-6bffc88c00ef:foreman] DEBUG o.a.d.e.p.s.h.DefaultSqlHandler - HEP_BOTTOM_UP:Physical Partition Prune Planning (1ms): ScreenPrel: rowcount = 30.0, cumulative cost = {308.0 rows, 938.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 862 ProjectPrel(FEATURE=[$0], TITLE=[$1], LABEL=[$2]): rowcount = 30.0, cumulative cost = {305.0 rows, 935.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 860 UnionExchangePrel: rowcount = 30.0, cumulative cost = {275.0 rows, 845.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 858 UnionAllPrel(all=[true]): rowcount = 30.0, cumulative cost = {245.0 rows, 605.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 856 JdbcPrel(sql=[SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" = 'GBP' ]): rowcount = 15.0, cumulative cost = {15.0 rows, 15.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 825 FilterPrel(condition=[=($0, 'GBP')]): rowcount = 15.0, cumulative cost = {200.0 rows, 560.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 854 JdbcPrel(sql=[SELECT * FROM `hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` ]): rowcount = 100.0, cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 828 2 9 rows selected (88.054 seconds) 0: jdbc:drill:zk=local> select * from oracle_sit.etx01da1.ASP_PMT_HEADER where feature ='GBP' union all select * from bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature ='GBP'; +----------+----------+--------+ | FEATURE | TITLE | LABEL | +----------+----------+--------+ | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | | GBP | GBP#GBP | 3.0 | +----------+----------+--------+ 9 rows selected (10.043 seconds) Regards, Charles C X Chen Senior Consultant Specialist, Development Payments and Global Liquidity and Cash Management (GLCM) | HSBC Operations, Services and Technology (HOST) L22, Office Tower 2, Taikoo Hui, 381 Tianhe Road, Tianhe District, Guangzhou, China ___________________________________________________________________________________ Phone. 8620 38589062 Tie-line 718630-89062 Email. [email protected] [mailto:[email protected]] ___________________________________________________________________________________ ******************************************************************* This e-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return e-mail. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions. ******************************************************************* "SAVE PAPER - THINK BEFORE YOU PRINT!" ****************************************************************** This E-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return E-mail. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions. ****************************************************************** SAVE PAPER - THINK BEFORE YOU PRINT! ******************************************************************* This e-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return e-mail. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions. ******************************************************************* "SAVE PAPER - THINK BEFORE YOU PRINT!"
