Andries Engelbrecht created DRILL-6708: ------------------------------------------
Summary: Flatten operator executes twice on subquery resulting in cartesian of flatten columns, when final query has 2 columns using the flatten column in original query Key: DRILL-6708 URL: https://issues.apache.org/jira/browse/DRILL-6708 Project: Apache Drill Issue Type: Bug Components: Query Planning & Optimization Affects Versions: 1.13.0 Reporter: Andries Engelbrecht Attachments: campaignclicks_50.json The following query with subquery and referencing the flatten column twice in final result, ends up with 1137195 rows vs the expected 140913 rows. {code:java} SELECT ( `Third`.`cust_id`), ( `Third`.`device`), ( `Third`.`prod_id`) ( `Third`.`prod_id`) AS `prod_id2` FROM ( SELECT ( `Second`.`cust_id`), ( `Second`.`device`), ( `Second`.`prod_id`) FROM ( SELECT ( `First`.`cust_id`), ( `First`.`device`), ( `First`.`prod_id`) FROM `dfs.views`.`clicks_campaign_vw` AS `First` ) AS `Second` ) AS `Third` {code} This executed against Drill View listed below {code:java} CREATE or REPLACE VIEW dfs.views.clicks_campaign_vw AS SELECT CAST(`t`.`trans_id` as BIGINT) as trans_id, CAST(`t`.`date` AS DATE) AS `thedate`, CAST(`t`.`user_info`['cust_id'] AS BIGINT) AS `cust_id`, CAST(`t`.`user_info`['device'] AS VARCHAR(20)) AS `device`, CAST(`t`.`user_info`['state'] AS VARCHAR(2)) AS `custstate`, CAST(FLATTEN(`t`.`trans_info`['prod_id']) AS BIGINT) AS `prod_id`, CAST(`t`.`trans_info`['purch_flag'] AS VARCHAR(6)) AS `purch_flag` FROM `dfs`.`clicks`.`campaignclicks_50.json` AS `t` WHERE `t`.`trans_info`['prod_id'][0] IS NOT NULL;{code} Below is the query plan showing FLATTEN invoked twice {code:java} 00-00 Screen : rowType = RecordType(BIGINT cust_id, VARCHAR(20) device, BIGINT prod_id, BIGINT prod_id2): rowcount = 7089.3, cumulative cost = {73965.03 rows, 337056.82999999996 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901702 00-01 ComplexToJson : rowType = RecordType(BIGINT cust_id, VARCHAR(20) device, BIGINT prod_id, BIGINT prod_id2): rowcount = 7089.3, cumulative cost = {73256.1 rows, 336347.89999999997 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901701 00-02 Project(cust_id=[CAST($0):BIGINT], device=[CAST($1):VARCHAR(20) CHARACTER SET "UTF-16LE" COLLATE "UTF-16LE$en_US$primary"], prod_id=[CAST($3):BIGINT], prod_id2=[CAST($4):BIGINT]) : rowType = RecordType(BIGINT cust_id, VARCHAR(20) device, BIGINT prod_id, BIGINT prod_id2): rowcount = 7089.3, cumulative cost = {66166.8 rows, 329258.6 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901700 00-03 Flatten(flattenField=[$4]) : rowType = RecordType(ANY EXPR$0, ANY EXPR$1, ANY EXPR$2, ANY EXPR$4, ANY EXPR$5): rowcount = 7089.3, cumulative cost = {59077.50000000001 rows, 215829.8 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901699 00-04 Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$4=[$3], EXPR$5=[$2]) : rowType = RecordType(ANY EXPR$0, ANY EXPR$1, ANY EXPR$2, ANY EXPR$4, ANY EXPR$5): rowcount = 7089.3, cumulative cost = {51988.200000000004 rows, 208740.5 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901698 00-05 Flatten(flattenField=[$3]) : rowType = RecordType(ANY EXPR$0, ANY EXPR$1, ANY EXPR$2, ANY EXPR$4): rowcount = 7089.3, cumulative cost = {44898.9 rows, 173294.0 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901697 00-06 Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$4=[$2]) : rowType = RecordType(ANY EXPR$0, ANY EXPR$1, ANY EXPR$2, ANY EXPR$4): rowcount = 7089.3, cumulative cost = {37809.6 rows, 166204.7 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901696 00-07 SelectionVectorRemover : rowType = RecordType(ANY ITEM, ANY ITEM1, ANY ITEM2, ANY ITEM3): rowcount = 7089.3, cumulative cost = {30720.3 rows, 137847.5 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901695 00-08 Filter(condition=[IS NOT NULL($3)]) : rowType = RecordType(ANY ITEM, ANY ITEM1, ANY ITEM2, ANY ITEM3): rowcount = 7089.3, cumulative cost = {23631.0 rows, 130758.2 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901694 00-09 Project(ITEM=[ITEM($0, 'cust_id')], ITEM1=[ITEM($0, 'device')], ITEM2=[ITEM($1, 'prod_id')], ITEM3=[ITEM(ITEM($1, 'prod_id'), 0)]) : rowType = RecordType(ANY ITEM, ANY ITEM1, ANY ITEM2, ANY ITEM3): rowcount = 7877.0, cumulative cost = {15754.0 rows, 70893.0 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901693 00-10 Scan(table=[[dfs, clicks, campaignclicks_50.json]], groupscan=[EasyGroupScan [selectionRoot=maprfs:/data/nested/clicks/campaignclicks_50.json, numFiles=1, columns=[`user_info`.`cust_id`, `user_info`.`device`, `trans_info`.`prod_id`, `trans_info`.`prod_id`[0]], files=[maprfs:///data/nested/clicks/campaignclicks_50.json]]]) : rowType = RecordType(ANY user_info, ANY trans_info): rowcount = 7877.0, cumulative cost = {7877.0 rows, 15754.0 cpu, 8067011.0 io, 0.0 network, 0.0 memory}, id = 901692{code} The source data file is attached -- This message was sent by Atlassian JIRA (v7.6.3#76005)