[jira] [Created] (HIVE-25170) Data error in constant propagation caused by wrong colExprMap generated in SemanticAnalyzer
Wei Zhang created HIVE-25170: Summary: Data error in constant propagation caused by wrong colExprMap generated in SemanticAnalyzer Key: HIVE-25170 URL: https://issues.apache.org/jira/browse/HIVE-25170 Project: Hive Issue Type: Bug Components: Query Planning Affects Versions: 3.1.2 Reporter: Wei Zhang Assignee: Wei Zhang {code:java} // code placeholder EXPLAIN SELECT constant_col, key, max(value) FROM ( SELECT 'constant' as constant_col, key, value FROM src DISTRIBUTE BY constant_col, key SORT BY constant_col, key, value ) a GROUP BY constant_col, key LIMIT 10; OK Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0 Fetch Operator limit:10 Stage-1 Reducer 3 File Output Operator [FS_10] Limit [LIM_9] (rows=1 width=368) Number of rows:10 Select Operator [SEL_8] (rows=1 width=368) Output:["_col0","_col1","_col2"] Group By Operator [GBY_7] (rows=1 width=368) Output:["_col0","_col1","_col2"],aggregations:["max(VALUE._col0)"],keys:'constant', 'constant' <-Reducer 2 [SIMPLE_EDGE] SHUFFLE [RS_6] PartitionCols:'constant', 'constant' Group By Operator [GBY_5] (rows=1 width=368) Output:["_col0","_col1","_col2"],aggregations:["max(_col2)"],keys:'constant', 'constant' Select Operator [SEL_3] (rows=500 width=178) Output:["_col2"] <-Map 1 [SIMPLE_EDGE] SHUFFLE [RS_2] PartitionCols:'constant', _col1 Select Operator [SEL_1] (rows=500 width=178) Output:["_col1","_col2"] TableScan [TS_0] (rows=500 width=10) src,src,Tbl:COMPLETE,Col:COMPLETE,Output:["key","value"]{code} Obviously, the `PartitionCols` in Reducer 2 is wrong. Instead of `'constant', 'constant'`, it should be `'constant', _col1` That's because after HIVE-13808, `SemanticAnalyzer` uses `sortCols` to generate the `colExprMap` structure in the key part, while the key columns are generated by `newSortCols`, leading to a column and expr mismatch when the constant column is not the trailing column in the key columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-21915) Hive with TEZ UNION ALL and UDTF results in data loss
Wei Zhang created HIVE-21915: Summary: Hive with TEZ UNION ALL and UDTF results in data loss Key: HIVE-21915 URL: https://issues.apache.org/jira/browse/HIVE-21915 Project: Hive Issue Type: Bug Affects Versions: 1.2.1 Reporter: Wei Zhang The HQL syntax is like this: CREATE TEMPORARY TABLE tez_union_all_loss_data AS SELECT xxx, yyy, zzz,1 as tag FROM ods_1 UNION ALL SELECT xxx, yyy, zzz, tag FROM ( SELECT xxx ,get_json_object(get_json_object(tb,'$.a'),'$.b') AS yyy ,zzz ,2 as tag FROM ods_2 LATERAL VIEW EXPLODE(some_udf(uuu)) team_number AS tb ) tbl ; With above HQL, we are expecting that rows with both tag = 2 and tag = 1 appear. In our case however, all the rows with tag = 1 are lost. Dig deeper we can find that the generated two maps have identical task tmp paths. And that results from when UDTF is present, the FileSinkOperator will be processed twice generating the tmp path in GenTezUtils.removeUnionOperators(); -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-9613) Left join query plan outputs wrong column when using subquery
[ https://issues.apache.org/jira/browse/HIVE-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322411#comment-14322411 ] Wei Zhang commented on HIVE-9613: - i came accross the same problem, any ideas? Left join query plan outputs wrong column when using subquery -- Key: HIVE-9613 URL: https://issues.apache.org/jira/browse/HIVE-9613 Project: Hive Issue Type: Bug Components: Parser, Query Planning Affects Versions: 0.14.0, 1.0.0 Environment: apache hadoop 2.5.1 Reporter: Li Xin I have a query that outputs a column with wrong contents when using subquery,and the contents of that column is equal to another column,not its own. I have three tables,as follows: table 1: _hivetemp.category_city_rank_: ||category||city||rank|| |jinrongfuwu|shanghai|1| |ktvjiuba|shanghai|2| table 2:_hivetemp.category_match_: ||src_category_en||src_category_cn||dst_category_en||dst_category_cn|| |danbaobaoxiantouzi|投资担保|担保/贷款|jinrongfuwu| |zpwentiyingshi|娱乐/休闲|KTV/酒吧|ktvjiuba| table 3:_hivetemp.city_match_: ||src_city_name_en||dst_city_name_en||city_name_cn|| |sh|shanghai|上海| And the query is : {code} select a.category, a.city, a.rank, b.src_category_en, c.src_city_name_en from hivetemp.category_city_rank a left outer join (select src_category_en, dst_category_en from hivetemp.category_match) b on a.category = b.dst_category_en left outer join (select src_city_name_en, dst_city_name_en from hivetemp.city_match) c on a.city = c.dst_city_name_en {code} which shoud output the results as follows,and i test it in hive 0.13: ||category||city||rank||src_category_en||src_city_name_en|| |jinrongfuwu|shanghai|1|danbaobaoxiantouzi|sh| |ktvjiuba|shanghai|2|zpwentiyingshi|sh| but int hive0.14,the results in the column *src_category_en* is wrong,and is just the *city* contents: ||category||city||rank||src_category_en||src_city_name_en|| |jinrongfuwu|shanghai|1|shanghai|sh| |ktvjiuba|shanghai|2|shanghai|sh| Using explain to examine the execution plan,i can see the first subquery just outputs one column of *dst_category_en*,and *src_category_en* is just missing. {quote} b:category_match TableScan alias: category_match Statistics: Num rows: 131 Data size: 13149 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: dst_category_en (type: string) outputColumnNames: _col1 Statistics: Num rows: 131 Data size: 13149 Basic stats: COMPLETE Column stats: NONE {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)