[jira] [Created] (HIVE-25170) Data error in constant propagation caused by wrong colExprMap generated in SemanticAnalyzer

2021-05-26 Thread Wei Zhang (Jira)
Wei Zhang created HIVE-25170:


 Summary: Data error in constant propagation caused by wrong 
colExprMap generated in SemanticAnalyzer
 Key: HIVE-25170
 URL: https://issues.apache.org/jira/browse/HIVE-25170
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Affects Versions: 3.1.2
Reporter: Wei Zhang
Assignee: Wei Zhang


 
{code:java}
// code placeholder

EXPLAIN
SELECT constant_col, key, max(value)
FROM
(
  SELECT 'constant' as constant_col, key, value
  FROM src
  DISTRIBUTE BY constant_col, key
  SORT BY constant_col, key, value
) a
GROUP BY constant_col, key
LIMIT 10;

OK
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0
  Fetch Operator
limit:10
Stage-1
  Reducer 3
  File Output Operator [FS_10]
Limit [LIM_9] (rows=1 width=368)
  Number of rows:10
  Select Operator [SEL_8] (rows=1 width=368)
Output:["_col0","_col1","_col2"]
Group By Operator [GBY_7] (rows=1 width=368)
  
Output:["_col0","_col1","_col2"],aggregations:["max(VALUE._col0)"],keys:'constant',
 'constant'
<-Reducer 2 [SIMPLE_EDGE]
  SHUFFLE [RS_6]
PartitionCols:'constant', 'constant'
Group By Operator [GBY_5] (rows=1 width=368)
  
Output:["_col0","_col1","_col2"],aggregations:["max(_col2)"],keys:'constant', 
'constant'
  Select Operator [SEL_3] (rows=500 width=178)
Output:["_col2"]
  <-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_2]
  PartitionCols:'constant', _col1
  Select Operator [SEL_1] (rows=500 width=178)
Output:["_col1","_col2"]
TableScan [TS_0] (rows=500 width=10)
  
src,src,Tbl:COMPLETE,Col:COMPLETE,Output:["key","value"]{code}
Obviously, the `PartitionCols` in Reducer 2 is wrong. Instead of `'constant', 
'constant'`, it should be `'constant', _col1`

 

That's because after HIVE-13808,  `SemanticAnalyzer` uses `sortCols` to 
generate the `colExprMap` structure in the key part, while the key columns are 
generated by `newSortCols`, leading to a column and expr mismatch when the 
constant column is not the trailing column in the key columns.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-21915) Hive with TEZ UNION ALL and UDTF results in data loss

2019-06-24 Thread Wei Zhang (JIRA)
Wei Zhang created HIVE-21915:


 Summary: Hive with TEZ UNION ALL and UDTF results in data loss
 Key: HIVE-21915
 URL: https://issues.apache.org/jira/browse/HIVE-21915
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: Wei Zhang


The HQL syntax is like this:

CREATE TEMPORARY TABLE tez_union_all_loss_data AS
SELECT xxx, yyy, zzz,1 as tag
FROM ods_1

UNION ALL

SELECT xxx, yyy, zzz, tag
FROM
(
SELECT xxx
,get_json_object(get_json_object(tb,'$.a'),'$.b') AS yyy
,zzz
,2 as tag
FROM ods_2
LATERAL VIEW EXPLODE(some_udf(uuu)) team_number AS tb
) tbl 
;

 

With above HQL, we are expecting that rows with both tag = 2 and tag = 1 
appear. In our case however, all the rows with tag = 1 are lost.

Dig deeper we can find that the generated two maps have identical task tmp 
paths. And that results from when UDTF is present, the FileSinkOperator will be 
processed twice generating the tmp path in GenTezUtils.removeUnionOperators();

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-9613) Left join query plan outputs wrong column when using subquery

2015-02-15 Thread Wei Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322411#comment-14322411
 ] 

Wei Zhang commented on HIVE-9613:
-

i came accross the same problem, any ideas?

 Left join query plan outputs  wrong column when using subquery
 --

 Key: HIVE-9613
 URL: https://issues.apache.org/jira/browse/HIVE-9613
 Project: Hive
  Issue Type: Bug
  Components: Parser, Query Planning
Affects Versions: 0.14.0, 1.0.0
 Environment: apache hadoop 2.5.1 
Reporter: Li Xin

 I have a query that outputs a column with wrong contents when using 
 subquery,and the contents of that column is equal to another column,not its 
 own.
 I have three tables,as follows:
 table 1: _hivetemp.category_city_rank_:
 ||category||city||rank||
 |jinrongfuwu|shanghai|1|
 |ktvjiuba|shanghai|2|
 table 2:_hivetemp.category_match_:
 ||src_category_en||src_category_cn||dst_category_en||dst_category_cn||
 |danbaobaoxiantouzi|投资担保|担保/贷款|jinrongfuwu|
 |zpwentiyingshi|娱乐/休闲|KTV/酒吧|ktvjiuba|
 table 3:_hivetemp.city_match_:
 ||src_city_name_en||dst_city_name_en||city_name_cn||
 |sh|shanghai|上海|
 And the query is :
 {code}
 select
 a.category,
 a.city,
 a.rank,
 b.src_category_en,
 c.src_city_name_en
 from
 hivetemp.category_city_rank a
 left outer join
 (select
 src_category_en,
 dst_category_en
 from
 hivetemp.category_match) b
 on  a.category = b.dst_category_en
 left outer join
 (select
 src_city_name_en,
 dst_city_name_en
 from
 hivetemp.city_match) c
 on  a.city = c.dst_city_name_en
 {code}
 which shoud output the results as follows,and i test it in hive 0.13:
 ||category||city||rank||src_category_en||src_city_name_en||
 |jinrongfuwu|shanghai|1|danbaobaoxiantouzi|sh|
 |ktvjiuba|shanghai|2|zpwentiyingshi|sh|
 but int hive0.14,the results in the column *src_category_en* is wrong,and is 
 just the *city* contents:
 ||category||city||rank||src_category_en||src_city_name_en||
 |jinrongfuwu|shanghai|1|shanghai|sh|
 |ktvjiuba|shanghai|2|shanghai|sh|
 Using explain to examine the execution plan,i can see the first subquery just 
 outputs one column of *dst_category_en*,and *src_category_en* is just missing.
 {quote}
b:category_match
   TableScan
 alias: category_match
 Statistics: Num rows: 131 Data size: 13149 Basic stats: COMPLETE 
 Column stats: NONE
 Select Operator
   expressions: dst_category_en (type: string)
   outputColumnNames: _col1
   Statistics: Num rows: 131 Data size: 13149 Basic stats: 
 COMPLETE Column stats: NONE
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)