[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2020-09-21 Thread GuangMing Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GuangMing Lu updated HIVE-22098:

Attachment: join_test.sql

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0, 3.1.2
>Reporter: GuangMing Lu
>Assignee: yongtaoliao
>Priority: Blocker
>  Labels: data-loss, wrongresults
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and no of reducers is greater 
> than 2, the result is incorrect (*data loss*).
>  *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
>  when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2020-09-21 Thread GuangMing Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GuangMing Lu updated HIVE-22098:

Attachment: (was: join_test.sql)

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0, 3.1.2
>Reporter: GuangMing Lu
>Assignee: yongtaoliao
>Priority: Blocker
>  Labels: data-loss, wrongresults
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and no of reducers is greater 
> than 2, the result is incorrect (*data loss*).
>  *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
>  when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2020-02-22 Thread JithendhiraKumar (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JithendhiraKumar updated HIVE-22098:

Description: 
When different bucketVersion of tables do join and no of reducers is greater 
than 2, the result is incorrect (*data loss*).
 *Scenario 1*: Three tables join. The temporary result data of table_a in the 
first table and table_b in the second table joins result is recorded as 
tmp_a_b, When it joins with the third table, the bucket_version=2 of the table 
created by default after hive-3.0.0, temporary data tmp_a_b initialized the 
bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In the 
init method, the hash algorithm of selecting join column is selected according 
to bucketVersion. If bucketVersion = 2 and is not an acid operation, it will 
acquired the new algorithm of hash. Otherwise, the old algorithm of hash is 
acquired. Because of the inconsistency of the algorithm of hash, the partition 
of data allocation caused are different. At stage of Reducer, Data with the 
same key can not be paired resulting in data loss.

*Scenario 2*: create two test tables, create table table_bucketversion_1(col_1 
string, col_2 string) TBLPROPERTIES ('bucketing_version'='1'); 
table_bucketversion_2(col_1 string, col_2 string) TBLPROPERTIES 
('bucketing_version'='2');
 when use table_bucketversion_1 to join table_bucketversion_2, partial result 
data will be loss due to bucketVerison is different.

 

  was:
When different bucketVersion of tables do join and no of reducers number 
greater than 2, the result is incorrect (*data loss*).
 *Scenario 1*: Three tables join. The temporary result data of table_a in the 
first table and table_b in the second table joins result is recorded as 
tmp_a_b, When it joins with the third table, the bucket_version=2 of the table 
created by default after hive-3.0.0, temporary data tmp_a_b initialized the 
bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In the 
init method, the hash algorithm of selecting join column is selected according 
to bucketVersion. If bucketVersion = 2 and is not an acid operation, it will 
acquired the new algorithm of hash. Otherwise, the old algorithm of hash is 
acquired. Because of the inconsistency of the algorithm of hash, the partition 
of data allocation caused are different. At stage of Reducer, Data with the 
same key can not be paired resulting in data loss.

*Scenario 2*: create two test tables, create table table_bucketversion_1(col_1 
string, col_2 string) TBLPROPERTIES ('bucketing_version'='1'); 
table_bucketversion_2(col_1 string, col_2 string) TBLPROPERTIES 
('bucketing_version'='2');
 when use table_bucketversion_1 to join table_bucketversion_2, partial result 
data will be loss due to bucketVerison is different.

 


> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0, 3.1.2
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Blocker
>  Labels: data-loss, wrongresults
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and no of reducers is greater 
> than 2, the result is incorrect (*data loss*).
>  *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
>  when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian Jira

[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2020-02-22 Thread JithendhiraKumar (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JithendhiraKumar updated HIVE-22098:

Affects Version/s: 3.1.2

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0, 3.1.2
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Blocker
>  Labels: data-loss, wrongresults
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and no of reducers number 
> greater than 2, the result is incorrect (*data loss*).
>  *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
>  when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2020-02-22 Thread JithendhiraKumar (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JithendhiraKumar updated HIVE-22098:

Labels: data-loss wrongresults  (was: )

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Blocker
>  Labels: data-loss, wrongresults
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and no of reducers number 
> greater than 2, the result is incorrect (*data loss*).
>  *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
>  when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2020-02-22 Thread JithendhiraKumar (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JithendhiraKumar updated HIVE-22098:

Description: 
When different bucketVersion of tables do join and no of reducers number 
greater than 2, the result is incorrect (*data loss*).
 *Scenario 1*: Three tables join. The temporary result data of table_a in the 
first table and table_b in the second table joins result is recorded as 
tmp_a_b, When it joins with the third table, the bucket_version=2 of the table 
created by default after hive-3.0.0, temporary data tmp_a_b initialized the 
bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In the 
init method, the hash algorithm of selecting join column is selected according 
to bucketVersion. If bucketVersion = 2 and is not an acid operation, it will 
acquired the new algorithm of hash. Otherwise, the old algorithm of hash is 
acquired. Because of the inconsistency of the algorithm of hash, the partition 
of data allocation caused are different. At stage of Reducer, Data with the 
same key can not be paired resulting in data loss.

*Scenario 2*: create two test tables, create table table_bucketversion_1(col_1 
string, col_2 string) TBLPROPERTIES ('bucketing_version'='1'); 
table_bucketversion_2(col_1 string, col_2 string) TBLPROPERTIES 
('bucketing_version'='2');
 when use table_bucketversion_1 to join table_bucketversion_2, partial result 
data will be loss due to bucketVerison is different.

 

  was:
When different bucketVersion of tables do join and  reducers number greater 
than 2, result is easy to lose data.
*Scenario 1*: Three tables join. The temporary result data of table_a in the 
first table and table_b in the second table joins result is recorded as 
tmp_a_b, When it joins with the third table, the bucket_version=2 of the table 
created by default after hive-3.0.0, temporary data tmp_a_b initialized the 
bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In the 
init method, the hash algorithm of selecting join column is selected according 
to bucketVersion. If bucketVersion = 2 and is not an acid operation, it will 
acquired the new algorithm of hash. Otherwise, the old algorithm of hash is 
acquired. Because of the inconsistency of the algorithm of hash, the partition 
of data allocation caused are different. At stage of Reducer, Data with the 
same key can not be paired resulting in data loss.

*Scenario 2*: create two test tables, create table table_bucketversion_1(col_1 
string, col_2 string) TBLPROPERTIES ('bucketing_version'='1'); 
table_bucketversion_2(col_1 string, col_2 string) TBLPROPERTIES 
('bucketing_version'='2');
when use table_bucketversion_1 to join table_bucketversion_2, partial result 
data will be loss due to bucketVerison is different.

 


> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Blocker
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and no of reducers number 
> greater than 2, the result is incorrect (*data loss*).
>  *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
>  when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2020-02-21 Thread JithendhiraKumar (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JithendhiraKumar updated HIVE-22098:

Priority: Blocker  (was: Major)

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Blocker
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and  reducers number greater 
> than 2, result is easy to lose data.
> *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2019-09-20 Thread LuGuangMing (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuGuangMing updated HIVE-22098:
---
Status: Patch Available  (was: Reopened)

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Major
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and  reducers number greater 
> than 2, result is easy to lose data.
> *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2019-08-12 Thread LuGuangMing (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuGuangMing updated HIVE-22098:
---
Attachment: HIVE-22098.1.patch

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Major
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and  reducers number greater 
> than 2, result is easy to lose data.
> *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2019-08-12 Thread LuGuangMing (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuGuangMing updated HIVE-22098:
---
Attachment: (was: HIVE-22098.1.patch)

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Major
> Attachments: image-2019-08-12-18-45-15-771.png, join_test.sql, 
> table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and  reducers number greater 
> than 2, result is easy to lose data.
> *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2019-08-12 Thread LuGuangMing (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuGuangMing updated HIVE-22098:
---
Attachment: HIVE-22098.1.patch

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Major
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and  reducers number greater 
> than 2, result is easy to lose data.
> *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2019-08-12 Thread LuGuangMing (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuGuangMing updated HIVE-22098:
---
Attachment: image-2019-08-12-18-45-15-771.png

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Major
> Attachments: image-2019-08-12-18-45-15-771.png, join_test.sql, 
> table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and  reducers number greater 
> than 2, result is easy to lose data.
> *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

2019-08-12 Thread LuGuangMing (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuGuangMing updated HIVE-22098:
---
Summary: Data loss occurs when multiple tables are join with different 
bucket_version  (was: Data loss occurs when joins occur on tables with 
different bucket_version)

> Data loss occurs when multiple tables are join with different bucket_version
> 
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 3.1.0
>Reporter: LuGuangMing
>Assignee: LuGuangMing
>Priority: Major
> Attachments: join_test.sql, table_a_data.orc, table_b_data.orc, 
> table_c_data.orc
>
>
> When different bucketVersion of tables do join and  reducers number greater 
> than 2, result is easy to lose data.
> *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)