[jira] [Updated] (HIVE-14970) repeated insert into is broken for buckets (incorrect results for tablesample, BucketingSortingReduceSinkOptimizer)

2016-11-02 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-14970:

Summary: repeated insert into is broken for buckets (incorrect results for 
tablesample, BucketingSortingReduceSinkOptimizer)  (was: repeated insert into 
is broken for buckets (incorrect results for tablesample))

> repeated insert into is broken for buckets (incorrect results for 
> tablesample, BucketingSortingReduceSinkOptimizer)
> ---
>
> Key: HIVE-14970
> URL: https://issues.apache.org/jira/browse/HIVE-14970
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Critical
>
> Running on a regular CLI driver
> {noformat}
> CREATE TABLE src_bucket(key STRING, value STRING) CLUSTERED BY (key) SORTED 
> BY (key) INTO 2 BUCKETS;
> insert into table src_bucket select key,value from srcpart limit 10;
> dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/src_bucket/;
> select *, INPUT__FILE__NAME from src_bucket;
> select * from src_bucket tablesample (bucket 1 out of 2) s;
> select * from src_bucket tablesample (bucket 2 out of 2) s;
> insert into table src_bucket select key,value from srcpart limit 10;
> dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/src_bucket/;
> select *, INPUT__FILE__NAME from src_bucket;
> select * from src_bucket tablesample (bucket 1 out of 2) s;
> select * from src_bucket tablesample (bucket 2 out of 2) s;
> {noformat}
> Results in the following (with masking disabled and grepping away the noise).
> Looks like bucket mapping completely breaks due to extra files, which may 
> have implications for all the optimizations that depend on them.
> This should work or at least fail if this is not supported.
> {noformat}
> PREHOOK: query: CREATE TABLE src_bucket(key STRING, value STRING) CLUSTERED 
> BY (key) SORTED BY (key) INTO 2 BUCKETS
> PREHOOK: query: insert into table src_bucket select key,value from srcpart 
> limit 10
> Found 2 items
> -rwxr-xr-x   1 sergey staff 46 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> -rwxr-xr-x   1 sergey staff 68 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> PREHOOK: query: select *, INPUT__FILE__NAME from src_bucket
> 165   val_165 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 255   val_255 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 484   val_484 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 86val_86  
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 238   val_238 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> 27val_27  
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> 278   val_278 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> 311   val_311 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> 409   val_409 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> 98val_98  
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> PREHOOK: query: select * from src_bucket tablesample (bucket 1 out of 2) s
> 165   val_165
> 255   val_255
> 484   val_484
> 86val_86
> PREHOOK: query: select * from src_bucket tablesample (bucket 2 out of 2) s
> 238   val_238
> 27val_27
> 278   val_278
> 311   val_311
> 409   val_409
> 98val_98
> {noformat}
> So far so good.
> {noformat}
> PREHOOK: query: insert into table src_bucket select key,value from srcpart 
> limit 10
> Found 4 items
> -rwxr-xr-x   1 sergey staff 46 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> -rwxr-xr-x   1 sergey staff 46 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0_copy_1
> -rwxr-xr-x   1 sergey staff 68 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> -rwxr-xr-x   1 sergey staff 68 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0_copy_1
> PREHOOK: query: select *, INPUT__FILE__NAME from src_bucket
> 165   val_165 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 255   val_255 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 484   val_484 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 86val_86  
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehou

[jira] [Updated] (HIVE-14970) repeated insert into is broken for buckets (incorrect results for tablesample)

2016-10-17 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-14970:

Summary: repeated insert into is broken for buckets (incorrect results for 
tablesample)  (was: repeated insert into is broken for buckets (incorrect 
results))

> repeated insert into is broken for buckets (incorrect results for tablesample)
> --
>
> Key: HIVE-14970
> URL: https://issues.apache.org/jira/browse/HIVE-14970
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Critical
>
> Running on a regular CLI driver
> {noformat}
> CREATE TABLE src_bucket(key STRING, value STRING) CLUSTERED BY (key) SORTED 
> BY (key) INTO 2 BUCKETS;
> insert into table src_bucket select key,value from srcpart limit 10;
> dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/src_bucket/;
> select *, INPUT__FILE__NAME from src_bucket;
> select * from src_bucket tablesample (bucket 1 out of 2) s;
> select * from src_bucket tablesample (bucket 2 out of 2) s;
> insert into table src_bucket select key,value from srcpart limit 10;
> dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/src_bucket/;
> select *, INPUT__FILE__NAME from src_bucket;
> select * from src_bucket tablesample (bucket 1 out of 2) s;
> select * from src_bucket tablesample (bucket 2 out of 2) s;
> {noformat}
> Results in the following (with masking disabled and grepping away the noise).
> Looks like bucket mapping completely breaks due to extra files, which may 
> have implications for all the optimizations that depend on them.
> This should work or at least fail if this is not supported.
> {noformat}
> PREHOOK: query: CREATE TABLE src_bucket(key STRING, value STRING) CLUSTERED 
> BY (key) SORTED BY (key) INTO 2 BUCKETS
> PREHOOK: query: insert into table src_bucket select key,value from srcpart 
> limit 10
> Found 2 items
> -rwxr-xr-x   1 sergey staff 46 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> -rwxr-xr-x   1 sergey staff 68 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> PREHOOK: query: select *, INPUT__FILE__NAME from src_bucket
> 165   val_165 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 255   val_255 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 484   val_484 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 86val_86  
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 238   val_238 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> 27val_27  
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> 278   val_278 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> 311   val_311 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> 409   val_409 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> 98val_98  
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> PREHOOK: query: select * from src_bucket tablesample (bucket 1 out of 2) s
> 165   val_165
> 255   val_255
> 484   val_484
> 86val_86
> PREHOOK: query: select * from src_bucket tablesample (bucket 2 out of 2) s
> 238   val_238
> 27val_27
> 278   val_278
> 311   val_311
> 409   val_409
> 98val_98
> {noformat}
> So far so good.
> {noformat}
> PREHOOK: query: insert into table src_bucket select key,value from srcpart 
> limit 10
> Found 4 items
> -rwxr-xr-x   1 sergey staff 46 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> -rwxr-xr-x   1 sergey staff 46 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0_copy_1
> -rwxr-xr-x   1 sergey staff 68 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
> -rwxr-xr-x   1 sergey staff 68 2016-10-14 16:09 
> pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0_copy_1
> PREHOOK: query: select *, INPUT__FILE__NAME from src_bucket
> 165   val_165 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 255   val_255 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 484   val_484 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 86val_86  
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
> 165   val_165 
> pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0_copy_1
> 

[jira] [Updated] (HIVE-14970) repeated insert into is broken for buckets (incorrect results)

2016-10-14 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-14970:

Description: 
Running on a regular CLI driver
{noformat}
CREATE TABLE src_bucket(key STRING, value STRING) CLUSTERED BY (key) SORTED BY 
(key) INTO 2 BUCKETS;

insert into table src_bucket select key,value from srcpart limit 10;
dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/src_bucket/;
select *, INPUT__FILE__NAME from src_bucket;
select * from src_bucket tablesample (bucket 1 out of 2) s;
select * from src_bucket tablesample (bucket 2 out of 2) s;

insert into table src_bucket select key,value from srcpart limit 10;
dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/src_bucket/;
select *, INPUT__FILE__NAME from src_bucket;
select * from src_bucket tablesample (bucket 1 out of 2) s;
select * from src_bucket tablesample (bucket 2 out of 2) s;
{noformat}

Results in the following (with masking disabled and grepping away the noise).
Looks like bucket mapping completely breaks due to extra files, which may have 
implications for all the optimizations that depend on them.
This should work or at least fail if this is not supported.
{noformat}
PREHOOK: query: CREATE TABLE src_bucket(key STRING, value STRING) CLUSTERED BY 
(key) SORTED BY (key) INTO 2 BUCKETS
PREHOOK: query: insert into table src_bucket select key,value from srcpart 
limit 10
Found 2 items
-rwxr-xr-x   1 sergey staff 46 2016-10-14 16:09 
pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
-rwxr-xr-x   1 sergey staff 68 2016-10-14 16:09 
pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
PREHOOK: query: select *, INPUT__FILE__NAME from src_bucket
165 val_165 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
255 val_255 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
484 val_484 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
86  val_86  
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
238 val_238 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
27  val_27  
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
278 val_278 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
311 val_311 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
409 val_409 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
98  val_98  
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
PREHOOK: query: select * from src_bucket tablesample (bucket 1 out of 2) s
165 val_165
255 val_255
484 val_484
86  val_86
PREHOOK: query: select * from src_bucket tablesample (bucket 2 out of 2) s
238 val_238
27  val_27
278 val_278
311 val_311
409 val_409
98  val_98
{noformat}
So far so good.
{noformat}
PREHOOK: query: insert into table src_bucket select key,value from srcpart 
limit 10
Found 4 items
-rwxr-xr-x   1 sergey staff 46 2016-10-14 16:09 
pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
-rwxr-xr-x   1 sergey staff 46 2016-10-14 16:09 
pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0_copy_1
-rwxr-xr-x   1 sergey staff 68 2016-10-14 16:09 
pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
-rwxr-xr-x   1 sergey staff 68 2016-10-14 16:09 
pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0_copy_1
PREHOOK: query: select *, INPUT__FILE__NAME from src_bucket
165 val_165 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
255 val_255 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
484 val_484 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
86  val_86  
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0
165 val_165 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0_copy_1
255 val_255 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0_copy_1
484 val_484 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0_copy_1
86  val_86  
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/00_0_copy_1
238 val_238 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
27  val_27  
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
278 val_278 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/01_0
311 val_311 
pfile:/Users/sergey/git/hive/itests/qtest/target/warehou