[jira] [Comment Edited] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-20 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370276#comment-16370276
 ] 

Alexander Behm edited comment on HIVE-18743 at 2/20/18 5:06 PM:


Thank you, [~kgyrtkirk]. I agree completely. I'm very much in favor of getting 
rid of all non-obvious side effects of Metastore API calls. Stats collection is 
one of those side effects. As is today, it is very hard to reason about what 
exactly the Metastore will do how expensive API calls are.


was (Author: alex.behm):
Thanks you, [~kgyrtkirk]. I agree completely. I'm very much in favor of getting 
rid of all non-obvious side effects of Metastore API calls. Stats collection is 
one of those side effects. As is today, it is very hard to reason about what 
exactly the Metastore will do how expensive API calls are.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-20 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370276#comment-16370276
 ] 

Alexander Behm commented on HIVE-18743:
---

Thanks you, [~kgyrtkirk]. I agree completely. I'm very much in favor of getting 
rid of all non-obvious side effects of Metastore API calls. Stats collection is 
one of those side effects. As is today, it is very hard to reason about what 
exactly the Metastore will do how expensive API calls are.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-15670) column_stats_accurate may not fit in PARTITION_PARAMS.VALUE

2017-10-27 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223239#comment-16223239
 ] 

Alexander Behm commented on HIVE-15670:
---

Thanks for the response. From the perspective of a client issuing an RPC to 
alter column stats, it seems like a rather questionable side-effect to also 
alter the table metadata with a pretty big payload. Instead of "fixing" this 
issue by changing the database schema, could we instead remove the JSON string 
altogether?

I'm definitely not familiar with the implementation details, just trying to 
provide a perspective from a Metastore client that is not Hive.

> column_stats_accurate may not fit in PARTITION_PARAMS.VALUE
> ---
>
> Key: HIVE-15670
> URL: https://issues.apache.org/jira/browse/HIVE-15670
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> The JSON can be too big with many columns (see setColumnStatsState method).
> We can make JSON more compact by only storing the list of columns with true 
> values. Or we can even store a bitmask in a dedicated column, and adjust it 
> when altering table (rare enough). Or we can just change the VALUE column to 
> text blob (might be a painful change wrt upgrade scripts, and supporting all 
> the DBs' varied blob implementations, esp. in directsql).
> Storing denormalized flags in a separate table will probably be slow, 
> comparatively.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15670) column_stats_accurate may not fit in PARTITION_PARAMS.VALUE

2017-10-24 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217912#comment-16217912
 ] 

Alexander Behm commented on HIVE-15670:
---

May I ask what's the purpose of storing this JSON in the tableproperties? Seems 
pretty expensive to me. If you want to keep track of the accuracy of column 
stats, why not populate a "last updated" timestamp in the appropriate column 
statistic?

> column_stats_accurate may not fit in PARTITION_PARAMS.VALUE
> ---
>
> Key: HIVE-15670
> URL: https://issues.apache.org/jira/browse/HIVE-15670
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> The JSON can be too big with many columns (see setColumnStatsState method).
> We can make JSON more compact by only storing the list of columns with true 
> values. Or we can even store a bitmask in a dedicated column, and adjust it 
> when altering table (rare enough). Or we can just change the VALUE column to 
> text blob (might be a painful change wrt upgrade scripts, and supporting all 
> the DBs' varied blob implementations, esp. in directsql).
> Storing denormalized flags in a separate table will probably be slow, 
> comparatively.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15653) Some ALTER TABLE commands drop table stats

2017-09-19 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172752#comment-16172752
 ] 

Alexander Behm commented on HIVE-15653:
---

[~ctang.ma], am I understanding correctly that there is no interest in fixing 
this issue on the Metastore side? I understand that all clients can pass 
DO_NOT_UPDATE_STATS.

> Some ALTER TABLE commands drop table stats
> --
>
> Key: HIVE-15653
> URL: https://issues.apache.org/jira/browse/HIVE-15653
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore, Statistics
>Affects Versions: 1.1.0
>Reporter: Alexander Behm
>Assignee: Chaoyu Tang
>Priority: Critical
> Fix For: 2.3.0
>
> Attachments: HIVE-15653.1.patch, HIVE-15653.2.patch, 
> HIVE-15653.3.patch, HIVE-15653.4.patch, HIVE-15653.5.patch, 
> HIVE-15653.6.patch, HIVE-15653.patch
>
>
> Some ALTER TABLE commands drop the table stats. That may make sense for some 
> ALTER TABLE operations, but certainly not for others. Personally, I I think 
> ALTER TABLE should only change what was requested by the user without any 
> side effects that may be unclear to users. In particular, collecting stats 
> can be an expensive operation so it's rather inconvenient for users if they 
> get wiped accidentally.
> Repro:
> {code}
> create table t (i int);
> insert into t values(1);
> analyze table t compute statistics;
> alter table t set tblproperties('test'='test');
> hive> describe formatted t;
> OK
> # col_namedata_type   comment 
>
> i int 
>
> # Detailed Table Information   
> Database: default  
> Owner:abehm
> CreateTime:   Tue Jan 17 18:13:34 PST 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:20500/test-warehouse/t  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   false   
>   last_modified_byabehm   
>   last_modified_time  1484705748  
>   numFiles1   
>   numRows -1  
>   rawDataSize -1  
>   testtest
>   totalSize   2   
>   transient_lastDdlTime   1484705748  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.169 seconds, Fetched: 34 row(s)
> {code}
> The same behavior can be observed with several other ALTER TABLE commands.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-6590) Hive does not work properly with boolean partition columns (wrong results and inserts to incorrect HDFS path)

2017-02-15 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868306#comment-15868306
 ] 

Alexander Behm commented on HIVE-6590:
--

[~kgyrtkirk], can you explain how your new proposed behavior is better? It 
seems like there is still a fundamental issue that partition-key values do not 
have a canonical string representation, so you could still have multiple HDFS 
paths referring to the same logical partition.

> Hive does not work properly with boolean partition columns (wrong results and 
> inserts to incorrect HDFS path)
> -
>
> Key: HIVE-6590
> URL: https://issues.apache.org/jira/browse/HIVE-6590
> Project: Hive
>  Issue Type: Bug
>  Components: Database/Schema, Metastore
>Affects Versions: 0.10.0
>Reporter: Lenni Kuff
>Assignee: Zoltan Haindrich
> Attachments: HIVE-6590.1.patch, HIVE-6590.2.patch, HIVE-6590.3.patch
>
>
> Hive does not work properly with boolean partition columns. Queries return 
> wrong results and also insert to incorrect HDFS paths.
> {code}
> create table bool_part(int_col int) partitioned by(bool_col boolean);
> # This works, creating 3 unique partitions!
> ALTER TABLE bool_table ADD PARTITION (bool_col=FALSE);
> ALTER TABLE bool_table ADD PARTITION (bool_col=false);
> ALTER TABLE bool_table ADD PARTITION (bool_col=False);
> {code}
> The first problem is that Hive cannot filter on a bool partition key column. 
> "select * from bool_part" returns the correct results, but if you apply a 
> filter on the bool partition key column hive won't return any results.
> The second problem is that Hive seems to just call "toString()" on the 
> boolean literal value. This means you can end up with multiple partitions 
> (FALSE, false, FaLSE, etc) mapping to the literal value 'FALSE'. For example, 
> if you can add three partition in have for the same logic value "false" doing:
> ALTER TABLE bool_table ADD PARTITION (bool_col=FALSE) -> 
> /test-warehouse/bool_table/bool_col=FALSE/
> ALTER TABLE bool_table ADD PARTITION (bool_col=false) -> 
> /test-warehouse/bool_table/bool_col=false/
> ALTER TABLE bool_table ADD PARTITION (bool_col=False) -> 
> /test-warehouse/bool_table/bool_col=False/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] (HIVE-15653) Some ALTER TABLE commands drop table stats

2017-01-31 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847540#comment-15847540
 ] 

Alexander Behm commented on HIVE-15653:
---

Thanks, [~ctang.ma]. I've already added a workaround to Impala, but not using 
StatsSetupConst.DO_NOT_UPDATE_STATS because sometimes we do want to update 
stats (e.g. compute stats).
See: http://gerrit.cloudera.org:8080/5731

> Some ALTER TABLE commands drop table stats
> --
>
> Key: HIVE-15653
> URL: https://issues.apache.org/jira/browse/HIVE-15653
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 1.1.0
>Reporter: Alexander Behm
>Assignee: Chaoyu Tang
>Priority: Critical
> Fix For: 2.2.0
>
> Attachments: HIVE-15653.1.patch, HIVE-15653.2.patch, 
> HIVE-15653.3.patch, HIVE-15653.4.patch, HIVE-15653.5.patch, 
> HIVE-15653.6.patch, HIVE-15653.patch
>
>
> Some ALTER TABLE commands drop the table stats. That may make sense for some 
> ALTER TABLE operations, but certainly not for others. Personally, I I think 
> ALTER TABLE should only change what was requested by the user without any 
> side effects that may be unclear to users. In particular, collecting stats 
> can be an expensive operation so it's rather inconvenient for users if they 
> get wiped accidentally.
> Repro:
> {code}
> create table t (i int);
> insert into t values(1);
> analyze table t compute statistics;
> alter table t set tblproperties('test'='test');
> hive> describe formatted t;
> OK
> # col_namedata_type   comment 
>
> i int 
>
> # Detailed Table Information   
> Database: default  
> Owner:abehm
> CreateTime:   Tue Jan 17 18:13:34 PST 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:20500/test-warehouse/t  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   false   
>   last_modified_byabehm   
>   last_modified_time  1484705748  
>   numFiles1   
>   numRows -1  
>   rawDataSize -1  
>   testtest
>   totalSize   2   
>   transient_lastDdlTime   1484705748  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.169 seconds, Fetched: 34 row(s)
> {code}
> The same behavior can be observed with several other ALTER TABLE commands.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] (HIVE-15653) Some ALTER TABLE commands drop table stats

2017-01-31 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847514#comment-15847514
 ] 

Alexander Behm commented on HIVE-15653:
---

Thanks for the info, [~ctang.ma]! I agree the 
alter_table_with_environmentContext() seems useful. 

Is my understanding correct that even after your patch the HMS alter_table() 
will drop stats in some cases?

> Some ALTER TABLE commands drop table stats
> --
>
> Key: HIVE-15653
> URL: https://issues.apache.org/jira/browse/HIVE-15653
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 1.1.0
>Reporter: Alexander Behm
>Assignee: Chaoyu Tang
>Priority: Critical
> Fix For: 2.2.0
>
> Attachments: HIVE-15653.1.patch, HIVE-15653.2.patch, 
> HIVE-15653.3.patch, HIVE-15653.4.patch, HIVE-15653.5.patch, 
> HIVE-15653.6.patch, HIVE-15653.patch
>
>
> Some ALTER TABLE commands drop the table stats. That may make sense for some 
> ALTER TABLE operations, but certainly not for others. Personally, I I think 
> ALTER TABLE should only change what was requested by the user without any 
> side effects that may be unclear to users. In particular, collecting stats 
> can be an expensive operation so it's rather inconvenient for users if they 
> get wiped accidentally.
> Repro:
> {code}
> create table t (i int);
> insert into t values(1);
> analyze table t compute statistics;
> alter table t set tblproperties('test'='test');
> hive> describe formatted t;
> OK
> # col_namedata_type   comment 
>
> i int 
>
> # Detailed Table Information   
> Database: default  
> Owner:abehm
> CreateTime:   Tue Jan 17 18:13:34 PST 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:20500/test-warehouse/t  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   false   
>   last_modified_byabehm   
>   last_modified_time  1484705748  
>   numFiles1   
>   numRows -1  
>   rawDataSize -1  
>   testtest
>   totalSize   2   
>   transient_lastDdlTime   1484705748  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.169 seconds, Fetched: 34 row(s)
> {code}
> The same behavior can be observed with several other ALTER TABLE commands.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-15653) Some ALTER TABLE commands drop table stats

2017-01-19 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830771#comment-15830771
 ] 

Alexander Behm commented on HIVE-15653:
---

Good to know thanks. Regarding ALTER TABLE SET LOCATION: I suppose it's 
arguable. Imo, we should not add side-effects to ALTER TABLE especially when it 
comes to stats because they are expensive to compute. This is more of a product 
question, so maybe [~grahn] or [~skumar] can weigh in.

> Some ALTER TABLE commands drop table stats
> --
>
> Key: HIVE-15653
> URL: https://issues.apache.org/jira/browse/HIVE-15653
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 1.1.0
>Reporter: Alexander Behm
>Assignee: Chaoyu Tang
>Priority: Critical
> Attachments: HIVE-15653.patch
>
>
> Some ALTER TABLE commands drop the table stats. That may make sense for some 
> ALTER TABLE operations, but certainly not for others. Personally, I I think 
> ALTER TABLE should only change what was requested by the user without any 
> side effects that may be unclear to users. In particular, collecting stats 
> can be an expensive operation so it's rather inconvenient for users if they 
> get wiped accidentally.
> Repro:
> {code}
> create table t (i int);
> insert into t values(1);
> analyze table t compute statistics;
> alter table t set tblproperties('test'='test');
> hive> describe formatted t;
> OK
> # col_namedata_type   comment 
>
> i int 
>
> # Detailed Table Information   
> Database: default  
> Owner:abehm
> CreateTime:   Tue Jan 17 18:13:34 PST 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:20500/test-warehouse/t  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   false   
>   last_modified_byabehm   
>   last_modified_time  1484705748  
>   numFiles1   
>   numRows -1  
>   rawDataSize -1  
>   testtest
>   totalSize   2   
>   transient_lastDdlTime   1484705748  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.169 seconds, Fetched: 34 row(s)
> {code}
> The same behavior can be observed with several other ALTER TABLE commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HIVE-15653) Some ALTER TABLE commands drop table stats

2017-01-19 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830719#comment-15830719
 ] 

Alexander Behm edited comment on HIVE-15653 at 1/19/17 10:28 PM:
-

[~ctang.ma] Impala calls the Metastore API alter_table(). I tried the following 
alterations in Impala and those did wipe the table stats:
ALTER TABLE ADD COLUMNS
ALTER TABLE CHANGE COLUMN
ALTER TABLE SET TBLPROPERTIES
ALTER TABLE SET SERDEPROPERTIES
ALTER TABLE SET LOCATION
ALTER TABLE SET FILEFORMAT
ALTER TABLE SET CACHED

So I would say most ALTER commands do wipe the stats (from Impala). Just trying 
to make sure the fix on the Hive side is complete, i.e. the alter_table() API 
call on the Metastore is fixed and not just the Hive DDL commands.

The ALTER TABLE RENAME command worked fine (preserved table stats).


was (Author: alex.behm):
[~ctang.ma] Impala calls the Metastore API alter_table(). I tried the following 
alterations and those did wipe the table stats:
ALTER TABLE ADD COLUMNS
ALTER TABLE CHANGE COLUMN
ALTER TABLE SET TBLPROPERTIES
ALTER TABLE SET SERDEPROPERTIES
ALTER TABLE SET LOCATION
ALTER TABLE SET FILEFORMAT
ALTER TABLE SET CACHED

So I would say most ALTER commands do wipe the stats. Just trying to make sure 
the fix on the Hive side is complete (i.e. the alter_table() API call on the 
Metastore).

The ALTER TABLE RENAME command worked fine (preserved table stats).

> Some ALTER TABLE commands drop table stats
> --
>
> Key: HIVE-15653
> URL: https://issues.apache.org/jira/browse/HIVE-15653
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 1.1.0
>Reporter: Alexander Behm
>Assignee: Chaoyu Tang
>Priority: Critical
> Attachments: HIVE-15653.patch
>
>
> Some ALTER TABLE commands drop the table stats. That may make sense for some 
> ALTER TABLE operations, but certainly not for others. Personally, I I think 
> ALTER TABLE should only change what was requested by the user without any 
> side effects that may be unclear to users. In particular, collecting stats 
> can be an expensive operation so it's rather inconvenient for users if they 
> get wiped accidentally.
> Repro:
> {code}
> create table t (i int);
> insert into t values(1);
> analyze table t compute statistics;
> alter table t set tblproperties('test'='test');
> hive> describe formatted t;
> OK
> # col_namedata_type   comment 
>
> i int 
>
> # Detailed Table Information   
> Database: default  
> Owner:abehm
> CreateTime:   Tue Jan 17 18:13:34 PST 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:20500/test-warehouse/t  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   false   
>   last_modified_byabehm   
>   last_modified_time  1484705748  
>   numFiles1   
>   numRows -1  
>   rawDataSize -1  
>   testtest
>   totalSize   2   
>   transient_lastDdlTime   1484705748  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.169 seconds, Fetched: 34 row(s)
> {code}
> The same behavior can be observed with several other ALTER TABLE commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15653) Some ALTER TABLE commands drop table stats

2017-01-19 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830719#comment-15830719
 ] 

Alexander Behm commented on HIVE-15653:
---

[~ctang.ma] Impala calls the Metastore API alter_table(). I tried the following 
alterations and those did wipe the table stats:
ALTER TABLE ADD COLUMNS
ALTER TABLE CHANGE COLUMN
ALTER TABLE SET TBLPROPERTIES
ALTER TABLE SET SERDEPROPERTIES
ALTER TABLE SET LOCATION
ALTER TABLE SET FILEFORMAT
ALTER TABLE SET CACHED

So I would say most ALTER commands do wipe the stats. Just trying to make sure 
the fix on the Hive side is complete (i.e. the alter_table() API call on the 
Metastore).

The ALTER TABLE RENAME command worked fine (preserved table stats).

> Some ALTER TABLE commands drop table stats
> --
>
> Key: HIVE-15653
> URL: https://issues.apache.org/jira/browse/HIVE-15653
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 1.1.0
>Reporter: Alexander Behm
>Assignee: Chaoyu Tang
>Priority: Critical
> Attachments: HIVE-15653.patch
>
>
> Some ALTER TABLE commands drop the table stats. That may make sense for some 
> ALTER TABLE operations, but certainly not for others. Personally, I I think 
> ALTER TABLE should only change what was requested by the user without any 
> side effects that may be unclear to users. In particular, collecting stats 
> can be an expensive operation so it's rather inconvenient for users if they 
> get wiped accidentally.
> Repro:
> {code}
> create table t (i int);
> insert into t values(1);
> analyze table t compute statistics;
> alter table t set tblproperties('test'='test');
> hive> describe formatted t;
> OK
> # col_namedata_type   comment 
>
> i int 
>
> # Detailed Table Information   
> Database: default  
> Owner:abehm
> CreateTime:   Tue Jan 17 18:13:34 PST 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:20500/test-warehouse/t  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   false   
>   last_modified_byabehm   
>   last_modified_time  1484705748  
>   numFiles1   
>   numRows -1  
>   rawDataSize -1  
>   testtest
>   totalSize   2   
>   transient_lastDdlTime   1484705748  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.169 seconds, Fetched: 34 row(s)
> {code}
> The same behavior can be observed with several other ALTER TABLE commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15653) Some ALTER TABLE commands drop table stats

2017-01-17 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827334#comment-15827334
 ] 

Alexander Behm commented on HIVE-15653:
---

Note that this problem seems to be specific to unpartitioned tables.
Partitioned tables work ok as far as I can tell.

> Some ALTER TABLE commands drop table stats
> --
>
> Key: HIVE-15653
> URL: https://issues.apache.org/jira/browse/HIVE-15653
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 1.1.0
>Reporter: Alexander Behm
>Priority: Critical
>
> Some ALTER TABLE commands drop the table stats. That may make sense for some 
> ALTER TABLE operations, but certainly not for others. Personally, I I think 
> ALTER TABLE should only change what was requested by the user without any 
> side effects that may be unclear to users. In particular, collecting stats 
> can be an expensive operation so it's rather inconvenient for users if they 
> get wiped accidentally.
> Repro:
> {code}
> create table t (i int);
> insert into t values(1);
> analyze table t compute statistics;
> alter table t set tblproperties('test'='test');
> hive> describe formatted t;
> OK
> # col_namedata_type   comment 
>
> i int 
>
> # Detailed Table Information   
> Database: default  
> Owner:abehm
> CreateTime:   Tue Jan 17 18:13:34 PST 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:20500/test-warehouse/t  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   false   
>   last_modified_byabehm   
>   last_modified_time  1484705748  
>   numFiles1   
>   numRows -1  
>   rawDataSize -1  
>   testtest
>   totalSize   2   
>   transient_lastDdlTime   1484705748  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.169 seconds, Fetched: 34 row(s)
> {code}
> The same behavior can be observed with several other ALTER TABLE commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-6590) Hive does not work properly with boolean partition columns (wrong results and inserts to incorrect HDFS path)

2016-08-16 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422294#comment-15422294
 ] 

Alexander Behm commented on HIVE-6590:
--

A related and severe issue is that ALTER TABLE DROP PARTITION may drop 
partitions that were not specified in the ALTER statement. This could lead to 
accidental data loss.

Reproduction:
{code}
CREATE TABLE broken (c int) PARTITIONED BY (b1 BOOLEAN, s STRING, b2 BOOLEAN, i 
INT);

# Insert a few variants of 'false' partition-key values.
INSERT INTO TABLE broken PARTITION(b1=false,s='a',b2=false,i=0) VALUES(1);
INSERT INTO TABLE broken PARTITION(b1=FALSE,s='a',b2=false,i=0) VALUES(3);
INSERT INTO TABLE broken PARTITION(b1=false,s='a',b2=False,i=0) VALUES(5);
INSERT INTO TABLE broken PARTITION(b1=false,s='a',b2=FalsE,i=0) VALUES(7);

# Insert a few variants of 'true' partition-key values.
INSERT INTO TABLE broken PARTITION(b1=true,s='a',b2=true,i=0) VALUES(2);
INSERT INTO TABLE broken PARTITION(b1=TRUE,s='a',b2=true,i=0) VALUES(4);
INSERT INTO TABLE broken PARTITION(b1=true,s='a',b2=True,i=0) VALUES(6);
INSERT INTO TABLE broken PARTITION(b1=true,s='a',b2=TruE,i=0) VALUES(8);

# Insert a few variants of mixed 'true'/'false' partition-key values.
INSERT INTO TABLE broken PARTITION(b1=false,s='a',b2=true,i=0) VALUES(100);
INSERT INTO TABLE broken PARTITION(b1=FALSE,s='a',b2=TRUE,i=0) VALUES(1000);
INSERT INTO TABLE broken PARTITION(b1=true,s='a',b2=false,i=0) VALUES(1);
INSERT INTO TABLE broken PARTITION(b1=tRUe,s='a',b2=fALSe,i=0) VALUES(10);

# Very broken partition drop.
hive> ALTER TABLE broken DROP PARTITION(b1=true,s='a',b2=true,i=0);
Dropped the partition b1=false/s=a/b2=false/i=0
Dropped the partition b1=false/s=a/b2=False/i=0
Dropped the partition b1=false/s=a/b2=FalsE/i=0
Dropped the partition b1=FALSE/s=a/b2=false/i=0
Dropped the partition b1=false/s=a/b2=true/i=0
Dropped the partition b1=FALSE/s=a/b2=TRUE/i=0
Dropped the partition b1=true/s=a/b2=false/i=0
Dropped the partition b1=tRUe/s=a/b2=fALSe/i=0
Dropped the partition b1=true/s=a/b2=true/i=0
Dropped the partition b1=true/s=a/b2=True/i=0
Dropped the partition b1=true/s=a/b2=TruE/i=0
Dropped the partition b1=TRUE/s=a/b2=true/i=0
OK
Time taken: 1.387 seconds
{code}


> Hive does not work properly with boolean partition columns (wrong results and 
> inserts to incorrect HDFS path)
> -
>
> Key: HIVE-6590
> URL: https://issues.apache.org/jira/browse/HIVE-6590
> Project: Hive
>  Issue Type: Bug
>  Components: Database/Schema, Metastore
>Affects Versions: 0.10.0
>Reporter: Lenni Kuff
>
> Hive does not work properly with boolean partition columns. Queries return 
> wrong results and also insert to incorrect HDFS paths.
> {code}
> create table bool_part(int_col int) partitioned by(bool_col boolean);
> # This works, creating 3 unique partitions!
> ALTER TABLE bool_table ADD PARTITION (bool_col=FALSE);
> ALTER TABLE bool_table ADD PARTITION (bool_col=false);
> ALTER TABLE bool_table ADD PARTITION (bool_col=False);
> {code}
> The first problem is that Hive cannot filter on a bool partition key column. 
> "select * from bool_part" returns the correct results, but if you apply a 
> filter on the bool partition key column hive won't return any results.
> The second problem is that Hive seems to just call "toString()" on the 
> boolean literal value. This means you can end up with multiple partitions 
> (FALSE, false, FaLSE, etc) mapping to the literal value 'FALSE'. For example, 
> if you can add three partition in have for the same logic value "false" doing:
> ALTER TABLE bool_table ADD PARTITION (bool_col=FALSE) -> 
> /test-warehouse/bool_table/bool_col=FALSE/
> ALTER TABLE bool_table ADD PARTITION (bool_col=false) -> 
> /test-warehouse/bool_table/bool_col=false/
> ALTER TABLE bool_table ADD PARTITION (bool_col=False) -> 
> /test-warehouse/bool_table/bool_col=False/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10427) collect_list() and collect_set() should accept struct types as argument

2015-09-23 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905854#comment-14905854
 ] 

Alexander Behm commented on HIVE-10427:
---

[~csun], [~lskuff], I was hoping to get this in for CDH 5.5 in the context of 
nested types support in Impala. I tried collect_list() on a nightly cluster 
just now, and the fix doesn't seem to be in. This is what I get:

{code}
hive> select collect_list(named_struct("f1", i)) from test;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but 
struct was passed as parameter 1.
{code}


> collect_list() and collect_set() should accept struct types as argument
> ---
>
> Key: HIVE-10427
> URL: https://issues.apache.org/jira/browse/HIVE-10427
> Project: Hive
>  Issue Type: Wish
>  Components: UDF
>Reporter: Alexander Behm
>Assignee: Chao Sun
>  Labels: TODOC1.3
> Fix For: 1.3.0, 2.0.0
>
> Attachments: HIVE-10427.1.patch, HIVE-10427.2.patch, 
> HIVE-10427.3.patch, HIVE-10427.4.patch
>
>
> The collect_list() and collect_set() functions currently only accept scalar 
> argument types. It would be very useful if these functions could also accept 
> struct argument types for creating nested data from flat data.
> For example, suppose I wanted to create a nested customers/orders table from 
> two flat tables, customers and orders. Then it'd be very convenient to write 
> something like this:
> {code}
> insert into table nested_customers_orders
> select c.*, collect_list(named_struct("oid", o.oid, "order_date": o.date...))
> from customers c inner join orders o on (c.cid = o.oid)
> group by c.cid
> {code}
> Thanks you for your consideration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11622) Creating an Avro table with a complex map-typed column leads to incorrect column type.

2015-08-22 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm updated HIVE-11622:
--
Description: 
In the following CREATE TABLE the following map-typed column leads to the wrong 
type. I suspect some problem with inferring the Avro schema from the column 
definitions, but I am not sure.

Reproduction:
{code}
hive create table t (c mapstring,arrayint) stored as avro;
OK
Time taken: 0.101 seconds
hive desc t;
OK
c   arraymapstring,int  from deserializer   
Time taken: 0.135 seconds, Fetched: 1 row(s)
{code}

Note how the type shown in DESCRIBE is not the type originally passed in the 
CREATE TABLE.

However, *sometimes* the DESCRIBE shows the correct output. You may also try 
these steps which produce a similar problem to increase the chance of hitting 
this issue:

{code}
hive create table t (c arraymapstring,int) stored as avro;
OK
Time taken: 0.063 seconds
hive desc t;
OK
c   mapstring,arrayint  from deserializer   
Time taken: 0.152 seconds, Fetched: 1 row(s)
{code}

  was:
In the following CREATE TABLE the following map-typed column leads to the wrong 
type. I suspect some problem with inferring the Avro schema from the column 
definitions, but I am not sure.

Reproduction:
{code}
hive create table t (c mapstring,arrayint) stored as avro;
OK
Time taken: 0.101 seconds
hive desc t;
OK
c   arraymapstring,int  from deserializer   
Time taken: 0.135 seconds, Fetched: 1 row(s)
{code}

Note how the type shown in DESCRIBE is not the type originally passed in the 
CREATE TABLE.



 Creating an Avro table with a complex map-typed column leads to incorrect 
 column type.
 --

 Key: HIVE-11622
 URL: https://issues.apache.org/jira/browse/HIVE-11622
 Project: Hive
  Issue Type: Bug
  Components: Database/Schema
Affects Versions: 1.1.0
Reporter: Alexander Behm

 In the following CREATE TABLE the following map-typed column leads to the 
 wrong type. I suspect some problem with inferring the Avro schema from the 
 column definitions, but I am not sure.
 Reproduction:
 {code}
 hive create table t (c mapstring,arrayint) stored as avro;
 OK
 Time taken: 0.101 seconds
 hive desc t;
 OK
 c arraymapstring,int  from deserializer   
 Time taken: 0.135 seconds, Fetched: 1 row(s)
 {code}
 Note how the type shown in DESCRIBE is not the type originally passed in the 
 CREATE TABLE.
 However, *sometimes* the DESCRIBE shows the correct output. You may also try 
 these steps which produce a similar problem to increase the chance of hitting 
 this issue:
 {code}
 hive create table t (c arraymapstring,int) stored as avro;
 OK
 Time taken: 0.063 seconds
 hive desc t;
 OK
 c mapstring,arrayint  from deserializer   
 Time taken: 0.152 seconds, Fetched: 1 row(s)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)