[jira] [Comment Edited] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.
[ https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370276#comment-16370276 ] Alexander Behm edited comment on HIVE-18743 at 2/20/18 5:06 PM: Thank you, [~kgyrtkirk]. I agree completely. I'm very much in favor of getting rid of all non-obvious side effects of Metastore API calls. Stats collection is one of those side effects. As is today, it is very hard to reason about what exactly the Metastore will do how expensive API calls are. was (Author: alex.behm): Thanks you, [~kgyrtkirk]. I agree completely. I'm very much in favor of getting rid of all non-obvious side effects of Metastore API calls. Stats collection is one of those side effects. As is today, it is very hard to reason about what exactly the Metastore will do how expensive API calls are. > CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround > is buggy. > --- > > Key: HIVE-18743 > URL: https://issues.apache.org/jira/browse/HIVE-18743 > Project: Hive > Issue Type: Improvement > Components: Metastore >Affects Versions: 1.2.0, 1.1.0 >Reporter: Alexander Behm >Assignee: Alexander Kolbasov >Priority: Major > > When hive.stats.autogather=true then the Metastore lists all files under the > table directory to populate basic stats like file counts and sizes. This file > listing operation can be very expensive particularly on filesystems like S3. > One way to address this issue is to reconfigure hive.stats.autogather=false. > *Here's the bug* > It is my understanding that the DO_NOT_UPDATE_STATS table property is > intended to selectively prevent this stats collection. Unfortunately, this > table property is checked *after* the expensive file listing operation, so > the DO_NOT_UPDATE_STATS does not seem to work as intended. See: > https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633 > Relevant code snippet: > {code} > public static boolean updateTableStatsFast(Database db, Table tbl, > Warehouse wh, > boolean madeDir, boolean > forceRecompute, EnvironmentContext environmentContext) throws MetaException { > if (tbl.getPartitionKeysSize() == 0) { > // Update stats only when unpartitioned > FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, > tbl); > return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, > environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after > wh.getFileStatusesForUnpartitionedTable() has already been called > } else { > return false; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.
[ https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370276#comment-16370276 ] Alexander Behm commented on HIVE-18743: --- Thanks you, [~kgyrtkirk]. I agree completely. I'm very much in favor of getting rid of all non-obvious side effects of Metastore API calls. Stats collection is one of those side effects. As is today, it is very hard to reason about what exactly the Metastore will do how expensive API calls are. > CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround > is buggy. > --- > > Key: HIVE-18743 > URL: https://issues.apache.org/jira/browse/HIVE-18743 > Project: Hive > Issue Type: Improvement > Components: Metastore >Affects Versions: 1.2.0, 1.1.0 >Reporter: Alexander Behm >Assignee: Alexander Kolbasov >Priority: Major > > When hive.stats.autogather=true then the Metastore lists all files under the > table directory to populate basic stats like file counts and sizes. This file > listing operation can be very expensive particularly on filesystems like S3. > One way to address this issue is to reconfigure hive.stats.autogather=false. > *Here's the bug* > It is my understanding that the DO_NOT_UPDATE_STATS table property is > intended to selectively prevent this stats collection. Unfortunately, this > table property is checked *after* the expensive file listing operation, so > the DO_NOT_UPDATE_STATS does not seem to work as intended. See: > https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633 > Relevant code snippet: > {code} > public static boolean updateTableStatsFast(Database db, Table tbl, > Warehouse wh, > boolean madeDir, boolean > forceRecompute, EnvironmentContext environmentContext) throws MetaException { > if (tbl.getPartitionKeysSize() == 0) { > // Update stats only when unpartitioned > FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, > tbl); > return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, > environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after > wh.getFileStatusesForUnpartitionedTable() has already been called > } else { > return false; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-15670) column_stats_accurate may not fit in PARTITION_PARAMS.VALUE
[ https://issues.apache.org/jira/browse/HIVE-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223239#comment-16223239 ] Alexander Behm commented on HIVE-15670: --- Thanks for the response. From the perspective of a client issuing an RPC to alter column stats, it seems like a rather questionable side-effect to also alter the table metadata with a pretty big payload. Instead of "fixing" this issue by changing the database schema, could we instead remove the JSON string altogether? I'm definitely not familiar with the implementation details, just trying to provide a perspective from a Metastore client that is not Hive. > column_stats_accurate may not fit in PARTITION_PARAMS.VALUE > --- > > Key: HIVE-15670 > URL: https://issues.apache.org/jira/browse/HIVE-15670 > Project: Hive > Issue Type: Bug >Reporter: Sergey Shelukhin > > The JSON can be too big with many columns (see setColumnStatsState method). > We can make JSON more compact by only storing the list of columns with true > values. Or we can even store a bitmask in a dedicated column, and adjust it > when altering table (rare enough). Or we can just change the VALUE column to > text blob (might be a painful change wrt upgrade scripts, and supporting all > the DBs' varied blob implementations, esp. in directsql). > Storing denormalized flags in a separate table will probably be slow, > comparatively. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15670) column_stats_accurate may not fit in PARTITION_PARAMS.VALUE
[ https://issues.apache.org/jira/browse/HIVE-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217912#comment-16217912 ] Alexander Behm commented on HIVE-15670: --- May I ask what's the purpose of storing this JSON in the tableproperties? Seems pretty expensive to me. If you want to keep track of the accuracy of column stats, why not populate a "last updated" timestamp in the appropriate column statistic? > column_stats_accurate may not fit in PARTITION_PARAMS.VALUE > --- > > Key: HIVE-15670 > URL: https://issues.apache.org/jira/browse/HIVE-15670 > Project: Hive > Issue Type: Bug >Reporter: Sergey Shelukhin > > The JSON can be too big with many columns (see setColumnStatsState method). > We can make JSON more compact by only storing the list of columns with true > values. Or we can even store a bitmask in a dedicated column, and adjust it > when altering table (rare enough). Or we can just change the VALUE column to > text blob (might be a painful change wrt upgrade scripts, and supporting all > the DBs' varied blob implementations, esp. in directsql). > Storing denormalized flags in a separate table will probably be slow, > comparatively. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-15653) Some ALTER TABLE commands drop table stats
[ https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172752#comment-16172752 ] Alexander Behm commented on HIVE-15653: --- [~ctang.ma], am I understanding correctly that there is no interest in fixing this issue on the Metastore side? I understand that all clients can pass DO_NOT_UPDATE_STATS. > Some ALTER TABLE commands drop table stats > -- > > Key: HIVE-15653 > URL: https://issues.apache.org/jira/browse/HIVE-15653 > Project: Hive > Issue Type: Bug > Components: Metastore, Statistics >Affects Versions: 1.1.0 >Reporter: Alexander Behm >Assignee: Chaoyu Tang >Priority: Critical > Fix For: 2.3.0 > > Attachments: HIVE-15653.1.patch, HIVE-15653.2.patch, > HIVE-15653.3.patch, HIVE-15653.4.patch, HIVE-15653.5.patch, > HIVE-15653.6.patch, HIVE-15653.patch > > > Some ALTER TABLE commands drop the table stats. That may make sense for some > ALTER TABLE operations, but certainly not for others. Personally, I I think > ALTER TABLE should only change what was requested by the user without any > side effects that may be unclear to users. In particular, collecting stats > can be an expensive operation so it's rather inconvenient for users if they > get wiped accidentally. > Repro: > {code} > create table t (i int); > insert into t values(1); > analyze table t compute statistics; > alter table t set tblproperties('test'='test'); > hive> describe formatted t; > OK > # col_namedata_type comment > > i int > > # Detailed Table Information > Database: default > Owner:abehm > CreateTime: Tue Jan 17 18:13:34 PST 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:20500/test-warehouse/t > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE false > last_modified_byabehm > last_modified_time 1484705748 > numFiles1 > numRows -1 > rawDataSize -1 > testtest > totalSize 2 > transient_lastDdlTime 1484705748 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.169 seconds, Fetched: 34 row(s) > {code} > The same behavior can be observed with several other ALTER TABLE commands. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-6590) Hive does not work properly with boolean partition columns (wrong results and inserts to incorrect HDFS path)
[ https://issues.apache.org/jira/browse/HIVE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868306#comment-15868306 ] Alexander Behm commented on HIVE-6590: -- [~kgyrtkirk], can you explain how your new proposed behavior is better? It seems like there is still a fundamental issue that partition-key values do not have a canonical string representation, so you could still have multiple HDFS paths referring to the same logical partition. > Hive does not work properly with boolean partition columns (wrong results and > inserts to incorrect HDFS path) > - > > Key: HIVE-6590 > URL: https://issues.apache.org/jira/browse/HIVE-6590 > Project: Hive > Issue Type: Bug > Components: Database/Schema, Metastore >Affects Versions: 0.10.0 >Reporter: Lenni Kuff >Assignee: Zoltan Haindrich > Attachments: HIVE-6590.1.patch, HIVE-6590.2.patch, HIVE-6590.3.patch > > > Hive does not work properly with boolean partition columns. Queries return > wrong results and also insert to incorrect HDFS paths. > {code} > create table bool_part(int_col int) partitioned by(bool_col boolean); > # This works, creating 3 unique partitions! > ALTER TABLE bool_table ADD PARTITION (bool_col=FALSE); > ALTER TABLE bool_table ADD PARTITION (bool_col=false); > ALTER TABLE bool_table ADD PARTITION (bool_col=False); > {code} > The first problem is that Hive cannot filter on a bool partition key column. > "select * from bool_part" returns the correct results, but if you apply a > filter on the bool partition key column hive won't return any results. > The second problem is that Hive seems to just call "toString()" on the > boolean literal value. This means you can end up with multiple partitions > (FALSE, false, FaLSE, etc) mapping to the literal value 'FALSE'. For example, > if you can add three partition in have for the same logic value "false" doing: > ALTER TABLE bool_table ADD PARTITION (bool_col=FALSE) -> > /test-warehouse/bool_table/bool_col=FALSE/ > ALTER TABLE bool_table ADD PARTITION (bool_col=false) -> > /test-warehouse/bool_table/bool_col=false/ > ALTER TABLE bool_table ADD PARTITION (bool_col=False) -> > /test-warehouse/bool_table/bool_col=False/ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] (HIVE-15653) Some ALTER TABLE commands drop table stats
[ https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847540#comment-15847540 ] Alexander Behm commented on HIVE-15653: --- Thanks, [~ctang.ma]. I've already added a workaround to Impala, but not using StatsSetupConst.DO_NOT_UPDATE_STATS because sometimes we do want to update stats (e.g. compute stats). See: http://gerrit.cloudera.org:8080/5731 > Some ALTER TABLE commands drop table stats > -- > > Key: HIVE-15653 > URL: https://issues.apache.org/jira/browse/HIVE-15653 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 1.1.0 >Reporter: Alexander Behm >Assignee: Chaoyu Tang >Priority: Critical > Fix For: 2.2.0 > > Attachments: HIVE-15653.1.patch, HIVE-15653.2.patch, > HIVE-15653.3.patch, HIVE-15653.4.patch, HIVE-15653.5.patch, > HIVE-15653.6.patch, HIVE-15653.patch > > > Some ALTER TABLE commands drop the table stats. That may make sense for some > ALTER TABLE operations, but certainly not for others. Personally, I I think > ALTER TABLE should only change what was requested by the user without any > side effects that may be unclear to users. In particular, collecting stats > can be an expensive operation so it's rather inconvenient for users if they > get wiped accidentally. > Repro: > {code} > create table t (i int); > insert into t values(1); > analyze table t compute statistics; > alter table t set tblproperties('test'='test'); > hive> describe formatted t; > OK > # col_namedata_type comment > > i int > > # Detailed Table Information > Database: default > Owner:abehm > CreateTime: Tue Jan 17 18:13:34 PST 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:20500/test-warehouse/t > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE false > last_modified_byabehm > last_modified_time 1484705748 > numFiles1 > numRows -1 > rawDataSize -1 > testtest > totalSize 2 > transient_lastDdlTime 1484705748 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.169 seconds, Fetched: 34 row(s) > {code} > The same behavior can be observed with several other ALTER TABLE commands. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] (HIVE-15653) Some ALTER TABLE commands drop table stats
[ https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847514#comment-15847514 ] Alexander Behm commented on HIVE-15653: --- Thanks for the info, [~ctang.ma]! I agree the alter_table_with_environmentContext() seems useful. Is my understanding correct that even after your patch the HMS alter_table() will drop stats in some cases? > Some ALTER TABLE commands drop table stats > -- > > Key: HIVE-15653 > URL: https://issues.apache.org/jira/browse/HIVE-15653 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 1.1.0 >Reporter: Alexander Behm >Assignee: Chaoyu Tang >Priority: Critical > Fix For: 2.2.0 > > Attachments: HIVE-15653.1.patch, HIVE-15653.2.patch, > HIVE-15653.3.patch, HIVE-15653.4.patch, HIVE-15653.5.patch, > HIVE-15653.6.patch, HIVE-15653.patch > > > Some ALTER TABLE commands drop the table stats. That may make sense for some > ALTER TABLE operations, but certainly not for others. Personally, I I think > ALTER TABLE should only change what was requested by the user without any > side effects that may be unclear to users. In particular, collecting stats > can be an expensive operation so it's rather inconvenient for users if they > get wiped accidentally. > Repro: > {code} > create table t (i int); > insert into t values(1); > analyze table t compute statistics; > alter table t set tblproperties('test'='test'); > hive> describe formatted t; > OK > # col_namedata_type comment > > i int > > # Detailed Table Information > Database: default > Owner:abehm > CreateTime: Tue Jan 17 18:13:34 PST 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:20500/test-warehouse/t > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE false > last_modified_byabehm > last_modified_time 1484705748 > numFiles1 > numRows -1 > rawDataSize -1 > testtest > totalSize 2 > transient_lastDdlTime 1484705748 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.169 seconds, Fetched: 34 row(s) > {code} > The same behavior can be observed with several other ALTER TABLE commands. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-15653) Some ALTER TABLE commands drop table stats
[ https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830771#comment-15830771 ] Alexander Behm commented on HIVE-15653: --- Good to know thanks. Regarding ALTER TABLE SET LOCATION: I suppose it's arguable. Imo, we should not add side-effects to ALTER TABLE especially when it comes to stats because they are expensive to compute. This is more of a product question, so maybe [~grahn] or [~skumar] can weigh in. > Some ALTER TABLE commands drop table stats > -- > > Key: HIVE-15653 > URL: https://issues.apache.org/jira/browse/HIVE-15653 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 1.1.0 >Reporter: Alexander Behm >Assignee: Chaoyu Tang >Priority: Critical > Attachments: HIVE-15653.patch > > > Some ALTER TABLE commands drop the table stats. That may make sense for some > ALTER TABLE operations, but certainly not for others. Personally, I I think > ALTER TABLE should only change what was requested by the user without any > side effects that may be unclear to users. In particular, collecting stats > can be an expensive operation so it's rather inconvenient for users if they > get wiped accidentally. > Repro: > {code} > create table t (i int); > insert into t values(1); > analyze table t compute statistics; > alter table t set tblproperties('test'='test'); > hive> describe formatted t; > OK > # col_namedata_type comment > > i int > > # Detailed Table Information > Database: default > Owner:abehm > CreateTime: Tue Jan 17 18:13:34 PST 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:20500/test-warehouse/t > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE false > last_modified_byabehm > last_modified_time 1484705748 > numFiles1 > numRows -1 > rawDataSize -1 > testtest > totalSize 2 > transient_lastDdlTime 1484705748 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.169 seconds, Fetched: 34 row(s) > {code} > The same behavior can be observed with several other ALTER TABLE commands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HIVE-15653) Some ALTER TABLE commands drop table stats
[ https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830719#comment-15830719 ] Alexander Behm edited comment on HIVE-15653 at 1/19/17 10:28 PM: - [~ctang.ma] Impala calls the Metastore API alter_table(). I tried the following alterations in Impala and those did wipe the table stats: ALTER TABLE ADD COLUMNS ALTER TABLE CHANGE COLUMN ALTER TABLE SET TBLPROPERTIES ALTER TABLE SET SERDEPROPERTIES ALTER TABLE SET LOCATION ALTER TABLE SET FILEFORMAT ALTER TABLE SET CACHED So I would say most ALTER commands do wipe the stats (from Impala). Just trying to make sure the fix on the Hive side is complete, i.e. the alter_table() API call on the Metastore is fixed and not just the Hive DDL commands. The ALTER TABLE RENAME command worked fine (preserved table stats). was (Author: alex.behm): [~ctang.ma] Impala calls the Metastore API alter_table(). I tried the following alterations and those did wipe the table stats: ALTER TABLE ADD COLUMNS ALTER TABLE CHANGE COLUMN ALTER TABLE SET TBLPROPERTIES ALTER TABLE SET SERDEPROPERTIES ALTER TABLE SET LOCATION ALTER TABLE SET FILEFORMAT ALTER TABLE SET CACHED So I would say most ALTER commands do wipe the stats. Just trying to make sure the fix on the Hive side is complete (i.e. the alter_table() API call on the Metastore). The ALTER TABLE RENAME command worked fine (preserved table stats). > Some ALTER TABLE commands drop table stats > -- > > Key: HIVE-15653 > URL: https://issues.apache.org/jira/browse/HIVE-15653 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 1.1.0 >Reporter: Alexander Behm >Assignee: Chaoyu Tang >Priority: Critical > Attachments: HIVE-15653.patch > > > Some ALTER TABLE commands drop the table stats. That may make sense for some > ALTER TABLE operations, but certainly not for others. Personally, I I think > ALTER TABLE should only change what was requested by the user without any > side effects that may be unclear to users. In particular, collecting stats > can be an expensive operation so it's rather inconvenient for users if they > get wiped accidentally. > Repro: > {code} > create table t (i int); > insert into t values(1); > analyze table t compute statistics; > alter table t set tblproperties('test'='test'); > hive> describe formatted t; > OK > # col_namedata_type comment > > i int > > # Detailed Table Information > Database: default > Owner:abehm > CreateTime: Tue Jan 17 18:13:34 PST 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:20500/test-warehouse/t > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE false > last_modified_byabehm > last_modified_time 1484705748 > numFiles1 > numRows -1 > rawDataSize -1 > testtest > totalSize 2 > transient_lastDdlTime 1484705748 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.169 seconds, Fetched: 34 row(s) > {code} > The same behavior can be observed with several other ALTER TABLE commands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15653) Some ALTER TABLE commands drop table stats
[ https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830719#comment-15830719 ] Alexander Behm commented on HIVE-15653: --- [~ctang.ma] Impala calls the Metastore API alter_table(). I tried the following alterations and those did wipe the table stats: ALTER TABLE ADD COLUMNS ALTER TABLE CHANGE COLUMN ALTER TABLE SET TBLPROPERTIES ALTER TABLE SET SERDEPROPERTIES ALTER TABLE SET LOCATION ALTER TABLE SET FILEFORMAT ALTER TABLE SET CACHED So I would say most ALTER commands do wipe the stats. Just trying to make sure the fix on the Hive side is complete (i.e. the alter_table() API call on the Metastore). The ALTER TABLE RENAME command worked fine (preserved table stats). > Some ALTER TABLE commands drop table stats > -- > > Key: HIVE-15653 > URL: https://issues.apache.org/jira/browse/HIVE-15653 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 1.1.0 >Reporter: Alexander Behm >Assignee: Chaoyu Tang >Priority: Critical > Attachments: HIVE-15653.patch > > > Some ALTER TABLE commands drop the table stats. That may make sense for some > ALTER TABLE operations, but certainly not for others. Personally, I I think > ALTER TABLE should only change what was requested by the user without any > side effects that may be unclear to users. In particular, collecting stats > can be an expensive operation so it's rather inconvenient for users if they > get wiped accidentally. > Repro: > {code} > create table t (i int); > insert into t values(1); > analyze table t compute statistics; > alter table t set tblproperties('test'='test'); > hive> describe formatted t; > OK > # col_namedata_type comment > > i int > > # Detailed Table Information > Database: default > Owner:abehm > CreateTime: Tue Jan 17 18:13:34 PST 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:20500/test-warehouse/t > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE false > last_modified_byabehm > last_modified_time 1484705748 > numFiles1 > numRows -1 > rawDataSize -1 > testtest > totalSize 2 > transient_lastDdlTime 1484705748 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.169 seconds, Fetched: 34 row(s) > {code} > The same behavior can be observed with several other ALTER TABLE commands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15653) Some ALTER TABLE commands drop table stats
[ https://issues.apache.org/jira/browse/HIVE-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827334#comment-15827334 ] Alexander Behm commented on HIVE-15653: --- Note that this problem seems to be specific to unpartitioned tables. Partitioned tables work ok as far as I can tell. > Some ALTER TABLE commands drop table stats > -- > > Key: HIVE-15653 > URL: https://issues.apache.org/jira/browse/HIVE-15653 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 1.1.0 >Reporter: Alexander Behm >Priority: Critical > > Some ALTER TABLE commands drop the table stats. That may make sense for some > ALTER TABLE operations, but certainly not for others. Personally, I I think > ALTER TABLE should only change what was requested by the user without any > side effects that may be unclear to users. In particular, collecting stats > can be an expensive operation so it's rather inconvenient for users if they > get wiped accidentally. > Repro: > {code} > create table t (i int); > insert into t values(1); > analyze table t compute statistics; > alter table t set tblproperties('test'='test'); > hive> describe formatted t; > OK > # col_namedata_type comment > > i int > > # Detailed Table Information > Database: default > Owner:abehm > CreateTime: Tue Jan 17 18:13:34 PST 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:20500/test-warehouse/t > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE false > last_modified_byabehm > last_modified_time 1484705748 > numFiles1 > numRows -1 > rawDataSize -1 > testtest > totalSize 2 > transient_lastDdlTime 1484705748 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.169 seconds, Fetched: 34 row(s) > {code} > The same behavior can be observed with several other ALTER TABLE commands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6590) Hive does not work properly with boolean partition columns (wrong results and inserts to incorrect HDFS path)
[ https://issues.apache.org/jira/browse/HIVE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422294#comment-15422294 ] Alexander Behm commented on HIVE-6590: -- A related and severe issue is that ALTER TABLE DROP PARTITION may drop partitions that were not specified in the ALTER statement. This could lead to accidental data loss. Reproduction: {code} CREATE TABLE broken (c int) PARTITIONED BY (b1 BOOLEAN, s STRING, b2 BOOLEAN, i INT); # Insert a few variants of 'false' partition-key values. INSERT INTO TABLE broken PARTITION(b1=false,s='a',b2=false,i=0) VALUES(1); INSERT INTO TABLE broken PARTITION(b1=FALSE,s='a',b2=false,i=0) VALUES(3); INSERT INTO TABLE broken PARTITION(b1=false,s='a',b2=False,i=0) VALUES(5); INSERT INTO TABLE broken PARTITION(b1=false,s='a',b2=FalsE,i=0) VALUES(7); # Insert a few variants of 'true' partition-key values. INSERT INTO TABLE broken PARTITION(b1=true,s='a',b2=true,i=0) VALUES(2); INSERT INTO TABLE broken PARTITION(b1=TRUE,s='a',b2=true,i=0) VALUES(4); INSERT INTO TABLE broken PARTITION(b1=true,s='a',b2=True,i=0) VALUES(6); INSERT INTO TABLE broken PARTITION(b1=true,s='a',b2=TruE,i=0) VALUES(8); # Insert a few variants of mixed 'true'/'false' partition-key values. INSERT INTO TABLE broken PARTITION(b1=false,s='a',b2=true,i=0) VALUES(100); INSERT INTO TABLE broken PARTITION(b1=FALSE,s='a',b2=TRUE,i=0) VALUES(1000); INSERT INTO TABLE broken PARTITION(b1=true,s='a',b2=false,i=0) VALUES(1); INSERT INTO TABLE broken PARTITION(b1=tRUe,s='a',b2=fALSe,i=0) VALUES(10); # Very broken partition drop. hive> ALTER TABLE broken DROP PARTITION(b1=true,s='a',b2=true,i=0); Dropped the partition b1=false/s=a/b2=false/i=0 Dropped the partition b1=false/s=a/b2=False/i=0 Dropped the partition b1=false/s=a/b2=FalsE/i=0 Dropped the partition b1=FALSE/s=a/b2=false/i=0 Dropped the partition b1=false/s=a/b2=true/i=0 Dropped the partition b1=FALSE/s=a/b2=TRUE/i=0 Dropped the partition b1=true/s=a/b2=false/i=0 Dropped the partition b1=tRUe/s=a/b2=fALSe/i=0 Dropped the partition b1=true/s=a/b2=true/i=0 Dropped the partition b1=true/s=a/b2=True/i=0 Dropped the partition b1=true/s=a/b2=TruE/i=0 Dropped the partition b1=TRUE/s=a/b2=true/i=0 OK Time taken: 1.387 seconds {code} > Hive does not work properly with boolean partition columns (wrong results and > inserts to incorrect HDFS path) > - > > Key: HIVE-6590 > URL: https://issues.apache.org/jira/browse/HIVE-6590 > Project: Hive > Issue Type: Bug > Components: Database/Schema, Metastore >Affects Versions: 0.10.0 >Reporter: Lenni Kuff > > Hive does not work properly with boolean partition columns. Queries return > wrong results and also insert to incorrect HDFS paths. > {code} > create table bool_part(int_col int) partitioned by(bool_col boolean); > # This works, creating 3 unique partitions! > ALTER TABLE bool_table ADD PARTITION (bool_col=FALSE); > ALTER TABLE bool_table ADD PARTITION (bool_col=false); > ALTER TABLE bool_table ADD PARTITION (bool_col=False); > {code} > The first problem is that Hive cannot filter on a bool partition key column. > "select * from bool_part" returns the correct results, but if you apply a > filter on the bool partition key column hive won't return any results. > The second problem is that Hive seems to just call "toString()" on the > boolean literal value. This means you can end up with multiple partitions > (FALSE, false, FaLSE, etc) mapping to the literal value 'FALSE'. For example, > if you can add three partition in have for the same logic value "false" doing: > ALTER TABLE bool_table ADD PARTITION (bool_col=FALSE) -> > /test-warehouse/bool_table/bool_col=FALSE/ > ALTER TABLE bool_table ADD PARTITION (bool_col=false) -> > /test-warehouse/bool_table/bool_col=false/ > ALTER TABLE bool_table ADD PARTITION (bool_col=False) -> > /test-warehouse/bool_table/bool_col=False/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10427) collect_list() and collect_set() should accept struct types as argument
[ https://issues.apache.org/jira/browse/HIVE-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905854#comment-14905854 ] Alexander Behm commented on HIVE-10427: --- [~csun], [~lskuff], I was hoping to get this in for CDH 5.5 in the context of nested types support in Impala. I tried collect_list() on a nightly cluster just now, and the fix doesn't seem to be in. This is what I get: {code} hive> select collect_list(named_struct("f1", i)) from test; FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct was passed as parameter 1. {code} > collect_list() and collect_set() should accept struct types as argument > --- > > Key: HIVE-10427 > URL: https://issues.apache.org/jira/browse/HIVE-10427 > Project: Hive > Issue Type: Wish > Components: UDF >Reporter: Alexander Behm >Assignee: Chao Sun > Labels: TODOC1.3 > Fix For: 1.3.0, 2.0.0 > > Attachments: HIVE-10427.1.patch, HIVE-10427.2.patch, > HIVE-10427.3.patch, HIVE-10427.4.patch > > > The collect_list() and collect_set() functions currently only accept scalar > argument types. It would be very useful if these functions could also accept > struct argument types for creating nested data from flat data. > For example, suppose I wanted to create a nested customers/orders table from > two flat tables, customers and orders. Then it'd be very convenient to write > something like this: > {code} > insert into table nested_customers_orders > select c.*, collect_list(named_struct("oid", o.oid, "order_date": o.date...)) > from customers c inner join orders o on (c.cid = o.oid) > group by c.cid > {code} > Thanks you for your consideration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-11622) Creating an Avro table with a complex map-typed column leads to incorrect column type.
[ https://issues.apache.org/jira/browse/HIVE-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Behm updated HIVE-11622: -- Description: In the following CREATE TABLE the following map-typed column leads to the wrong type. I suspect some problem with inferring the Avro schema from the column definitions, but I am not sure. Reproduction: {code} hive create table t (c mapstring,arrayint) stored as avro; OK Time taken: 0.101 seconds hive desc t; OK c arraymapstring,int from deserializer Time taken: 0.135 seconds, Fetched: 1 row(s) {code} Note how the type shown in DESCRIBE is not the type originally passed in the CREATE TABLE. However, *sometimes* the DESCRIBE shows the correct output. You may also try these steps which produce a similar problem to increase the chance of hitting this issue: {code} hive create table t (c arraymapstring,int) stored as avro; OK Time taken: 0.063 seconds hive desc t; OK c mapstring,arrayint from deserializer Time taken: 0.152 seconds, Fetched: 1 row(s) {code} was: In the following CREATE TABLE the following map-typed column leads to the wrong type. I suspect some problem with inferring the Avro schema from the column definitions, but I am not sure. Reproduction: {code} hive create table t (c mapstring,arrayint) stored as avro; OK Time taken: 0.101 seconds hive desc t; OK c arraymapstring,int from deserializer Time taken: 0.135 seconds, Fetched: 1 row(s) {code} Note how the type shown in DESCRIBE is not the type originally passed in the CREATE TABLE. Creating an Avro table with a complex map-typed column leads to incorrect column type. -- Key: HIVE-11622 URL: https://issues.apache.org/jira/browse/HIVE-11622 Project: Hive Issue Type: Bug Components: Database/Schema Affects Versions: 1.1.0 Reporter: Alexander Behm In the following CREATE TABLE the following map-typed column leads to the wrong type. I suspect some problem with inferring the Avro schema from the column definitions, but I am not sure. Reproduction: {code} hive create table t (c mapstring,arrayint) stored as avro; OK Time taken: 0.101 seconds hive desc t; OK c arraymapstring,int from deserializer Time taken: 0.135 seconds, Fetched: 1 row(s) {code} Note how the type shown in DESCRIBE is not the type originally passed in the CREATE TABLE. However, *sometimes* the DESCRIBE shows the correct output. You may also try these steps which produce a similar problem to increase the chance of hitting this issue: {code} hive create table t (c arraymapstring,int) stored as avro; OK Time taken: 0.063 seconds hive desc t; OK c mapstring,arrayint from deserializer Time taken: 0.152 seconds, Fetched: 1 row(s) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)