[ https://issues.apache.org/jira/browse/HIVE-26313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on HIVE-26313 started by László Végh. ------------------------------------------ > Aggregate all column statistics into a single field in metastore > ---------------------------------------------------------------- > > Key: HIVE-26313 > URL: https://issues.apache.org/jira/browse/HIVE-26313 > Project: Hive > Issue Type: Improvement > Components: Standalone Metastore, Statistics > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: László Végh > Priority: Major > Labels: breaking_change > > At the moment, column statistics tables in the metastore schema look like > this (it's similar for _PART_COL_STATS_): > {noformat} > CREATE TABLE "APP"."TAB_COL_STATS"( > "CAT_NAME" VARCHAR(256) NOT NULL, > "DB_NAME" VARCHAR(128) NOT NULL, > "TABLE_NAME" VARCHAR(256) NOT NULL, > "COLUMN_NAME" VARCHAR(767) NOT NULL, > "COLUMN_TYPE" VARCHAR(128) NOT NULL, > "LONG_LOW_VALUE" BIGINT, > "LONG_HIGH_VALUE" BIGINT, > "DOUBLE_LOW_VALUE" DOUBLE, > "DOUBLE_HIGH_VALUE" DOUBLE, > "BIG_DECIMAL_LOW_VALUE" VARCHAR(4000), > "BIG_DECIMAL_HIGH_VALUE" VARCHAR(4000), > "NUM_DISTINCTS" BIGINT, > "NUM_NULLS" BIGINT NOT NULL, > "AVG_COL_LEN" DOUBLE, > "MAX_COL_LEN" BIGINT, > "NUM_TRUES" BIGINT, > "NUM_FALSES" BIGINT, > "LAST_ANALYZED" BIGINT, > "CS_ID" BIGINT NOT NULL, > "TBL_ID" BIGINT NOT NULL, > "BIT_VECTOR" BLOB, > "ENGINE" VARCHAR(128) NOT NULL > ); > {noformat} > The idea is to have a single blob named _STATISTICS_ to replace them, as > follows: > {noformat} > CREATE TABLE "APP"."TAB_COL_STATS"( > "CAT_NAME" VARCHAR(256) NOT NULL, > "DB_NAME" VARCHAR(128) NOT NULL, > "TABLE_NAME" VARCHAR(256) NOT NULL, > "COLUMN_NAME" VARCHAR(767) NOT NULL, > "COLUMN_TYPE" VARCHAR(128) NOT NULL, > "STATISTICS" BLOB, > "LAST_ANALYZED" BIGINT, > "CS_ID" BIGINT NOT NULL, > "TBL_ID" BIGINT NOT NULL, > "ENGINE" VARCHAR(128) NOT NULL > ); > {noformat} > The _STATISTICS_ column could be the serialization of a Json-encoded string, > which will be consumed in a "schema-on-read" fashion. > At first at least the removed column statistics will be encoded in the > _STATISTICS_ column, but since each "consumer" will read the portion of the > schema it is interested into, multiple engines (see the _ENGINE_ column) can > read and write statistics as they deem fit. > Another advantage is that, if we plan to add more statistics in the future, > we won't need to change the thrift interface for the metastore again. -- This message was sent by Atlassian Jira (v8.20.10#820010)