[jira] [Work started] (HIVE-26313) Aggregate all column statistics into a single field in metastore

Jira Wed, 12 Oct 2022 01:03:03 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Work on HIVE-26313 started by László Végh.
------------------------------------------
> Aggregate all column statistics into a single field in metastore
> ----------------------------------------------------------------
>
>                 Key: HIVE-26313
>                 URL: https://issues.apache.org/jira/browse/HIVE-26313
>             Project: Hive
>          Issue Type: Improvement
>          Components: Standalone Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: László Végh
>            Priority: Major
>              Labels: breaking_change
>
> At the moment, column statistics tables in the metastore schema look like 
> this (it's similar for _PART_COL_STATS_):
> {noformat}
> CREATE TABLE "APP"."TAB_COL_STATS"(
>     "CAT_NAME" VARCHAR(256) NOT NULL,
>     "DB_NAME" VARCHAR(128) NOT NULL,
>     "TABLE_NAME" VARCHAR(256) NOT NULL,
>     "COLUMN_NAME" VARCHAR(767) NOT NULL,
>     "COLUMN_TYPE" VARCHAR(128) NOT NULL,
>     "LONG_LOW_VALUE" BIGINT,
>     "LONG_HIGH_VALUE" BIGINT,
>     "DOUBLE_LOW_VALUE" DOUBLE,
>     "DOUBLE_HIGH_VALUE" DOUBLE,
>     "BIG_DECIMAL_LOW_VALUE" VARCHAR(4000),
>     "BIG_DECIMAL_HIGH_VALUE" VARCHAR(4000),
>     "NUM_DISTINCTS" BIGINT,
>     "NUM_NULLS" BIGINT NOT NULL,
>     "AVG_COL_LEN" DOUBLE,
>     "MAX_COL_LEN" BIGINT,
>     "NUM_TRUES" BIGINT,
>     "NUM_FALSES" BIGINT,
>     "LAST_ANALYZED" BIGINT,
>     "CS_ID" BIGINT NOT NULL,
>     "TBL_ID" BIGINT NOT NULL,
>     "BIT_VECTOR" BLOB,
>     "ENGINE" VARCHAR(128) NOT NULL
> );
> {noformat}
> The idea is to have a single blob named _STATISTICS_ to replace them, as 
> follows:
> {noformat}
> CREATE TABLE "APP"."TAB_COL_STATS"(
>     "CAT_NAME" VARCHAR(256) NOT NULL,
>     "DB_NAME" VARCHAR(128) NOT NULL,
>     "TABLE_NAME" VARCHAR(256) NOT NULL,
>     "COLUMN_NAME" VARCHAR(767) NOT NULL,
>     "COLUMN_TYPE" VARCHAR(128) NOT NULL,
>     "STATISTICS" BLOB,
>     "LAST_ANALYZED" BIGINT,
>     "CS_ID" BIGINT NOT NULL,
>     "TBL_ID" BIGINT NOT NULL,
>     "ENGINE" VARCHAR(128) NOT NULL
> );
> {noformat}
> The _STATISTICS_ column could be the serialization of a Json-encoded string, 
> which will be consumed in a "schema-on-read" fashion.
> At first at least the removed column statistics will be encoded in the 
> _STATISTICS_ column, but since each "consumer" will read the portion of the 
> schema it is interested into, multiple engines (see the _ENGINE_ column) can 
> read and write statistics as they deem fit.
> Another advantage is that, if we plan to add more statistics in the future, 
> we won't need to change the thrift interface for the metastore again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work started] (HIVE-26313) Aggregate all column statistics into a single field in metastore

Reply via email to