[jira] [Commented] (HIVE-17108) Parquet file does not gather statistic such as "RAW DATA SIZE" automatically
[ https://issues.apache.org/jira/browse/HIVE-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094128#comment-16094128 ] liyunzhang_intel commented on HIVE-17108: - the detail reason why parquet file does not gather statistic such as "RAW DATA SIZE" automatically: when executing "INSERT OVERWRITE TABLE xxx SELECT * xxx", hive with orc will update statistics from orc footer in [FileSinkOperator#closeOp|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1060] while hive with parquet will not. OrcRecordWriter implements StatsProvidingRecordWriter. ParquetRecordWriterWrapper not implements StatsProvidingRecordWriter. But i guess even ParquetRecordWriterWrapper implements [StatsProvidingRecordWriter|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/StatsProvidingRecordWriter.java], statistics like "RAW DATA SIZE" can not be updated because org.apache.parquet.hadoop.ParquetWriter does not provide interface like getRawDataSize() or getRawCount(). > Parquet file does not gather statistic such as "RAW DATA SIZE" automatically > - > > Key: HIVE-17108 > URL: https://issues.apache.org/jira/browse/HIVE-17108 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [parquet_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_analyze.q#L27], > we need run "ANALYZE TABLE parquet_create_people COMPUTE STATISTICS noscan" > to update the statistic. > In > [orc_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/orc_analyze.q#L45], > we need not do that if we set hive.stats.autogather as true. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17108) Parquet file does not gather statistic such as "RAW DATA SIZE" automatically
[ https://issues.apache.org/jira/browse/HIVE-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091284#comment-16091284 ] liyunzhang_intel commented on HIVE-17108: - [~pxiong]: when I view the code about [orc_analyze.q https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/orc_analyze.q#L45], it seems that orc will automatically gather statics like "RAW DATA SIZE" without executing "analyze table compute statistics". But reading the code, whether orc or parquet, it calls [org.apache.hadoop.hive.ql.exec.StatsNoJobTask#aggregateStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsNoJobTask.java#L266] to update the statistics. The query {{INSERT OVERWRITE TABLE orc_create_people SELECT * FROM orc_create_people_staging ORDER BY id}} will only call StatsJobTask but not calls StatsNoJobTask. So why orc updates statistics automatically? Let's use an example to explain more detailed {noformat} set hive.execution.engine=mr; use default; drop table if exists orc_create_people_staging; drop table if exists orc_create_people; CREATE TABLE orc_create_people_staging ( id int, first_name string, last_name string, address string, salary decimal, start_date timestamp, state string); LOAD DATA LOCAL INPATH './orc_create_people.txt' OVERWRITE INTO TABLE orc_create_people_staging; CREATE TABLE orc_create_people ( id int, first_name string, last_name string, address string, salary decimal, start_date timestamp, state string) STORED AS orc; explain extended INSERT OVERWRITE TABLE orc_create_people SELECT * FROM orc_create_people_staging ORDER BY id; INSERT OVERWRITE TABLE orc_create_people SELECT * FROM orc_create_people_staging ORDER BY id; desc formatted orc_create_people; {noformat} the result of "desc formatted orc_create_people" is following , the rawDataSize is correct(value 336) {code} # col_name data_type comment id int first_name string last_name string address string salary decimal(10,0) start_date timestamp state string # Detailed Table Information Database: default Owner: root CreateTime: Tue Jul 18 04:29:16 EDT 2017 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://bdpe42:8020/user/hive/warehouse/orc_create_people Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles1 numRows 3 rawDataSize 336 totalSize 537 transient_lastDdlTime 1500366574 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format {code} > Parquet file does not gather statistic such as "RAW DATA SIZE" automatically > - > > Key: HIVE-17108 > URL: https://issues.apache.org/jira/browse/HIVE-17108 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [parquet_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_analyze.q#L27], > we need run "ANALYZE TABLE parquet_create_people COMPUTE STATISTICS noscan" > to update the statistic. > In > [orc_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/orc_analyze.q#L45], > we need not do that if we set hive.stats.autogather as true. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17108) Parquet file does not gather statistic such as "RAW DATA SIZE" automatically
[ https://issues.apache.org/jira/browse/HIVE-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090965#comment-16090965 ] liyunzhang_intel commented on HIVE-17108: - [~csun], [~xuefuz]: If we must use "ANALYZE TABLE parquet_create_people COMPUTE STATISTICS noscan" to get the statistics such as "RAW DATA SIZE", we need update parquet*q to add "analyze table " such as [parquet_join.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_join.q]. let's use part code of parquet_join.q to explain: after use "analyze table parquet_jointable1 compute statistics nocan", the raw data size is changed from 4 to 108. {code} set hive.mapred.mode=nonstrict; drop table if exists staging; drop table if exists parquet_jointable1; drop table if exists parquet_jointable2; create table staging (key int, value string) stored as textfile; insert into table staging select distinct key, value from src order by key limit 2; create table parquet_jointable1 stored as parquet as select * from staging; create table parquet_jointable2 stored as parquet as select key,key+1,concat(value,"value") as myvalue from staging; -- MR join describe formatted parquet_jointable1; explain select p2.myvalue from parquet_jointable1 p1 join parquet_jointable2 p2 on p1.key=p2.key; --update the statistics of parquet_jointable1 analyze table parquet_jointable1 COMPUTE STATISTICS noscan; describe formatted parquet_jointable1; --now the datasize of parquet_jointable1 changes from 4 to 108 explain select p2.myvalue from parquet_jointable1 p1 join parquet_jointable2 p2 on p1.key=p2.key; {code} the output of the script {code} # Detailed Table Information Database: default Owner: root CreateTime: Mon Jul 17 21:34:52 EDT 2017 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://bdpe42:8020/user/hive/warehouse/parquet_jointable1 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles1 numRows 2 rawDataSize 4 totalSize 345 transient_lastDdlTime 1500341692 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format1 Time taken: 0.202 seconds, Fetched: 31 row(s) OK STAGE DEPENDENCIES: Stage-2 is a root stage Stage-1 depends on stages: Stage-2 Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-2 Spark DagName: root_20170717213454_eadfaac1-9d9d-4bc0-bb65-cb4beef4505a:5 Vertices: Map 1 Map Operator Tree: TableScan alias: p1 Statistics: Num rows: 2 Data size: 4 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 2 Data size: 4 Basic stats: COMPLETE Column stats: NONE Spark HashTable Sink Operator keys: 0 key (type: int) 1 key (type: int) Local Work: Map Reduce Local Work Stage: Stage-1 Spark DagName: root_20170717213454_eadfaac1-9d9d-4bc0-bb65-cb4beef4505a:4 Vertices: Map 2 Map Operator Tree: TableScan alias: p2 Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 key (type: int) 1 key (type: int) outputColumnNames: _col7 input vertices: 0 Map 1 Statistics: Num rows: 2 Data size: 4 Basic st
[jira] [Commented] (HIVE-17108) Parquet file does not gather statistic such as "RAW DATA SIZE" automatically
[ https://issues.apache.org/jira/browse/HIVE-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089491#comment-16089491 ] liyunzhang_intel commented on HIVE-17108: - [~csun] or [~xuefuz]: can you help to view it, thanks! > Parquet file does not gather statistic such as "RAW DATA SIZE" automatically > - > > Key: HIVE-17108 > URL: https://issues.apache.org/jira/browse/HIVE-17108 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel > > in > [parquet_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_analyze.q#L27], > we need run "ANALYZE TABLE parquet_create_people COMPUTE STATISTICS noscan" > to update the statistic. > In > [orc_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/orc_analyze.q#L45], > we need not do that if we set hive.stats.autogather as true. -- This message was sent by Atlassian JIRA (v6.4.14#64029)