[jira] [Commented] (HIVE-17108) Parquet file does not gather statistic such as "RAW DATA SIZE" automatically

2017-07-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094128#comment-16094128
 ] 

liyunzhang_intel commented on HIVE-17108:
-

the detail reason why parquet file does not gather statistic such as "RAW DATA 
SIZE" automatically:
when executing "INSERT OVERWRITE TABLE xxx SELECT * xxx",
hive with orc will update statistics from orc footer in 
[FileSinkOperator#closeOp|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1060]
 while hive with parquet will not. 
OrcRecordWriter implements StatsProvidingRecordWriter.
ParquetRecordWriterWrapper not implements StatsProvidingRecordWriter.

But i guess even ParquetRecordWriterWrapper implements 
[StatsProvidingRecordWriter|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/StatsProvidingRecordWriter.java],
 statistics like "RAW DATA SIZE" can not be updated because 
org.apache.parquet.hadoop.ParquetWriter does not provide interface like 
getRawDataSize() or getRawCount().

> Parquet file does not gather statistic such as "RAW DATA SIZE" automatically 
> -
>
> Key: HIVE-17108
> URL: https://issues.apache.org/jira/browse/HIVE-17108
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [parquet_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_analyze.q#L27],
>  we need run "ANALYZE TABLE parquet_create_people COMPUTE STATISTICS noscan" 
> to update the statistic. 
> In 
> [orc_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/orc_analyze.q#L45],
>  we need not do that if we set hive.stats.autogather as true.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17108) Parquet file does not gather statistic such as "RAW DATA SIZE" automatically

2017-07-18 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091284#comment-16091284
 ] 

liyunzhang_intel commented on HIVE-17108:
-

[~pxiong]: when I view the code about [orc_analyze.q 
https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/orc_analyze.q#L45],
 it seems that orc will automatically gather statics like "RAW DATA SIZE" 
without executing "analyze table compute statistics". But reading the code, 
whether orc or parquet, it calls 
[org.apache.hadoop.hive.ql.exec.StatsNoJobTask#aggregateStats|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsNoJobTask.java#L266]
 to update the statistics. The query {{INSERT OVERWRITE TABLE orc_create_people 
SELECT * FROM orc_create_people_staging ORDER BY id}} will only call 
StatsJobTask but not calls StatsNoJobTask. So why orc updates statistics 
automatically?

Let's use an example to explain more detailed
{noformat}
set hive.execution.engine=mr;
use default;
drop table if exists orc_create_people_staging;
drop table if exists orc_create_people;

CREATE TABLE orc_create_people_staging (
  id int,
  first_name string,
  last_name string,
  address string,
  salary decimal,
  start_date timestamp,
  state string);

LOAD DATA LOCAL INPATH './orc_create_people.txt' OVERWRITE INTO TABLE 
orc_create_people_staging;

CREATE TABLE orc_create_people (
  id int,
  first_name string,
  last_name string,
  address string,
  salary decimal,
  start_date timestamp,
  state string)
STORED AS orc;

explain extended INSERT OVERWRITE TABLE orc_create_people SELECT * FROM 
orc_create_people_staging ORDER BY id;
INSERT OVERWRITE TABLE orc_create_people SELECT * FROM 
orc_create_people_staging ORDER BY id;
desc formatted orc_create_people;

{noformat}


the result of "desc formatted orc_create_people" is following , the rawDataSize 
is correct(value 336)
{code}
# col_name  data_type   comment 
 
id  int 
first_name  string  
last_name   string  
address string  
salary  decimal(10,0)   
start_date  timestamp   
state   string  
 
# Detailed Table Information 
Database:   default  
Owner:  root 
CreateTime: Tue Jul 18 04:29:16 EDT 2017 
LastAccessTime: UNKNOWN  
Retention:  0
Location:   
hdfs://bdpe42:8020/user/hive/warehouse/orc_create_people 
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
numFiles1   
numRows 3   
rawDataSize 336 
totalSize   537 
transient_lastDdlTime   1500366574  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
 
Compressed: No   
Num Buckets:-1   
Bucket Columns: []   
Sort Columns:   []   
Storage Desc Params: 
serialization.format
{code}

> Parquet file does not gather statistic such as "RAW DATA SIZE" automatically 
> -
>
> Key: HIVE-17108
> URL: https://issues.apache.org/jira/browse/HIVE-17108
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [parquet_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_analyze.q#L27],
>  we need run "ANALYZE TABLE parquet_create_people COMPUTE STATISTICS noscan" 
> to update the statistic. 
> In 
> [orc_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/orc_analyze.q#L45],
>  we need not do that if we set hive.stats.autogather as true.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17108) Parquet file does not gather statistic such as "RAW DATA SIZE" automatically

2017-07-17 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090965#comment-16090965
 ] 

liyunzhang_intel commented on HIVE-17108:
-

[~csun], [~xuefuz]:  If we must use "ANALYZE TABLE parquet_create_people 
COMPUTE STATISTICS noscan" to get the statistics such as "RAW DATA SIZE",  we 
need update parquet*q to add "analyze table "  such as 
[parquet_join.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_join.q].
let's use part code of parquet_join.q to explain:
after use "analyze table parquet_jointable1 compute statistics nocan", the raw 
data size is changed from 4 to 108.
{code}
set hive.mapred.mode=nonstrict;

drop table if exists staging;
drop table if exists parquet_jointable1;
drop table if exists parquet_jointable2;

create table staging (key int, value string) stored as textfile;
insert into table staging select distinct key, value from src order by key 
limit 2;

create table parquet_jointable1 stored as parquet as select * from staging;
create table parquet_jointable2 stored as parquet as select 
key,key+1,concat(value,"value") as myvalue from staging;


-- MR join
describe formatted parquet_jointable1;
explain select p2.myvalue from parquet_jointable1 p1 join parquet_jointable2 p2 
on p1.key=p2.key;
--update the statistics of parquet_jointable1
analyze table parquet_jointable1 COMPUTE STATISTICS noscan;
describe formatted parquet_jointable1;
--now the datasize of parquet_jointable1 changes from 4 to 108
explain select p2.myvalue from parquet_jointable1 p1 join parquet_jointable2 p2 
on p1.key=p2.key;
{code}

the output of the script
{code}
# Detailed Table Information 
Database:   default  
Owner:  root 
CreateTime: Mon Jul 17 21:34:52 EDT 2017 
LastAccessTime: UNKNOWN  
Retention:  0
Location:   
hdfs://bdpe42:8020/user/hive/warehouse/parquet_jointable1
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
numFiles1   
numRows 2   
rawDataSize 4   
totalSize   345 
transient_lastDdlTime   1500341692  
 
# Storage Information
SerDe Library:  
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
InputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed: No   
Num Buckets:-1   
Bucket Columns: []   
Sort Columns:   []   
Storage Desc Params: 
serialization.format1   
Time taken: 0.202 seconds, Fetched: 31 row(s)
OK
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
Spark
  DagName: root_20170717213454_eadfaac1-9d9d-4bc0-bb65-cb4beef4505a:5
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: p1
  Statistics: Num rows: 2 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 2 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
Spark HashTable Sink Operator
  keys:
0 key (type: int)
1 key (type: int)
Local Work:
  Map Reduce Local Work

  Stage: Stage-1
Spark
  DagName: root_20170717213454_eadfaac1-9d9d-4bc0-bb65-cb4beef4505a:4
  Vertices:
Map 2 
Map Operator Tree:
TableScan
  alias: p2
  Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 2 Data size: 6 Basic stats: COMPLETE 
Column stats: NONE
Map Join Operator
  condition map:
   Inner Join 0 to 1
  keys:
0 key (type: int)
1 key (type: int)
  outputColumnNames: _col7
  input vertices:
0 Map 1
  Statistics: Num rows: 2 Data size: 4 Basic st

[jira] [Commented] (HIVE-17108) Parquet file does not gather statistic such as "RAW DATA SIZE" automatically

2017-07-17 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089491#comment-16089491
 ] 

liyunzhang_intel commented on HIVE-17108:
-

[~csun] or [~xuefuz]: can you help to view it, thanks!

> Parquet file does not gather statistic such as "RAW DATA SIZE" automatically 
> -
>
> Key: HIVE-17108
> URL: https://issues.apache.org/jira/browse/HIVE-17108
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>
> in 
> [parquet_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_analyze.q#L27],
>  we need run "ANALYZE TABLE parquet_create_people COMPUTE STATISTICS noscan" 
> to update the statistic. 
> In 
> [orc_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/orc_analyze.q#L45],
>  we need not do that if we set hive.stats.autogather as true.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)