[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze
[ https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361156#comment-15361156 ] Nemon Lou commented on HIVE-14143: -- [~pxiong] The "ids" passed in is just "sizeOfColumnsInTableScan" in many places.So "ids.size() != *sizeOfColumnsInTableScan" will always be false. {code} ColumnProjectionUtils.appendReadColumns( jobConf, ts.getNeededColumnIDs(), ts.getNeededColumns()); {code} In the case of count(1) or stats gather,"sizeOfColumnsInTableScan" is zero.We need to find a way to distinguish these two cases. For count(1), READ_ALL_COLUMNS should be set to false. For stat gather of rcfile,READ_ALL_COLUMNS should be set to true in order to read all columns and then calculate rawDataSize. > RawDataSize of RCFile is zero after analyze > > > Key: HIVE-14143 > URL: https://issues.apache.org/jira/browse/HIVE-14143 > Project: Hive > Issue Type: Bug > Components: Statistics >Affects Versions: 1.2.1, 2.1.0 >Reporter: Nemon Lou >Assignee: Abhishek >Priority: Minor > Attachments: HIVE-14143.1.patch, HIVE-14143.patch > > > After running the following analyze command ,rawDataSize becomes zero for > rcfile tables. > {noformat} > analyze table RCFILE_TABLE compute statistics ; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze
[ https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360690#comment-15360690 ] Pengcheng Xiong commented on HIVE-14143: [~nemon], I think we only need to see if it is possible to improve your first patch regarding the "// Set READ_ALL_COLUMNS to false " Right now, you have {code} 111 if (ids.size() > 0) { 112 // Set READ_ALL_COLUMNS to false 113 conf.setBoolean(READ_ALL_COLUMNS, false); 114 } 113 } 115 } {code} I would like to know if it is possible to change it to the following to be consistent with the definition we have in TableScanDesc.java: {code} 111 if (ids == null || ids.size() != *sizeOfColumnsInTableScan*) { 112 // Set READ_ALL_COLUMNS to false 113 conf.setBoolean(READ_ALL_COLUMNS, false); 114 } 113 } 115 } {code} Then you do not need to modify the TestColumnProjectionUtils.java. Thanks. > RawDataSize of RCFile is zero after analyze > > > Key: HIVE-14143 > URL: https://issues.apache.org/jira/browse/HIVE-14143 > Project: Hive > Issue Type: Bug > Components: Statistics >Affects Versions: 1.2.1, 2.1.0 >Reporter: Nemon Lou >Assignee: Nemon Lou >Priority: Minor > Attachments: HIVE-14143.1.patch, HIVE-14143.patch > > > After running the following analyze command ,rawDataSize becomes zero for > rcfile tables. > {noformat} > analyze table RCFILE_TABLE compute statistics ; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze
[ https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360057#comment-15360057 ] Nemon Lou commented on HIVE-14143: -- Referring to ORC and LazySimpleSerde, rawDataSize is calculated without any care of column projection. So rawDataSize calculation for RCFile can be the same way.Right? > RawDataSize of RCFile is zero after analyze > > > Key: HIVE-14143 > URL: https://issues.apache.org/jira/browse/HIVE-14143 > Project: Hive > Issue Type: Bug > Components: Statistics >Affects Versions: 1.2.1, 2.1.0 >Reporter: Nemon Lou >Assignee: Nemon Lou >Priority: Minor > Attachments: HIVE-14143.patch > > > After running the following analyze command ,rawDataSize becomes zero for > rcfile tables. > {noformat} > analyze table RCFILE_TABLE compute statistics ; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze
[ https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360049#comment-15360049 ] Nemon Lou commented on HIVE-14143: -- Agreed. As describe in TableScanDesc.java {code} // Both neededColumnIDs and neededColumns should never be null. // When neededColumnIDs is an empty list, // it means no needed column (e.g. we do not need any column to evaluate // SELECT count(*) FROM t). private List neededColumnIDs; {code} I must has been misleading by the following code in HiveInputFormat.java: {code} private void pushProjection(final JobConf newjob, final StringBuilder readColumnsBuffer, final StringBuilder readColumnNamesBuffer) { String readColIds = readColumnsBuffer.toString(); String readColNames = readColumnNamesBuffer.toString(); boolean readAllColumns = readColIds.isEmpty() ? true : false; newjob.setBoolean(ColumnProjectionUtils.READ_ALL_COLUMNS, readAllColumns); ... } {code} The solution is not clear for me . Any suggestions? > RawDataSize of RCFile is zero after analyze > > > Key: HIVE-14143 > URL: https://issues.apache.org/jira/browse/HIVE-14143 > Project: Hive > Issue Type: Bug > Components: Statistics >Affects Versions: 1.2.1, 2.1.0 >Reporter: Nemon Lou >Assignee: Nemon Lou >Priority: Minor > Attachments: HIVE-14143.patch > > > After running the following analyze command ,rawDataSize becomes zero for > rcfile tables. > {noformat} > analyze table RCFILE_TABLE compute statistics ; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze
[ https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360012#comment-15360012 ] Pengcheng Xiong commented on HIVE-14143: [~nemon], thanks a lot for your explanation. I think current assumption that "empty column ids means read all columns" is confusing and misleading. I would prefer the following assumption: {code} getNeededColumnIDs==null or empty ===means==> do not need any columns {code} If you agree, could you please change the code accordingly? Thanks. > RawDataSize of RCFile is zero after analyze > > > Key: HIVE-14143 > URL: https://issues.apache.org/jira/browse/HIVE-14143 > Project: Hive > Issue Type: Bug > Components: Statistics >Affects Versions: 1.2.1, 2.1.0 >Reporter: Nemon Lou >Assignee: Nemon Lou >Priority: Minor > Attachments: HIVE-14143.patch > > > After running the following analyze command ,rawDataSize becomes zero for > rcfile tables. > {noformat} > analyze table RCFILE_TABLE compute statistics ; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze
[ https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359940#comment-15359940 ] Nemon Lou commented on HIVE-14143: -- [~pxiong] Thanks for your attention. RawDataSize for rcfile is a summary size of the total selected columns. https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStructBase.java#L229 {code} public long getRawDataSerializedSize() { long serializedSize = 0; for (int i = 0; i < fieldInfoList.length; ++i) { serializedSize += fieldInfoList[i].getSerializedSize(); } return serializedSize; } {code} During projections push down,READ_ALL_COLUMNS is always set to false,no matter the specified columns are empty or not. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L656 {code} for (String alias : aliases) { Operator op = this.mrwork.getAliasToWork().get( alias); if (op instanceof TableScanOperator) { TableScanOperator ts = (TableScanOperator) op; // push down projections. ColumnProjectionUtils.appendReadColumns( jobConf, ts.getNeededColumnIDs(), ts.getNeededColumns()); // push down filters pushFilters(jobConf, ts); AcidUtils.setTransactionalTableScan(job, ts.getConf().isAcidTable()); } } {code} The specified column ids are empty for analyze,which means read all columns. Finally, no column is read : https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java#L104 {code} List notSkipIDs = new ArrayList(); if (conf == null || ColumnProjectionUtils.isReadAllColumns(conf)) { for (int i = 0; i < size; i++ ) { notSkipIDs.add(i); } } else { notSkipIDs = ColumnProjectionUtils.getReadColumnIDs(conf); } cachedLazyStruct = new ColumnarStruct( cachedObjectInspector, notSkipIDs, serdeParams.getNullSequence()); {code} > RawDataSize of RCFile is zero after analyze > > > Key: HIVE-14143 > URL: https://issues.apache.org/jira/browse/HIVE-14143 > Project: Hive > Issue Type: Bug > Components: Statistics >Affects Versions: 1.2.1, 2.1.0 >Reporter: Nemon Lou >Assignee: Nemon Lou >Priority: Minor > Attachments: HIVE-14143.patch > > > After running the following analyze command ,rawDataSize becomes zero for > rcfile tables. > {noformat} > analyze table RCFILE_TABLE compute statistics ; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze
[ https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359371#comment-15359371 ] Pengcheng Xiong commented on HIVE-14143: [~nemon], patch LGTM. Could u explain the logic behind the code that you have added/changed: {code} if (ids.size() > 0) { // Set READ_ALL_COLUMNS to false conf.setBoolean(READ_ALL_COLUMNS, false); } {code} > RawDataSize of RCFile is zero after analyze > > > Key: HIVE-14143 > URL: https://issues.apache.org/jira/browse/HIVE-14143 > Project: Hive > Issue Type: Bug > Components: Statistics >Affects Versions: 1.2.1, 2.1.0 >Reporter: Nemon Lou >Assignee: Nemon Lou >Priority: Minor > Attachments: HIVE-14143.patch > > > After running the following analyze command ,rawDataSize becomes zero for > rcfile tables. > {noformat} > analyze table RCFILE_TABLE compute statistics ; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)