[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze

2016-07-04 Thread Nemon Lou (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361156#comment-15361156
 ] 

Nemon Lou commented on HIVE-14143:
--

[~pxiong] The "ids" passed in is just "sizeOfColumnsInTableScan" in many 
places.So "ids.size() != *sizeOfColumnsInTableScan" will always be false.
{code}
 ColumnProjectionUtils.appendReadColumns(
  jobConf, ts.getNeededColumnIDs(), ts.getNeededColumns());
{code}
In the case of count(1) or stats gather,"sizeOfColumnsInTableScan"  is zero.We 
need to find a way to distinguish these two cases.
For  count(1), READ_ALL_COLUMNS should be set to false.
For stat gather of rcfile,READ_ALL_COLUMNS should be set to true in order to 
read all columns and then calculate rawDataSize.



> RawDataSize of RCFile is zero after analyze 
> 
>
> Key: HIVE-14143
> URL: https://issues.apache.org/jira/browse/HIVE-14143
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Nemon Lou
>Assignee: Abhishek
>Priority: Minor
> Attachments: HIVE-14143.1.patch, HIVE-14143.patch
>
>
> After running the following analyze command ,rawDataSize becomes zero for 
> rcfile tables.
> {noformat}
>  analyze table RCFILE_TABLE compute statistics ;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze

2016-07-03 Thread Pengcheng Xiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360690#comment-15360690
 ] 

Pengcheng Xiong commented on HIVE-14143:


[~nemon], I think we only need to see if it is possible to improve your first 
patch regarding the "// Set READ_ALL_COLUMNS to false
"
Right now, you have 
{code}
111 if (ids.size() > 0) {
112   // Set READ_ALL_COLUMNS to false
113   conf.setBoolean(READ_ALL_COLUMNS, false);
114 }
113   } 115   }
{code}
I would like to know if it is possible to change it to the following to be 
consistent with the definition we have in TableScanDesc.java:
{code}
111 if (ids == null || ids.size() != *sizeOfColumnsInTableScan*) {
112   // Set READ_ALL_COLUMNS to false
113   conf.setBoolean(READ_ALL_COLUMNS, false);
114 }
113   } 115   }
{code}

Then you do not need to modify the TestColumnProjectionUtils.java. Thanks.

> RawDataSize of RCFile is zero after analyze 
> 
>
> Key: HIVE-14143
> URL: https://issues.apache.org/jira/browse/HIVE-14143
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Nemon Lou
>Assignee: Nemon Lou
>Priority: Minor
> Attachments: HIVE-14143.1.patch, HIVE-14143.patch
>
>
> After running the following analyze command ,rawDataSize becomes zero for 
> rcfile tables.
> {noformat}
>  analyze table RCFILE_TABLE compute statistics ;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze

2016-07-02 Thread Nemon Lou (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360057#comment-15360057
 ] 

Nemon Lou commented on HIVE-14143:
--

Referring to ORC and LazySimpleSerde, rawDataSize is calculated without any 
care of column projection.
So rawDataSize calculation for RCFile can be the same way.Right?

> RawDataSize of RCFile is zero after analyze 
> 
>
> Key: HIVE-14143
> URL: https://issues.apache.org/jira/browse/HIVE-14143
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Nemon Lou
>Assignee: Nemon Lou
>Priority: Minor
> Attachments: HIVE-14143.patch
>
>
> After running the following analyze command ,rawDataSize becomes zero for 
> rcfile tables.
> {noformat}
>  analyze table RCFILE_TABLE compute statistics ;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze

2016-07-02 Thread Nemon Lou (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360049#comment-15360049
 ] 

Nemon Lou commented on HIVE-14143:
--

Agreed. As describe in TableScanDesc.java
{code} 
  // Both neededColumnIDs and neededColumns should never be null.
  // When neededColumnIDs is an empty list,
  // it means no needed column (e.g. we do not need any column to evaluate
  // SELECT count(*) FROM t).
  private List neededColumnIDs;
 {code} 
 
 I must has been misleading by the following code in HiveInputFormat.java:
{code}
  private void pushProjection(final JobConf newjob, final StringBuilder 
readColumnsBuffer,
  final StringBuilder readColumnNamesBuffer) {
String readColIds = readColumnsBuffer.toString();
String readColNames = readColumnNamesBuffer.toString();
boolean readAllColumns = readColIds.isEmpty() ? true : false;
newjob.setBoolean(ColumnProjectionUtils.READ_ALL_COLUMNS, readAllColumns);
   ...
  }  
 {code}
The solution is not clear for me .  Any suggestions?

> RawDataSize of RCFile is zero after analyze 
> 
>
> Key: HIVE-14143
> URL: https://issues.apache.org/jira/browse/HIVE-14143
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Nemon Lou
>Assignee: Nemon Lou
>Priority: Minor
> Attachments: HIVE-14143.patch
>
>
> After running the following analyze command ,rawDataSize becomes zero for 
> rcfile tables.
> {noformat}
>  analyze table RCFILE_TABLE compute statistics ;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze

2016-07-02 Thread Pengcheng Xiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360012#comment-15360012
 ] 

Pengcheng Xiong commented on HIVE-14143:


[~nemon], thanks a lot for your explanation. I think current assumption that 
"empty column ids means read all columns" is confusing and misleading. I would 
prefer the following assumption:
{code}
getNeededColumnIDs==null or empty ===means==> do not need any columns
{code}
If you agree, could you please change the code accordingly? Thanks.

> RawDataSize of RCFile is zero after analyze 
> 
>
> Key: HIVE-14143
> URL: https://issues.apache.org/jira/browse/HIVE-14143
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Nemon Lou
>Assignee: Nemon Lou
>Priority: Minor
> Attachments: HIVE-14143.patch
>
>
> After running the following analyze command ,rawDataSize becomes zero for 
> rcfile tables.
> {noformat}
>  analyze table RCFILE_TABLE compute statistics ;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze

2016-07-01 Thread Nemon Lou (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359940#comment-15359940
 ] 

Nemon Lou commented on HIVE-14143:
--

[~pxiong] Thanks for your attention.

RawDataSize for rcfile is a summary size of the total selected columns.
https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStructBase.java#L229
{code}
  public long getRawDataSerializedSize() {
long serializedSize = 0;
for (int i = 0; i < fieldInfoList.length; ++i) {
  serializedSize += fieldInfoList[i].getSerializedSize();
}
return serializedSize;
  }
{code}

During projections push down,READ_ALL_COLUMNS is always set to false,no matter 
the specified columns are empty or not.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L656
{code}
for (String alias : aliases) {
  Operator op = this.mrwork.getAliasToWork().get(
alias);
  if (op instanceof TableScanOperator) {
TableScanOperator ts = (TableScanOperator) op;
// push down projections.
ColumnProjectionUtils.appendReadColumns(
jobConf, ts.getNeededColumnIDs(), ts.getNeededColumns());
// push down filters
pushFilters(jobConf, ts);

AcidUtils.setTransactionalTableScan(job, ts.getConf().isAcidTable());
  }
}
{code}
The specified column ids are empty for analyze,which means read all columns.

Finally, no column is read :
https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java#L104
{code}
List notSkipIDs = new ArrayList();
if (conf == null || ColumnProjectionUtils.isReadAllColumns(conf)) {
  for (int i = 0; i < size; i++ ) {
notSkipIDs.add(i);
  }
} else {
  notSkipIDs = ColumnProjectionUtils.getReadColumnIDs(conf);
}
cachedLazyStruct = new ColumnarStruct(
cachedObjectInspector, notSkipIDs, serdeParams.getNullSequence());
{code}

> RawDataSize of RCFile is zero after analyze 
> 
>
> Key: HIVE-14143
> URL: https://issues.apache.org/jira/browse/HIVE-14143
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Nemon Lou
>Assignee: Nemon Lou
>Priority: Minor
> Attachments: HIVE-14143.patch
>
>
> After running the following analyze command ,rawDataSize becomes zero for 
> rcfile tables.
> {noformat}
>  analyze table RCFILE_TABLE compute statistics ;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14143) RawDataSize of RCFile is zero after analyze

2016-07-01 Thread Pengcheng Xiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359371#comment-15359371
 ] 

Pengcheng Xiong commented on HIVE-14143:


[~nemon], patch LGTM. Could u explain the logic behind the code that you have 
added/changed:
{code}
if (ids.size() > 0) {
  // Set READ_ALL_COLUMNS to false
  conf.setBoolean(READ_ALL_COLUMNS, false);
}
{code}

> RawDataSize of RCFile is zero after analyze 
> 
>
> Key: HIVE-14143
> URL: https://issues.apache.org/jira/browse/HIVE-14143
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Nemon Lou
>Assignee: Nemon Lou
>Priority: Minor
> Attachments: HIVE-14143.patch
>
>
> After running the following analyze command ,rawDataSize becomes zero for 
> rcfile tables.
> {noformat}
>  analyze table RCFILE_TABLE compute statistics ;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)