[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776989#comment-16776989
 ] 

BELUGA BEHR commented on HIVE-20079:


[~asinkovits] Thanks for the clarification.  I see now that the current 
implementation just counts columns.  I'm on the same page now.  Thanks.

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-25 Thread Antal Sinkovits (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776941#comment-16776941
 ] 

Antal Sinkovits commented on HIVE-20079:


[~belugabehr]
The title of the jira is "Populate more accurate rawDataSize for parquet 
format" - which it does. As the current logic uses 1 byte / column.
As I said, I agree with the consolidation, but that is a much larger work, and 
this patch provides a more usable _approximation_ on the raw data size.

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776925#comment-16776925
 ] 

BELUGA BEHR commented on HIVE-20079:


This patch is still incorrect.  It's actually producing the same wrong numbers 
as before, though, perhaps a bit more efficiently.

{code}
totalSize += block.getTotalByteSize();
{code}

{{getTotalByteSize()}} is not the same as "rawDataSize".


bq. rawDataSize—Approximate size of data in memory

https://www.cloudera.com/documentation/enterprise/5-15-x/topics/admin_hos_tuning.html

That means that for a single table row with 4 INTs (values: 1,2,3,4) I would 
expect a rawDataSize of (4 bytes x 4 Java ints) = 32 bytes.  However, Parquet 
would report this as 4 bytes because of the way that Parquet packs these 
numbers internal to its implementation.  Hive should look at the row counts and 
multiply it by the row data types.

The {{AbstractSerDe}} class should have code to facilitate all of this like 
{{readNumber()}} {{readString(int bumBytes}}, etc that can be called as each 
row is read.

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-25 Thread Peter Vary (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776738#comment-16776738
 ] 

Peter Vary commented on HIVE-20079:
---

+1

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-25 Thread Antal Sinkovits (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776727#comment-16776727
 ] 

Antal Sinkovits commented on HIVE-20079:


[~stakiar]
The main gap, is what is stored in the file footer. Parquet stores a different 
value than ORC. And that value is internal to parquet.

I agree, that a consistent approach would be good, but it goes further than ORC 
and parquet, because for text file, it's also a different value. So this should 
be consolidated, regardless of the file format. I'll create a jira for this.

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-25 Thread Antal Sinkovits (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776734#comment-16776734
 ] 

Antal Sinkovits commented on HIVE-20079:


https://issues.apache.org/jira/browse/HIVE-21315

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-22 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775431#comment-16775431
 ] 

Hive QA commented on HIVE-20079:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12959742/HIVE-20079.6.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:green}SUCCESS:{color} +1 due to 15811 tests passed

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/16203/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16203/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16203/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12959742 - PreCommit-HIVE-Build

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-22 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775378#comment-16775378
 ] 

Hive QA commented on HIVE-20079:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
34s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
6s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
58s{color} | {color:blue} ql in master has 2261 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} ql: The patch generated 0 new + 14 unchanged - 5 
fixed = 14 total (was 19) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
15s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 24m 43s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-16203/dev-support/hive-personality.sh
 |
| git revision | master / c45751f |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-16203/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-21 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774563#comment-16774563
 ] 

Hive QA commented on HIVE-20079:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12959639/HIVE-20079.5.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 15810 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamptz_2] 
(batchId=86)
org.apache.hive.jdbc.TestTriggersTezSessionPoolManager.testTriggerCustomCreatedDynamicPartitions
 (batchId=264)
org.apache.hive.jdbc.TestTriggersTezSessionPoolManager.testTriggerCustomCreatedDynamicPartitionsUnionAll
 (batchId=264)
org.apache.hive.jdbc.TestTriggersTezSessionPoolManager.testTriggerHighShuffleBytes
 (batchId=264)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/16189/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16189/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16189/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12959639 - PreCommit-HIVE-Build

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-21 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774536#comment-16774536
 ] 

Hive QA commented on HIVE-20079:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
33s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  4m 
10s{color} | {color:blue} ql in master has 2260 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
59s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} ql: The patch generated 0 new + 14 unchanged - 5 
fixed = 14 total (was 19) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
15s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 24m 41s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-16189/dev-support/hive-personality.sh
 |
| git revision | master / e71b096 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-16189/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-21 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774090#comment-16774090
 ] 

Hive QA commented on HIVE-20079:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12959567/HIVE-20079.4.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 15810 tests 
executed
*Failed tests:*
{noformat}
org.apache.hive.jdbc.miniHS2.TestHs2ConnectionMetricsHttp.testOpenConnectionMetrics
 (batchId=266)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/16179/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16179/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16179/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12959567 - PreCommit-HIVE-Build

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-21 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774064#comment-16774064
 ] 

Hive QA commented on HIVE-20079:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
36s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
6s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
55s{color} | {color:blue} ql in master has 2260 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
39s{color} | {color:red} ql: The patch generated 3 new + 13 unchanged - 6 fixed 
= 16 total (was 19) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 24m  6s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-16179/dev-support/hive-personality.sh
 |
| git revision | master / 89b9f12 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-16179/yetus/diff-checkstyle-ql.txt
 |
| whitespace | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-16179/yetus/whitespace-eol.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-16179/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-20 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773273#comment-16773273
 ] 

Hive QA commented on HIVE-20079:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12959444/HIVE-20079.3.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 11 failed/errored test(s), 15790 tests 
executed
*Failed tests:*
{noformat}
TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=191)

[infer_bucket_sort_num_buckets.q,gen_udf_example_add10.q,spark_explainuser_1.q,spark_use_ts_stats_for_mapjoin.q,orc_merge6.q,orc_merge5.q,bucketmapjoin6.q,spark_opt_shuffle_serde.q,temp_table_external.q,spark_dynamic_partition_pruning_6.q,dynamic_rdd_cache.q,auto_sortmerge_join_16.q,vector_outer_join3.q,spark_dynamic_partition_pruning_7.q,schemeAuthority.q,parallel_orderby.q,vector_outer_join1.q,load_hdfs_file_with_space_in_the_name.q,spark_dynamic_partition_pruning_recursive_mapjoin.q,spark_dynamic_partition_pruning_mapjoin_only.q]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_stats] 
(batchId=48)
org.apache.hadoop.hive.metastore.TestObjectStore.testDirectSQLDropParitionsCleanup
 (batchId=230)
org.apache.hadoop.hive.metastore.TestObjectStore.testDirectSQLDropPartitionsCacheCrossSession
 (batchId=230)
org.apache.hadoop.hive.metastore.TestObjectStore.testDirectSqlErrorMetrics 
(batchId=230)
org.apache.hadoop.hive.metastore.TestObjectStore.testEmptyTrustStoreProps 
(batchId=230)
org.apache.hadoop.hive.metastore.TestObjectStore.testMaxEventResponse 
(batchId=230)
org.apache.hadoop.hive.metastore.TestObjectStore.testPartitionOps (batchId=230)
org.apache.hadoop.hive.metastore.TestObjectStore.testQueryCloseOnError 
(batchId=230)
org.apache.hadoop.hive.metastore.TestObjectStore.testRoleOps (batchId=230)
org.apache.hadoop.hive.metastore.TestObjectStore.testUseSSLProperty 
(batchId=230)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/16164/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16164/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16164/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 11 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12959444 - PreCommit-HIVE-Build

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-20 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773250#comment-16773250
 ] 

Hive QA commented on HIVE-20079:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
17s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
7s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  4m  
2s{color} | {color:blue} ql in master has 2260 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
9s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
42s{color} | {color:red} ql: The patch generated 3 new + 13 unchanged - 6 fixed 
= 16 total (was 19) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 24m 37s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-16164/dev-support/hive-personality.sh
 |
| git revision | master / 89b9f12 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-16164/yetus/diff-checkstyle-ql.txt
 |
| whitespace | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-16164/yetus/whitespace-eol.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-16164/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-20 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773098#comment-16773098
 ] 

Sahil Takiar commented on HIVE-20079:
-

How does ORC handle this? Is there a fundamental reason we can't mimic the same 
thing they are doing? Getting things to be consistent with how ORC handles this 
makes more sense to me than implementing two different approaches for ORC vs. 
Parquet and ending up with an inconsistent definition of {{rawDataSize}} 
depending on the file format. Sure, this patch is probably a better estimation 
so I see no reason to not proceed with it.

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-20 Thread Antal Sinkovits (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773048#comment-16773048
 ] 

Antal Sinkovits commented on HIVE-20079:


[~stakiar] I'm afraid, thats not an option, as it will cause discrepancy 
between the calculations.
Check my comment here:
https://issues.apache.org/jira/browse/HIVE-20523



> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-20 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773046#comment-16773046
 ] 

Sahil Takiar commented on HIVE-20079:
-

FYI I don't think {{block.getTotalByteSize}} provides the size of the data when 
loaded into memory. Talking to a few Parquet folks, no such method to get the 
raw data size exists. If we want to implement this patch we will have to do 
something similar to what ORC does - 
https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L601

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-18 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771292#comment-16771292
 ] 

Hive QA commented on HIVE-20079:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12931668/HIVE-20079.2.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/16131/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16131/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16131/

Messages:
{noformat}
 This message was trimmed, see log for full details 
/data/hiveptest/working/scratch/build.patch:2050: trailing whitespace.
id  int 
/data/hiveptest/working/scratch/build.patch:2051: trailing whitespace.
str string  
error: patch failed: 
ql/src/test/results/clientpositive/llap/vector_partitioned_date_time.q.out:4323
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/llap/vector_partitioned_date_time.q.out' 
with conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/nested_column_pruning.q.out:135
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/nested_column_pruning.q.out' with conflicts.
error: patch failed: ql/src/test/results/clientpositive/parquet_analyze.q.out:93
Falling back to three-way merge...
Applied patch to 'ql/src/test/results/clientpositive/parquet_analyze.q.out' 
with conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_complex_types_vectorization.q.out:102
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_complex_types_vectorization.q.out' 
with conflicts.
error: patch failed: ql/src/test/results/clientpositive/parquet_join.q.out:76
Falling back to three-way merge...
Applied patch to 'ql/src/test/results/clientpositive/parquet_join.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_map_type_vectorization.q.out:114
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_map_type_vectorization.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_no_row_serde.q.out:139
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_no_row_serde.q.out' with conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_13.q.out:80
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_13.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_14.q.out:80
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_14.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_15.q.out:76
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_15.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_16.q.out:53
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_16.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_17.q.out:61
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_17.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_2.q.out:59
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_2.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_3.q.out:64
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_3.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_4.q.out:59
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_4.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_5.q.out:53
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_5.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_6.q.out:55
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_6.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/parquet_vectorization_7.q.out:67
Falling 

[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-12-15 Thread Antal Sinkovits (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722137#comment-16722137
 ] 

Antal Sinkovits commented on HIVE-20079:


Hi, 
I'm planing to work on this next week. Let me know, if there are any concerns 
regarding it. Thanks.

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-11-05 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675625#comment-16675625
 ] 

Hive QA commented on HIVE-20079:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12931668/HIVE-20079.2.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/14750/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/14750/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-14750/

Messages:
{noformat}
 This message was trimmed, see log for full details 
Applied patch to 
'ql/src/test/results/clientpositive/parquet_vectorization_limit.q.out' cleanly.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_0.q.out:34
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_0.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_1.q.out:60
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_1.q.out' 
cleanly.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_10.q.out:64
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_10.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_11.q.out:46
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_11.q.out' 
cleanly.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_12.q.out:83
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_12.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_13.q.out:85
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_13.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_14.q.out:85
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_14.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_15.q.out:81
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_15.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_16.q.out:58
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_16.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_17.q.out:66
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_17.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_2.q.out:64
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_2.q.out' 
cleanly.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_3.q.out:69
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_3.q.out' 
cleanly.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_4.q.out:64
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_4.q.out' 
cleanly.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_5.q.out:58
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_5.q.out' 
cleanly.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_6.q.out:58
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_6.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_7.q.out:72
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_7.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_8.q.out:68
Falling back to three-way merge...
Applied patch to 
'ql/src/test/results/clientpositive/spark/parquet_vectorization_8.q.out' with 
conflicts.
error: patch failed: 
ql/src/test/results/clientpositive/spark/parquet_vectorization_9.q.out:58
Falling back to three-way 

[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-09-14 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16615173#comment-16615173
 ] 

Sahil Takiar commented on HIVE-20079:
-

[~aihuaxu] not sure if you are still planning to work on this? If not, mind if 
I assign it to myself?

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-07-15 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544613#comment-16544613
 ] 

Aihua Xu commented on HIVE-20079:
-

[~stakiar] Thanks for taking a look. Notice that as well and I will check that 
out.

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-07-14 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544385#comment-16544385
 ] 

Hive QA commented on HIVE-20079:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12931668/HIVE-20079.2.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 14645 tests 
executed
*Failed tests:*
{noformat}
TestMiniDruidCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=191)

[druidmini_dynamic_partition.q,druidmini_expressions.q,druidmini_test_alter.q,druidmini_test1.q,druidmini_test_insert.q]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization]
 (batchId=27)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
 (batchId=89)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization]
 (batchId=14)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_0] 
(batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_10]
 (batchId=23)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_11]
 (batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_12]
 (batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_1] 
(batchId=11)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_parquet_projection]
 (batchId=45)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_parquet_types]
 (batchId=70)
org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver[druid_timestamptz]
 (batchId=192)
org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver[druidmini_joins]
 (batchId=192)
org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver[druidmini_masking]
 (batchId=192)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/12619/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12619/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12619/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 14 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12931668 - PreCommit-HIVE-Build

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-07-14 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544382#comment-16544382
 ] 

Hive QA commented on HIVE-20079:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
49s{color} | {color:blue} ql in master has 2291 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
35s{color} | {color:red} ql: The patch generated 3 new + 13 unchanged - 6 fixed 
= 16 total (was 19) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 22m 53s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-12619/dev-support/hive-personality.sh
 |
| git revision | master / 1b5903b |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12619/yetus/diff-checkstyle-ql.txt
 |
| whitespace | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12619/yetus/whitespace-eol.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12619/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-07-14 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544333#comment-16544333
 ] 

Sahil Takiar commented on HIVE-20079:
-

Some of the updates to the q.out files change {{rawDataSize}} to a value of 0, 
which doesn't look right - e.g. parquet_vectorization_0.q.out

So does {{BlockMetaData#getTotalByteSize}} return the size of the block when 
loaded into memory? It looks like we close the file first and then re-open it 
to get this metadata, is there any way to collect it while the file is still 
open?

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-07-14 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544305#comment-16544305
 ] 

Aihua Xu commented on HIVE-20079:
-

patch-2: fix affected unit tests.

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-07-12 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542117#comment-16542117
 ] 

Sahil Takiar commented on HIVE-20079:
-

Looks similar to HIVE-16887

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20079.1.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-07-04 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533250#comment-16533250
 ] 

Hive QA commented on HIVE-20079:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12930200/HIVE-20079.1.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 66 failed/errored test(s), 14638 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_analyze] 
(batchId=23)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_complex_types_vectorization]
 (batchId=75)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_join] 
(batchId=20)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_type_vectorization]
 (batchId=87)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_no_row_serde] 
(batchId=73)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization]
 (batchId=27)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
 (batchId=89)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization]
 (batchId=14)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_0] 
(batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_10]
 (batchId=23)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_11]
 (batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_12]
 (batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_13]
 (batchId=54)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_14]
 (batchId=40)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_15]
 (batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_16]
 (batchId=85)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_17]
 (batchId=30)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_1] 
(batchId=11)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_2] 
(batchId=3)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_3] 
(batchId=80)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_4] 
(batchId=45)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_5] 
(batchId=73)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_6] 
(batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_7] 
(batchId=88)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_8] 
(batchId=14)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_9] 
(batchId=31)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=31)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_div0]
 (batchId=80)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_limit]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_offset_limit]
 (batchId=34)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_pushdown]
 (batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_numeric_overflows]
 (batchId=72)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_parquet_projection]
 (batchId=45)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_parquet_types]
 (batchId=69)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_partitioned_date_time]
 (batchId=175)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning]
 (batchId=184)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_join] 
(batchId=116)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_0]
 (batchId=114)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_10]
 (batchId=117)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_11]
 (batchId=124)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_12]
 (batchId=118)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_13]
 (batchId=131)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_14]
 (batchId=124)

[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2018-07-04 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533242#comment-16533242
 ] 

Hive QA commented on HIVE-20079:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
16s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
11s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  4m 
29s{color} | {color:blue} ql in master has 2287 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
2s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
42s{color} | {color:red} ql: The patch generated 3 new + 8 unchanged - 3 fixed 
= 11 total (was 11) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
15s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 25m 32s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-12388/dev-support/hive-personality.sh
 |
| git revision | master / 5e2a530 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12388/yetus/diff-checkstyle-ql.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12388/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20079.1.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)