[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776989#comment-16776989 ] BELUGA BEHR commented on HIVE-20079: [~asinkovits] Thanks for the clarification. I see now that the current implementation just counts columns. I'm on the same page now. Thanks. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776941#comment-16776941 ] Antal Sinkovits commented on HIVE-20079: [~belugabehr] The title of the jira is "Populate more accurate rawDataSize for parquet format" - which it does. As the current logic uses 1 byte / column. As I said, I agree with the consolidation, but that is a much larger work, and this patch provides a more usable _approximation_ on the raw data size. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776925#comment-16776925 ] BELUGA BEHR commented on HIVE-20079: This patch is still incorrect. It's actually producing the same wrong numbers as before, though, perhaps a bit more efficiently. {code} totalSize += block.getTotalByteSize(); {code} {{getTotalByteSize()}} is not the same as "rawDataSize". bq. rawDataSize—Approximate size of data in memory https://www.cloudera.com/documentation/enterprise/5-15-x/topics/admin_hos_tuning.html That means that for a single table row with 4 INTs (values: 1,2,3,4) I would expect a rawDataSize of (4 bytes x 4 Java ints) = 32 bytes. However, Parquet would report this as 4 bytes because of the way that Parquet packs these numbers internal to its implementation. Hive should look at the row counts and multiply it by the row data types. The {{AbstractSerDe}} class should have code to facilitate all of this like {{readNumber()}} {{readString(int bumBytes}}, etc that can be called as each row is read. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776738#comment-16776738 ] Peter Vary commented on HIVE-20079: --- +1 > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776727#comment-16776727 ] Antal Sinkovits commented on HIVE-20079: [~stakiar] The main gap, is what is stored in the file footer. Parquet stores a different value than ORC. And that value is internal to parquet. I agree, that a consistent approach would be good, but it goes further than ORC and parquet, because for text file, it's also a different value. So this should be consolidated, regardless of the file format. I'll create a jira for this. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776734#comment-16776734 ] Antal Sinkovits commented on HIVE-20079: https://issues.apache.org/jira/browse/HIVE-21315 > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775431#comment-16775431 ] Hive QA commented on HIVE-20079: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12959742/HIVE-20079.6.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:green}SUCCESS:{color} +1 due to 15811 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/16203/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16203/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16203/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12959742 - PreCommit-HIVE-Build > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775378#comment-16775378 ] Hive QA commented on HIVE-20079: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 34s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 6s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 42s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 3m 58s{color} | {color:blue} ql in master has 2261 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} ql: The patch generated 0 new + 14 unchanged - 5 fixed = 14 total (was 19) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 24m 43s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-16203/dev-support/hive-personality.sh | | git revision | master / c45751f | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-16203/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774563#comment-16774563 ] Hive QA commented on HIVE-20079: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12959639/HIVE-20079.5.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 15810 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamptz_2] (batchId=86) org.apache.hive.jdbc.TestTriggersTezSessionPoolManager.testTriggerCustomCreatedDynamicPartitions (batchId=264) org.apache.hive.jdbc.TestTriggersTezSessionPoolManager.testTriggerCustomCreatedDynamicPartitionsUnionAll (batchId=264) org.apache.hive.jdbc.TestTriggersTezSessionPoolManager.testTriggerHighShuffleBytes (batchId=264) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/16189/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16189/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16189/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12959639 - PreCommit-HIVE-Build > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774536#comment-16774536 ] Hive QA commented on HIVE-20079: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 33s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 4m 10s{color} | {color:blue} ql in master has 2260 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} ql: The patch generated 0 new + 14 unchanged - 5 fixed = 14 total (was 19) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 24m 41s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-16189/dev-support/hive-personality.sh | | git revision | master / e71b096 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-16189/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774090#comment-16774090 ] Hive QA commented on HIVE-20079: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12959567/HIVE-20079.4.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 15810 tests executed *Failed tests:* {noformat} org.apache.hive.jdbc.miniHS2.TestHs2ConnectionMetricsHttp.testOpenConnectionMetrics (batchId=266) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/16179/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16179/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16179/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12959567 - PreCommit-HIVE-Build > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774064#comment-16774064 ] Hive QA commented on HIVE-20079: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 36s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 6s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 3m 55s{color} | {color:blue} ql in master has 2260 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 57s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 39s{color} | {color:red} ql: The patch generated 3 new + 13 unchanged - 6 fixed = 16 total (was 19) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 14s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 24m 6s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-16179/dev-support/hive-personality.sh | | git revision | master / 89b9f12 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-16179/yetus/diff-checkstyle-ql.txt | | whitespace | http://104.198.109.242/logs//PreCommit-HIVE-Build-16179/yetus/whitespace-eol.txt | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-16179/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773273#comment-16773273 ] Hive QA commented on HIVE-20079: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12959444/HIVE-20079.3.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 11 failed/errored test(s), 15790 tests executed *Failed tests:* {noformat} TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=191) [infer_bucket_sort_num_buckets.q,gen_udf_example_add10.q,spark_explainuser_1.q,spark_use_ts_stats_for_mapjoin.q,orc_merge6.q,orc_merge5.q,bucketmapjoin6.q,spark_opt_shuffle_serde.q,temp_table_external.q,spark_dynamic_partition_pruning_6.q,dynamic_rdd_cache.q,auto_sortmerge_join_16.q,vector_outer_join3.q,spark_dynamic_partition_pruning_7.q,schemeAuthority.q,parallel_orderby.q,vector_outer_join1.q,load_hdfs_file_with_space_in_the_name.q,spark_dynamic_partition_pruning_recursive_mapjoin.q,spark_dynamic_partition_pruning_mapjoin_only.q] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_stats] (batchId=48) org.apache.hadoop.hive.metastore.TestObjectStore.testDirectSQLDropParitionsCleanup (batchId=230) org.apache.hadoop.hive.metastore.TestObjectStore.testDirectSQLDropPartitionsCacheCrossSession (batchId=230) org.apache.hadoop.hive.metastore.TestObjectStore.testDirectSqlErrorMetrics (batchId=230) org.apache.hadoop.hive.metastore.TestObjectStore.testEmptyTrustStoreProps (batchId=230) org.apache.hadoop.hive.metastore.TestObjectStore.testMaxEventResponse (batchId=230) org.apache.hadoop.hive.metastore.TestObjectStore.testPartitionOps (batchId=230) org.apache.hadoop.hive.metastore.TestObjectStore.testQueryCloseOnError (batchId=230) org.apache.hadoop.hive.metastore.TestObjectStore.testRoleOps (batchId=230) org.apache.hadoop.hive.metastore.TestObjectStore.testUseSSLProperty (batchId=230) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/16164/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16164/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16164/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 11 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12959444 - PreCommit-HIVE-Build > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773250#comment-16773250 ] Hive QA commented on HIVE-20079: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 17s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 4m 2s{color} | {color:blue} ql in master has 2260 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 42s{color} | {color:red} ql: The patch generated 3 new + 13 unchanged - 6 fixed = 16 total (was 19) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 14s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 24m 37s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-16164/dev-support/hive-personality.sh | | git revision | master / 89b9f12 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-16164/yetus/diff-checkstyle-ql.txt | | whitespace | http://104.198.109.242/logs//PreCommit-HIVE-Build-16164/yetus/whitespace-eol.txt | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-16164/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773098#comment-16773098 ] Sahil Takiar commented on HIVE-20079: - How does ORC handle this? Is there a fundamental reason we can't mimic the same thing they are doing? Getting things to be consistent with how ORC handles this makes more sense to me than implementing two different approaches for ORC vs. Parquet and ending up with an inconsistent definition of {{rawDataSize}} depending on the file format. Sure, this patch is probably a better estimation so I see no reason to not proceed with it. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773048#comment-16773048 ] Antal Sinkovits commented on HIVE-20079: [~stakiar] I'm afraid, thats not an option, as it will cause discrepancy between the calculations. Check my comment here: https://issues.apache.org/jira/browse/HIVE-20523 > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773046#comment-16773046 ] Sahil Takiar commented on HIVE-20079: - FYI I don't think {{block.getTotalByteSize}} provides the size of the data when loaded into memory. Talking to a few Parquet folks, no such method to get the raw data size exists. If we want to implement this patch we will have to do something similar to what ORC does - https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L601 > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771292#comment-16771292 ] Hive QA commented on HIVE-20079: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12931668/HIVE-20079.2.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/16131/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16131/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16131/ Messages: {noformat} This message was trimmed, see log for full details /data/hiveptest/working/scratch/build.patch:2050: trailing whitespace. id int /data/hiveptest/working/scratch/build.patch:2051: trailing whitespace. str string error: patch failed: ql/src/test/results/clientpositive/llap/vector_partitioned_date_time.q.out:4323 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/llap/vector_partitioned_date_time.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/nested_column_pruning.q.out:135 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/nested_column_pruning.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_analyze.q.out:93 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_analyze.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_complex_types_vectorization.q.out:102 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_complex_types_vectorization.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_join.q.out:76 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_join.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_map_type_vectorization.q.out:114 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_map_type_vectorization.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_no_row_serde.q.out:139 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_no_row_serde.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_13.q.out:80 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_13.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_14.q.out:80 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_14.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_15.q.out:76 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_15.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_16.q.out:53 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_16.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_17.q.out:61 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_17.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_2.q.out:59 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_2.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_3.q.out:64 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_3.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_4.q.out:59 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_4.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_5.q.out:53 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_5.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_6.q.out:55 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_6.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/parquet_vectorization_7.q.out:67 Falling
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722137#comment-16722137 ] Antal Sinkovits commented on HIVE-20079: Hi, I'm planing to work on this next week. Let me know, if there are any concerns regarding it. Thanks. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675625#comment-16675625 ] Hive QA commented on HIVE-20079: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12931668/HIVE-20079.2.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/14750/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/14750/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-14750/ Messages: {noformat} This message was trimmed, see log for full details Applied patch to 'ql/src/test/results/clientpositive/parquet_vectorization_limit.q.out' cleanly. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_0.q.out:34 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_0.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_1.q.out:60 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_1.q.out' cleanly. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_10.q.out:64 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_10.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_11.q.out:46 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_11.q.out' cleanly. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_12.q.out:83 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_12.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_13.q.out:85 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_13.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_14.q.out:85 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_14.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_15.q.out:81 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_15.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_16.q.out:58 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_16.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_17.q.out:66 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_17.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_2.q.out:64 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_2.q.out' cleanly. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_3.q.out:69 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_3.q.out' cleanly. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_4.q.out:64 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_4.q.out' cleanly. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_5.q.out:58 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_5.q.out' cleanly. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_6.q.out:58 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_6.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_7.q.out:72 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_7.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_8.q.out:68 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/spark/parquet_vectorization_8.q.out' with conflicts. error: patch failed: ql/src/test/results/clientpositive/spark/parquet_vectorization_9.q.out:58 Falling back to three-way
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16615173#comment-16615173 ] Sahil Takiar commented on HIVE-20079: - [~aihuaxu] not sure if you are still planning to work on this? If not, mind if I assign it to myself? > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544613#comment-16544613 ] Aihua Xu commented on HIVE-20079: - [~stakiar] Thanks for taking a look. Notice that as well and I will check that out. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544385#comment-16544385 ] Hive QA commented on HIVE-20079: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12931668/HIVE-20079.2.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 14645 tests executed *Failed tests:* {noformat} TestMiniDruidCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=191) [druidmini_dynamic_partition.q,druidmini_expressions.q,druidmini_test_alter.q,druidmini_test1.q,druidmini_test_insert.q] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization] (batchId=27) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization] (batchId=89) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization] (batchId=14) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_0] (batchId=17) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_10] (batchId=23) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_11] (batchId=39) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_12] (batchId=24) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_1] (batchId=11) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_parquet_projection] (batchId=45) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_parquet_types] (batchId=70) org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver[druid_timestamptz] (batchId=192) org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver[druidmini_joins] (batchId=192) org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver[druidmini_masking] (batchId=192) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/12619/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12619/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12619/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 14 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12931668 - PreCommit-HIVE-Build > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544382#comment-16544382 ] Hive QA commented on HIVE-20079: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 3m 49s{color} | {color:blue} ql in master has 2291 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 35s{color} | {color:red} ql: The patch generated 3 new + 13 unchanged - 6 fixed = 16 total (was 19) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 22m 53s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-12619/dev-support/hive-personality.sh | | git revision | master / 1b5903b | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-12619/yetus/diff-checkstyle-ql.txt | | whitespace | http://104.198.109.242/logs//PreCommit-HIVE-Build-12619/yetus/whitespace-eol.txt | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-12619/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544333#comment-16544333 ] Sahil Takiar commented on HIVE-20079: - Some of the updates to the q.out files change {{rawDataSize}} to a value of 0, which doesn't look right - e.g. parquet_vectorization_0.q.out So does {{BlockMetaData#getTotalByteSize}} return the size of the block when loaded into memory? It looks like we close the file first and then re-open it to get this metadata, is there any way to collect it while the file is still open? > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544305#comment-16544305 ] Aihua Xu commented on HIVE-20079: - patch-2: fix affected unit tests. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542117#comment-16542117 ] Sahil Takiar commented on HIVE-20079: - Looks similar to HIVE-16887 > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20079.1.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533250#comment-16533250 ] Hive QA commented on HIVE-20079: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12930200/HIVE-20079.1.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 66 failed/errored test(s), 14638 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_analyze] (batchId=23) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_complex_types_vectorization] (batchId=75) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_join] (batchId=20) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_type_vectorization] (batchId=87) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_no_row_serde] (batchId=73) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization] (batchId=27) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization] (batchId=89) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization] (batchId=14) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_0] (batchId=17) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_10] (batchId=23) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_11] (batchId=39) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_12] (batchId=24) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_13] (batchId=54) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_14] (batchId=40) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_15] (batchId=90) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_16] (batchId=85) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_17] (batchId=30) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_1] (batchId=11) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_2] (batchId=3) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_3] (batchId=80) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_4] (batchId=45) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_5] (batchId=73) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_6] (batchId=43) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_7] (batchId=88) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_8] (batchId=14) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_9] (batchId=31) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_decimal_date] (batchId=31) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_div0] (batchId=80) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_limit] (batchId=25) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_offset_limit] (batchId=34) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_project] (batchId=37) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_pushdown] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_numeric_overflows] (batchId=72) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_parquet_projection] (batchId=45) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_parquet_types] (batchId=69) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_partitioned_date_time] (batchId=175) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning] (batchId=184) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_join] (batchId=116) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_0] (batchId=114) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_10] (batchId=117) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_11] (batchId=124) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_12] (batchId=118) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_13] (batchId=131) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_14] (batchId=124)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533242#comment-16533242 ] Hive QA commented on HIVE-20079: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 16s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 4m 29s{color} | {color:blue} ql in master has 2287 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 2s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 42s{color} | {color:red} ql: The patch generated 3 new + 8 unchanged - 3 fixed = 11 total (was 11) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 25m 32s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-12388/dev-support/hive-personality.sh | | git revision | master / 5e2a530 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-12388/yetus/diff-checkstyle-ql.txt | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-12388/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20079.1.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)