[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics for external tables
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246083#comment-17246083 ] Tristan Stevens commented on HIVE-11266: [~findepi] this is true however with managed (i.e. non external) tables then modifying the underlying data without performing a REFRESH is not supported. With external tables however it is expected behaviour. This is essentially the definition of MANAGED vs. EXTERNAL. > count(*) wrong result based on table statistics for external tables > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Jesus Camacho Rodriguez >Priority: Blocker > Fix For: 3.0.0 > > Attachments: HIVE-11266.01.patch, HIVE-11266.patch > > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics for external tables
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245953#comment-17245953 ] Piotr Findeisen commented on HIVE-11266: {quote}This is not just external tables - any tables where users are directly modifying the underlying data can be impacted by this. {quote} {quote}Yes, I agree with you, external table is just my personal use case.{quote} [~tmgstev] [~simobatt] was there a follow-up issue to this? >From the attached patch (same as >[https://github.com/apache/hive/commit/a2dff9e13acc62ecc0388b3b2e221f26c9184dbb)] > i see this was fixed for external tables only. > count(*) wrong result based on table statistics for external tables > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Jesus Camacho Rodriguez >Priority: Blocker > Fix For: 3.0.0 > > Attachments: HIVE-11266.01.patch, HIVE-11266.patch > > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics for external tables
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197551#comment-16197551 ] Hive QA commented on HIVE-11266: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12891109/HIVE-11266.01.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 11191 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[optimize_nullscan] (batchId=162) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_explainuser_1] (batchId=171) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] (batchId=101) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[stats_noscan_2] (batchId=117) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7197/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7197/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7197/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12891109 - PreCommit-HIVE-Build > count(*) wrong result based on table statistics for external tables > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Jesus Camacho Rodriguez >Priority: Blocker > Attachments: HIVE-11266.01.patch, HIVE-11266.patch > > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics for external tables
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197401#comment-16197401 ] Jesus Camacho Rodriguez commented on HIVE-11266: Adding test to the patch. > count(*) wrong result based on table statistics for external tables > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Jesus Camacho Rodriguez >Priority: Blocker > Attachments: HIVE-11266.01.patch, HIVE-11266.patch > > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics for external tables
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197143#comment-16197143 ] Ashutosh Chauhan commented on HIVE-11266: - +1 > count(*) wrong result based on table statistics for external tables > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Jesus Camacho Rodriguez >Priority: Blocker > Attachments: HIVE-11266.patch > > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics for external tables
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195776#comment-16195776 ] Hive QA commented on HIVE-11266: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12890844/HIVE-11266.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 11190 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_predicate_pushdown] (batchId=231) org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_single_sourced_multi_insert] (batchId=231) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[optimize_nullscan] (batchId=162) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_explainuser_1] (batchId=171) org.apache.hadoop.hive.cli.TestTezPerfCliDriver.testCliDriver[query23] (batchId=239) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7181/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7181/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7181/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12890844 - PreCommit-HIVE-Build > count(*) wrong result based on table statistics for external tables > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Jesus Camacho Rodriguez >Priority: Blocker > Attachments: HIVE-11266.patch > > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102369#comment-16102369 ] Sergey Shelukhin commented on HIVE-11266: - [~pxiong] the issue is still there for external tables. The semantics for external tables in Hive are not well defined, but many people assume (and I agree) that it's ok to manually manage these using file operations, which invalidates the stats without Hive knowing about it. I don't think this setting should be used for external tables. Thoughts? cc [~ashutoshc] [~hagleitn] > count(*) wrong result based on table statistics > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Pengcheng Xiong >Priority: Critical > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900854#comment-15900854 ] Tristan Stevens commented on HIVE-11266: If Hive is still serving results directly from the stats then with external tables it cannot guarantee their accuracy. > count(*) wrong result based on table statistics > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Pengcheng Xiong >Priority: Critical > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900851#comment-15900851 ] Pengcheng Xiong commented on HIVE-11266: I see. We changed a lot since then. This should be already fixed in the recent Hive versions. > count(*) wrong result based on table statistics > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Pengcheng Xiong >Priority: Critical > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900832#comment-15900832 ] Simone commented on HIVE-11266: --- It was Hive 1.1.0 in CDH distribution > count(*) wrong result based on table statistics > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Pengcheng Xiong >Priority: Critical > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900364#comment-15900364 ] Pengcheng Xiong commented on HIVE-11266: Hello there, which version of hive are you using? I saw you put a lable 1.1.0 as affected versions. Does that mean that you are using Hive 1.1? Thanks. > count(*) wrong result based on table statistics > --- > > Key: HIVE-11266 > URL: https://issues.apache.org/jira/browse/HIVE-11266 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Simone Battaglia >Assignee: Pengcheng Xiong >Priority: Critical > > Hive returns wrong count result on an external table with table statistics if > I change table data files. > This is the scenario in details: > 1) create external table my_table (...) location 'my_location'; > 2) analyze table my_table compute statistics; > 3) change/add/delete one or more files in 'my_location' directory; > 4) select count(\*) from my_table; > In this case the count query doesn't generate a MR job and returns the result > based on table statistics. This result is wrong because is based on > statistics stored in the Hive metastore and doesn't take into account > modifications introduced on data files. > Obviously setting "hive.compute.query.using.stats" to FALSE this problem > doesn't occur but the default value of this property is TRUE. > I thinks that also this post on stackoverflow, that shows another type of bug > in case of multiple insert, is related to the one that I reported: > http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638644#comment-14638644 ] Tristan Stevens commented on HIVE-11266: This is not just external tables - any tables where users are directly modifying the underlying data can be impacted by this. count(*) wrong result based on table statistics --- Key: HIVE-11266 URL: https://issues.apache.org/jira/browse/HIVE-11266 Project: Hive Issue Type: Bug Affects Versions: 1.1.0 Reporter: Simone Priority: Critical Hive returns wrong count result on an external table with table statistics if I change table data files. This is the scenario in details: 1) create external table my_table (...) location 'my_location'; 2) analyze table my_table compute statistics; 3) change/add/delete one or more files in 'my_location' directory; 4) select count(\*) from my_table; In this case the count query doesn't generate a MR job and returns the result based on table statistics. This result is wrong because is based on statistics stored in the Hive metastore and doesn't take into account modifications introduced on data files. Obviously setting hive.compute.query.using.stats to FALSE this problem doesn't occur but the default value of this property is TRUE. I thinks that also this post on stackoverflow, that shows another type of bug in case of multiple insert, is related to the one that I reported: http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11266) count(*) wrong result based on table statistics
[ https://issues.apache.org/jira/browse/HIVE-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638659#comment-14638659 ] Simone commented on HIVE-11266: --- Yes, I agree with you, external table is just my personal use case. count(*) wrong result based on table statistics --- Key: HIVE-11266 URL: https://issues.apache.org/jira/browse/HIVE-11266 Project: Hive Issue Type: Bug Affects Versions: 1.1.0 Reporter: Simone Priority: Critical Hive returns wrong count result on an external table with table statistics if I change table data files. This is the scenario in details: 1) create external table my_table (...) location 'my_location'; 2) analyze table my_table compute statistics; 3) change/add/delete one or more files in 'my_location' directory; 4) select count(\*) from my_table; In this case the count query doesn't generate a MR job and returns the result based on table statistics. This result is wrong because is based on statistics stored in the Hive metastore and doesn't take into account modifications introduced on data files. Obviously setting hive.compute.query.using.stats to FALSE this problem doesn't occur but the default value of this property is TRUE. I thinks that also this post on stackoverflow, that shows another type of bug in case of multiple insert, is related to the one that I reported: http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table -- This message was sent by Atlassian JIRA (v6.3.4#6332)