Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/#review696 --- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java https://reviews.apache.org/r/765/#comment1395 Here we need to catch SQLRecoverableException and retry. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1396 If these parameters present in conf/hive-default.xml, you don't need to set them again here since new JobConf() should read from hive-default.xml. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1397 the usual use case for aggregateStats() is that the key should be the prefix (e.g., file_000) of the string inserted by publishStats, so that all keys that match the prefix will be aggregated. Can you add one more test for aggregateStats('file_000')? trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1398 won't this also change the stats at the 2nd publishStat()? trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1399 also add another aggStats for prefix. - Ning On 2011-05-19 23:14:26, Tomasz Nykiel wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/ --- (Updated 2011-05-19 23:14:26) Review request for hive. Summary --- Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). HIVE-2144 improves on performance by greedily performing the insertion statement. Then instead of executing two queries per row inserted, we can execute one INSERT query. In the case primary key constraint violation, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are newer then the ones already in the table. This addresses bug HIVE-2144. https://issues.apache.org/jira/browse/HIVE-2144 Diffs - trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION Diff: https://reviews.apache.org/r/765/diff Testing --- TestStatsPublisher JUnit test: - basic behaviour - multiple updates - cleanup of the statistics table after aggregation Standalone testing on the cluster. - insert/analyze queries over non-partitioned/partitioned tables NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. Thanks, Tomasz
[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036707#comment-13036707 ] jirapos...@reviews.apache.org commented on HIVE-2144: - --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/#review696 --- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java https://reviews.apache.org/r/765/#comment1395 Here we need to catch SQLRecoverableException and retry. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1396 If these parameters present in conf/hive-default.xml, you don't need to set them again here since new JobConf() should read from hive-default.xml. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1397 the usual use case for aggregateStats() is that the key should be the prefix (e.g., file_000) of the string inserted by publishStats, so that all keys that match the prefix will be aggregated. Can you add one more test for aggregateStats('file_000')? trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1398 won't this also change the stats at the 2nd publishStat()? trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1399 also add another aggStats for prefix. - Ning On 2011-05-19 23:14:26, Tomasz Nykiel wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/765/ bq. --- bq. bq. (Updated 2011-05-19 23:14:26) bq. bq. bq. Review request for hive. bq. bq. bq. Summary bq. --- bq. bq. Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. bq. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. bq. bq. Currently the schema of the stat table is the following: bq. bq. PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. bq. bq. We amend it to: bq. bq. PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). bq. bq. HIVE-2144 improves on performance by greedily performing the insertion statement. bq. Then instead of executing two queries per row inserted, we can execute one INSERT query. bq. In the case primary key constraint violation, we perform a single UPDATE query. bq. The UPDATE query needs to check the condition, if the currently inserted stats are newer then the ones already in the table. bq. bq. bq. This addresses bug HIVE-2144. bq. https://issues.apache.org/jira/browse/HIVE-2144 bq. bq. bq. Diffs bq. - bq. bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 bq.trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION bq. bq. Diff: https://reviews.apache.org/r/765/diff bq. bq. bq. Testing bq. --- bq. bq. TestStatsPublisher JUnit test: bq. - basic behaviour bq. - multiple updates bq. - cleanup of the statistics table after aggregation bq. bq. Standalone testing on the cluster. bq. - insert/analyze queries over non-partitioned/partitioned tables bq. bq. NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. bq. bq. bq. Thanks, bq. bq. Tomasz bq. bq. reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel Attachments: HIVE-2144.patch In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID
[jira] [Commented] (HIVE-2036) Update bitmap indexes for automatic usage
[ https://issues.apache.org/jira/browse/HIVE-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036709#comment-13036709 ] Marquis Wang commented on HIVE-2036: Russell is right. hive.index.compact.file is deprecated and replaced with hive.index.blockfilter.file (I think). I kept the former around for backwards-compatibility reasons, but we should try to avoid using it. Update bitmap indexes for automatic usage - Key: HIVE-2036 URL: https://issues.apache.org/jira/browse/HIVE-2036 Project: Hive Issue Type: Improvement Components: Indexing Affects Versions: 0.8.0 Reporter: Russell Melick Assignee: Syed S. Albiz HIVE-1644 will provide automatic usage of indexes, and HIVE-1803 adds bitmap index support. The bitmap code will need to be extended after it is committed to enable automatic use of indexing. Most work will be focused in the BitmapIndexHandler, which needs to generate the re-entrant QL index query. There may also be significant work in the IndexPredicateAnalyzer to support predicates with OR's, instead of just AND's as it is currently. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Review Request: HIVE-2117: insert overwrite ignoring partition location
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/772/ --- Review request for hive and Carl Steinbach. Summary --- This change resolves a regression introduced by HIVE-1707, specifically that the partition location (set via alter table partition location) is not being respected. I addressed this by using the user specified location (as done originally), except in the case with cross-filesystem moves (which was the concern in 1707). This addresses bug HIVE-2117. https://issues.apache.org/jira/browse/HIVE-2117 Diffs - ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 916b235 ql/src/test/org/apache/hadoop/hive/ql/BaseTestQueries.java PRE-CREATION ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java 4685471 ql/src/test/org/apache/hadoop/hive/ql/TestLocationQueries.java PRE-CREATION ql/src/test/org/apache/hadoop/hive/ql/TestMTQueries.java 8c7c0b8 ql/src/test/queries/clientpositive/alter5.q PRE-CREATION ql/src/test/results/clientpositive/alter5.q.out PRE-CREATION Diff: https://reviews.apache.org/r/772/diff Testing --- I added a new test which verifies partition location explicitly - as the existing tests ignore this detail. This test failed w/o my fix applied, it passes with the fix applied. Thanks, Patrick
[jira] [Commented] (HIVE-2117) insert overwrite ignoring partition location
[ https://issues.apache.org/jira/browse/HIVE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036908#comment-13036908 ] jirapos...@reviews.apache.org commented on HIVE-2117: - --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/772/ --- Review request for hive and Carl Steinbach. Summary --- This change resolves a regression introduced by HIVE-1707, specifically that the partition location (set via alter table partition location) is not being respected. I addressed this by using the user specified location (as done originally), except in the case with cross-filesystem moves (which was the concern in 1707). This addresses bug HIVE-2117. https://issues.apache.org/jira/browse/HIVE-2117 Diffs - ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 916b235 ql/src/test/org/apache/hadoop/hive/ql/BaseTestQueries.java PRE-CREATION ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java 4685471 ql/src/test/org/apache/hadoop/hive/ql/TestLocationQueries.java PRE-CREATION ql/src/test/org/apache/hadoop/hive/ql/TestMTQueries.java 8c7c0b8 ql/src/test/queries/clientpositive/alter5.q PRE-CREATION ql/src/test/results/clientpositive/alter5.q.out PRE-CREATION Diff: https://reviews.apache.org/r/772/diff Testing --- I added a new test which verifies partition location explicitly - as the existing tests ignore this detail. This test failed w/o my fix applied, it passes with the fix applied. Thanks, Patrick insert overwrite ignoring partition location Key: HIVE-2117 URL: https://issues.apache.org/jira/browse/HIVE-2117 Project: Hive Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Attachments: HIVE-2117_br07.patch, HIVE-2117_br07.patch, HIVE-2117_trunk.patch, data.txt The following code works differently in 0.5.0 vs 0.7.0. In 0.5.0 the partition location is respected. However in 0.7.0 while the initial partition is create with the specified location path/parta, the insert overwrite ... results in the partition written to path/dt=a (note that path is the same in both cases). {code} create table foo_stg (bar INT, car INT); load data local inpath 'data.txt' into table foo_stg; create table foo4 (bar INT, car INT) partitioned by (dt STRING) LOCATION '/user/hive/warehouse/foo4'; alter table foo4 add partition (dt='a') location '/user/hive/warehouse/foo4/parta'; from foo_stg fs insert overwrite table foo4 partition (dt='a') select *; {code} From what I can tell HIVE-1707 introduced this via a change to org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Path, String, MapString, String, boolean, boolean) specifically: {code} + Path partPath = new Path(tbl.getDataLocation().getPath(), + Warehouse.makePartPath(partSpec)); + + Path newPartPath = new Path(loadPath.toUri().getScheme(), loadPath + .toUri().getAuthority(), partPath.toUri().getPath()); {code} Reading the description on HIVE-1707 it seems that this may have been done purposefully, however given the partition location is explicitly specified for the partition in question it seems like that should be honored (esp give the table location has not changed). This difference in behavior is causing a regression in existing production Hive based code. I'd like to take a stab at addressing this, any suggestions? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2117) insert overwrite ignoring partition location
[ https://issues.apache.org/jira/browse/HIVE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036912#comment-13036912 ] Patrick Hunt commented on HIVE-2117: I posted reviews up on reviewboard: trunk: https://reviews.apache.org/r/773/ branch-0.7: https://reviews.apache.org/r/772/ insert overwrite ignoring partition location Key: HIVE-2117 URL: https://issues.apache.org/jira/browse/HIVE-2117 Project: Hive Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Attachments: HIVE-2117_br07.patch, HIVE-2117_br07.patch, HIVE-2117_trunk.patch, data.txt The following code works differently in 0.5.0 vs 0.7.0. In 0.5.0 the partition location is respected. However in 0.7.0 while the initial partition is create with the specified location path/parta, the insert overwrite ... results in the partition written to path/dt=a (note that path is the same in both cases). {code} create table foo_stg (bar INT, car INT); load data local inpath 'data.txt' into table foo_stg; create table foo4 (bar INT, car INT) partitioned by (dt STRING) LOCATION '/user/hive/warehouse/foo4'; alter table foo4 add partition (dt='a') location '/user/hive/warehouse/foo4/parta'; from foo_stg fs insert overwrite table foo4 partition (dt='a') select *; {code} From what I can tell HIVE-1707 introduced this via a change to org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Path, String, MapString, String, boolean, boolean) specifically: {code} + Path partPath = new Path(tbl.getDataLocation().getPath(), + Warehouse.makePartPath(partSpec)); + + Path newPartPath = new Path(loadPath.toUri().getScheme(), loadPath + .toUri().getAuthority(), partPath.toUri().getPath()); {code} Reading the description on HIVE-1707 it seems that this may have been done purposefully, however given the partition location is explicitly specified for the partition in question it seems like that should be honored (esp give the table location has not changed). This difference in behavior is causing a regression in existing production Hive based code. I'd like to take a stab at addressing this, any suggestions? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2117) insert overwrite ignoring partition location
[ https://issues.apache.org/jira/browse/HIVE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036914#comment-13036914 ] jirapos...@reviews.apache.org commented on HIVE-2117: - --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/773/ --- Review request for hive and Carl Steinbach. Summary --- This change resolves a regression introduced by HIVE-1707, specifically that the partition location (set via alter table partition location) is not being respected. I addressed this by using the user specified location (as done originally), except in the case with cross-filesystem moves (which was the concern in 1707). This addresses bug HIVE-2117. https://issues.apache.org/jira/browse/HIVE-2117 Diffs - ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java bcacd35 ql/src/test/org/apache/hadoop/hive/ql/BaseTestQueries.java PRE-CREATION ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java 06a0447 ql/src/test/org/apache/hadoop/hive/ql/TestLocationQueries.java PRE-CREATION ql/src/test/org/apache/hadoop/hive/ql/TestMTQueries.java 8c7c0b8 ql/src/test/queries/clientpositive/alter5.q PRE-CREATION ql/src/test/results/clientpositive/alter5.q.out PRE-CREATION Diff: https://reviews.apache.org/r/773/diff Testing --- I added a new test which verifies partition location explicitly - as the existing tests ignore this detail. This test failed w/o my fix applied, it passes with the fix applied. Thanks, Patrick insert overwrite ignoring partition location Key: HIVE-2117 URL: https://issues.apache.org/jira/browse/HIVE-2117 Project: Hive Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Attachments: HIVE-2117_br07.patch, HIVE-2117_br07.patch, HIVE-2117_trunk.patch, data.txt The following code works differently in 0.5.0 vs 0.7.0. In 0.5.0 the partition location is respected. However in 0.7.0 while the initial partition is create with the specified location path/parta, the insert overwrite ... results in the partition written to path/dt=a (note that path is the same in both cases). {code} create table foo_stg (bar INT, car INT); load data local inpath 'data.txt' into table foo_stg; create table foo4 (bar INT, car INT) partitioned by (dt STRING) LOCATION '/user/hive/warehouse/foo4'; alter table foo4 add partition (dt='a') location '/user/hive/warehouse/foo4/parta'; from foo_stg fs insert overwrite table foo4 partition (dt='a') select *; {code} From what I can tell HIVE-1707 introduced this via a change to org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Path, String, MapString, String, boolean, boolean) specifically: {code} + Path partPath = new Path(tbl.getDataLocation().getPath(), + Warehouse.makePartPath(partSpec)); + + Path newPartPath = new Path(loadPath.toUri().getScheme(), loadPath + .toUri().getAuthority(), partPath.toUri().getPath()); {code} Reading the description on HIVE-1707 it seems that this may have been done purposefully, however given the partition location is explicitly specified for the partition in question it seems like that should be honored (esp give the table location has not changed). This difference in behavior is causing a regression in existing production Hive based code. I'd like to take a stab at addressing this, any suggestions? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036932#comment-13036932 ] jirapos...@reviews.apache.org commented on HIVE-2144: - --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/#review702 --- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java https://reviews.apache.org/r/765/#comment1402 Yes. That's correct. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1403 ok. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1404 I will amend the test cases to aggregate over prefixes. I will also add one simple test case to aggregate over exact match. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1405 The original value inserted in line 120 is 200. Neither 100, nor 150 should change the values. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1406 As disscussed before, I will improve the test cases to aggregate over prefixes. - Tomasz On 2011-05-19 23:14:26, Tomasz Nykiel wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/765/ bq. --- bq. bq. (Updated 2011-05-19 23:14:26) bq. bq. bq. Review request for hive. bq. bq. bq. Summary bq. --- bq. bq. Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. bq. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. bq. bq. Currently the schema of the stat table is the following: bq. bq. PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. bq. bq. We amend it to: bq. bq. PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). bq. bq. HIVE-2144 improves on performance by greedily performing the insertion statement. bq. Then instead of executing two queries per row inserted, we can execute one INSERT query. bq. In the case primary key constraint violation, we perform a single UPDATE query. bq. The UPDATE query needs to check the condition, if the currently inserted stats are newer then the ones already in the table. bq. bq. bq. This addresses bug HIVE-2144. bq. https://issues.apache.org/jira/browse/HIVE-2144 bq. bq. bq. Diffs bq. - bq. bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 bq.trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION bq. bq. Diff: https://reviews.apache.org/r/765/diff bq. bq. bq. Testing bq. --- bq. bq. TestStatsPublisher JUnit test: bq. - basic behaviour bq. - multiple updates bq. - cleanup of the statistics table after aggregation bq. bq. Standalone testing on the cluster. bq. - insert/analyze queries over non-partitioned/partitioned tables bq. bq. NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. bq. bq. bq. Thanks, bq. bq. Tomasz bq. bq. reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel Attachments: HIVE-2144.patch In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see:
Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/#review702 --- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java https://reviews.apache.org/r/765/#comment1402 Yes. That's correct. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1403 ok. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1404 I will amend the test cases to aggregate over prefixes. I will also add one simple test case to aggregate over exact match. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1405 The original value inserted in line 120 is 200. Neither 100, nor 150 should change the values. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1406 As disscussed before, I will improve the test cases to aggregate over prefixes. - Tomasz On 2011-05-19 23:14:26, Tomasz Nykiel wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/ --- (Updated 2011-05-19 23:14:26) Review request for hive. Summary --- Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). HIVE-2144 improves on performance by greedily performing the insertion statement. Then instead of executing two queries per row inserted, we can execute one INSERT query. In the case primary key constraint violation, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are newer then the ones already in the table. This addresses bug HIVE-2144. https://issues.apache.org/jira/browse/HIVE-2144 Diffs - trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION Diff: https://reviews.apache.org/r/765/diff Testing --- TestStatsPublisher JUnit test: - basic behaviour - multiple updates - cleanup of the statistics table after aggregation Standalone testing on the cluster. - insert/analyze queries over non-partitioned/partitioned tables NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. Thanks, Tomasz
[jira] [Updated] (HIVE-2096) throw a error if the input is larger than a threshold for index input format
[ https://issues.apache.org/jira/browse/HIVE-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-2096: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed! Thanks Wojciech! (still trying to figure out how to assign this task to Wojciech :) ) throw a error if the input is larger than a threshold for index input format Key: HIVE-2096 URL: https://issues.apache.org/jira/browse/HIVE-2096 Project: Hive Issue Type: Bug Affects Versions: 0.8.0 Reporter: Namit Jain Attachments: HIVE-2096.1.patch.txt, HIVE-2096.2.patch.txt, HIVE-2096.3.patch.txt, HIVE-2096.4.patch.txt This can hang for ever. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2100) virtual column references inside subqueries cause execution exceptions
[ https://issues.apache.org/jira/browse/HIVE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-2100: --- Status: Patch Available (was: Open) virtual column references inside subqueries cause execution exceptions -- Key: HIVE-2100 URL: https://issues.apache.org/jira/browse/HIVE-2100 Project: Hive Issue Type: Bug Reporter: Joydeep Sen Sarma Attachments: HIVE-2100.txt example: create table jssarma_nilzma_bad as select a.fname, a.offset, a.val from (select hash(eventid,userid,eventtime,browsercookie,userstate,useragent,userip,serverip,clienttime,geoid,countrycode\ ,actionid,lastimpressionid,lastnavimpressionid,impressiontype,fullurl,fullreferrer,pagesection,modulesection,adsection) as val, INPUT__FILE__NAME as fname, BLOCK__OFFSET__INSIDE__FILE as offset from nectar_impression_lzma_unverified where ds='2010-07-28') a join jssarma_hc_diff b on (a.val=b.val); causes Caused by: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121) ... 18 more Caused by: java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@664310d0, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3d04fc23, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@12457d21, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@101a0ae6, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dc18a4c, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@d5e92d7, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3bfa681c, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@34c92507, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@19e09a4, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2e8aeed0, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2344b18f, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@72e5355f, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@26132ae7, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3465b738, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dfd868, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@ef894ce, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@61f1680f, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2fe6e305, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@5f4275d4, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@445e228, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@802b249] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321) at org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:96) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:73) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
[jira] [Commented] (HIVE-2100) virtual column references inside subqueries cause execution exceptions
[ https://issues.apache.org/jira/browse/HIVE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036946#comment-13036946 ] He Yongqiang commented on HIVE-2100: running tests, and also put this into PA queue virtual column references inside subqueries cause execution exceptions -- Key: HIVE-2100 URL: https://issues.apache.org/jira/browse/HIVE-2100 Project: Hive Issue Type: Bug Reporter: Joydeep Sen Sarma Attachments: HIVE-2100.txt example: create table jssarma_nilzma_bad as select a.fname, a.offset, a.val from (select hash(eventid,userid,eventtime,browsercookie,userstate,useragent,userip,serverip,clienttime,geoid,countrycode\ ,actionid,lastimpressionid,lastnavimpressionid,impressiontype,fullurl,fullreferrer,pagesection,modulesection,adsection) as val, INPUT__FILE__NAME as fname, BLOCK__OFFSET__INSIDE__FILE as offset from nectar_impression_lzma_unverified where ds='2010-07-28') a join jssarma_hc_diff b on (a.val=b.val); causes Caused by: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121) ... 18 more Caused by: java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@664310d0, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3d04fc23, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@12457d21, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@101a0ae6, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dc18a4c, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@d5e92d7, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3bfa681c, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@34c92507, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@19e09a4, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2e8aeed0, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2344b18f, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@72e5355f, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@26132ae7, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3465b738, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dfd868, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@ef894ce, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@61f1680f, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2fe6e305, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@5f4275d4, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@445e228, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@802b249] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321) at org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:96) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:73) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444) at
[jira] [Updated] (HIVE-2117) insert overwrite ignoring partition location
[ https://issues.apache.org/jira/browse/HIVE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated HIVE-2117: --- Status: Patch Available (was: Open) insert overwrite ignoring partition location Key: HIVE-2117 URL: https://issues.apache.org/jira/browse/HIVE-2117 Project: Hive Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Attachments: HIVE-2117_br07.patch, HIVE-2117_br07.patch, HIVE-2117_trunk.patch, data.txt The following code works differently in 0.5.0 vs 0.7.0. In 0.5.0 the partition location is respected. However in 0.7.0 while the initial partition is create with the specified location path/parta, the insert overwrite ... results in the partition written to path/dt=a (note that path is the same in both cases). {code} create table foo_stg (bar INT, car INT); load data local inpath 'data.txt' into table foo_stg; create table foo4 (bar INT, car INT) partitioned by (dt STRING) LOCATION '/user/hive/warehouse/foo4'; alter table foo4 add partition (dt='a') location '/user/hive/warehouse/foo4/parta'; from foo_stg fs insert overwrite table foo4 partition (dt='a') select *; {code} From what I can tell HIVE-1707 introduced this via a change to org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Path, String, MapString, String, boolean, boolean) specifically: {code} + Path partPath = new Path(tbl.getDataLocation().getPath(), + Warehouse.makePartPath(partSpec)); + + Path newPartPath = new Path(loadPath.toUri().getScheme(), loadPath + .toUri().getAuthority(), partPath.toUri().getPath()); {code} Reading the description on HIVE-1707 it seems that this may have been done purposefully, however given the partition location is explicitly specified for the partition in question it seems like that should be honored (esp give the table location has not changed). This difference in behavior is causing a regression in existing production Hive based code. I'd like to take a stab at addressing this, any suggestions? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2117) insert overwrite ignoring partition location
[ https://issues.apache.org/jira/browse/HIVE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037096#comment-13037096 ] Carl Steinbach commented on HIVE-2117: -- +1. Will commit if tests pass. insert overwrite ignoring partition location Key: HIVE-2117 URL: https://issues.apache.org/jira/browse/HIVE-2117 Project: Hive Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Blocker Attachments: HIVE-2117_br07.patch, HIVE-2117_br07.patch, HIVE-2117_trunk.patch, data.txt The following code works differently in 0.5.0 vs 0.7.0. In 0.5.0 the partition location is respected. However in 0.7.0 while the initial partition is create with the specified location path/parta, the insert overwrite ... results in the partition written to path/dt=a (note that path is the same in both cases). {code} create table foo_stg (bar INT, car INT); load data local inpath 'data.txt' into table foo_stg; create table foo4 (bar INT, car INT) partitioned by (dt STRING) LOCATION '/user/hive/warehouse/foo4'; alter table foo4 add partition (dt='a') location '/user/hive/warehouse/foo4/parta'; from foo_stg fs insert overwrite table foo4 partition (dt='a') select *; {code} From what I can tell HIVE-1707 introduced this via a change to org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Path, String, MapString, String, boolean, boolean) specifically: {code} + Path partPath = new Path(tbl.getDataLocation().getPath(), + Warehouse.makePartPath(partSpec)); + + Path newPartPath = new Path(loadPath.toUri().getScheme(), loadPath + .toUri().getAuthority(), partPath.toUri().getPath()); {code} Reading the description on HIVE-1707 it seems that this may have been done purposefully, however given the partition location is explicitly specified for the partition in question it seems like that should be honored (esp give the table location has not changed). This difference in behavior is causing a regression in existing production Hive based code. I'd like to take a stab at addressing this, any suggestions? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2175) NPE if zookeeper is down
[ https://issues.apache.org/jira/browse/HIVE-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037166#comment-13037166 ] Namit Jain commented on HIVE-2175: -- We should throw a more meaningful exception. NPE if zookeeper is down Key: HIVE-2175 URL: https://issues.apache.org/jira/browse/HIVE-2175 Project: Hive Issue Type: Bug Reporter: Namit Jain Assignee: He Yongqiang ERROR ZooKeeperHiveLockManager (ZooKeeperHiveLockManager.java:lock(337)) - Failed to get ZooKeeper lock: java.lang.NullPointerException -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-2175) NPE if zookeeper is down
NPE if zookeeper is down Key: HIVE-2175 URL: https://issues.apache.org/jira/browse/HIVE-2175 Project: Hive Issue Type: Bug Reporter: Namit Jain Assignee: He Yongqiang ERROR ZooKeeperHiveLockManager (ZooKeeperHiveLockManager.java:lock(337)) - Failed to get ZooKeeper lock: java.lang.NullPointerException -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-2176) Schema creation scripts are incomplete since they leave out tables that are specific to DataNucleus
Schema creation scripts are incomplete since they leave out tables that are specific to DataNucleus --- Key: HIVE-2176 URL: https://issues.apache.org/jira/browse/HIVE-2176 Project: Hive Issue Type: Bug Components: Configuration, Metastore Affects Versions: 0.7.0 Reporter: Esteban Gutierrez When using the DDL SQL scripts to create the Metastore, tables like SEQUENCE_TABLE are missing and force the user to change the configuration to use Datanucleus to do all the provisioning of the Metastore tables. Adding the missing table definitions to the DDL scripts will allow to have a functional Hive Metastore without enabling additional privileges to the Metastore user and/or enabling datanucleus.autoCreateSchema property in hive-site.xml -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2176) Schema creation scripts are incomplete since they leave out tables that are specific to DataNucleus
[ https://issues.apache.org/jira/browse/HIVE-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Esteban Gutierrez updated HIVE-2176: Description: When using the DDL SQL scripts to create the Metastore, tables like SEQUENCE_TABLE are missing and force the user to change the configuration to use Datanucleus to do all the provisioning of the Metastore tables. Adding the missing table definitions to the DDL scripts will allow to have a functional Hive Metastore without enabling additional privileges to the Metastore user and/or enabling datanucleus.autoCreateSchema property in hive-site.xml [After running the hive-schema-0.7.0.mysql.sql and revoking ALTER and CREATE privileges to the 'metastoreuser'] hive show tables; FAILED: Error in metadata: javax.jdo.JDOException: Exception thrown calling table.exists() for `SEQUENCE_TABLE` NestedThrowables: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: CREATE command denied to user 'metastoreuser'@'localhost' for table 'SEQUENCE_TABLE' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask was: When using the DDL SQL scripts to create the Metastore, tables like SEQUENCE_TABLE are missing and force the user to change the configuration to use Datanucleus to do all the provisioning of the Metastore tables. Adding the missing table definitions to the DDL scripts will allow to have a functional Hive Metastore without enabling additional privileges to the Metastore user and/or enabling datanucleus.autoCreateSchema property in hive-site.xml Schema creation scripts are incomplete since they leave out tables that are specific to DataNucleus --- Key: HIVE-2176 URL: https://issues.apache.org/jira/browse/HIVE-2176 Project: Hive Issue Type: Bug Components: Configuration, Metastore Affects Versions: 0.7.0 Reporter: Esteban Gutierrez Labels: derby, mysql, postgres When using the DDL SQL scripts to create the Metastore, tables like SEQUENCE_TABLE are missing and force the user to change the configuration to use Datanucleus to do all the provisioning of the Metastore tables. Adding the missing table definitions to the DDL scripts will allow to have a functional Hive Metastore without enabling additional privileges to the Metastore user and/or enabling datanucleus.autoCreateSchema property in hive-site.xml [After running the hive-schema-0.7.0.mysql.sql and revoking ALTER and CREATE privileges to the 'metastoreuser'] hive show tables; FAILED: Error in metadata: javax.jdo.JDOException: Exception thrown calling table.exists() for `SEQUENCE_TABLE` NestedThrowables: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: CREATE command denied to user 'metastoreuser'@'localhost' for table 'SEQUENCE_TABLE' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037221#comment-13037221 ] jirapos...@reviews.apache.org commented on HIVE-2144: - --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/ --- (Updated 2011-05-21 01:49:07.819494) Review request for hive. Changes --- -Amended the test cases to accommodate prefix aggregation -Fixed unnecessary conf settings -Fixed exception handling in JDBCStatsPublisher.publishStats - SQLRecoverableException is handled when executing the update statement. Summary --- Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). HIVE-2144 improves on performance by greedily performing the insertion statement. Then instead of executing two queries per row inserted, we can execute one INSERT query. In the case primary key constraint violation, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are newer then the ones already in the table. This addresses bug HIVE-2144. https://issues.apache.org/jira/browse/HIVE-2144 Diffs (updated) - trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125468 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION Diff: https://reviews.apache.org/r/765/diff Testing --- TestStatsPublisher JUnit test: - basic behaviour - multiple updates - cleanup of the statistics table after aggregation Standalone testing on the cluster. - insert/analyze queries over non-partitioned/partitioned tables NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. Thanks, Tomasz reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel Attachments: HIVE-2144.patch In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Nykiel updated HIVE-2144: Attachment: HIVE-2144.1.patch Fixed after revision 1. reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel Attachments: HIVE-2144.1.patch, HIVE-2144.patch In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira