[jira] [Commented] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837765#comment-15837765 ] Ferdinand Xu commented on HIVE-15580: - Hi [~xuefuz], I am taking a leave because of Chinese new year. Will run some workloads to evaluate this patch once cluster is available. > Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark > - > > Key: HIVE-15580 > URL: https://issues.apache.org/jira/browse/HIVE-15580 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 2.2.0 > > Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, > HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, > HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch > > > Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded > memory. For orderBy, Hive accumulates key groups using ArrayList (described > in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, > which has a shortcoming of not being able to spill to disk within a key > group. Thus, for large key group, memory usage is also unbounded. > It's likely that this will impact performance. We will profile and optimize > afterwards. We could also make this change configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836414#comment-15836414 ] Xuefu Zhang commented on HIVE-15580: Hi [~Ferd] and [~dapengsun], I'm wondering if you guys could help measure the performance impact of the patch here? We at Uber don't have a dedicated environment, so getting accurate measurement is challenging. It would be great if you guys can help. Based on the result, we may have some followup work to do. Thanks. > Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark > - > > Key: HIVE-15580 > URL: https://issues.apache.org/jira/browse/HIVE-15580 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 2.2.0 > > Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, > HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, > HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch > > > Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded > memory. For orderBy, Hive accumulates key groups using ArrayList (described > in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, > which has a shortcoming of not being able to spill to disk within a key > group. Thus, for large key group, memory usage is also unbounded. > It's likely that this will impact performance. We will profile and optimize > afterwards. We could also make this change configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832383#comment-15832383 ] Xuefu Zhang commented on HIVE-15580: Thanks, Chao! I will commit this first and create a couple of followups. [~lirui], it would be great if you can also take a look at the patch. I will incorporate your comments (if any) as followups as well. Thanks. > Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark > - > > Key: HIVE-15580 > URL: https://issues.apache.org/jira/browse/HIVE-15580 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, > HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, > HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch > > > Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded > memory. For orderBy, Hive accumulates key groups using ArrayList (described > in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, > which has a shortcoming of not being able to spill to disk within a key > group. Thus, for large key group, memory usage is also unbounded. > It's likely that this will impact performance. We will profile and optimize > afterwards. We could also make this change configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832377#comment-15832377 ] Chao Sun commented on HIVE-15580: - +1 > Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark > - > > Key: HIVE-15580 > URL: https://issues.apache.org/jira/browse/HIVE-15580 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, > HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, > HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch > > > Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded > memory. For orderBy, Hive accumulates key groups using ArrayList (described > in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, > which has a shortcoming of not being able to spill to disk within a key > group. Thus, for large key group, memory usage is also unbounded. > It's likely that this will impact performance. We will profile and optimize > afterwards. We could also make this change configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832194#comment-15832194 ] Xuefu Zhang commented on HIVE-15580: RB: https://reviews.apache.org/r/55776/ > Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark > - > > Key: HIVE-15580 > URL: https://issues.apache.org/jira/browse/HIVE-15580 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, > HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, > HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch > > > Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded > memory. For orderBy, Hive accumulates key groups using ArrayList (described > in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, > which has a shortcoming of not being able to spill to disk within a key > group. Thus, for large key group, memory usage is also unbounded. > It's likely that this will impact performance. We will profile and optimize > afterwards. We could also make this change configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831746#comment-15831746 ] Hive QA commented on HIVE-15580: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12848310/HIVE-15580.5.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 13 failed/errored test(s), 10965 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=235) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[specialChar] (batchId=22) org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_join_with_different_encryption_keys] (batchId=159) org.apache.hadoop.hive.cli.TestHBaseNegativeCliDriver.testCliDriver[cascade_dbdrop] (batchId=226) org.apache.hadoop.hive.cli.TestHBaseNegativeCliDriver.testCliDriver[generatehfiles_require_family_path] (batchId=226) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=135) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a] (batchId=136) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[escape1] (batchId=139) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[escape2] (batchId=154) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer] (batchId=151) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_part] (batchId=149) org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[cluster_tasklog_retrieval] (batchId=87) org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[minimr_broken_pipe] (batchId=87) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3065/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3065/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3065/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 13 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12848310 - PreCommit-HIVE-Build > Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark > - > > Key: HIVE-15580 > URL: https://issues.apache.org/jira/browse/HIVE-15580 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, > HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, > HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch > > > Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded > memory. For orderBy, Hive accumulates key groups using ArrayList (described > in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, > which has a shortcoming of not being able to spill to disk within a key > group. Thus, for large key group, memory usage is also unbounded. > It's likely that this will impact performance. We will profile and optimize > afterwards. We could also make this change configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831579#comment-15831579 ] Hive QA commented on HIVE-15580: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12848310/HIVE-15580.5.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 13 failed/errored test(s), 10965 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=235) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[specialChar] (batchId=22) org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_join_with_different_encryption_keys] (batchId=159) org.apache.hadoop.hive.cli.TestHBaseNegativeCliDriver.testCliDriver[cascade_dbdrop] (batchId=226) org.apache.hadoop.hive.cli.TestHBaseNegativeCliDriver.testCliDriver[generatehfiles_require_family_path] (batchId=226) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] (batchId=135) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a] (batchId=136) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[escape1] (batchId=139) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[escape2] (batchId=154) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_part] (batchId=149) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_varchar_simple] (batchId=152) org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[cluster_tasklog_retrieval] (batchId=87) org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[minimr_broken_pipe] (batchId=87) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3063/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3063/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3063/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 13 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12848310 - PreCommit-HIVE-Build > Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark > - > > Key: HIVE-15580 > URL: https://issues.apache.org/jira/browse/HIVE-15580 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, > HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, > HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch > > > Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded > memory. For orderBy, Hive accumulates key groups using ArrayList (described > in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, > which has a shortcoming of not being able to spill to disk within a key > group. Thus, for large key group, memory usage is also unbounded. > It's likely that this will impact performance. We will profile and optimize > afterwards. We could also make this change configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)