[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537540#comment-14537540 ] Selina Zhang commented on HIVE-10036: - The above two unit test failures seem irrelevant to this patch. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch, HIVE-10036.8.patch, HIVE-10036.9.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537415#comment-14537415 ] Hive QA commented on HIVE-10036: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12731823/HIVE-10036.9.patch {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 8921 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver_encryption_insert_partition_static org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchCommit_Json {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3841/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3841/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3841/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12731823 - PreCommit-HIVE-TRUNK-Build Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch, HIVE-10036.8.patch, HIVE-10036.9.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536335#comment-14536335 ] Hive QA commented on HIVE-10036: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12731308/HIVE-10036.8.patch {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 8919 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver_encryption_insert_partition_static org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testMemoryManagementV12[0] org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testMemoryManagementV12[1] {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3824/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3824/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3824/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12731308 - PreCommit-HIVE-TRUNK-Build Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch, HIVE-10036.8.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526023#comment-14526023 ] Selina Zhang commented on HIVE-10036: - [~gopalv] Thank you! I included the io.netty to ql/pom.xml and uploaded a new patch. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510096#comment-14510096 ] Owen O'Malley commented on HIVE-10036: -- I understand the problem, but this patch creates more trouble than it fixes. The original design is such that you don't do buffer copies. This patch destroys that design and adds both buffer copies and reallocations. We should set the buffer sizes down for the bit vectors, but this patch is going the wrong way. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Labels: orcfile Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495849#comment-14495849 ] Gopal V commented on HIVE-10036: [~selinazh]: I added this to my ORC bench for HadoopSummit and it looks like we're not shipping io.netty inside hive-exec.jar. The io.netty:netty-buffer needs to be included in the maven-shade lines the dependencies for ql/pom.xml. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493925#comment-14493925 ] Hive QA commented on HIVE-10036: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12725128/HIVE-10036.6.patch {color:red}ERROR:{color} -1 due to 323 failed/errored test(s), 8688 tests executed *Failed tests:* {noformat} TestMinimrCliDriver-bucketmapjoin6.q-constprog_partitioner.q-infer_bucket_sort_dyn_part.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-external_table_with_space_in_location_path.q-infer_bucket_sort_merge.q-auto_sortmerge_join_16.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-groupby2.q-import_exported_table.q-bucketizedhiveinputformat.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-index_bitmap3.q-stats_counter_partitioned.q-temp_table_external.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_map_operators.q-join1.q-bucketmapjoin7.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_num_buckets.q-disable_merge_for_bucketing.q-uber_reduce.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_reducers_power_two.q-scriptfile1.q-scriptfile1_win.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-leftsemijoin_mr.q-load_hdfs_file_with_space_in_the_name.q-root_dir_external_table.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-list_bucket_dml_10.q-bucket_num_reducers.q-bucket6.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-load_fs2.q-file_with_header_footer.q-ql_rewrite_gbtoidx_cbo_1.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-parallel_orderby.q-reduce_deduplicate.q-ql_rewrite_gbtoidx_cbo_2.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-ql_rewrite_gbtoidx.q-smb_mapjoin_8.q - did not produce a TEST-*.xml file TestMinimrCliDriver-schemeAuthority2.q-bucket4.q-input16_cc.q-and-1-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization_partition org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization_project org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_merge_2_orc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_merge_orc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_merge_stats_orc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_filter org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_groupby org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_limit org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_select org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_union org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete_own_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update_own_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_char_serde org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_date_serde org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_join2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_all_non_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_all_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_orig_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_tmp_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_no_match org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_non_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_whole_partition org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_opt_vectorization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization_acid org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_full org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_partial
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491458#comment-14491458 ] Hive QA commented on HIVE-10036: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12724721/HIVE-10036.5.patch {color:red}ERROR:{color} -1 due to 316 failed/errored test(s), 8672 tests executed *Failed tests:* {noformat} TestMinimrCliDriver-bucketmapjoin6.q-constprog_partitioner.q-infer_bucket_sort_dyn_part.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-external_table_with_space_in_location_path.q-infer_bucket_sort_merge.q-auto_sortmerge_join_16.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-groupby2.q-import_exported_table.q-bucketizedhiveinputformat.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-index_bitmap3.q-stats_counter_partitioned.q-temp_table_external.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_map_operators.q-join1.q-bucketmapjoin7.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_num_buckets.q-disable_merge_for_bucketing.q-uber_reduce.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_reducers_power_two.q-scriptfile1.q-scriptfile1_win.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-leftsemijoin_mr.q-load_hdfs_file_with_space_in_the_name.q-root_dir_external_table.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-list_bucket_dml_10.q-bucket_num_reducers.q-bucket6.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-load_fs2.q-file_with_header_footer.q-ql_rewrite_gbtoidx_cbo_1.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-parallel_orderby.q-reduce_deduplicate.q-ql_rewrite_gbtoidx_cbo_2.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-ql_rewrite_gbtoidx.q-smb_mapjoin_8.q - did not produce a TEST-*.xml file TestMinimrCliDriver-schemeAuthority2.q-bucket4.q-input16_cc.q-and-1-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization_partition org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization_project org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_merge_2_orc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_merge_orc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_merge_stats_orc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_filter org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_groupby org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_limit org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_select org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_union org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete_own_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update_own_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_char_serde org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_date_serde org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_join2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_all_non_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_all_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_orig_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_tmp_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_no_match org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_non_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_whole_partition org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_opt_vectorization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization_acid org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_full org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_partial
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392404#comment-14392404 ] Prasanth Jayachandran commented on HIVE-10036: -- [~selinazh] Thanks for the patch! Left some comments in RB. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378863#comment-14378863 ] Selina Zhang commented on HIVE-10036: - [~gopalv] I checked my maven repository, it seems avro 1.7.5 pulled in the netty-3.4.0.Final, which override the one defined in main pom.xml. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378176#comment-14378176 ] Selina Zhang commented on HIVE-10036: - The review request: https://reviews.apache.org/r/32445/ Thank you! Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378260#comment-14378260 ] Gopal V commented on HIVE-10036: [~selinazh]: quick q - the netty version hive-1.2 is 4.0. Do you know where the netty-3 ChannelBuffer in the patch is from? The recent release renamed that to ByteBuf. Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377054#comment-14377054 ] Hive QA commented on HIVE-10036: {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12706769/HIVE-10036.3.patch {color:green}SUCCESS:{color} +1 7824 tests passed Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3123/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3123/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3123/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12706769 - PreCommit-HIVE-TRUNK-Build Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, HIVE-10036.3.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for FileSinkOperator). Global ORC memory manager controls the buffer size, but it only got kicked in at 5000 rows interval. An enhancement could be done here, but the problem is reducing the buffer size introduces worse compression and more IOs in read path. Sacrificing the read performance is always not a good choice. I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound to the existing configurable buffer size. Most of the streams does not need large buffer so the performance got improved significantly. Comparing to Facebook's hive-dwrf, I monitored 2x performance gain with this fix. Solving OOM for ORC completely maybe needs lots of effort , but this is definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers
[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372158#comment-14372158 ] Hive QA commented on HIVE-10036: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12706001/HIVE-10036.1.patch {color:red}ERROR:{color} -1 due to 39 failed/errored test(s), 7819 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete_own_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update_own_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_all_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_opt_vectorization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization_acid org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_full org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_partial org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_values_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_values_tmp_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_empty_files org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_merge5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_merge6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_vectorization_ppd org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_acid org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_after_multiple_inserts org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_short_regress org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_all_partitioned org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynpart_sort_opt_vectorization org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_insert_values_tmp_table org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_orc_merge5 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_orc_merge6 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_orc_vectorization_ppd org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_after_multiple_inserts org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_vectorization_short_regress org.apache.hadoop.hive.ql.io.orc.TestFileDump.testBloomFilter2 org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDictionaryThreshold org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDump org.apache.hadoop.hive.ql.io.orc.TestInStream.testCompressed org.apache.hadoop.hive.ql.io.orc.TestInStream.testDisjointBuffers org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testMemoryManagementV12[0] org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testMemoryManagementV12[1] org.apache.hadoop.hive.ql.io.orc.TestOrcRawRecordMerger.testEmpty org.apache.hive.hcatalog.pig.TestHCatLoader.testColumnarStorePushdown[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testEmptyStore[3] {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3095/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3095/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3095/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 39 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12706001 - PreCommit-HIVE-TRUNK-Build Writing ORC format big table causes OOM - too many fixed sized stream buffers - Key: HIVE-10036 URL: https://issues.apache.org/jira/browse/HIVE-10036 Project: Hive Issue Type: Improvement Reporter: Selina Zhang Assignee: Selina Zhang Attachments: HIVE-10036.1.patch ORC writer keeps multiple out steams for each column. Each output stream is allocated fixed size ByteBuffer (configurable, default to 256K). For a big table, the memory cost is unbearable. Specially when HCatalog dynamic partition involves, several hundreds files may be open and writing at the same time (same problems for