[jira] [Commented] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns
[ https://issues.apache.org/jira/browse/HIVE-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390928#comment-16390928 ] Hive QA commented on HIVE-18608: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12913421/HIVE-18608.2-branch-2.2.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 59 failed/errored test(s), 9944 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=244) TestJdbcDriver2 - did not produce a TEST-*.xml file (likely timed out) (batchId=225) TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=167) [acid_globallimit.q,alter_merge_2_orc.q] TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=173) [infer_bucket_sort_reducers_power_two.q,list_bucket_dml_10.q,orc_merge9.q,orc_merge6.q,leftsemijoin_mr.q,bucket6.q,bucketmapjoin7.q,uber_reduce.q,empty_dir_in_table.q,vector_outer_join3.q,index_bitmap_auto.q,vector_outer_join2.q,vector_outer_join1.q,orc_merge1.q,orc_merge_diff_fs.q,load_hdfs_file_with_space_in_the_name.q,scriptfile1_win.q,quotedid_smb.q,truncate_column_buckets.q,orc_merge3.q] TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=174) [infer_bucket_sort_num_buckets.q,gen_udf_example_add10.q,insert_overwrite_directory2.q,orc_merge5.q,bucketmapjoin6.q,import_exported_table.q,vector_outer_join0.q,orc_merge4.q,temp_table_external.q,orc_merge_incompat1.q,root_dir_external_table.q,constprog_semijoin.q,auto_sortmerge_join_16.q,schemeAuthority.q,index_bitmap3.q,external_table_with_space_in_location_path.q,parallel_orderby.q,infer_bucket_sort_map_operators.q,bucketizedhiveinputformat.q,remote_script.q] TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=175) [scriptfile1.q,vector_outer_join5.q,file_with_header_footer.q,bucket4.q,input16_cc.q,bucket5.q,infer_bucket_sort_merge.q,constprog_partitioner.q,orc_merge2.q,reduce_deduplicate.q,schemeAuthority2.q,load_fs2.q,orc_merge8.q,orc_merge_incompat2.q,infer_bucket_sort_bucketed_table.q,vector_outer_join4.q,disable_merge_for_bucketing.q,vector_inner_join.q,orc_merge7.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=118) [bucketmapjoin4.q,bucket_map_join_spark4.q,union21.q,groupby2_noskew.q,timestamp_2.q,date_join1.q,mergejoins.q,smb_mapjoin_11.q,auto_sortmerge_join_3.q,mapjoin_test_outer.q,vectorization_9.q,merge2.q,groupby6_noskew.q,auto_join_without_localtask.q,multi_join_union.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=119) [join_cond_pushdown_unqual4.q,union_remove_7.q,join13.q,join_vc.q,groupby_cube1.q,bucket_map_join_spark2.q,sample3.q,smb_mapjoin_19.q,stats16.q,union23.q,union.q,union31.q,cbo_udf_udaf.q,ptf_decimal.q,bucketmapjoin2.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=120) [parallel_join1.q,union27.q,union12.q,groupby7_map_multi_single_reducer.q,varchar_join1.q,join7.q,join_reorder4.q,skewjoinopt2.q,bucketsortoptimize_insert_2.q,smb_mapjoin_17.q,script_env_var1.q,groupby7_map.q,groupby3.q,bucketsortoptimize_insert_8.q,union20.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=121) [ptf_general_queries.q,auto_join_reordering_values.q,sample2.q,join1.q,decimal_join.q,mapjoin_subquery2.q,join32_lessSize.q,mapjoin1.q,order2.q,skewjoinopt18.q,union_remove_18.q,join25.q,groupby9.q,bucketsortoptimize_insert_6.q,ctas.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=122) [groupby_map_ppr.q,nullgroup4_multi_distinct.q,join_rc.q,union14.q,smb_mapjoin_12.q,vector_cast_constant.q,union_remove_4.q,auto_join11.q,load_dyn_part7.q,udaf_collect_set.q,vectorization_12.q,groupby_sort_skew_1.q,groupby_sort_skew_1_23.q,smb_mapjoin_25.q,skewjoinopt12.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=123) [skewjoinopt15.q,auto_join18.q,list_bucket_dml_2.q,input1_limit.q,load_dyn_part3.q,union_remove_14.q,auto_sortmerge_join_14.q,auto_sortmerge_join_15.q,union10.q,bucket_map_join_tez2.q,groupby5_map_skew.q,join_reorder.q,sample1.q,bucketmapjoin8.q,union34.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=124) [avro_joins.q,skewjoinopt16.q,auto_join14.q,vectorization_14.q,auto_join26.q,stats1.q,cbo_stats.q,auto_sortmerge_join_6.q,union22.q,union_remove_24.q,union_view.q,smb_mapjoin_22.q,stats15.q,ptf_matchpath.q,transform_ppr1.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=125)
[jira] [Commented] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns
[ https://issues.apache.org/jira/browse/HIVE-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390030#comment-16390030 ] Mithun Radhakrishnan commented on HIVE-18608: - Hey, [~owen.omalley]. I've renamed that property, as you've suggested. Also, thanks for ORC-308. :] > ORC should allow selectively disabling dictionary-encoding on specified > columns > --- > > Key: HIVE-18608 > URL: https://issues.apache.org/jira/browse/HIVE-18608 > Project: Hive > Issue Type: New Feature > Components: ORC >Affects Versions: 3.0.0, 2.4.0, 2.2.1 >Reporter: Mithun Radhakrishnan >Assignee: Mithun Radhakrishnan >Priority: Major > Attachments: HIVE-18608.1-branch-2.2.patch, > HIVE-18608.2-branch-2.2.patch > > > Just as ORC allows the choice of columns to enable bloom-filters on, it would > be nice to have a way to specify which columns {{DICTIONARY_V2}} encoding > should be disabled on. > Currently, the choice of dictionary-encoding depends on the results of > sampling the first row-stride within a stripe. If the user knows that a > column's cardinality is bound to prevent an effective dictionary, she might > choose to simply disable it on just that column, and avoid the cost of > sampling in the first row-stride. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns
[ https://issues.apache.org/jira/browse/HIVE-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379551#comment-16379551 ] Owen O'Malley commented on HIVE-18608: -- I've just opened a jira and a pull request that is useful to this and other changes that need to specify column names. https://issues.apache.org/jira/browse/ORC-308 it allows you to specify subfields by name such as your example: myInfoArray._elem_.emailBody. > ORC should allow selectively disabling dictionary-encoding on specified > columns > --- > > Key: HIVE-18608 > URL: https://issues.apache.org/jira/browse/HIVE-18608 > Project: Hive > Issue Type: New Feature > Components: ORC >Affects Versions: 3.0.0, 2.4.0, 2.2.1 >Reporter: Mithun Radhakrishnan >Assignee: Mithun Radhakrishnan >Priority: Major > Attachments: HIVE-18608.1-branch-2.2.patch > > > Just as ORC allows the choice of columns to enable bloom-filters on, it would > be nice to have a way to specify which columns {{DICTIONARY_V2}} encoding > should be disabled on. > Currently, the choice of dictionary-encoding depends on the results of > sampling the first row-stride within a stripe. If the user knows that a > column's cardinality is bound to prevent an effective dictionary, she might > choose to simply disable it on just that column, and avoid the cost of > sampling in the first row-stride. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns
[ https://issues.apache.org/jira/browse/HIVE-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372101#comment-16372101 ] Owen O'Malley commented on HIVE-18608: -- I'd suggest making the property: orc.column.encoding.direct =col10,col20 > ORC should allow selectively disabling dictionary-encoding on specified > columns > --- > > Key: HIVE-18608 > URL: https://issues.apache.org/jira/browse/HIVE-18608 > Project: Hive > Issue Type: New Feature > Components: ORC >Affects Versions: 3.0.0, 2.4.0, 2.2.1 >Reporter: Mithun Radhakrishnan >Assignee: Mithun Radhakrishnan >Priority: Major > Attachments: HIVE-18608.1-branch-2.2.patch > > > Just as ORC allows the choice of columns to enable bloom-filters on, it would > be nice to have a way to specify which columns {{DICTIONARY_V2}} encoding > should be disabled on. > Currently, the choice of dictionary-encoding depends on the results of > sampling the first row-stride within a stripe. If the user knows that a > column's cardinality is bound to prevent an effective dictionary, she might > choose to simply disable it on just that column, and avoid the cost of > sampling in the first row-stride. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns
[ https://issues.apache.org/jira/browse/HIVE-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349660#comment-16349660 ] Mithun Radhakrishnan commented on HIVE-18608: - I've attached an initial implementation, where dictionary encoding might be disabled via a table-property ({{'orc.skip.dictionary.for.columns'}}. Note: I've only added support for top-level columns. Specifying this on a {{STRUCT}} will disable dictionary encoding for the entire sub-tree (i.e. all members of the STRUCT), recursively. It might be good to support selection at an arbitrary depth. E.g. {{myInfoArray.__elem__.emailBody}}. > ORC should allow selectively disabling dictionary-encoding on specified > columns > --- > > Key: HIVE-18608 > URL: https://issues.apache.org/jira/browse/HIVE-18608 > Project: Hive > Issue Type: New Feature > Components: ORC >Affects Versions: 3.0.0, 2.4.0, 2.2.1 >Reporter: Mithun Radhakrishnan >Assignee: Mithun Radhakrishnan >Priority: Major > Attachments: HIVE-18608.1-branch-2.2.patch > > > Just as ORC allows the choice of columns to enable bloom-filters on, it would > be nice to have a way to specify which columns {{DICTIONARY_V2}} encoding > should be disabled on. > Currently, the choice of dictionary-encoding depends on the results of > sampling the first row-stride within a stripe. If the user knows that a > column's cardinality is bound to prevent an effective dictionary, she might > choose to simply disable it on just that column, and avoid the cost of > sampling in the first row-stride. -- This message was sent by Atlassian JIRA (v7.6.3#76005)