[jira] [Commented] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns

2018-03-08 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390928#comment-16390928
 ] 

Hive QA commented on HIVE-18608:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12913421/HIVE-18608.2-branch-2.2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 59 failed/errored test(s), 9944 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=244)
TestJdbcDriver2 - did not produce a TEST-*.xml file (likely timed out) 
(batchId=225)
TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=167)
[acid_globallimit.q,alter_merge_2_orc.q]
TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=173)

[infer_bucket_sort_reducers_power_two.q,list_bucket_dml_10.q,orc_merge9.q,orc_merge6.q,leftsemijoin_mr.q,bucket6.q,bucketmapjoin7.q,uber_reduce.q,empty_dir_in_table.q,vector_outer_join3.q,index_bitmap_auto.q,vector_outer_join2.q,vector_outer_join1.q,orc_merge1.q,orc_merge_diff_fs.q,load_hdfs_file_with_space_in_the_name.q,scriptfile1_win.q,quotedid_smb.q,truncate_column_buckets.q,orc_merge3.q]
TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=174)

[infer_bucket_sort_num_buckets.q,gen_udf_example_add10.q,insert_overwrite_directory2.q,orc_merge5.q,bucketmapjoin6.q,import_exported_table.q,vector_outer_join0.q,orc_merge4.q,temp_table_external.q,orc_merge_incompat1.q,root_dir_external_table.q,constprog_semijoin.q,auto_sortmerge_join_16.q,schemeAuthority.q,index_bitmap3.q,external_table_with_space_in_location_path.q,parallel_orderby.q,infer_bucket_sort_map_operators.q,bucketizedhiveinputformat.q,remote_script.q]
TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=175)

[scriptfile1.q,vector_outer_join5.q,file_with_header_footer.q,bucket4.q,input16_cc.q,bucket5.q,infer_bucket_sort_merge.q,constprog_partitioner.q,orc_merge2.q,reduce_deduplicate.q,schemeAuthority2.q,load_fs2.q,orc_merge8.q,orc_merge_incompat2.q,infer_bucket_sort_bucketed_table.q,vector_outer_join4.q,disable_merge_for_bucketing.q,vector_inner_join.q,orc_merge7.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=118)

[bucketmapjoin4.q,bucket_map_join_spark4.q,union21.q,groupby2_noskew.q,timestamp_2.q,date_join1.q,mergejoins.q,smb_mapjoin_11.q,auto_sortmerge_join_3.q,mapjoin_test_outer.q,vectorization_9.q,merge2.q,groupby6_noskew.q,auto_join_without_localtask.q,multi_join_union.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=119)

[join_cond_pushdown_unqual4.q,union_remove_7.q,join13.q,join_vc.q,groupby_cube1.q,bucket_map_join_spark2.q,sample3.q,smb_mapjoin_19.q,stats16.q,union23.q,union.q,union31.q,cbo_udf_udaf.q,ptf_decimal.q,bucketmapjoin2.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=120)

[parallel_join1.q,union27.q,union12.q,groupby7_map_multi_single_reducer.q,varchar_join1.q,join7.q,join_reorder4.q,skewjoinopt2.q,bucketsortoptimize_insert_2.q,smb_mapjoin_17.q,script_env_var1.q,groupby7_map.q,groupby3.q,bucketsortoptimize_insert_8.q,union20.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=121)

[ptf_general_queries.q,auto_join_reordering_values.q,sample2.q,join1.q,decimal_join.q,mapjoin_subquery2.q,join32_lessSize.q,mapjoin1.q,order2.q,skewjoinopt18.q,union_remove_18.q,join25.q,groupby9.q,bucketsortoptimize_insert_6.q,ctas.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=122)

[groupby_map_ppr.q,nullgroup4_multi_distinct.q,join_rc.q,union14.q,smb_mapjoin_12.q,vector_cast_constant.q,union_remove_4.q,auto_join11.q,load_dyn_part7.q,udaf_collect_set.q,vectorization_12.q,groupby_sort_skew_1.q,groupby_sort_skew_1_23.q,smb_mapjoin_25.q,skewjoinopt12.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=123)

[skewjoinopt15.q,auto_join18.q,list_bucket_dml_2.q,input1_limit.q,load_dyn_part3.q,union_remove_14.q,auto_sortmerge_join_14.q,auto_sortmerge_join_15.q,union10.q,bucket_map_join_tez2.q,groupby5_map_skew.q,join_reorder.q,sample1.q,bucketmapjoin8.q,union34.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=124)

[avro_joins.q,skewjoinopt16.q,auto_join14.q,vectorization_14.q,auto_join26.q,stats1.q,cbo_stats.q,auto_sortmerge_join_6.q,union22.q,union_remove_24.q,union_view.q,smb_mapjoin_22.q,stats15.q,ptf_matchpath.q,transform_ppr1.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=125)


[jira] [Commented] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns

2018-03-07 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390030#comment-16390030
 ] 

Mithun Radhakrishnan commented on HIVE-18608:
-

Hey, [~owen.omalley]. I've renamed that property, as you've suggested.

Also, thanks for ORC-308. :]

> ORC should allow selectively disabling dictionary-encoding on specified 
> columns
> ---
>
> Key: HIVE-18608
> URL: https://issues.apache.org/jira/browse/HIVE-18608
> Project: Hive
>  Issue Type: New Feature
>  Components: ORC
>Affects Versions: 3.0.0, 2.4.0, 2.2.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>Priority: Major
> Attachments: HIVE-18608.1-branch-2.2.patch, 
> HIVE-18608.2-branch-2.2.patch
>
>
> Just as ORC allows the choice of columns to enable bloom-filters on, it would 
> be nice to have a way to specify which columns {{DICTIONARY_V2}} encoding 
> should be disabled on.
> Currently, the choice of dictionary-encoding depends on the results of 
> sampling the first row-stride within a stripe. If the user knows that a 
> column's cardinality is bound to prevent an effective dictionary, she might 
> choose to simply disable it on just that column, and avoid the cost of 
> sampling in the first row-stride.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns

2018-02-27 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379551#comment-16379551
 ] 

Owen O'Malley commented on HIVE-18608:
--

I've just opened a jira and a pull request that is useful to this and other 
changes that need to specify column names.

https://issues.apache.org/jira/browse/ORC-308

it allows you to specify subfields by name such as your example: 
myInfoArray._elem_.emailBody.



> ORC should allow selectively disabling dictionary-encoding on specified 
> columns
> ---
>
> Key: HIVE-18608
> URL: https://issues.apache.org/jira/browse/HIVE-18608
> Project: Hive
>  Issue Type: New Feature
>  Components: ORC
>Affects Versions: 3.0.0, 2.4.0, 2.2.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>Priority: Major
> Attachments: HIVE-18608.1-branch-2.2.patch
>
>
> Just as ORC allows the choice of columns to enable bloom-filters on, it would 
> be nice to have a way to specify which columns {{DICTIONARY_V2}} encoding 
> should be disabled on.
> Currently, the choice of dictionary-encoding depends on the results of 
> sampling the first row-stride within a stripe. If the user knows that a 
> column's cardinality is bound to prevent an effective dictionary, she might 
> choose to simply disable it on just that column, and avoid the cost of 
> sampling in the first row-stride.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns

2018-02-21 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372101#comment-16372101
 ] 

Owen O'Malley commented on HIVE-18608:
--

I'd suggest making the property:
orc.column.encoding.direct =col10,col20

> ORC should allow selectively disabling dictionary-encoding on specified 
> columns
> ---
>
> Key: HIVE-18608
> URL: https://issues.apache.org/jira/browse/HIVE-18608
> Project: Hive
>  Issue Type: New Feature
>  Components: ORC
>Affects Versions: 3.0.0, 2.4.0, 2.2.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>Priority: Major
> Attachments: HIVE-18608.1-branch-2.2.patch
>
>
> Just as ORC allows the choice of columns to enable bloom-filters on, it would 
> be nice to have a way to specify which columns {{DICTIONARY_V2}} encoding 
> should be disabled on.
> Currently, the choice of dictionary-encoding depends on the results of 
> sampling the first row-stride within a stripe. If the user knows that a 
> column's cardinality is bound to prevent an effective dictionary, she might 
> choose to simply disable it on just that column, and avoid the cost of 
> sampling in the first row-stride.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns

2018-02-01 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349660#comment-16349660
 ] 

Mithun Radhakrishnan commented on HIVE-18608:
-

I've attached an initial implementation, where dictionary encoding might be 
disabled via a table-property ({{'orc.skip.dictionary.for.columns'}}.

Note: I've only added support for top-level columns. Specifying this on a 
{{STRUCT}} will disable dictionary encoding for the entire sub-tree (i.e. all 
members of the STRUCT), recursively.

It might be good to support selection at an arbitrary depth. E.g. 
{{myInfoArray.__elem__.emailBody}}.

> ORC should allow selectively disabling dictionary-encoding on specified 
> columns
> ---
>
> Key: HIVE-18608
> URL: https://issues.apache.org/jira/browse/HIVE-18608
> Project: Hive
>  Issue Type: New Feature
>  Components: ORC
>Affects Versions: 3.0.0, 2.4.0, 2.2.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>Priority: Major
> Attachments: HIVE-18608.1-branch-2.2.patch
>
>
> Just as ORC allows the choice of columns to enable bloom-filters on, it would 
> be nice to have a way to specify which columns {{DICTIONARY_V2}} encoding 
> should be disabled on.
> Currently, the choice of dictionary-encoding depends on the results of 
> sampling the first row-stride within a stripe. If the user knows that a 
> column's cardinality is bound to prevent an effective dictionary, she might 
> choose to simply disable it on just that column, and avoid the cost of 
> sampling in the first row-stride.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)