[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group
[ https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205041#comment-15205041 ] Ashutosh Chauhan commented on HIVE-13250: - I think we can special case this for equality predicate. I will update the patch for that. > Compute predicate conversions on the client, instead of per row group > - > > Key: HIVE-13250 > URL: https://issues.apache.org/jira/browse/HIVE-13250 > Project: Hive > Issue Type: Improvement >Affects Versions: 2.1.0 >Reporter: Siddharth Seth >Assignee: Ashutosh Chauhan > Attachments: HIVE-13250.2.patch, HIVE-13250.patch > > > When running a query for the form > select count from table where ts_field = "2016-01-23 00:00:00"; > or > select count from table where ts_field = 1453507200 > ts_field is of type TIMESTAMP > The predicate is converted to whatever format is appropriate for TIMESTAMP > processing on each and every row group. > It would be far more efficient to process this once on the client - or even > once per task. > The same applies to ORC splt elimination as well - this is applied for each > stripe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group
[ https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204859#comment-15204859 ] Gopal V commented on HIVE-13250: bq. I'd expect the cast to change the value to whatever can be compared directly against storage. UDFs disable all predicate pushdown. Only constants are applied to PPD, so retaining the UDFToString disabled all PPD into the storage system. > Compute predicate conversions on the client, instead of per row group > - > > Key: HIVE-13250 > URL: https://issues.apache.org/jira/browse/HIVE-13250 > Project: Hive > Issue Type: Improvement >Affects Versions: 2.1.0 >Reporter: Siddharth Seth >Assignee: Ashutosh Chauhan > Attachments: HIVE-13250.2.patch, HIVE-13250.patch > > > When running a query for the form > select count from table where ts_field = "2016-01-23 00:00:00"; > or > select count from table where ts_field = 1453507200 > ts_field is of type TIMESTAMP > The predicate is converted to whatever format is appropriate for TIMESTAMP > processing on each and every row group. > It would be far more efficient to process this once on the client - or even > once per task. > The same applies to ORC splt elimination as well - this is applied for each > stripe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group
[ https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204812#comment-15204812 ] Siddharth Seth commented on HIVE-13250: --- bq. I misunderstood this bug report. Without patch, filter expression for {{ ts_field = "2016-01-23 00:00:00"}} gets executed as (UDFToString(ts_field) = '2016-01-23 00:00:00') In the patch I made changes such that cast is on constant (ts_field = UDFTOTimeStamp('2016-01-23 00:00:00')) which gets folded compile time to (ts_field = 2016-01-23 00:00:00.0) I'd expect the cast to change the value to whatever can be compared directly against storage. However, I think the type promotion system is far more complicated - and this may not be possible always. > Compute predicate conversions on the client, instead of per row group > - > > Key: HIVE-13250 > URL: https://issues.apache.org/jira/browse/HIVE-13250 > Project: Hive > Issue Type: Improvement >Affects Versions: 2.1.0 >Reporter: Siddharth Seth >Assignee: Ashutosh Chauhan > Attachments: HIVE-13250.2.patch, HIVE-13250.patch > > > When running a query for the form > select count from table where ts_field = "2016-01-23 00:00:00"; > or > select count from table where ts_field = 1453507200 > ts_field is of type TIMESTAMP > The predicate is converted to whatever format is appropriate for TIMESTAMP > processing on each and every row group. > It would be far more efficient to process this once on the client - or even > once per task. > The same applies to ORC splt elimination as well - this is applied for each > stripe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group
[ https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204809#comment-15204809 ] Siddharth Seth commented on HIVE-13250: --- [~ashutoshc] - this is what was observed. The following exception was seen for every row group in ORC Files. Note how the constant is being cast each and every time. The intent was to avoid that. It seems like this is something the can be avoided on the client itself by casting the constant to whatever format the column requires. Now, with schema evolution, this may not always be possible - which is why the suggestion for once per task. {code} 2016-02-10 02:15:43,175 [WARN] [TezChild] |orc.RecordReaderImpl|: Exception when evaluating predicate. Skipping ORC PPD. Exception: java.lang.IllegalArgumentException: ORC SARGS could not convert from String to TIMESTAMP at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.getBaseObjectForComparison(RecordReaderImpl.java:659) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.evaluatePredicateRange(RecordReaderImpl.java:373) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.evaluatePredicateProto(RecordReaderImpl.java:338) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$SargApplier.pickRowGroups(RecordReaderImpl.java:710) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.pickRowGroups(RecordReaderImpl.java:751) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:777) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1019) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:205) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:598) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1269) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1151) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:249) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:135) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:406) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) a
[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group
[ https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201376#comment-15201376 ] Hive QA commented on HIVE-13250: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12793829/HIVE-13250.2.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 383 failed/errored test(s), 9833 tests executed *Failed tests:* {noformat} TestSparkCliDriver-groupby3_map.q-sample2.q-auto_join14.q-and-12-more - did not produce a TEST-*.xml file TestSparkCliDriver-groupby_map_ppr_multi_distinct.q-table_access_keys_stats.q-groupby4_noskew.q-and-12-more - did not produce a TEST-*.xml file TestSparkCliDriver-join_rc.q-insert1.q-vectorized_rcfile_columnar.q-and-12-more - did not produce a TEST-*.xml file TestSparkCliDriver-ppd_join4.q-join9.q-ppd_join3.q-and-12-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_allcolref_in_udf org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_archive_excludeHadoop20 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join11 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join16 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join27 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_without_localtask org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_smb_mapjoin_14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_15 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_9 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucket_map_join_spark4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cast1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_auto_join1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_lineage2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_outer_join_ppr org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_semijoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_union org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_semijoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_union org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_column_access_stats org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_combine2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_constprog3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_constprog_dp org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_create_view org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_create_view_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cross_join_merge org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_colname org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cte_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cte_5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cte_mat_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cte_mat_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_database org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynamic_rdd_cache org.apache.hadoop.hiv
[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group
[ https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201819#comment-15201819 ] Ashutosh Chauhan commented on HIVE-13250: - I misunderstood this bug report. Without patch, filter expression for {{ ts_field = "2016-01-23 00:00:00"}} gets executed as {{(UDFToString(ts_field) = '2016-01-23 00:00:00')}} In the patch I made changes such that cast is on constant {{(ts_field = UDFTOTimeStamp('2016-01-23 00:00:00'))}} which gets folded compile time to {{(ts_field = 2016-01-23 00:00:00.0)}} However this is incorrect. Earlier behavior of cast on column is indeed correct. I tested this on oracle, mysql & SQL Server all of which puts a cast on column and not constant. [~sseth] Do you have anything else on the mind for this one? > Compute predicate conversions on the client, instead of per row group > - > > Key: HIVE-13250 > URL: https://issues.apache.org/jira/browse/HIVE-13250 > Project: Hive > Issue Type: Improvement >Affects Versions: 2.1.0 >Reporter: Siddharth Seth >Assignee: Ashutosh Chauhan > Attachments: HIVE-13250.2.patch, HIVE-13250.patch > > > When running a query for the form > select count from table where ts_field = "2016-01-23 00:00:00"; > or > select count from table where ts_field = 1453507200 > ts_field is of type TIMESTAMP > The predicate is converted to whatever format is appropriate for TIMESTAMP > processing on each and every row group. > It would be far more efficient to process this once on the client - or even > once per task. > The same applies to ORC splt elimination as well - this is applied for each > stripe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group
[ https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189842#comment-15189842 ] Gopal V commented on HIVE-13250: The int -> timestamp conversion needs docs, thanks to {{hive.int.timestamp.conversion.in.seconds=false}} > Compute predicate conversions on the client, instead of per row group > - > > Key: HIVE-13250 > URL: https://issues.apache.org/jira/browse/HIVE-13250 > Project: Hive > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Ashutosh Chauhan > Attachments: HIVE-13250.patch > > > When running a query for the form > select count from table where ts_field = "2016-01-23 00:00:00"; > or > select count from table where ts_field = 1453507200 > ts_field is of type TIMESTAMP > The predicate is converted to whatever format is appropriate for TIMESTAMP > processing on each and every row group. > It would be far more efficient to process this once on the client - or even > once per task. > The same applies to ORC splt elimination as well - this is applied for each > stripe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group
[ https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188204#comment-15188204 ] Siddharth Seth commented on HIVE-13250: --- cc [~ashutoshc], [~prasanth_j] - this was initially filed for Orc split elimination and partition pruning. [~ashutoshc] mentioned that CBO may be affected as well. > Compute predicate conversions on the client, instead of per row group > - > > Key: HIVE-13250 > URL: https://issues.apache.org/jira/browse/HIVE-13250 > Project: Hive > Issue Type: Improvement >Reporter: Siddharth Seth > > When running a query for the form > select count from table where ts_field = "2016-01-23 00:00:00"; > or > select count from table where ts_field = 1453507200 > ts_field is of type TIMESTAMP > The predicate is converted to whatever format is appropriate for TIMESTAMP > processing on each and every row group. > It would be far more efficient to process this once on the client - or even > once per task. > The same applies to ORC splt elimination as well - this is applied for each > stripe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)