[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-30 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Status: Patch Available  (was: Open)

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Spark Partition Pruning Sink Operator
>  

[jira] [Updated] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-30 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17414:

Attachment: HIVE-17414.patch

[~stakiar],[~lirui]: please help review, 
Before we restrict clazz as “SparkPartitionPruningSinkOperator” when calling   
SparkUtilities#collectOp(Collection result, Operator root, 
Class clazz). so now when using VectorSparkPartitionPruningSinkOperator, 
HIVE-16948 does not work. The changes in the patch:
{code}
if (root == null) {
   return;
 }
-if (clazz.equals(root.getClass())) {
+if (clazz.equals(root.getClass()) || 
clazz.isAssignableFrom(root.getClass())) {
   result.add(root);
 }
{code}


> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
> Attachments: HIVE-17414.patch
>
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148485#comment-16148485
 ] 

Rui Li commented on HIVE-15104:
---

[~xuefuz], I'll try if that's feasible. Do you think it's OK to create a 
package just for one single class?

> Hive on Spark generate more shuffle data than hive on mr
> 
>
> Key: HIVE-15104
> URL: https://issues.apache.org/jira/browse/HIVE-15104
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.2.1
>Reporter: wangwenli
>Assignee: Rui Li
> Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, 
> HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, 
> HIVE-15104.5.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT

2017-08-30 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148476#comment-16148476
 ] 

Sahil Takiar commented on HIVE-17405:
-

Updated patch + RB with updated golden files for the skewjoin qtests. The 
additional call to constant propagation simplified a few of the predicates, but 
there were no changes to query output.

[~lirui], [~kellyzly] could you review?

> HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
> -
>
> Key: HIVE-17405
> URL: https://issues.apache.org/jira/browse/HIVE-17405
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, 
> HIVE-17405.3.patch, HIVE-17405.4.patch, HIVE-17405.5.patch
>
>
> In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new 
> ConstantPropagate().transform(parseContext)}} to {{new 
> ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}}
> Hive-on-Tez does the same thing.
> Running the full constant propagation isn't really necessary, we just want to 
> eliminate any {{and true}} predicates that were introduced by 
> {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The 
> {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the 
> operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. 
> The predicates introduced via {{SyntheticJoinPredicate}} are necessary to 
> help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or 
> not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT

2017-08-30 Thread Sahil Takiar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated HIVE-17405:

Attachment: HIVE-17405.5.patch

> HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
> -
>
> Key: HIVE-17405
> URL: https://issues.apache.org/jira/browse/HIVE-17405
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, 
> HIVE-17405.3.patch, HIVE-17405.4.patch, HIVE-17405.5.patch
>
>
> In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new 
> ConstantPropagate().transform(parseContext)}} to {{new 
> ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}}
> Hive-on-Tez does the same thing.
> Running the full constant propagation isn't really necessary, we just want to 
> eliminate any {{and true}} predicates that were introduced by 
> {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The 
> {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the 
> operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. 
> The predicates introduced via {{SyntheticJoinPredicate}} are necessary to 
> help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or 
> not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17415) Hit error "SemanticException View xxx is corresponding to LIMIT, rather than a SelectOperator." in Hive queries

2017-08-30 Thread Deepak Jaiswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Jaiswal updated HIVE-17415:
--
Attachment: HIVE-17415.2.patch

Forgot to add the test in testconfiguration.

> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries
> ---
>
> Key: HIVE-17415
> URL: https://issues.apache.org/jira/browse/HIVE-17415
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
> Attachments: HIVE-17415.1.patch, HIVE-17415.2.patch
>
>
> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries when a user creates a view with limits
> set 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider;
> create table my_passwd (
> username string,
> uid int);
> insert into my_passwd values
> ("Dev1", 1),
> ("Dev2", 2),
> ("Dev3", 3),
> ("Dev4", 4),
> ("Dev5", 5),
> ("Dev6", 6);
> create view my_passwd_vw as select * from my_passwd limit 3;
> set hive.security.authorization.enabled=true;
> grant select on table my_passwd to user hive_test_user;
> grant select on table my_passwd_vw to user hive_test_user;
> select * from my_passwd_vw;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17417) Lazy Timestamp and Date serialization is very expensive

2017-08-30 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148384#comment-16148384
 ] 

Sergey Shelukhin commented on HIVE-17417:
-

That seems like a schema problem in addition to potential serialization 
inefficiencies.

> Lazy Timestamp and Date serialization is very expensive
> ---
>
> Key: HIVE-17417
> URL: https://issues.apache.org/jira/browse/HIVE-17417
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Affects Versions: 3.0.0, 2.4.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Critical
> Attachments: date-serialize.png, timestamp-serialize.png
>
>
> In a specific case where a schema contains array with timestamp and 
> date fields (array size >1). Any access to this column very very 
> expensive in terms of CPU as most of the time is serialization of timestamp 
> and date. Refer attached profiles. >70% time spent in serialization + 
> tostring conversions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17415) Hit error "SemanticException View xxx is corresponding to LIMIT, rather than a SelectOperator." in Hive queries

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148357#comment-16148357
 ] 

Hive QA commented on HIVE-17415:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884563/HIVE-17415.1.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 11020 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_view_8] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=100)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.ql.io.TestSymlinkTextInputFormat.testCombine 
(batchId=262)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6610/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6610/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6610/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 7 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884563 - PreCommit-HIVE-Build

> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries
> ---
>
> Key: HIVE-17415
> URL: https://issues.apache.org/jira/browse/HIVE-17415
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
> Attachments: HIVE-17415.1.patch
>
>
> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries when a user creates a view with limits
> set 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider;
> create table my_passwd (
> username string,
> uid int);
> insert into my_passwd values
> ("Dev1", 1),
> ("Dev2", 2),
> ("Dev3", 3),
> ("Dev4", 4),
> ("Dev5", 5),
> ("Dev6", 6);
> create view my_passwd_vw as select * from my_passwd limit 3;
> set hive.security.authorization.enabled=true;
> grant select on table my_passwd to user hive_test_user;
> grant select on table my_passwd_vw to user hive_test_user;
> select * from my_passwd_vw;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-17383) ArrayIndexOutOfBoundsException in VectorGroupByOperator

2017-08-30 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li reassigned HIVE-17383:
-

Assignee: Rui Li

> ArrayIndexOutOfBoundsException in VectorGroupByOperator
> ---
>
> Key: HIVE-17383
> URL: https://issues.apache.org/jira/browse/HIVE-17383
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>
> Query to reproduce:
> {noformat}
> set hive.cbo.enable=false;
> select count(*) from (select key from src group by key) s where s.key='98';
> {noformat}
> The stack trace is:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:831)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:174)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:1046)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:462)
>   ... 18 more
> {noformat}
> More details can be found in HIVE-16823



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148351#comment-16148351
 ] 

Xuefu Zhang commented on HIVE-15104:


I see. It might be possible to put this class in a new package (jar), for which 
we don't relocate kryo? 

> Hive on Spark generate more shuffle data than hive on mr
> 
>
> Key: HIVE-15104
> URL: https://issues.apache.org/jira/browse/HIVE-15104
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.2.1
>Reporter: wangwenli
>Assignee: Rui Li
> Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, 
> HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, 
> HIVE-15104.5.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17193) HoS: don't combine map works that are targets of different DPPs

2017-08-30 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148344#comment-16148344
 ] 

Rui Li commented on HIVE-17193:
---

Yes I think so. We don't consider DPP when we combine map works in 
CombineEquivalentWorkResolver. This is wrong because if two map works are to be 
pruned by different DPP sinks, their outputs are probably different and 
shouldn't be combined.

> HoS: don't combine map works that are targets of different DPPs
> ---
>
> Key: HIVE-17193
> URL: https://issues.apache.org/jira/browse/HIVE-17193
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148340#comment-16148340
 ] 

Rui Li commented on HIVE-15104:
---

[~xuefuz], my previous 
[comment|https://issues.apache.org/jira/browse/HIVE-15104?focusedCommentId=15998177=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15998177]
 has some explanations about the relocation problem. Basically, the problem is 
we need to implement some method defined by Spark, and the method accepts a 
kryo parameter. With relocation, Hive's kryo and Spark's kryo are in different 
packages. If we compile the class in Hive and runs it in Spark, Spark will find 
the method not implemented because it has a different signature.

> Hive on Spark generate more shuffle data than hive on mr
> 
>
> Key: HIVE-15104
> URL: https://issues.apache.org/jira/browse/HIVE-15104
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.2.1
>Reporter: wangwenli
>Assignee: Rui Li
> Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, 
> HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, 
> HIVE-15104.5.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q

2017-08-30 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148336#comment-16148336
 ] 

liyunzhang_intel commented on HIVE-17412:
-

[~Ferd]: As Xuefu and Sahil finished review, can you help commit the patch, 
thanks, the reason why i trigger Hive QA is because HIVE-17405 will update the 
other change in spark_vectorized_dynamic_partition_pruning.q.out.

> Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-17412
> URL: https://issues.apache.org/jira/browse/HIVE-17412
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17412.patch
>
>
> for query
> {code}
>  set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> select distinct ds from srcpart;
> {code}
> the result is 
> {code}
> 2008-04-09
> 2008-04-08
> {code}
> the result of groupby in spark is not in order. Sometimes it returns 
> {code}
> 2008-04-08
> 2008-04-09
> {code}
> Sometimes it returns
> {code}
> 2008-04-09
> 2008-04-08
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17417) Lazy Timestamp and Date serialization is very expensive

2017-08-30 Thread Prasanth Jayachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-17417:
-
Attachment: date-serialize.png
timestamp-serialize.png

> Lazy Timestamp and Date serialization is very expensive
> ---
>
> Key: HIVE-17417
> URL: https://issues.apache.org/jira/browse/HIVE-17417
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Affects Versions: 3.0.0, 2.4.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Critical
> Attachments: date-serialize.png, timestamp-serialize.png
>
>
> In a specific case where a schema contains array with timestamp and 
> date fields (array size >1). Any access to this column very very 
> expensive in terms of CPU as most of the time is serialization of timestamp 
> and date. Refer attached profiles. >70% time spent in serialization + 
> tostring conversions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17381) When we enable Parquet Writer Version V2, hive throws an exception: Unsupported encoding: DELTA_BYTE_ARRAY.

2017-08-30 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148320#comment-16148320
 ] 

Ferdinand Xu commented on HIVE-17381:
-

Thanks [~colinma] for the patch. LGTM +1.

> When we enable Parquet Writer Version V2, hive throws an exception: 
> Unsupported encoding: DELTA_BYTE_ARRAY.
> ---
>
> Key: HIVE-17381
> URL: https://issues.apache.org/jira/browse/HIVE-17381
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Ke Jia
>Assignee: Colin Ma
> Attachments: HIVE-17381.001.patch
>
>
> when we set "hive.vectorized.execution.enabled=true" and 
> "parquet.writer.version=v2" simultaneously, hive throws the following 
> exception:
> Caused by: java.io.IOException: java.io.IOException: 
> java.lang.UnsupportedOperationException: Unsupported encoding: 
> DELTA_BYTE_ARRAY
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:232)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:142)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:254)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:83)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: java.lang.UnsupportedOperationException: 
> Unsupported encoding: DELTA_BYTE_ARRAY
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:167)
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:52)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:229)
>   ... 16 more



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-17417) Lazy Timestamp and Date serialization is very expensive

2017-08-30 Thread Prasanth Jayachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran reassigned HIVE-17417:



> Lazy Timestamp and Date serialization is very expensive
> ---
>
> Key: HIVE-17417
> URL: https://issues.apache.org/jira/browse/HIVE-17417
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Affects Versions: 3.0.0, 2.4.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Critical
>
> In a specific case where a schema contains array with timestamp and 
> date fields (array size >1). Any access to this column very very 
> expensive in terms of CPU as most of the time is serialization of timestamp 
> and date. Refer attached profiles. >70% time spent in serialization + 
> tostring conversions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148302#comment-16148302
 ] 

Hive QA commented on HIVE-17405:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884560/HIVE-17405.4.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 23 failed/errored test(s), 11019 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=100)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoin_union_remove_1]
 (batchId=138)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoin_union_remove_2]
 (batchId=113)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt10] 
(batchId=110)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt11] 
(batchId=131)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt12] 
(batchId=105)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt14] 
(batchId=132)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt15] 
(batchId=106)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt16] 
(batchId=107)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt17] 
(batchId=137)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt19] 
(batchId=110)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt1] 
(batchId=135)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt20] 
(batchId=133)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt2] 
(batchId=103)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt3] 
(batchId=111)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt4] 
(batchId=112)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt5] 
(batchId=112)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt6] 
(batchId=110)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt7] 
(batchId=123)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoinopt8] 
(batchId=113)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6609/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6609/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6609/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 23 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884560 - PreCommit-HIVE-Build

> HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
> -
>
> Key: HIVE-17405
> URL: https://issues.apache.org/jira/browse/HIVE-17405
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, 
> HIVE-17405.3.patch, HIVE-17405.4.patch
>
>
> In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new 
> ConstantPropagate().transform(parseContext)}} to {{new 
> ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}}
> Hive-on-Tez does the same thing.
> Running the full constant propagation isn't really necessary, we just want to 
> eliminate any {{and true}} predicates that were introduced by 
> {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The 
> {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the 
> operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. 
> The predicates introduced via {{SyntheticJoinPredicate}} are necessary to 
> help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or 
> not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17408) replication distcp should only be invoked if number of files AND file size cross configured limits

2017-08-30 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated HIVE-17408:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to master. I also added missing apache license header for new 
test case before commit.
Thanks for the patch [~anishek]!


> replication distcp should only be invoked if number of files AND file size 
> cross configured limits
> --
>
> Key: HIVE-17408
> URL: https://issues.apache.org/jira/browse/HIVE-17408
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Trivial
> Fix For: 3.0.0
>
> Attachments: HIVE-17408.1.patch
>
>
> CopyUtils currently invokes distcp on whether 
> "hive.exec.copyfile.maxnumfiles" or "hive.exec.copyfile.maxsize" condition is 
> breached,  should only be invoked when both are breached so should be AND 
> rather than OR. 
> distcp cannot do a distributed copy of a large single file hence more reason 
> to do the above change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (HIVE-17367) IMPORT table doesn't load from data dump if a metadata-only dump was already imported.

2017-08-30 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair reopened HIVE-17367:
--

Resolved wrong jira, reopening!
This is not committed.

> IMPORT table doesn't load from data dump if a metadata-only dump was already 
> imported.
> --
>
> Key: HIVE-17367
> URL: https://issues.apache.org/jira/browse/HIVE-17367
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, Import/Export, repl
>Affects Versions: 3.0.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>  Labels: DR, replication
> Fix For: 3.0.0
>
> Attachments: HIVE-17367.01.patch, HIVE-17367.02.patch
>
>
> Repl v1 creates a set of EXPORT/IMPORT commands to replicate modified data 
> (as per events) across clusters.
> For instance, let's say, insert generates 2 events such as
> ALTER_TABLE (ID: 10)
> INSERT (ID: 11)
> Each event generates a set of EXPORT and IMPORT commands.
> ALTER_TABLE event generates metadata only export/import
> INSERT generates metadata+data export/import.
> As Hive always dump the latest copy of table during export, it sets the 
> latest notification event ID as current state of it. So, in this example, 
> import of metadata by ALTER_TABLE event sets the current state of the table 
> as 11.
> Now, when we try to import the data dumped by INSERT event, it is noop as the 
> table's current state(11) is equal to the dump state (11) which in-turn leads 
> to the data never gets replicated to target cluster.
> So, it is necessary to allow overwrite of table/partition if their current 
> state equals the dump state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17367) IMPORT table doesn't load from data dump if a metadata-only dump was already imported.

2017-08-30 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated HIVE-17367:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to master.
Thanks for the patch [~sankarh] and for the review [~anishek]!


> IMPORT table doesn't load from data dump if a metadata-only dump was already 
> imported.
> --
>
> Key: HIVE-17367
> URL: https://issues.apache.org/jira/browse/HIVE-17367
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, Import/Export, repl
>Affects Versions: 3.0.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>  Labels: DR, replication
> Fix For: 3.0.0
>
> Attachments: HIVE-17367.01.patch, HIVE-17367.02.patch
>
>
> Repl v1 creates a set of EXPORT/IMPORT commands to replicate modified data 
> (as per events) across clusters.
> For instance, let's say, insert generates 2 events such as
> ALTER_TABLE (ID: 10)
> INSERT (ID: 11)
> Each event generates a set of EXPORT and IMPORT commands.
> ALTER_TABLE event generates metadata only export/import
> INSERT generates metadata+data export/import.
> As Hive always dump the latest copy of table during export, it sets the 
> latest notification event ID as current state of it. So, in this example, 
> import of metadata by ALTER_TABLE event sets the current state of the table 
> as 11.
> Now, when we try to import the data dumped by INSERT event, it is noop as the 
> table's current state(11) is equal to the dump state (11) which in-turn leads 
> to the data never gets replicated to target cluster.
> So, it is necessary to allow overwrite of table/partition if their current 
> state equals the dump state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17100) Improve HS2 operation logs for REPL commands.

2017-08-30 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148247#comment-16148247
 ] 

Thejas M Nair commented on HIVE-17100:
--

+1

> Improve HS2 operation logs for REPL commands.
> -
>
> Key: HIVE-17100
> URL: https://issues.apache.org/jira/browse/HIVE-17100
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2, repl
>Affects Versions: 2.1.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>  Labels: DR, replication
> Fix For: 3.0.0
>
> Attachments: HIVE-17100.01.patch, HIVE-17100.02.patch, 
> HIVE-17100.03.patch, HIVE-17100.04.patch, HIVE-17100.05.patch, 
> HIVE-17100.06.patch, HIVE-17100.07.patch, HIVE-17100.08.patch, 
> HIVE-17100.09.patch, HIVE-17100.10.patch
>
>
> It is necessary to log the progress the replication tasks in a structured 
> manner as follows.
> *+Bootstrap Dump:+*
> * At the start of bootstrap dump, will add one log with below details.
> {color:#59afe1}* Database Name
> * Dump Type (BOOTSTRAP)
> * (Estimated) Total number of tables/views to dump
> * (Estimated) Total number of functions to dump.
> * Dump Start Time{color}
> * After each table dump, will add a log as follows
> {color:#59afe1}* Table/View Name
> * Type (TABLE/VIEW/MATERIALIZED_VIEW)
> * Table dump end time
> * Table dump progress. Format is Table sequence no/(Estimated) Total number 
> of tables and views.{color}
> * After each function dump, will add a log as follows
> {color:#59afe1}* Function Name
> * Function dump end time
> * Function dump progress. Format is Function sequence no/(Estimated) Total 
> number of functions.{color}
> * After completion of all dumps, will add a log as follows to consolidate the 
> dump.
> {color:#59afe1}* Database Name.
> * Dump Type (BOOTSTRAP).
> * Dump End Time.
> * (Actual) Total number of tables/views dumped.
> * (Actual) Total number of functions dumped.
> * Dump Directory.
> * Last Repl ID of the dump.{color}
> *Note:* The actual and estimated number of tables/functions may not match if 
> any table/function is dropped when dump in progress.
> *+Bootstrap Load:+*
> * At the start of bootstrap load, will add one log with below details.
> {color:#59afe1}* Database Name
> * Dump directory
> * Load Type (BOOTSTRAP)
> * Total number of tables/views to load
> * Total number of functions to load.
> * Load Start Time{color}
> * After each table load, will add a log as follows
> {color:#59afe1}* Table/View Name
> * Type (TABLE/VIEW/MATERIALIZED_VIEW)
> * Table load completion time
> * Table load progress. Format is Table sequence no/Total number of tables and 
> views.{color}
> * After each function load, will add a log as follows
> {color:#59afe1}* Function Name
> * Function load completion time
> * Function load progress. Format is Function sequence no/Total number of 
> functions.{color}
> * After completion of all dumps, will add a log as follows to consolidate the 
> load.
> {color:#59afe1}* Database Name.
> * Load Type (BOOTSTRAP).
> * Load End Time.
> * Total number of tables/views loaded.
> * Total number of functions loaded.
> * Last Repl ID of the loaded database.{color}
> *+Incremental Dump:+*
> * At the start of database dump, will add one log with below details.
> {color:#59afe1}* Database Name
> * Dump Type (INCREMENTAL)
> * (Estimated) Total number of events to dump.
> * Dump Start Time{color}
> * After each event dump, will add a log as follows
> {color:#59afe1}* Event ID
> * Event Type (CREATE_TABLE, DROP_TABLE, ALTER_TABLE, INSERT etc)
> * Event dump end time
> * Event dump progress. Format is Event sequence no/ (Estimated) Total number 
> of events.{color}
> * After completion of all event dumps, will add a log as follows.
> {color:#59afe1}* Database Name.
> * Dump Type (INCREMENTAL).
> * Dump End Time.
> * (Actual) Total number of events dumped.
> * Dump Directory.
> * Last Repl ID of the dump.{color}
> *Note:* The estimated number of events can be terribly inaccurate with actual 
> number as we don’t have the number of events upfront until we read from 
> metastore NotificationEvents table.
> *+Incremental Load:+*
> * At the start of incremental load, will add one log with below details.
> {color:#59afe1}* Target Database Name 
> * Dump directory
> * Load Type (INCREMENTAL)
> * Total number of events to load
> * Load Start Time{color}
> * After each event load, will add a log as follows
> {color:#59afe1}* Event ID
> * Event Type (CREATE_TABLE, DROP_TABLE, ALTER_TABLE, INSERT etc)
> * Event load end time
> * Event load progress. Format is Event sequence no/ Total number of 
> events.{color}
> * After completion of all event loads, will add a log as follows to 
> consolidate the load.
> {color:#59afe1}* Target Database Name.
> * Load Type (INCREMENTAL).
> * Load End Time.
> * Total 

[jira] [Commented] (HIVE-17006) LLAP: Parquet caching

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148242#comment-16148242
 ] 

Hive QA commented on HIVE-17006:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884556/HIVE-17006.03.patch

{color:green}SUCCESS:{color} +1 due to 6 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 11020 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_partitioned_date_time]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.cli.TestSparkNegativeCliDriver.testCliDriver[spark_stage_max_tasks]
 (batchId=241)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6608/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6608/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6608/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884556 - PreCommit-HIVE-Build

> LLAP: Parquet caching
> -
>
> Key: HIVE-17006
> URL: https://issues.apache.org/jira/browse/HIVE-17006
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-17006.01.patch, HIVE-17006.02.patch, 
> HIVE-17006.03.patch, HIVE-17006.patch, HIVE-17006.WIP.patch
>
>
> There are multiple options to do Parquet caching in LLAP:
> 1) Full elevator (too intrusive for now).
> 2) Page based cache like ORC (requires some changes to Parquet or 
> copy-pasted).
> 3) Cache disk data on column chunk level as is.
> Given that Parquet reads at column chunk granularity, (2) is not as useful as 
> for ORC, but still a good idea. I messaged the dev list about it but didn't 
> get a response, we may follow up later.
> For now, do (3). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16904) during repl load for large number of partitions the metadata file can be huge and can lead to out of memory

2017-08-30 Thread anishek (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148234#comment-16148234
 ] 

anishek commented on HIVE-16904:


On internal runs we saw that for 1 partitions with one file each it was 
creating a metadata file of about ~ 16 MB. extrapolating this to include 
additional properties and files etc, to 20 MB for 1 Partitions then for 1 
million its about 2GB.

Adding java object overhead to about another 50% we should still be using about 
3 GB of RAM to process this file which does not seem too large. 

So parking this for now and will come back to this later if there still an 
issue. 

Sample Code to allow doing this 

{code}

import org.apache.commons.io.FileUtils;
import org.apache.hadoop.hive.metastore.api.Partition;
import org.apache.thrift.TDeserializer;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TJSONProtocol;
import org.codehaus.jackson.JsonFactory;
import org.codehaus.jackson.JsonNode;
import org.codehaus.jackson.JsonParser;
import org.codehaus.jackson.JsonToken;
import org.codehaus.jackson.map.MappingJsonFactory;
import org.codehaus.jackson.map.ObjectMapper;
import org.json.JSONObject;
import org.junit.Test;

import java.io.File;
import java.io.IOException;

import static org.junit.Assert.fail;

public class StreamingJsonTests {

  @Test
  public void testStreaming() throws IOException, TException {
TDeserializer deserializer = new TDeserializer(new TJSONProtocol.Factory());
ObjectMapper mapper = new ObjectMapper();
JsonFactory factory = new MappingJsonFactory();
printMemory("before reading file to parser");
JsonParser parser =
factory.createJsonParser(new File("_metadata"));
if (parser.nextToken() != JsonToken.START_OBJECT)
  fail("cant parse the files");
for (JsonToken jsonToken = parser.nextToken();
 jsonToken != JsonToken.END_OBJECT; jsonToken = parser.nextToken()) {
  if (parser.getCurrentName().equalsIgnoreCase("partitions")) {
break;
  }
}
int count = 0;
printMemory("after finding out the partitions object location");
if (parser.nextToken() == JsonToken.START_ARRAY) {
  while (parser.nextToken() != JsonToken.END_ARRAY) {
JsonNode jsonNode = mapper.readTree(parser);
Partition partition = new Partition();
deserializer.deserialize(partition, jsonNode.asText(), "UTF-8");
count++;
  }
  System.out.println("number of partitions :" + count);
} else {
  fail("no partitions array token");
}
parser.close();
  }

  @Test
  public void testRegular() throws IOException {
printMemory("before starting");
JSONObject jsonObject = new JSONObject(
FileUtils.readFileToString(new File("_metadata")));
printMemory("after reading the file");
jsonObject.toString();
  }

  private void printMemory(String msg) {
Runtime runtime = Runtime.getRuntime();
runtime.gc();
long usedMemory = runtime.totalMemory() - runtime.freeMemory();
System.out.println(msg + " KB used : " + usedMemory / 1024);
  }

}

{code}

Additional problem to look at is the overhead that bootstrap creates on 
namenode. all partitions will have their own directory hierarchy ( for multiple 
partition columns per table) to store the {{_files}}. 

> during repl load for large number of partitions the metadata file can be huge 
> and can lead to out of memory 
> 
>
> Key: HIVE-16904
> URL: https://issues.apache.org/jira/browse/HIVE-16904
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
> Fix For: 3.0.0
>
>
> the metadata pertaining to a table + its partitions is stored in a single 
> file, During repl load all the data is loaded in memory in one shot and then 
> individual partitions processed. This can lead to huge memory overhead as the 
> entire file is read in memory. try to deserialize the partition objects with 
> some sort of streaming json deserializer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17415) Hit error "SemanticException View xxx is corresponding to LIMIT, rather than a SelectOperator." in Hive queries

2017-08-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148224#comment-16148224
 ] 

Ashutosh Chauhan commented on HIVE-17415:
-

+1

> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries
> ---
>
> Key: HIVE-17415
> URL: https://issues.apache.org/jira/browse/HIVE-17415
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
> Attachments: HIVE-17415.1.patch
>
>
> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries when a user creates a view with limits
> set 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider;
> create table my_passwd (
> username string,
> uid int);
> insert into my_passwd values
> ("Dev1", 1),
> ("Dev2", 2),
> ("Dev3", 3),
> ("Dev4", 4),
> ("Dev5", 5),
> ("Dev6", 6);
> create view my_passwd_vw as select * from my_passwd limit 3;
> set hive.security.authorization.enabled=true;
> grant select on table my_passwd to user hive_test_user;
> grant select on table my_passwd_vw to user hive_test_user;
> select * from my_passwd_vw;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17361) Support LOAD DATA for transactional tables

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148191#comment-16148191
 ] 

Hive QA commented on HIVE-17361:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884548/HIVE-17361.2.patch

{color:green}SUCCESS:{color} +1 due to 4 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 23 failed/errored test(s), 11024 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=234)
org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnMR (batchId=215)
org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=215)
org.apache.hadoop.hive.ql.TestAcidOnTez.testMergeJoinOnMR (batchId=215)
org.apache.hadoop.hive.ql.TestAcidOnTez.testMergeJoinOnTez (batchId=215)
org.apache.hadoop.hive.ql.TestTxnCommands2.testACIDwithSchemaEvolutionAndCompaction
 (batchId=270)
org.apache.hadoop.hive.ql.TestTxnCommands2.testCompactWithDelete (batchId=270)
org.apache.hadoop.hive.ql.TestTxnCommands2.testInsertOverwrite2 (batchId=270)
org.apache.hadoop.hive.ql.TestTxnCommands2WithSplitUpdateAndVectorization.testACIDwithSchemaEvolutionAndCompaction
 (batchId=279)
org.apache.hadoop.hive.ql.TestTxnCommands2WithSplitUpdateAndVectorization.testCompactWithDelete
 (batchId=279)
org.apache.hadoop.hive.ql.TestTxnCommands2WithSplitUpdateAndVectorization.testInsertOverwrite2
 (batchId=279)
org.apache.hadoop.hive.ql.TestTxnNoBuckets.testNoBuckets (batchId=270)
org.apache.hadoop.hive.ql.lockmgr.TestDbTxnManager2.testLocksInSubquery 
(batchId=282)
org.apache.hadoop.hive.ql.txn.compactor.TestCompactor.majorCompactAfterAbort 
(batchId=216)
org.apache.hadoop.hive.ql.txn.compactor.TestCompactor.majorCompactWhileStreaming
 (batchId=216)
org.apache.hadoop.hive.ql.txn.compactor.TestCompactor.majorCompactWhileStreamingForSplitUpdate
 (batchId=216)
org.apache.hadoop.hive.ql.txn.compactor.TestCompactor.testStatsAfterCompactionPartTbl
 (batchId=216)
org.apache.hadoop.hive.ql.txn.compactor.TestCompactor.testTableProperties 
(batchId=216)
org.apache.hive.hcatalog.streaming.TestStreaming.testNoBuckets (batchId=192)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6607/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6607/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6607/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 23 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884548 - PreCommit-HIVE-Build

> Support LOAD DATA for transactional tables
> --
>
> Key: HIVE-17361
> URL: https://issues.apache.org/jira/browse/HIVE-17361
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Wei Zheng
>Assignee: Wei Zheng
> Attachments: HIVE-17361.1.patch, HIVE-17361.2.patch
>
>
> LOAD DATA was not supported since ACID was introduced. Need to fill this gap 
> between ACID table and regular hive table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17415) Hit error "SemanticException View xxx is corresponding to LIMIT, rather than a SelectOperator." in Hive queries

2017-08-30 Thread Deepak Jaiswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Jaiswal updated HIVE-17415:
--
Attachment: HIVE-17415.1.patch

[~ashutoshc] can you please review?

> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries
> ---
>
> Key: HIVE-17415
> URL: https://issues.apache.org/jira/browse/HIVE-17415
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
> Attachments: HIVE-17415.1.patch
>
>
> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries when a user creates a view with limits
> set 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider;
> create table my_passwd (
> username string,
> uid int);
> insert into my_passwd values
> ("Dev1", 1),
> ("Dev2", 2),
> ("Dev3", 3),
> ("Dev4", 4),
> ("Dev5", 5),
> ("Dev6", 6);
> create view my_passwd_vw as select * from my_passwd limit 3;
> set hive.security.authorization.enabled=true;
> grant select on table my_passwd to user hive_test_user;
> grant select on table my_passwd_vw to user hive_test_user;
> select * from my_passwd_vw;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-30 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148179#comment-16148179
 ] 

Sahil Takiar commented on HIVE-16823:
-

[~kellyzly] sounds good. I attached a new patch to HIVE-17405 with the call to 
constant propagate moved to the end of {{SparkCompiler#optimizeOperatorPlan}}. 
Let's see if there are any test failures. I'm not actually sure if an 
additional call to {{ConstantPropagate}} will improve performance, I would just 
assume if it changes things in the explain plan, then the execution of that 
plan should be faster; but thats just a theory.

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_112]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_112]
>   at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> 

[jira] [Commented] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT

2017-08-30 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148178#comment-16148178
 ] 

Sahil Takiar commented on HIVE-17405:
-

Attaching an updated patch with the call to constant propagation moved to the 
end of {{SparkCompiler#optimizeOperatorPlan}} - lets see if there are any test 
failures that result from moving it there.

Re-generated {{spark_vectorized_dynamic_partition_pruning.q}} since it hasn't 
been updated in a long time. I did a diff between the new 
{{spark_vectorized_dynamic_partition_pruning.q.out}} and 
{{spark_dynamic_partition_pruning.q.out}} and the only diffs were in the table 
stats, the {{Execution mode: vectorized}}, and HIVE-17414.

> HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
> -
>
> Key: HIVE-17405
> URL: https://issues.apache.org/jira/browse/HIVE-17405
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, 
> HIVE-17405.3.patch, HIVE-17405.4.patch
>
>
> In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new 
> ConstantPropagate().transform(parseContext)}} to {{new 
> ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}}
> Hive-on-Tez does the same thing.
> Running the full constant propagation isn't really necessary, we just want to 
> eliminate any {{and true}} predicates that were introduced by 
> {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The 
> {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the 
> operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. 
> The predicates introduced via {{SyntheticJoinPredicate}} are necessary to 
> help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or 
> not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17415) Hit error "SemanticException View xxx is corresponding to LIMIT, rather than a SelectOperator." in Hive queries

2017-08-30 Thread Deepak Jaiswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Jaiswal updated HIVE-17415:
--
Status: Patch Available  (was: In Progress)

> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries
> ---
>
> Key: HIVE-17415
> URL: https://issues.apache.org/jira/browse/HIVE-17415
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
>
> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries when a user creates a view with limits
> set 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider;
> create table my_passwd (
> username string,
> uid int);
> insert into my_passwd values
> ("Dev1", 1),
> ("Dev2", 2),
> ("Dev3", 3),
> ("Dev4", 4),
> ("Dev5", 5),
> ("Dev6", 6);
> create view my_passwd_vw as select * from my_passwd limit 3;
> set hive.security.authorization.enabled=true;
> grant select on table my_passwd to user hive_test_user;
> grant select on table my_passwd_vw to user hive_test_user;
> select * from my_passwd_vw;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148174#comment-16148174
 ] 

Xuefu Zhang commented on HIVE-15104:


The patch looks good to me. My only concern is about the reliability of the 
runtime compilation and jar creating. I'd think it's best if we can avoid that.

I'm not 100% sure of the class loading problem we faced. If we define class 
HiveKryoRegistrator in Hive, with relocation, Spark's unrelocated kryo isn't 
able to find it?

> Hive on Spark generate more shuffle data than hive on mr
> 
>
> Key: HIVE-15104
> URL: https://issues.apache.org/jira/browse/HIVE-15104
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.2.1
>Reporter: wangwenli
>Assignee: Rui Li
> Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, 
> HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, 
> HIVE-15104.5.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT

2017-08-30 Thread Sahil Takiar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated HIVE-17405:

Attachment: HIVE-17405.4.patch

> HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
> -
>
> Key: HIVE-17405
> URL: https://issues.apache.org/jira/browse/HIVE-17405
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, 
> HIVE-17405.3.patch, HIVE-17405.4.patch
>
>
> In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new 
> ConstantPropagate().transform(parseContext)}} to {{new 
> ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}}
> Hive-on-Tez does the same thing.
> Running the full constant propagation isn't really necessary, we just want to 
> eliminate any {{and true}} predicates that were introduced by 
> {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The 
> {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the 
> operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. 
> The predicates introduced via {{SyntheticJoinPredicate}} are necessary to 
> help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or 
> not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (HIVE-16923) Hive-on-Spark DPP Improvements

2017-08-30 Thread Sahil Takiar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated HIVE-16923:

Comment: was deleted

(was: Will post a design doc soon.

Two of the biggest limitations of the current DPP implementation are that it 
requires an additional Spark job and it requires writing some intermediate data 
to HDFS. We should evaluate the overhead of these limitations and if its 
possible to remove them.

Ideally, DPP shouldn't hurt performance for any query. One way to ensure this 
is to build some type of cost-based model that predicts whether or not DPP will 
help perf or not. For example, a simple cost-based model could simply enable 
DPP for map-joins only. Since map-joins already require two Spark jobs and 
writing intermediate data to HDFS, there shouldn't be significant overhead to 
running DPP with a map-join.)

> Hive-on-Spark DPP Improvements
> --
>
> Key: HIVE-16923
> URL: https://issues.apache.org/jira/browse/HIVE-16923
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> Improvements to Hive-on-Spark DPP so that it is production ready.
> Hive-on-Spark DPP was implemented in HIVE-9152. However, it is disabled by 
> default. The goal of this JIRA is to improve the DPP implementation so that 
> it can be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17006) LLAP: Parquet caching

2017-08-30 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-17006:

Attachment: HIVE-17006.03.patch

Addressing CR feedback, fixing some bugs

> LLAP: Parquet caching
> -
>
> Key: HIVE-17006
> URL: https://issues.apache.org/jira/browse/HIVE-17006
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-17006.01.patch, HIVE-17006.02.patch, 
> HIVE-17006.03.patch, HIVE-17006.patch, HIVE-17006.WIP.patch
>
>
> There are multiple options to do Parquet caching in LLAP:
> 1) Full elevator (too intrusive for now).
> 2) Page based cache like ORC (requires some changes to Parquet or 
> copy-pasted).
> 3) Cache disk data on column chunk level as is.
> Given that Parquet reads at column chunk granularity, (2) is not as useful as 
> for ORC, but still a good idea. I messaged the dev list about it but didn't 
> get a response, we may follow up later.
> For now, do (3). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-30 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel reassigned HIVE-17414:
---

Assignee: liyunzhang_intel

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: liyunzhang_intel
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Spark Partition Pruning Sink Operator
>   Target column: ds 

[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-30 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148125#comment-16148125
 ] 

liyunzhang_intel commented on HIVE-16823:
-

let's fix spark_vectorized_dynamic_partition_pruning.q  in the HIVE-17405 
although the target of HIVE-17405 is not 
spark_vectorized_dynamic_partition_pruning.q  after HIVE-17383 is resolved.

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_112]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_112]
>   at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:107)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeReduceMergePartial.doProcessBatch(VectorGroupByOperator.java:832)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeBase.processBatch(VectorGroupByOperator.java:179)
>  

[jira] [Commented] (HIVE-17414) HoS DPP + Vectorization generates invalid explain plan due to CombineEquivalentWorkResolver

2017-08-30 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148118#comment-16148118
 ] 

Sahil Takiar commented on HIVE-17414:
-

CC: [~lirui], [~kellyzly] - its probably because the patch in HIVE-16948 runs 
{{collectOp(..., SparkPartitionPruningSinkOperator.class)}} which will only 
collect {{SparkPartitionPruningSinkOperator}}, but not 
{{VectorSparkPartitionPruningSinkOperator}}.

> HoS DPP + Vectorization generates invalid explain plan due to 
> CombineEquivalentWorkResolver
> ---
>
> Key: HIVE-17414
> URL: https://issues.apache.org/jira/browse/HIVE-17414
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>
> Similar to HIVE-16948, the following query generates an invalid explain plan 
> when HoS DPP is enabled + vectorization:
> {code:sql}
> select ds from (select distinct(ds) as ds from srcpart union all select 
> distinct(ds) as ds from srcpart) s where s.ds in (select max(srcpart.ds) from 
> srcpart union all select min(srcpart.ds) from srcpart)
> {code}
> Explain Plan:
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   Edges:
> Reducer 11 <- Map 10 (GROUP, 1)
> Reducer 13 <- Map 12 (GROUP, 1)
>  A masked pattern was here 
>   Vertices:
> Map 10
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Map 12
> Map Operator Tree:
> TableScan
>   alias: srcpart
>   Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: ds (type: string)
> outputColumnNames: ds
> Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: min(ds)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> Execution mode: vectorized
> Reducer 11
> Execution mode: vectorized
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
> Column stats: NONE
> Filter Operator
>   predicate: _col0 is not null (type: boolean)
>   Statistics: Num rows: 1 Data size: 184 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: _col0 (type: string)
>   outputColumnNames: _col0
>   Statistics: Num rows: 2 Data size: 368 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> keys: _col0 (type: string)
> mode: hash
> outputColumnNames: _col0
>

[jira] [Work started] (HIVE-17415) Hit error "SemanticException View xxx is corresponding to LIMIT, rather than a SelectOperator." in Hive queries

2017-08-30 Thread Deepak Jaiswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-17415 started by Deepak Jaiswal.
-
> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries
> ---
>
> Key: HIVE-17415
> URL: https://issues.apache.org/jira/browse/HIVE-17415
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
>
> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries when a user creates a view with limits
> set 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider;
> create table my_passwd (
> username string,
> uid int);
> insert into my_passwd values
> ("Dev1", 1),
> ("Dev2", 2),
> ("Dev3", 3),
> ("Dev4", 4),
> ("Dev5", 5),
> ("Dev6", 6);
> create view my_passwd_vw as select * from my_passwd limit 3;
> set hive.security.authorization.enabled=true;
> grant select on table my_passwd to user hive_test_user;
> grant select on table my_passwd_vw to user hive_test_user;
> select * from my_passwd_vw;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-17415) Hit error "SemanticException View xxx is corresponding to LIMIT, rather than a SelectOperator." in Hive queries

2017-08-30 Thread Deepak Jaiswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Jaiswal reassigned HIVE-17415:
-


> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries
> ---
>
> Key: HIVE-17415
> URL: https://issues.apache.org/jira/browse/HIVE-17415
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
>
> Hit error "SemanticException View xxx is corresponding to LIMIT, rather than 
> a SelectOperator." in Hive queries when a user creates a view with limits
> set 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider;
> create table my_passwd (
> username string,
> uid int);
> insert into my_passwd values
> ("Dev1", 1),
> ("Dev2", 2),
> ("Dev3", 3),
> ("Dev4", 4),
> ("Dev5", 5),
> ("Dev6", 6);
> create view my_passwd_vw as select * from my_passwd limit 3;
> set hive.security.authorization.enabled=true;
> grant select on table my_passwd to user hive_test_user;
> grant select on table my_passwd_vw to user hive_test_user;
> select * from my_passwd_vw;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-30 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148105#comment-16148105
 ] 

liyunzhang_intel commented on HIVE-16823:
-

[~stakiar]: {quote} 
Maybe a follow up JIRA would be to see what happens when we run 
{ConstantPropagate()}} at the end of SparkCompiler#optimizeOperatorPlan? 
Theoretically, it should improve performance? But sounds like there are some 
bugs we need to address before getting to that stage.
{quote}

is there any unit test failures if we put following code in the end of 
SparkCompiler#optimizeOperatorPlan?
{code}
  if(procCtx.conf.getBoolVar(ConfVars.HIVEOPTCONSTANTPROPAGATION)) {
  new 
ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(procCtx.parseContext);
}
{code}

I think it is better to put it in the end of SparkCompiler#optimizeOperatorPlan 
than in the runDynamicPartitionPruning. This is not related dpp just found bug 
in dpp unit test.  Beside, why it should improve performance? if you know, 
please tell me, thanks!

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_112]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  

[jira] [Updated] (HIVE-17361) Support LOAD DATA for transactional tables

2017-08-30 Thread Wei Zheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Zheng updated HIVE-17361:
-
Attachment: HIVE-17361.2.patch

patch 2 with a different approach

> Support LOAD DATA for transactional tables
> --
>
> Key: HIVE-17361
> URL: https://issues.apache.org/jira/browse/HIVE-17361
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Wei Zheng
>Assignee: Wei Zheng
> Attachments: HIVE-17361.1.patch, HIVE-17361.2.patch
>
>
> LOAD DATA was not supported since ACID was introduced. Need to fill this gap 
> between ACID table and regular hive table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-11548) HCatLoader should support predicate pushdown.

2017-08-30 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148073#comment-16148073
 ] 

Mithun Radhakrishnan commented on HIVE-11548:
-

Hey, [~thejas]. Could I bug you for a review on this one? 

> HCatLoader should support predicate pushdown.
> -
>
> Key: HIVE-11548
> URL: https://issues.apache.org/jira/browse/HIVE-11548
> Project: Hive
>  Issue Type: New Feature
>  Components: HCatalog
>Affects Versions: 3.0.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-11548.1.patch, HIVE-11548.2.patch, 
> HIVE-11548.3.patch, HIVE-11548.4.patch, HIVE-11548.5.patch
>
>
> When one uses {{HCatInputFormat}}/{{HCatLoader}} to read from file-formats 
> that support predicate pushdown (such as ORC, with 
> {{hive.optimize.index.filter=true}}), one sees that the predicates aren't 
> actually pushed down into the storage layer.
> The forthcoming patch should allow for filter-pushdown, if any of the 
> partitions being scanned with {{HCatLoader}} support the functionality. The 
> patch should technically allow the same for users of {{HCatInputFormat}}, but 
> I don't currently have a neat interface to build a compound 
> predicate-expression. Will add this separately, if required.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17408) replication distcp should only be invoked if number of files AND file size cross configured limits

2017-08-30 Thread anishek (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148023#comment-16148023
 ] 

anishek commented on HIVE-17408:


* 
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2]:
 runs fine on local machine.
* org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[smb_mapjoin_3]: 
runs fine on local machine

Other tests are failing from previous builds.

> replication distcp should only be invoked if number of files AND file size 
> cross configured limits
> --
>
> Key: HIVE-17408
> URL: https://issues.apache.org/jira/browse/HIVE-17408
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Trivial
> Fix For: 3.0.0
>
> Attachments: HIVE-17408.1.patch
>
>
> CopyUtils currently invokes distcp on whether 
> "hive.exec.copyfile.maxnumfiles" or "hive.exec.copyfile.maxsize" condition is 
> breached,  should only be invoked when both are breached so should be AND 
> rather than OR. 
> distcp cannot do a distributed copy of a large single file hence more reason 
> to do the above change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17307) Change the metastore to not use the metrics code in hive/common

2017-08-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated HIVE-17307:
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Patch 5 committed.  Thanks Vihang for all the good feedback.

> Change the metastore to not use the metrics code in hive/common
> ---
>
> Key: HIVE-17307
> URL: https://issues.apache.org/jira/browse/HIVE-17307
> Project: Hive
>  Issue Type: Sub-task
>  Components: Metastore
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 3.0.0
>
> Attachments: HIVE-17307.2.patch, HIVE-17307.3.patch, 
> HIVE-17307.4.patch, HIVE-17307.5.patch, HIVE-17307.patch
>
>
> As we move code into the standalone metastore module, it cannot use the 
> metrics in hive-common.  We could copy the current Metrics interface or we 
> could change the metastore code to directly use codahale metrics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17307) Change the metastore to not use the metrics code in hive/common

2017-08-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated HIVE-17307:
--
Attachment: HIVE-17307.5.patch

Final version of the patch.  This differs from patch 4 only in that I removed 
the unused initialization of the ThreadPool in Metrics.java

> Change the metastore to not use the metrics code in hive/common
> ---
>
> Key: HIVE-17307
> URL: https://issues.apache.org/jira/browse/HIVE-17307
> Project: Hive
>  Issue Type: Sub-task
>  Components: Metastore
>Reporter: Alan Gates
>Assignee: Alan Gates
> Attachments: HIVE-17307.2.patch, HIVE-17307.3.patch, 
> HIVE-17307.4.patch, HIVE-17307.5.patch, HIVE-17307.patch
>
>
> As we move code into the standalone metastore module, it cannot use the 
> metrics in hive-common.  We could copy the current Metrics interface or we 
> could change the metastore code to directly use codahale metrics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17307) Change the metastore to not use the metrics code in hive/common

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147949#comment-16147949
 ] 

ASF GitHub Bot commented on HIVE-17307:
---

Github user asfgit closed the pull request at:

https://github.com/apache/hive/pull/235


> Change the metastore to not use the metrics code in hive/common
> ---
>
> Key: HIVE-17307
> URL: https://issues.apache.org/jira/browse/HIVE-17307
> Project: Hive
>  Issue Type: Sub-task
>  Components: Metastore
>Reporter: Alan Gates
>Assignee: Alan Gates
> Attachments: HIVE-17307.2.patch, HIVE-17307.3.patch, 
> HIVE-17307.4.patch, HIVE-17307.patch
>
>
> As we move code into the standalone metastore module, it cannot use the 
> metrics in hive-common.  We could copy the current Metrics interface or we 
> could change the metastore code to directly use codahale metrics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16886) HMS log notifications may have duplicated event IDs if multiple HMS are running concurrently

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147865#comment-16147865
 ] 

Hive QA commented on HIVE-16886:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884510/HIVE-16886.7.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 11014 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_5] 
(batchId=84)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[load_dyn_part5]
 (batchId=154)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=234)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6606/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6606/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6606/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 7 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884510 - PreCommit-HIVE-Build

> HMS log notifications may have duplicated event IDs if multiple HMS are 
> running concurrently
> 
>
> Key: HIVE-16886
> URL: https://issues.apache.org/jira/browse/HIVE-16886
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Metastore
>Reporter: Sergio Peña
>Assignee: anishek
> Attachments: datastore-identity-holes.diff, HIVE-16886.1.patch, 
> HIVE-16886.2.patch, HIVE-16886.3.patch, HIVE-16886.4.patch, 
> HIVE-16886.5.patch, HIVE-16886.6.patch, HIVE-16886.7.patch
>
>
> When running multiple Hive Metastore servers and DB notifications are 
> enabled, I could see that notifications can be persisted with a duplicated 
> event ID. 
> This does not happen when running multiple threads in a single HMS node due 
> to the locking acquired on the DbNotificationsLog class, but multiple HMS 
> could cause conflicts.
> The issue is in the ObjectStore#addNotificationEvent() method. The event ID 
> fetched from the datastore is used for the new notification, incremented in 
> the server itself, then persisted or updated back to the datastore. If 2 
> servers read the same ID, then these 2 servers write a new notification with 
> the same ID.
> The event ID is not unique nor a primary key.
> Here's a test case using the TestObjectStore class that confirms this issue:
> {noformat}
> @Test
>   public void testConcurrentAddNotifications() throws ExecutionException, 
> InterruptedException {
> final int NUM_THREADS = 2;
> CountDownLatch countIn = new CountDownLatch(NUM_THREADS);
> CountDownLatch countOut = new CountDownLatch(1);
> HiveConf conf = new HiveConf();
> conf.setVar(HiveConf.ConfVars.METASTORE_EXPRESSION_PROXY_CLASS, 
> MockPartitionExpressionProxy.class.getName());
> ExecutorService executorService = 
> Executors.newFixedThreadPool(NUM_THREADS);
> FutureTask tasks[] = new FutureTask[NUM_THREADS];
> for (int i=0; i   final int n = i;
>   tasks[i] = new FutureTask(new Callable() {
> @Override
> public Void call() throws Exception {
>   ObjectStore store = new ObjectStore();
>   store.setConf(conf);
>   NotificationEvent dbEvent =
>   new NotificationEvent(0, 0, 
> EventMessage.EventType.CREATE_DATABASE.toString(), "CREATE DATABASE DB" + n);
>   System.out.println("ADDING NOTIFICATION");
>   countIn.countDown();
>   countOut.await();
>   store.addNotificationEvent(dbEvent);
>   System.out.println("FINISH NOTIFICATION");
>   return null;
> }
>   });
>   executorService.execute(tasks[i]);
> }
> countIn.await();
> countOut.countDown();
> for (int i = 0; i < NUM_THREADS; ++i) {
>   tasks[i].get();
> }
> NotificationEventResponse eventResponse = 
> objectStore.getNextNotification(new NotificationEventRequest());
> Assert.assertEquals(2, eventResponse.getEventsSize());
> 

[jira] [Updated] (HIVE-17411) LLAP IO may incorrectly release a refcount in some rare cases

2017-08-30 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-17411:

   Resolution: Fixed
Fix Version/s: 2.4.0
   3.0.0
   2.3.0
   Status: Resolved  (was: Patch Available)

Pushed to master, branch-2 and branch-2.3. Thanks for the reviews!

> LLAP IO may incorrectly release a refcount in some rare cases
> -
>
> Key: HIVE-17411
> URL: https://issues.apache.org/jira/browse/HIVE-17411
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Fix For: 2.3.0, 3.0.0, 2.4.0
>
> Attachments: HIVE-17411.patch
>
>
> In a large stream whose buffers are not reused, and that is separated into 
> many CB (e.g. due to a small ORC compression buffer size), it may happen that 
> some, but not all, buffers that are read together as a unit are evicted from 
> cache.
> If CacheBuffer follows BufferChunk in the buffer list when a stream like this 
> is read, the latter will be converted to ProcCacheChunk;  it is possible for 
> early refcount release logic from the former to release the refcount (for a 
> dictionary stream, the initial refCount is always released early), and then 
> backtrack to the latter to see if we can unlock more buffers. It would then 
> to decref an uninitialized MemoryBuffer in ProcCacheChunk because 
> ProcCacheChunk looks like a CacheChunk. PCC initial refcounts are released 
> separately after the data is uncompressed.
> I'm assuming this would almost never happen with non-stripe-level streams 
> because one would need a large RG to span 2+ CBs, no overlap with 
> next/previous RGs in 2+ buffers for the early release to kick in, and an 
> unfortunate eviction order. However it's possible with large-ish dictionaries.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16823) "ArrayIndexOutOfBoundsException" in spark_vectorized_dynamic_partition_pruning.q

2017-08-30 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147825#comment-16147825
 ] 

Sahil Takiar commented on HIVE-16823:
-

So HIVE-15269 doesn't actually remove the call to constant propagate. It just 
moves it to the end of {{TezCompiler#optimizeOperatorPlan}}. It seems the 
reason Tez has {{keys: KEY._col0 (type: string)}} in the explain plan, while 
Spark has {{keys: '2008-04-08' (type: string)}} is because Tez calls 
{{ConstantPropagate(ConstantPropagateOption.SHORTCUT)}} while Spark has been 
calling {{ConstantPropagate()}}. I changed this in HIVE-17405 to fix some other 
issues with the call to {{ConstantPropagate}}, and it looks like that made 
{{spark_vectorized_dynamic_partition_pruning.q}} work again.

So looks like HIVE-17405 should fix 
spark_vectorized_dynamic_partition_pruning.q, although sounds like HIVE-17383 
is still a real bug.

Maybe a follow up JIRA would be to see what happens when we run 
{ConstantPropagate()}} at the end of {{SparkCompiler#optimizeOperatorPlan}}? 
Theoretically, it should improve performance? But sounds like there are some 
bugs we need to address before getting to that stage.

> "ArrayIndexOutOfBoundsException" in 
> spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-16823
> URL: https://issues.apache.org/jira/browse/HIVE-16823
> Project: Hive
>  Issue Type: Bug
>Reporter: Jianguo Tian
>Assignee: liyunzhang_intel
> Attachments: explain.spark, explain.tez, HIVE-16823.1.patch, 
> HIVE-16823.patch
>
>
> spark_vectorized_dynamic_partition_pruning.q
> {code}
> set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> -- parent is reduce tasks
> select count(*) from srcpart join (select ds as ds, ds as `date` from srcpart 
> group by ds) s on (srcpart.ds = s.ds) where s.`date` = '2008-04-08';
> {code}
> The exceptions are as follows:
> {code}
> 2017-06-05T09:20:31,468 ERROR [Executor task launch worker-0] 
> spark.SparkReduceRecordHandler: Fatal error: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
> org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing 
> vector batch (tag=0) Column vector types: 0:BYTES, 1:BYTES
> ["2008-04-08", "2008-04-08"]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:413)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:301)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>  ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:85) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
>   at 
> 

[jira] [Commented] (HIVE-16895) Multi-threaded execution of bootstrap dump of partitions

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147807#comment-16147807
 ] 

ASF GitHub Bot commented on HIVE-16895:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/217


>  Multi-threaded execution of bootstrap dump of partitions
> -
>
> Key: HIVE-16895
> URL: https://issues.apache.org/jira/browse/HIVE-16895
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
>  Labels: TODOC3.0
> Fix For: 3.0.0
>
> Attachments: HIVE-16895.1.patch, HIVE-16895.2.patch
>
>
> to allow faster execution of bootstrap dump phase we dump multiple partitions 
> from same table simultaneously. 
> even though dumping  functions is  not going to be a blocker, moving to 
> similar execution modes for all metastore objects will make code more 
> coherent. 
> Bootstrap dump at db level does :
> * boostrap of all tables
> ** boostrap of all partitions in a table.  (scope of current jira) 
> * boostrap of all functions 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17144) export of temporary tables not working and it seems to be using distcp rather than filesystem copy

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147806#comment-16147806
 ] 

ASF GitHub Bot commented on HIVE-17144:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/215


> export of temporary tables not working and it seems to be using distcp rather 
> than filesystem copy
> --
>
> Key: HIVE-17144
> URL: https://issues.apache.org/jira/browse/HIVE-17144
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
> Fix For: 3.0.0
>
> Attachments: HIVE-17144.1.patch
>
>
> create temporary table t1 (i int);
> insert into t1 values (3);
> export table t1 to 'hdfs://somelocation';
> above fails. additionally it should use filesystem copy and not distcp to do 
> the job.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16892) Move creation of _files from ReplCopyTask to analysis phase for boostrap replication

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147804#comment-16147804
 ] 

ASF GitHub Bot commented on HIVE-16892:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/196


> Move creation of _files from ReplCopyTask to analysis phase for boostrap 
> replication 
> -
>
> Key: HIVE-16892
> URL: https://issues.apache.org/jira/browse/HIVE-16892
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
> Fix For: 3.0.0
>
> Attachments: HIVE-16892.1.patch, HIVE-16892.2.patch, 
> HIVE-16892.3.patch, HIVE-16892.4.patch, HIVE-16892.5.patch, HIVE-16892.6.patch
>
>
> during replication boostrap we create the _files via ReplCopyTask for 
> partitions and tables, this can be done inline as part of analysis phase 
> rather than creating the replCopytask,
> This is done to prevent creation of huge number of these tasks in memory 
> before giving it to the execution engine. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16268) enable incremental repl dump to handle functions metadata

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147797#comment-16147797
 ] 

ASF GitHub Bot commented on HIVE-16268:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/178


> enable incremental repl dump to handle functions metadata
> -
>
> Key: HIVE-16268
> URL: https://issues.apache.org/jira/browse/HIVE-16268
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2, repl
>Affects Versions: 2.2.0
>Reporter: anishek
>Assignee: anishek
> Fix For: 3.0.0
>
> Attachments: HIVE-16268.1.patch, HIVE-16268.2.patch, 
> HIVE-16268.3.patch, HIVE-16268.4.patch, HIVE-16268.5.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> this is created separately to ensure that any other metadata related to 
> replication which comes from replication spec, if they are needed as part of 
> the function dump output when doing incremental update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16267) Enable bootstrap function metadata to be loaded in repl load

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147798#comment-16147798
 ] 

ASF GitHub Bot commented on HIVE-16267:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/179


> Enable bootstrap function metadata to be loaded in repl load
> 
>
> Key: HIVE-16267
> URL: https://issues.apache.org/jira/browse/HIVE-16267
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2, repl
>Reporter: anishek
>Assignee: anishek
> Fix For: 3.0.0
>
> Attachments: HIVE-16267.1.patch, HIVE-16267.2.patch, 
> HIVE-16267.3.patch, HIVE-16267.4.patch, HIVE-16267.5.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16591) DR for function Binaries on HDFS

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147800#comment-16147800
 ] 

ASF GitHub Bot commented on HIVE-16591:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/189


> DR for function Binaries on HDFS 
> -
>
> Key: HIVE-16591
> URL: https://issues.apache.org/jira/browse/HIVE-16591
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
> Fix For: 3.0.0
>
> Attachments: HIVE-16591.1.patch, HIVE-16591.2.patch, 
> HIVE-16591.3.patch
>
>
> # We have to make sure that during incremental dump we dont allow functions 
> to be copied if they have local filesystem "file://" resources.  -- depends 
> how much system side work we want to do, We are going to explicitly provide a 
> caveat for replicating functions where in, only functions created "using" 
> clause will be replicated and the "using" clause prohibits creating functions 
> with the local "file://"  resources and hence doing additional checks when 
> doing repl dump might not be required. 
> # We have to make sure that during the bootstrap / incremental dump we append 
> the namenode host + port  if functions are created without the fully 
> qualified location of uri on hdfs, not sure how this would play for S3 or 
> WASB filesystem.
> # We have to copy the binaries of a function resource list on CREATE / DROP 
> FUNCTION . The change management file system has to keep a copy of the binary 
> when a DROP function is called, to provide capability of updating binary 
> definition for existing functions along with DR. An example of list of steps 
> is given in doc (ReplicateFunctions.pdf ) attached in  parent Issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16269) enable incremental function dump to be loaded via repl load

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147799#comment-16147799
 ] 

ASF GitHub Bot commented on HIVE-16269:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/182


> enable incremental function dump to be loaded via repl load 
> 
>
> Key: HIVE-16269
> URL: https://issues.apache.org/jira/browse/HIVE-16269
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2
>Affects Versions: 2.2.0
>Reporter: anishek
>Assignee: anishek
> Fix For: 3.0.0
>
> Attachments: HIVE-16269.1.patch, HIVE-16269.2.patch, 
> HIVE-16269.3.patch
>
>
> depends if there is additional spec elements we put out as part of HIVE-16268



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16893) move replication dump related work in semantic analysis phase to execution phase using a task

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147805#comment-16147805
 ] 

ASF GitHub Bot commented on HIVE-16893:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/201


> move replication dump related work in semantic analysis phase to execution 
> phase using a task
> -
>
> Key: HIVE-16893
> URL: https://issues.apache.org/jira/browse/HIVE-16893
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
> Fix For: 3.0.0
>
> Attachments: HIVE-16893.2.patch, HIVE-16893.3.patch, 
> HIVE-16893.4.patch
>
>
> Since we run in to the possibility of creating a large number tasks during 
> replication bootstrap dump
> * we may not be able to hold all of them in memory for really large 
> databases, which might not hold true once we complete HIVE-16892
> * Also a compile time lock is taken such that only one query is run in this 
> phase which in replication bootstrap scenario is going to be a very long 
> running task and hence moving it to execution phase will limit the lock 
> period in compile phase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16866) existing available UDF is used in TestReplicationScenariosAcrossInstances#testDropFunctionIncrementalReplication

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147801#comment-16147801
 ] 

ASF GitHub Bot commented on HIVE-16866:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/193


> existing available UDF is used in 
> TestReplicationScenariosAcrossInstances#testDropFunctionIncrementalReplication
>  
> -
>
> Key: HIVE-16866
> URL: https://issues.apache.org/jira/browse/HIVE-16866
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Trivial
> Fix For: 3.0.0
>
> Attachments: HIVE-16866.1.patch
>
>
> use ivy://io.github.myui:hivemall:0.4.0-2 instead of 
> ivy://com.yahoo.datasketches:sketches-hive:0.8.2 
> in testDropFunctionIncrementalReplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16115) Stop printing progress info from operation logs with beeline progress bar

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147795#comment-16147795
 ] 

ASF GitHub Bot commented on HIVE-16115:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/155


> Stop printing progress info from operation logs with beeline progress bar
> -
>
> Key: HIVE-16115
> URL: https://issues.apache.org/jira/browse/HIVE-16115
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 2.2.0
>Reporter: anishek
>Assignee: anishek
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: HIVE-16115.4.patch, HIVE-16115.5.patch
>
>
> when in progress bar is enabled, we should not print the progress information 
> via the operations logs. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16219) metastore notification_log contains serialized message with non functional fields

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147796#comment-16147796
 ] 

ASF GitHub Bot commented on HIVE-16219:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/159


> metastore notification_log contains serialized message with  non functional 
> fields
> --
>
> Key: HIVE-16219
> URL: https://issues.apache.org/jira/browse/HIVE-16219
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 2.2.0
>Reporter: anishek
>Assignee: anishek
> Fix For: 2.3.0, 3.0.0
>
> Attachments: HIVE-16219.3.patch
>
>
> the event notification logs stored in hive metastore have json serialized 
> messages stored in NOTIFICATION_LOG table,  these messages also store the 
> serialized Thrift API objects in them. when doing a reply dump we are however 
> serializing both the metadata for replication event + event Message + 
> additional helper method getters representing the thrift objects.
> We should only serialize metadata for replication event + event Message 
>  for ex for create table :
> {code}
> {
>   "eventType": "CREATE_TABLE",
>   "server": "",
>   "servicePrincipal": "",
>   "db": "default",
>   "table": "a",
>   "tableObjJson": 
> "{\"1\":{\"str\":\"a\"},\"2\":{\"str\":\"default\"},\"3\":{\"str\":\"anagarwal\"},\"4\":{\"i32\":1489552350},\"5\":{\"i32\":0},\"6\":{\"i32\":0},\"7\":{\"rec\":{\"1\":{\"lst\":[\"rec\",1,{\"1\":{\"str\":\"name\"},\"2\":{\"str\":\"string\"}}]},\"2\":{\"str\":\"file:/tmp/warehouse/a\"},\"3\":{\"str\":\"org.apache.hadoop.mapred.TextInputFormat\"},\"4\":{\"str\":\"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\"},\"5\":{\"tf\":0},\"6\":{\"i32\":-1},\"7\":{\"rec\":{\"2\":{\"str\":\"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe\"},\"3\":{\"map\":[\"str\",\"str\",2,{\"field.delim\":\"\\n\",\"serialization.format\":\"\\n\"}]}}},\"8\":{\"lst\":[\"str\",0]},\"9\":{\"lst\":[\"rec\",0]},\"10\":{\"map\":[\"str\",\"str\",0,{}]},\"11\":{\"rec\":{\"1\":{\"lst\":[\"str\",0]},\"2\":{\"lst\":[\"lst\",0]},\"3\":{\"map\":[\"lst\",\"str\",0,{}]}}},\"12\":{\"tf\":0}}},\"8\":{\"lst\":[\"rec\",0]},\"9\":{\"map\":[\"str\",\"str\",7,{\"totalSize\":\"0\",\"EXTERNAL\":\"TRUE\",\"numRows\":\"0\",\"rawDataSize\":\"0\",\"COLUMN_STATS_ACCURATE\":\"{\\\"BASIC_STATS\\\":\\\"true\\\"}\",\"numFiles\":\"0\",\"transient_lastDdlTime\":\"1489552350\"}]},\"12\":{\"str\":\"EXTERNAL_TABLE\"},\"13\":{\"rec\":{\"1\":{\"map\":[\"str\",\"lst\",1,{\"anagarwal\":[\"rec\",4,{\"1\":{\"str\":\"INSERT\"},\"2\":{\"i32\":-1},\"3\":{\"str\":\"anagarwal\"},\"4\":{\"i32\":1},\"5\":{\"tf\":1}},{\"1\":{\"str\":\"SELECT\"},\"2\":{\"i32\":-1},\"3\":{\"str\":\"anagarwal\"},\"4\":{\"i32\":1},\"5\":{\"tf\":1}},{\"1\":{\"str\":\"UPDATE\"},\"2\":{\"i32\":-1},\"3\":{\"str\":\"anagarwal\"},\"4\":{\"i32\":1},\"5\":{\"tf\":1}},{\"1\":{\"str\":\"DELETE\"},\"2\":{\"i32\":-1},\"3\":{\"str\":\"anagarwal\"},\"4\":{\"i32\":1},\"5\":{\"tf\":1}}]}]}}},\"14\":{\"tf\":0}}",
>   "timestamp": 1489552350,
>   "files": [],
>   "tableObj": {
> "tableName": "a",
> "dbName": "default",
> "owner": "anagarwal",
> "createTime": 1489552350,
> "lastAccessTime": 0,
> "retention": 0,
> "sd": {
>   "cols": [
> {
>   "name": "name",
>   "type": "string",
>   "comment": null,
>   "setName": true,
>   "setType": true,
>   "setComment": false
> }
>   ],
>   "location": "file:/tmp/warehouse/a",
>   "inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
>   "outputFormat": 
> "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
>   "compressed": false,
>   "numBuckets": -1,
>   "serdeInfo": {
> "name": null,
> "serializationLib": 
> "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
> "parameters": {
>   "serialization.format": "\n",
>   "field.delim": "\n"
> },
> "setName": false,
> "parametersSize": 2,
> "setParameters": true,
> "setSerializationLib": true
>   },
>   "bucketCols": [],
>   "sortCols": [],
>   "parameters": {},
>   "skewedInfo": {
> "skewedColNames": [],
> "skewedColValues": [],
> "skewedColValueLocationMaps": {},
> "setSkewedColNames": true,
> "setSkewedColValues": true,
> "setSkewedColValueLocationMaps": true,
> "skewedColNamesSize": 0,
> "skewedColNamesIterator": [],
> "skewedColValuesSize": 0,
> "skewedColValuesIterator": [],
> "skewedColValueLocationMapsSize": 0
>   },
>   "storedAsSubDirectories": false,
>   "setSkewedInfo": true,
>   

[jira] [Commented] (HIVE-16045) Print progress bar along with operation log

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147794#comment-16147794
 ] 

ASF GitHub Bot commented on HIVE-16045:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/153


> Print progress bar along with operation log
> ---
>
> Key: HIVE-16045
> URL: https://issues.apache.org/jira/browse/HIVE-16045
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 2.2.0
>Reporter: anishek
>Assignee: anishek
> Fix For: 2.3.0
>
> Attachments: HIVE-16045.5.patch, HIVE-16045.6.patch
>
>
> allow printing of the operation logs and progress bar such that,
> allow operations logs to output data once -> block it -> start progress bar 
> -> finish progress bar -> unblock the operations log -> finish operations log 
> -> print query results. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15713) add ldap authentication related configuration to restricted list

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147790#comment-16147790
 ] 

ASF GitHub Bot commented on HIVE-15713:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/138


> add ldap authentication related configuration to restricted list
> 
>
> Key: HIVE-15713
> URL: https://issues.apache.org/jira/browse/HIVE-15713
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 2.1.1
>Reporter: anishek
>Assignee: anishek
>Priority: Minor
>  Labels: TODOC2.2
> Fix For: 2.3.0
>
> Attachments: HIVE-15713.1.patch, HIVE-15713.1.patch
>
>
> Various ldap configuration parameters as below should be added to the 
> restricted list of configuration parameters such that users cant change them 
> per session. 
> hive.server2.authentication.ldap.baseDN
> hive.server2.authentication.ldap.url
> hive.server2.authentication.ldap.Domain
> hive.server2.authentication.ldap.groupDNPattern
> hive.server2.authentication.ldap.groupFilter
> hive.server2.authentication.ldap.userDNPattern
> hive.server2.authentication.ldap.userFilter
> hive.server2.authentication.ldap.groupMembershipKey
> hive.server2.authentication.ldap.userMembershipKey
> hive.server2.authentication.ldap.groupClassKey
> hive.server2.authentication.ldap.customLDAPQuery



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15906) thrift code regeneration to include new protocol version

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147791#comment-16147791
 ] 

ASF GitHub Bot commented on HIVE-15906:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/146


> thrift code regeneration to include new protocol version
> 
>
> Key: HIVE-15906
> URL: https://issues.apache.org/jira/browse/HIVE-15906
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 2.2.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
> Fix For: 2.2.0
>
> Attachments: HIVE-15906.1.patch, HIVE-15906.2.patch
>
>
> HIVE-15473  changed the protocol version in thrift file. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15550) fix arglist logging in schematool

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147787#comment-16147787
 ] 

ASF GitHub Bot commented on HIVE-15550:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/130


> fix arglist logging in schematool
> -
>
> Key: HIVE-15550
> URL: https://issues.apache.org/jira/browse/HIVE-15550
> Project: Hive
>  Issue Type: Improvement
>  Components: Beeline
>Affects Versions: 2.1.1
>Reporter: anishek
>Assignee: anishek
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15550.1.patch, HIVE-15550.1.patch
>
>
> In DEBUG mode schemaTool prints the password to log file.
> This is also seen if the user includes --verbose option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15712) new HiveConf in SQLOperation.getSerDe() impacts CPU on hiveserver2

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147789#comment-16147789
 ] 

ASF GitHub Bot commented on HIVE-15712:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/137


> new HiveConf in SQLOperation.getSerDe() impacts CPU on hiveserver2
> --
>
> Key: HIVE-15712
> URL: https://issues.apache.org/jira/browse/HIVE-15712
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 2.1.1
>Reporter: anishek
>Assignee: anishek
> Fix For: 2.3.0
>
> Attachments: HIVE-15712.1.patch, HIVE-15712.1.patch
>
>
> On doing internal performance test, with about 10 concurrent users we found 
> that about 18%  of CPU on hiveserver2 is spent in creation of new HiveConf() 
> in  SQLOperation.getSerDe().



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15711) Flaky TestEmbeddedThriftBinaryCLIService.testTaskStatus

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147788#comment-16147788
 ] 

ASF GitHub Bot commented on HIVE-15711:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/136


> Flaky TestEmbeddedThriftBinaryCLIService.testTaskStatus
> ---
>
> Key: HIVE-15711
> URL: https://issues.apache.org/jira/browse/HIVE-15711
> Project: Hive
>  Issue Type: Test
>  Components: Hive
>Affects Versions: 2.1.1
>Reporter: anishek
>Assignee: anishek
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15711.1.patch
>
>
> the above test is flaky and on the local build environments it keeps failing. 
> Fix to prevent it from failing intermittently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15473) Progress Bar on Beeline client

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147785#comment-16147785
 ] 

ASF GitHub Bot commented on HIVE-15473:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/129


> Progress Bar on Beeline client
> --
>
> Key: HIVE-15473
> URL: https://issues.apache.org/jira/browse/HIVE-15473
> Project: Hive
>  Issue Type: Improvement
>  Components: Beeline, HiveServer2
>Affects Versions: 2.1.1
>Reporter: anishek
>Assignee: anishek
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15473.10.patch, HIVE-15473.11.patch, 
> HIVE-15473.2.patch, HIVE-15473.3.patch, HIVE-15473.4.patch, 
> HIVE-15473.5.patch, HIVE-15473.6.patch, HIVE-15473.7.patch, 
> HIVE-15473.8.patch, HIVE-15473.9.patch, io_summary_after_patch.png, 
> io_summary_before_patch.png, screen_shot_beeline.jpg, status_after_patch.png, 
> status_before_patch.png, summary_after_patch.png, summary_before_patch.png
>
>
> Hive Cli allows showing progress bar for tez execution engine as shown in 
> https://issues.apache.org/jira/secure/attachment/12678767/ux-demo.gif
> it would be great to have similar progress bar displayed when user is 
> connecting via beeline command line client as well. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17410) repl load task during subsequent DAG generation does not start from the last partition processed

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147783#comment-16147783
 ] 

ASF GitHub Bot commented on HIVE-17410:
---

GitHub user anishek opened a pull request:

https://github.com/apache/hive/pull/240

HIVE-17410 : repl load task during subsequent DAG generation does notstart 
from the last partition processed



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/anishek/hive HIVE-17410

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hive/pull/240.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #240


commit f9072a7f76484222f0f78398fec6138d0d0847a3
Author: Anishek Agarwal 
Date:   2017-08-30T00:03:39Z

HIVE-17410 : repl load task during subsequent DAG generation does not start 
from the last partition processed




> repl load task during subsequent DAG generation does not start from the last 
> partition processed
> 
>
> Key: HIVE-17410
> URL: https://issues.apache.org/jira/browse/HIVE-17410
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
> Attachments: HIVE-17410.1.patch
>
>
> DAG generation for repl load task was to be generated dynamically such that 
> if the load break happens at a partition load time then for subsequent runs 
> we should start post the last partition processed.
> We currently identify the point from where we have to process the event but 
> reinitialize the iterator to start from beginning of all partition's to 
> process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17410) repl load task during subsequent DAG generation does not start from the last partition processed

2017-08-30 Thread anishek (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147734#comment-16147734
 ] 

anishek commented on HIVE-17410:


* 
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
: runs fine on local machine
* 
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
  : seems to be a local environment problem, on local machine it takes for ever 
to run, running for about 20 mins now and no errors, stopped it since on the 
test environment it failed pretty fast.

Other tests are failures from older builds.

> repl load task during subsequent DAG generation does not start from the last 
> partition processed
> 
>
> Key: HIVE-17410
> URL: https://issues.apache.org/jira/browse/HIVE-17410
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
> Attachments: HIVE-17410.1.patch
>
>
> DAG generation for repl load task was to be generated dynamically such that 
> if the load break happens at a partition load time then for subsequent runs 
> we should start post the last partition processed.
> We currently identify the point from where we have to process the event but 
> reinitialize the iterator to start from beginning of all partition's to 
> process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17216) Additional qtests for HoS DPP

2017-08-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HIVE-17216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147726#comment-16147726
 ] 

Sergio Peña commented on HIVE-17216:


LGTM
+1

> Additional qtests for HoS DPP
> -
>
> Key: HIVE-17216
> URL: https://issues.apache.org/jira/browse/HIVE-17216
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17216.1.patch, HIVE-17216.2.patch, 
> HIVE-17216.3.patch, HIVE-17216.4.patch
>
>
> There are a few queries that we can add to the HoS DPP tests to increase 
> coverage. There are a few query patterns that the current tests don't cover.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17307) Change the metastore to not use the metrics code in hive/common

2017-08-30 Thread Vihang Karajgaonkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147715#comment-16147715
 ] 

Vihang Karajgaonkar commented on HIVE-17307:


+1 LGTM. I added one minor comment on the PR and would be good if you could add 
a line of comment explaining a bit more there. Thanks for the changes 
[~alangates]

> Change the metastore to not use the metrics code in hive/common
> ---
>
> Key: HIVE-17307
> URL: https://issues.apache.org/jira/browse/HIVE-17307
> Project: Hive
>  Issue Type: Sub-task
>  Components: Metastore
>Reporter: Alan Gates
>Assignee: Alan Gates
> Attachments: HIVE-17307.2.patch, HIVE-17307.3.patch, 
> HIVE-17307.4.patch, HIVE-17307.patch
>
>
> As we move code into the standalone metastore module, it cannot use the 
> metrics in hive-common.  We could copy the current Metrics interface or we 
> could change the metastore code to directly use codahale metrics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17225) HoS DPP pruning sink ops can target parallel work objects

2017-08-30 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147670#comment-16147670
 ] 

Sahil Takiar commented on HIVE-17225:
-

[~janulatha], [~asherman] addressed your review comments.

> HoS DPP pruning sink ops can target parallel work objects
> -
>
> Key: HIVE-17225
> URL: https://issues.apache.org/jira/browse/HIVE-17225
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Affects Versions: 3.0.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE17225.1.patch, HIVE-17225.2.patch, 
> HIVE-17225.3.patch, HIVE-17225.4.patch
>
>
> Setup:
> {code:sql}
> SET hive.spark.dynamic.partition.pruning=true;
> SET hive.strict.checks.cartesian.product=false;
> SET hive.auto.convert.join=true;
> CREATE TABLE partitioned_table1 (col int) PARTITIONED BY (part_col int);
> CREATE TABLE regular_table1 (col int);
> CREATE TABLE regular_table2 (col int);
> ALTER TABLE partitioned_table1 ADD PARTITION (part_col = 1);
> ALTER TABLE partitioned_table1 ADD PARTITION (part_col = 2);
> ALTER TABLE partitioned_table1 ADD PARTITION (part_col = 3);
> INSERT INTO table regular_table1 VALUES (1), (2), (3), (4), (5), (6);
> INSERT INTO table regular_table2 VALUES (1), (2), (3), (4), (5), (6);
> INSERT INTO TABLE partitioned_table1 PARTITION (part_col = 1) VALUES (1);
> INSERT INTO TABLE partitioned_table1 PARTITION (part_col = 2) VALUES (2);
> INSERT INTO TABLE partitioned_table1 PARTITION (part_col = 3) VALUES (3);
> SELECT *
> FROM   partitioned_table1,
>regular_table1 rt1,
>regular_table2 rt2
> WHERE  rt1.col = partitioned_table1.part_col
>AND rt2.col = partitioned_table1.part_col;
> {code}
> Exception:
> {code}
> 2017-08-01T13:27:47,483 ERROR [b0d354a8-4cdb-4ba9-acec-27d14926aaf4 main] 
> ql.Driver: FAILED: Execution Error, return code 3 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask. java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.FileNotFoundException: File 
> file:/Users/stakiar/Documents/idea/apache-hive/itests/qtest-spark/target/tmp/scratchdir/stakiar/b0d354a8-4cdb-4ba9-acec-27d14926aaf4/hive_2017-08-01_13-27-45_553_1088589686371686526-1/-mr-10004/3/5
>  does not exist
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:408)
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:498)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:82)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:82)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:82)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:82)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:82)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:82)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at 

[jira] [Updated] (HIVE-16886) HMS log notifications may have duplicated event IDs if multiple HMS are running concurrently

2017-08-30 Thread anishek (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-16886:
---
Attachment: HIVE-16886.7.patch

fixing log comments

> HMS log notifications may have duplicated event IDs if multiple HMS are 
> running concurrently
> 
>
> Key: HIVE-16886
> URL: https://issues.apache.org/jira/browse/HIVE-16886
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Metastore
>Reporter: Sergio Peña
>Assignee: anishek
> Attachments: datastore-identity-holes.diff, HIVE-16886.1.patch, 
> HIVE-16886.2.patch, HIVE-16886.3.patch, HIVE-16886.4.patch, 
> HIVE-16886.5.patch, HIVE-16886.6.patch, HIVE-16886.7.patch
>
>
> When running multiple Hive Metastore servers and DB notifications are 
> enabled, I could see that notifications can be persisted with a duplicated 
> event ID. 
> This does not happen when running multiple threads in a single HMS node due 
> to the locking acquired on the DbNotificationsLog class, but multiple HMS 
> could cause conflicts.
> The issue is in the ObjectStore#addNotificationEvent() method. The event ID 
> fetched from the datastore is used for the new notification, incremented in 
> the server itself, then persisted or updated back to the datastore. If 2 
> servers read the same ID, then these 2 servers write a new notification with 
> the same ID.
> The event ID is not unique nor a primary key.
> Here's a test case using the TestObjectStore class that confirms this issue:
> {noformat}
> @Test
>   public void testConcurrentAddNotifications() throws ExecutionException, 
> InterruptedException {
> final int NUM_THREADS = 2;
> CountDownLatch countIn = new CountDownLatch(NUM_THREADS);
> CountDownLatch countOut = new CountDownLatch(1);
> HiveConf conf = new HiveConf();
> conf.setVar(HiveConf.ConfVars.METASTORE_EXPRESSION_PROXY_CLASS, 
> MockPartitionExpressionProxy.class.getName());
> ExecutorService executorService = 
> Executors.newFixedThreadPool(NUM_THREADS);
> FutureTask tasks[] = new FutureTask[NUM_THREADS];
> for (int i=0; i   final int n = i;
>   tasks[i] = new FutureTask(new Callable() {
> @Override
> public Void call() throws Exception {
>   ObjectStore store = new ObjectStore();
>   store.setConf(conf);
>   NotificationEvent dbEvent =
>   new NotificationEvent(0, 0, 
> EventMessage.EventType.CREATE_DATABASE.toString(), "CREATE DATABASE DB" + n);
>   System.out.println("ADDING NOTIFICATION");
>   countIn.countDown();
>   countOut.await();
>   store.addNotificationEvent(dbEvent);
>   System.out.println("FINISH NOTIFICATION");
>   return null;
> }
>   });
>   executorService.execute(tasks[i]);
> }
> countIn.await();
> countOut.countDown();
> for (int i = 0; i < NUM_THREADS; ++i) {
>   tasks[i].get();
> }
> NotificationEventResponse eventResponse = 
> objectStore.getNextNotification(new NotificationEventRequest());
> Assert.assertEquals(2, eventResponse.getEventsSize());
> Assert.assertEquals(1, eventResponse.getEvents().get(0).getEventId());
> // This fails because the next notification has an event ID = 1
> Assert.assertEquals(2, eventResponse.getEvents().get(1).getEventId());
>   }
> {noformat}
> The last assertion fails expecting an event ID 1 instead of 2. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16886) HMS log notifications may have duplicated event IDs if multiple HMS are running concurrently

2017-08-30 Thread anishek (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147667#comment-16147667
 ] 

anishek commented on HIVE-16886:


* org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[transform_acid] : 
works fine on local machine
* org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] : fails 
on local, only order of the first two statements is in correct which was 
specifically changed as part of HIVE-17037 not sure why its failing now and its 
not part of the current set of changes.
*  
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 : seems like a disk IO operation that failed, not a test failure, running this 
on local machine for the last 10 mins and there is no end to it, so stopping it.


Other tests are failing from older build

> HMS log notifications may have duplicated event IDs if multiple HMS are 
> running concurrently
> 
>
> Key: HIVE-16886
> URL: https://issues.apache.org/jira/browse/HIVE-16886
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Metastore
>Reporter: Sergio Peña
>Assignee: anishek
> Attachments: datastore-identity-holes.diff, HIVE-16886.1.patch, 
> HIVE-16886.2.patch, HIVE-16886.3.patch, HIVE-16886.4.patch, 
> HIVE-16886.5.patch, HIVE-16886.6.patch
>
>
> When running multiple Hive Metastore servers and DB notifications are 
> enabled, I could see that notifications can be persisted with a duplicated 
> event ID. 
> This does not happen when running multiple threads in a single HMS node due 
> to the locking acquired on the DbNotificationsLog class, but multiple HMS 
> could cause conflicts.
> The issue is in the ObjectStore#addNotificationEvent() method. The event ID 
> fetched from the datastore is used for the new notification, incremented in 
> the server itself, then persisted or updated back to the datastore. If 2 
> servers read the same ID, then these 2 servers write a new notification with 
> the same ID.
> The event ID is not unique nor a primary key.
> Here's a test case using the TestObjectStore class that confirms this issue:
> {noformat}
> @Test
>   public void testConcurrentAddNotifications() throws ExecutionException, 
> InterruptedException {
> final int NUM_THREADS = 2;
> CountDownLatch countIn = new CountDownLatch(NUM_THREADS);
> CountDownLatch countOut = new CountDownLatch(1);
> HiveConf conf = new HiveConf();
> conf.setVar(HiveConf.ConfVars.METASTORE_EXPRESSION_PROXY_CLASS, 
> MockPartitionExpressionProxy.class.getName());
> ExecutorService executorService = 
> Executors.newFixedThreadPool(NUM_THREADS);
> FutureTask tasks[] = new FutureTask[NUM_THREADS];
> for (int i=0; i   final int n = i;
>   tasks[i] = new FutureTask(new Callable() {
> @Override
> public Void call() throws Exception {
>   ObjectStore store = new ObjectStore();
>   store.setConf(conf);
>   NotificationEvent dbEvent =
>   new NotificationEvent(0, 0, 
> EventMessage.EventType.CREATE_DATABASE.toString(), "CREATE DATABASE DB" + n);
>   System.out.println("ADDING NOTIFICATION");
>   countIn.countDown();
>   countOut.await();
>   store.addNotificationEvent(dbEvent);
>   System.out.println("FINISH NOTIFICATION");
>   return null;
> }
>   });
>   executorService.execute(tasks[i]);
> }
> countIn.await();
> countOut.countDown();
> for (int i = 0; i < NUM_THREADS; ++i) {
>   tasks[i].get();
> }
> NotificationEventResponse eventResponse = 
> objectStore.getNextNotification(new NotificationEventRequest());
> Assert.assertEquals(2, eventResponse.getEventsSize());
> Assert.assertEquals(1, eventResponse.getEvents().get(0).getEventId());
> // This fails because the next notification has an event ID = 1
> Assert.assertEquals(2, eventResponse.getEvents().get(1).getEventId());
>   }
> {noformat}
> The last assertion fails expecting an event ID 1 instead of 2. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17193) HoS: don't combine map works that are targets of different DPPs

2017-08-30 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147645#comment-16147645
 ] 

Sahil Takiar commented on HIVE-17193:
-

[~lirui] is this a bug with {{CombineEquivalentWorkResolver}}?

> HoS: don't combine map works that are targets of different DPPs
> ---
>
> Key: HIVE-17193
> URL: https://issues.apache.org/jira/browse/HIVE-17193
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17382) Change startsWith relation introduced in HIVE-17316

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147619#comment-16147619
 ] 

Hive QA commented on HIVE-17382:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884453/HIVE-17382.02.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 11015 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.conf.TestHiveConfRestrictList.testMultipleRestrictions 
(batchId=249)
org.apache.hadoop.hive.metastore.datasource.TestDataSourceProviderFactory.testBoneCPConfigCannotBeSet
 (batchId=201)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6605/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6605/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6605/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884453 - PreCommit-HIVE-Build

> Change startsWith relation introduced in HIVE-17316
> ---
>
> Key: HIVE-17382
> URL: https://issues.apache.org/jira/browse/HIVE-17382
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Barna Zsombor Klara
>Assignee: Barna Zsombor Klara
> Fix For: 3.0.0
>
> Attachments: HIVE-17382.01.patch, HIVE-17382.02.patch
>
>
> In HiveConf the new name should be checked if it starts with a 
> restricted/hidden variable prefix and not vice-versa.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16886) HMS log notifications may have duplicated event IDs if multiple HMS are running concurrently

2017-08-30 Thread anishek (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147592#comment-16147592
 ] 

anishek commented on HIVE-16886:


[~akolb] {{SQLGenerator}} is a artifact of the ACID implementation and we are 
just reusing the clause, from the looks of it SQL injection might be possible 
but i am not an expert in that area so I will let [~ekoifman] have a look at 
it, also may be track this as a separate bug.

I am just running the failed test to see if there are any problems, will upload 
a patch soon with log comment fixes suggested by [~spena]


> HMS log notifications may have duplicated event IDs if multiple HMS are 
> running concurrently
> 
>
> Key: HIVE-16886
> URL: https://issues.apache.org/jira/browse/HIVE-16886
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Metastore
>Reporter: Sergio Peña
>Assignee: anishek
> Attachments: datastore-identity-holes.diff, HIVE-16886.1.patch, 
> HIVE-16886.2.patch, HIVE-16886.3.patch, HIVE-16886.4.patch, 
> HIVE-16886.5.patch, HIVE-16886.6.patch
>
>
> When running multiple Hive Metastore servers and DB notifications are 
> enabled, I could see that notifications can be persisted with a duplicated 
> event ID. 
> This does not happen when running multiple threads in a single HMS node due 
> to the locking acquired on the DbNotificationsLog class, but multiple HMS 
> could cause conflicts.
> The issue is in the ObjectStore#addNotificationEvent() method. The event ID 
> fetched from the datastore is used for the new notification, incremented in 
> the server itself, then persisted or updated back to the datastore. If 2 
> servers read the same ID, then these 2 servers write a new notification with 
> the same ID.
> The event ID is not unique nor a primary key.
> Here's a test case using the TestObjectStore class that confirms this issue:
> {noformat}
> @Test
>   public void testConcurrentAddNotifications() throws ExecutionException, 
> InterruptedException {
> final int NUM_THREADS = 2;
> CountDownLatch countIn = new CountDownLatch(NUM_THREADS);
> CountDownLatch countOut = new CountDownLatch(1);
> HiveConf conf = new HiveConf();
> conf.setVar(HiveConf.ConfVars.METASTORE_EXPRESSION_PROXY_CLASS, 
> MockPartitionExpressionProxy.class.getName());
> ExecutorService executorService = 
> Executors.newFixedThreadPool(NUM_THREADS);
> FutureTask tasks[] = new FutureTask[NUM_THREADS];
> for (int i=0; i   final int n = i;
>   tasks[i] = new FutureTask(new Callable() {
> @Override
> public Void call() throws Exception {
>   ObjectStore store = new ObjectStore();
>   store.setConf(conf);
>   NotificationEvent dbEvent =
>   new NotificationEvent(0, 0, 
> EventMessage.EventType.CREATE_DATABASE.toString(), "CREATE DATABASE DB" + n);
>   System.out.println("ADDING NOTIFICATION");
>   countIn.countDown();
>   countOut.await();
>   store.addNotificationEvent(dbEvent);
>   System.out.println("FINISH NOTIFICATION");
>   return null;
> }
>   });
>   executorService.execute(tasks[i]);
> }
> countIn.await();
> countOut.countDown();
> for (int i = 0; i < NUM_THREADS; ++i) {
>   tasks[i].get();
> }
> NotificationEventResponse eventResponse = 
> objectStore.getNextNotification(new NotificationEventRequest());
> Assert.assertEquals(2, eventResponse.getEventsSize());
> Assert.assertEquals(1, eventResponse.getEvents().get(0).getEventId());
> // This fails because the next notification has an event ID = 1
> Assert.assertEquals(2, eventResponse.getEvents().get(1).getEventId());
>   }
> {noformat}
> The last assertion fails expecting an event ID 1 instead of 2. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q

2017-08-30 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147582#comment-16147582
 ] 

Sahil Takiar commented on HIVE-17412:
-

LGTM +1

> Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-17412
> URL: https://issues.apache.org/jira/browse/HIVE-17412
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17412.patch
>
>
> for query
> {code}
>  set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> select distinct ds from srcpart;
> {code}
> the result is 
> {code}
> 2008-04-09
> 2008-04-08
> {code}
> the result of groupby in spark is not in order. Sometimes it returns 
> {code}
> 2008-04-08
> 2008-04-09
> {code}
> Sometimes it returns
> {code}
> 2008-04-09
> 2008-04-08
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17183) Disable rename operations during bootstrap dump

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147507#comment-16147507
 ] 

Hive QA commented on HIVE-17183:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884440/HIVE-17183.04.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 11015 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=234)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6604/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6604/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6604/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884440 - PreCommit-HIVE-Build

> Disable rename operations during bootstrap dump
> ---
>
> Key: HIVE-17183
> URL: https://issues.apache.org/jira/browse/HIVE-17183
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Affects Versions: 2.1.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>  Labels: DR, replication
> Fix For: 3.0.0
>
> Attachments: HIVE-17183.01.patch, HIVE-17183.02.patch, 
> HIVE-17183.03.patch, HIVE-17183.04.patch
>
>
> Currently, bootstrap dump shall lead to data loss when any rename happens 
> while dump in progress. 
> *Scenario:*
> - Fetch table names (T1 and T2)
> - Dump table T1
> - Rename table T2 to T3 generates RENAME event
> - Dump table T2 is noop as table doesn’t exist.
> - In target after load, it only have T1.
> - Apply RENAME event will fail as T2 doesn’t exist in target.
> This feature can be supported in next phase development as it need proper 
> design to keep track of renamed tables/partitions. 
> So, for time being, we shall disable rename operations when bootstrap dump in 
> progress to avoid any inconsistent state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17385) Fix incremental repl error for non-native tables

2017-08-30 Thread Tao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147485#comment-16147485
 ] 

Tao Li edited comment on HIVE-17385 at 8/30/17 3:56 PM:


[~daijy] Regarding the exception, I am sticking with the old logic, i.e. 
throwing if replicationSpec.isInReplicationScope() is false and 
tableHandle.isNonNative() is true. We do skip the non-native silently if 
replicationSpec.isInReplicationScope() is true, which is the case in our 
incremental dump scenario.

We do the null check on ImportSemanticAnalyzer because we do run into the null 
table case after making the above dump change. Also a null check is a good 
practice/convention.


was (Author: taoli-hwx):
[~daijy] Regarding the exception, I am sticking with the old logic, i.e. 
throwing if replicationSpec.isInReplicationScope() is false and 
tableHandle.isNonNative() is true. We do skip the non-native silently if 
replicationSpec.isInReplicationScope() is true.

We do the null check on ImportSemanticAnalyzer because we do run into the null 
table case after making the above dump change. Also a null check is a good 
practice/convention.

> Fix incremental repl error for non-native tables
> 
>
> Key: HIVE-17385
> URL: https://issues.apache.org/jira/browse/HIVE-17385
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Reporter: Tao Li
>Assignee: Tao Li
> Attachments: HIVE-17385.1.patch, HIVE-17385.2.patch, 
> HIVE-17385.3.patch, HIVE-17385.4.patch
>
>
> See below error with incremental replication for non-native (storage handler 
> based) tables. The bug is that we are not checking a table should be 
> dumped/exported or not during incremental dump.
> 2017-08-02T12:31:48,195 ERROR [HiveServer2-Background-Pool: Thread-8078]: 
> exec.DDLTask (DDLTask.java:failed(632)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:LOCATION may not be specified for HBase.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17276) Check max shuffle size when converting to dynamically partitioned hash join

2017-08-30 Thread Jesus Camacho Rodriguez (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-17276:
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Pushed to master, thanks for the review [~ashutoshc]!

> Check max shuffle size when converting to dynamically partitioned hash join
> ---
>
> Key: HIVE-17276
> URL: https://issues.apache.org/jira/browse/HIVE-17276
> Project: Hive
>  Issue Type: Bug
>  Components: Physical Optimizer
>Affects Versions: 3.0.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Fix For: 3.0.0
>
> Attachments: HIVE-17276.01.patch, HIVE-17276.02.patch, 
> HIVE-17276.03.patch, HIVE-17276.patch
>
>
> Currently we only check that the max number of entries in the hashmap for a 
> MapJoin surpasses a certain threshold to decide whether to execute a 
> dynamically partitioned hash join.
> We would like to factor the size of the large input that we will shuffle for 
> the dynamically partitioned hash join into the cost model too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17385) Fix incremental repl error for non-native tables

2017-08-30 Thread Tao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147485#comment-16147485
 ] 

Tao Li commented on HIVE-17385:
---

[~daijy] Regarding the exception, I am sticking with the old logic, i.e. 
throwing if replicationSpec.isInReplicationScope() is false and 
tableHandle.isNonNative() is true. We do skip the non-native silently if 
replicationSpec.isInReplicationScope() is true.

We do the null check on ImportSemanticAnalyzer because we do run into the null 
table case after making the above dump change. Also a null check is a good 
practice/convention.

> Fix incremental repl error for non-native tables
> 
>
> Key: HIVE-17385
> URL: https://issues.apache.org/jira/browse/HIVE-17385
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Reporter: Tao Li
>Assignee: Tao Li
> Attachments: HIVE-17385.1.patch, HIVE-17385.2.patch, 
> HIVE-17385.3.patch, HIVE-17385.4.patch
>
>
> See below error with incremental replication for non-native (storage handler 
> based) tables. The bug is that we are not checking a table should be 
> dumped/exported or not during incremental dump.
> 2017-08-02T12:31:48,195 ERROR [HiveServer2-Background-Pool: Thread-8078]: 
> exec.DDLTask (DDLTask.java:failed(632)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:LOCATION may not be specified for HBase.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17399) Do not remove semijoin branch if it feeds to TS->DPP_EVENT

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147374#comment-16147374
 ] 

Hive QA commented on HIVE-17399:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884405/HIVE-17399.2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 10 failed/errored test(s), 11000 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynamic_partition_pruning]
 (batchId=151)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorized_dynamic_partition_pruning]
 (batchId=152)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=234)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=104)
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testEnableThriftSerializeInTasks 
(batchId=227)
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testHttpRetryOnServerIdleTimeout 
(batchId=227)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6603/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6603/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6603/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 10 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884405 - PreCommit-HIVE-Build

> Do not remove semijoin branch if it feeds to TS->DPP_EVENT
> --
>
> Key: HIVE-17399
> URL: https://issues.apache.org/jira/browse/HIVE-17399
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
> Attachments: HIVE-17399.1.patch, HIVE-17399.2.patch
>
>
> If there is an incoming semijoin branch to a TS which has DPP event, then try 
> to keep it as it may serve as an excellent filter for DPP thus reducing the 
> input to join drastically.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17412) Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q

2017-08-30 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147276#comment-16147276
 ] 

Xuefu Zhang commented on HIVE-17412:


+1

> Add "-- SORT_QUERY_RESULTS" for spark_vectorized_dynamic_partition_pruning.q
> 
>
> Key: HIVE-17412
> URL: https://issues.apache.org/jira/browse/HIVE-17412
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-17412.patch
>
>
> for query
> {code}
>  set hive.optimize.ppd=true;
> set hive.ppd.remove.duplicatefilters=true;
> set hive.spark.dynamic.partition.pruning=true;
> set hive.optimize.metadataonly=false;
> set hive.optimize.index.filter=true;
> set hive.vectorized.execution.enabled=true;
> set hive.strict.checks.cartesian.product=false;
> select distinct ds from srcpart;
> {code}
> the result is 
> {code}
> 2008-04-09
> 2008-04-08
> {code}
> the result of groupby in spark is not in order. Sometimes it returns 
> {code}
> 2008-04-08
> 2008-04-09
> {code}
> Sometimes it returns
> {code}
> 2008-04-09
> 2008-04-08
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17100) Improve HS2 operation logs for REPL commands.

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147253#comment-16147253
 ] 

Hive QA commented on HIVE-17100:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884402/HIVE-17100.10.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 11014 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6602/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6602/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6602/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884402 - PreCommit-HIVE-Build

> Improve HS2 operation logs for REPL commands.
> -
>
> Key: HIVE-17100
> URL: https://issues.apache.org/jira/browse/HIVE-17100
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2, repl
>Affects Versions: 2.1.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>  Labels: DR, replication
> Fix For: 3.0.0
>
> Attachments: HIVE-17100.01.patch, HIVE-17100.02.patch, 
> HIVE-17100.03.patch, HIVE-17100.04.patch, HIVE-17100.05.patch, 
> HIVE-17100.06.patch, HIVE-17100.07.patch, HIVE-17100.08.patch, 
> HIVE-17100.09.patch, HIVE-17100.10.patch
>
>
> It is necessary to log the progress the replication tasks in a structured 
> manner as follows.
> *+Bootstrap Dump:+*
> * At the start of bootstrap dump, will add one log with below details.
> {color:#59afe1}* Database Name
> * Dump Type (BOOTSTRAP)
> * (Estimated) Total number of tables/views to dump
> * (Estimated) Total number of functions to dump.
> * Dump Start Time{color}
> * After each table dump, will add a log as follows
> {color:#59afe1}* Table/View Name
> * Type (TABLE/VIEW/MATERIALIZED_VIEW)
> * Table dump end time
> * Table dump progress. Format is Table sequence no/(Estimated) Total number 
> of tables and views.{color}
> * After each function dump, will add a log as follows
> {color:#59afe1}* Function Name
> * Function dump end time
> * Function dump progress. Format is Function sequence no/(Estimated) Total 
> number of functions.{color}
> * After completion of all dumps, will add a log as follows to consolidate the 
> dump.
> {color:#59afe1}* Database Name.
> * Dump Type (BOOTSTRAP).
> * Dump End Time.
> * (Actual) Total number of tables/views dumped.
> * (Actual) Total number of functions dumped.
> * Dump Directory.
> * Last Repl ID of the dump.{color}
> *Note:* The actual and estimated number of tables/functions may not match if 
> any table/function is dropped when dump in progress.
> *+Bootstrap Load:+*
> * At the start of bootstrap load, will add one log with below details.
> {color:#59afe1}* Database Name
> * Dump directory
> * Load Type (BOOTSTRAP)
> * Total number of tables/views to load
> * Total number of functions to load.
> * Load Start Time{color}
> * After each table load, will add a log as follows
> {color:#59afe1}* Table/View Name
> * Type (TABLE/VIEW/MATERIALIZED_VIEW)
> * Table load completion time
> * Table load progress. Format is Table sequence no/Total number of tables and 
> views.{color}
> * After each function load, will add a log as follows
> {color:#59afe1}* Function Name
> * Function load completion time
> * Function load progress. Format is Function sequence no/Total number of 
> functions.{color}
> * After completion of all dumps, will add a log as follows to consolidate the 
> load.
> {color:#59afe1}* Database Name.
> * Load Type (BOOTSTRAP).
> * Load End Time.
> * Total number of tables/views loaded.
> * Total number of functions loaded.
> * Last Repl ID of the loaded database.{color}
> *+Incremental Dump:+*
> * At the start of database dump, will add one log with below details.
> {color:#59afe1}* Database Name
> * Dump Type (INCREMENTAL)
> * (Estimated) Total number of events to dump.
> * Dump Start Time{color}
> * After each event dump, will add a log as follows
> 

[jira] [Commented] (HIVE-17405) HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147160#comment-16147160
 ] 

Hive QA commented on HIVE-17405:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884395/HIVE-17405.3.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 11014 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] 
(batchId=143)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=234)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6601/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6601/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6601/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884395 - PreCommit-HIVE-Build

> HoS DPP ConstantPropagate should use ConstantPropagateOption.SHORTCUT
> -
>
> Key: HIVE-17405
> URL: https://issues.apache.org/jira/browse/HIVE-17405
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17405.1.patch, HIVE-17405.2.patch, 
> HIVE-17405.3.patch
>
>
> In {{SparkCompiler#runDynamicPartitionPruning}} we should change {{new 
> ConstantPropagate().transform(parseContext)}} to {{new 
> ConstantPropagate(ConstantPropagateOption.SHORTCUT).transform(parseContext)}}
> Hive-on-Tez does the same thing.
> Running the full constant propagation isn't really necessary, we just want to 
> eliminate any {{and true}} predicates that were introduced by 
> {{SyntheticJoinPredicate}} and {{DynamicPartitionPruningOptimization}}. The 
> {{SyntheticJoinPredicate}} will introduce dummy filter predicates into the 
> operator tree, and {{DynamicPartitionPruningOptimization}} will replace them. 
> The predicates introduced via {{SyntheticJoinPredicate}} are necessary to 
> help {{DynamicPartitionPruningOptimization}} determine if DPP can be used or 
> not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17366) Constraint replication in bootstrap

2017-08-30 Thread Sankar Hariappan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147135#comment-16147135
 ] 

Sankar Hariappan commented on HIVE-17366:
-

[~daijy],
I submitted my review comments in the pull request link. Please have a look and 
let me know for any queries.

> Constraint replication in bootstrap
> ---
>
> Key: HIVE-17366
> URL: https://issues.apache.org/jira/browse/HIVE-17366
> Project: Hive
>  Issue Type: New Feature
>  Components: repl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: HIVE-17366.1.patch
>
>
> Incremental constraint replication is tracked in HIVE-15705. This is to track 
> the bootstrap replication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17399) Do not remove semijoin branch if it feeds to TS->DPP_EVENT

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147098#comment-16147098
 ] 

Hive QA commented on HIVE-17399:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884405/HIVE-17399.2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 11014 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynamic_partition_pruning]
 (batchId=151)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorized_dynamic_partition_pruning]
 (batchId=152)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=234)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6600/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6600/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6600/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 7 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884405 - PreCommit-HIVE-Build

> Do not remove semijoin branch if it feeds to TS->DPP_EVENT
> --
>
> Key: HIVE-17399
> URL: https://issues.apache.org/jira/browse/HIVE-17399
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
> Attachments: HIVE-17399.1.patch, HIVE-17399.2.patch
>
>
> If there is an incoming semijoin branch to a TS which has DPP event, then try 
> to keep it as it may serve as an excellent filter for DPP thus reducing the 
> input to join drastically.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17323) Improve upon HIVE-16260

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147038#comment-16147038
 ] 

Hive QA commented on HIVE-17323:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884387/HIVE-17323.5.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 11014 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] 
(batchId=143)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=234)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6599/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6599/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6599/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884387 - PreCommit-HIVE-Build

> Improve upon HIVE-16260
> ---
>
> Key: HIVE-17323
> URL: https://issues.apache.org/jira/browse/HIVE-17323
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
> Attachments: HIVE-17323.1.patch, HIVE-17323.2.patch, 
> HIVE-17323.3.patch, HIVE-17323.4.patch, HIVE-17323.5.patch
>
>
> HIVE-16260 allows removal of parallel edges of semijoin with mapjoins.
> https://issues.apache.org/jira/browse/HIVE-16260
> However, it should also consider dynamic partition pruning edge like semijoin 
> without removing it while traversing the query tree.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17382) Change startsWith relation introduced in HIVE-17316

2017-08-30 Thread Barna Zsombor Klara (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barna Zsombor Klara updated HIVE-17382:
---
Attachment: HIVE-17382.02.patch

Fixed BeeLine tests.

> Change startsWith relation introduced in HIVE-17316
> ---
>
> Key: HIVE-17382
> URL: https://issues.apache.org/jira/browse/HIVE-17382
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Barna Zsombor Klara
>Assignee: Barna Zsombor Klara
> Fix For: 3.0.0
>
> Attachments: HIVE-17382.01.patch, HIVE-17382.02.patch
>
>
> In HiveConf the new name should be checked if it starts with a 
> restricted/hidden variable prefix and not vice-versa.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17276) Check max shuffle size when converting to dynamically partitioned hash join

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146950#comment-16146950
 ] 

Hive QA commented on HIVE-17276:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884386/HIVE-17276.03.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 11014 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_queries]
 (batchId=230)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] 
(batchId=143)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=234)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6598/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6598/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6598/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 7 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884386 - PreCommit-HIVE-Build

> Check max shuffle size when converting to dynamically partitioned hash join
> ---
>
> Key: HIVE-17276
> URL: https://issues.apache.org/jira/browse/HIVE-17276
> Project: Hive
>  Issue Type: Bug
>  Components: Physical Optimizer
>Affects Versions: 3.0.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Attachments: HIVE-17276.01.patch, HIVE-17276.02.patch, 
> HIVE-17276.03.patch, HIVE-17276.patch
>
>
> Currently we only check that the max number of entries in the hashmap for a 
> MapJoin surpasses a certain threshold to decide whether to execute a 
> dynamically partitioned hash join.
> We would like to factor the size of the large input that we will shuffle for 
> the dynamically partitioned hash join into the cost model too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17183) Disable rename operations during bootstrap dump

2017-08-30 Thread Sankar Hariappan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-17183:

Status: Patch Available  (was: Open)

> Disable rename operations during bootstrap dump
> ---
>
> Key: HIVE-17183
> URL: https://issues.apache.org/jira/browse/HIVE-17183
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Affects Versions: 2.1.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>  Labels: DR, replication
> Fix For: 3.0.0
>
> Attachments: HIVE-17183.01.patch, HIVE-17183.02.patch, 
> HIVE-17183.03.patch, HIVE-17183.04.patch
>
>
> Currently, bootstrap dump shall lead to data loss when any rename happens 
> while dump in progress. 
> *Scenario:*
> - Fetch table names (T1 and T2)
> - Dump table T1
> - Rename table T2 to T3 generates RENAME event
> - Dump table T2 is noop as table doesn’t exist.
> - In target after load, it only have T1.
> - Apply RENAME event will fail as T2 doesn’t exist in target.
> This feature can be supported in next phase development as it need proper 
> design to keep track of renamed tables/partitions. 
> So, for time being, we shall disable rename operations when bootstrap dump in 
> progress to avoid any inconsistent state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17183) Disable rename operations during bootstrap dump

2017-08-30 Thread Sankar Hariappan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-17183:

Attachment: HIVE-17183.04.patch

Added 04.patch with below updates.
- Added UUID to the key to differentiate multiple bootstrap dumps in parallel 
for scenario primary -> replica_1 and primary->replica_2.
- Removed all the dump state entries in the database properties before dumping 
the database object for bootstrap dump. This is to avoid the case where rename 
always blocked in replica if the dump state property is replicated.

Request [~anishek]/[~thejas] to please review the patch.

> Disable rename operations during bootstrap dump
> ---
>
> Key: HIVE-17183
> URL: https://issues.apache.org/jira/browse/HIVE-17183
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Affects Versions: 2.1.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>  Labels: DR, replication
> Fix For: 3.0.0
>
> Attachments: HIVE-17183.01.patch, HIVE-17183.02.patch, 
> HIVE-17183.03.patch, HIVE-17183.04.patch
>
>
> Currently, bootstrap dump shall lead to data loss when any rename happens 
> while dump in progress. 
> *Scenario:*
> - Fetch table names (T1 and T2)
> - Dump table T1
> - Rename table T2 to T3 generates RENAME event
> - Dump table T2 is noop as table doesn’t exist.
> - In target after load, it only have T1.
> - Apply RENAME event will fail as T2 doesn’t exist in target.
> This feature can be supported in next phase development as it need proper 
> design to keep track of renamed tables/partitions. 
> So, for time being, we shall disable rename operations when bootstrap dump in 
> progress to avoid any inconsistent state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-2526) "lastAceesTime" is always zero when executed through "describe extended " unlike "show table extende like " where lastAccessTime is updated.

2017-08-30 Thread Dmytro Sen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146944#comment-16146944
 ] 

Dmytro Sen commented on HIVE-2526:
--

I you want lastAccessTime updated, just set
{code}
set hive.exec.pre.hooks = 
org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;
{code}

The hook will update LAST_ACCESS_TIME on every query as it was initially 
implemented in HIVE-1819, so both commands will return non-zero lastAccessTime.


> "lastAceesTime" is always zero when executed through "describe extended 
> " unlike "show table extende like "  where 
> lastAccessTime is updated.
> -
>
> Key: HIVE-2526
> URL: https://issues.apache.org/jira/browse/HIVE-2526
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.9.0
> Environment: Linux : SuSE 11 SP1
>Reporter: Rohith Sharma K S
>Assignee: Priyadarshini
>Priority: Minor
> Attachments: HIVE-2526.patch
>
>
> When the table is accessed(load),lastAccessTime is displaying updated 
> accessTime in 
> "show table extended like ".But "describe extended " 
> is always displaying zero.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Work stopped] (HIVE-17183) Disable rename operations during bootstrap dump

2017-08-30 Thread Sankar Hariappan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-17183 stopped by Sankar Hariappan.
---
> Disable rename operations during bootstrap dump
> ---
>
> Key: HIVE-17183
> URL: https://issues.apache.org/jira/browse/HIVE-17183
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Affects Versions: 2.1.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>  Labels: DR, replication
> Fix For: 3.0.0
>
> Attachments: HIVE-17183.01.patch, HIVE-17183.02.patch, 
> HIVE-17183.03.patch
>
>
> Currently, bootstrap dump shall lead to data loss when any rename happens 
> while dump in progress. 
> *Scenario:*
> - Fetch table names (T1 and T2)
> - Dump table T1
> - Rename table T2 to T3 generates RENAME event
> - Dump table T2 is noop as table doesn’t exist.
> - In target after load, it only have T1.
> - Apply RENAME event will fail as T2 doesn’t exist in target.
> This feature can be supported in next phase development as it need proper 
> design to keep track of renamed tables/partitions. 
> So, for time being, we shall disable rename operations when bootstrap dump in 
> progress to avoid any inconsistent state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

2017-08-30 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146933#comment-16146933
 ] 

Marco Gaido commented on HIVE-17280:


I see, but this won't fix the problem with files written by Spark. This is the 
way Spark names files to managed tables. Thus the issue will still be there.

> Data loss in CONCATENATE ORC created by Spark
> -
>
> Key: HIVE-17280
> URL: https://issues.apache.org/jira/browse/HIVE-17280
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Spark
>Affects Versions: 1.2.1
> Environment: Spark 1.6.3
>Reporter: Marco Gaido
>Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

2017-08-30 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146914#comment-16146914
 ] 

Prasanth Jayachandran commented on HIVE-17280:
--

That is certainly not the format that hive expects. After concatenation, merged 
and unmerged (incompatible) files gets moved to a staging directory. Then 
MoveTask moves the files from staging directory to final destination directory 
(which is also the source directory in case of concatenation). There are 
certain assumptions around filenames for bucketing, speculative execution etc. 
in move task. In the example files that you had provided, part-0_copy_1 and 
part-1_copy_1 will be considered same file written by different tasks (from 
speculative execution) and the largest file will be picked as the winner of 
speculated execution. This is the same issue as HIVE-17403. Hive usually writes 
files with format 00_0 where 00 is task id/bucket id and digit after _  
is considered a task attempt. I am working on a patch that will restrict 
concatenation for external tables. And for hive managed tables, load data 
command will make sure the filenames conform to Hive's expectation. 

> Data loss in CONCATENATE ORC created by Spark
> -
>
> Key: HIVE-17280
> URL: https://issues.apache.org/jira/browse/HIVE-17280
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Spark
>Affects Versions: 1.2.1
> Environment: Spark 1.6.3
>Reporter: Marco Gaido
>Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

2017-08-30 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146882#comment-16146882
 ] 

Marco Gaido commented on HIVE-17280:


The names of the files are:

{noformat}
/apps/hive/warehouse/aa/part-0
/apps/hive/warehouse/aa/part-0_copy_1
/apps/hive/warehouse/aa/part-1
/apps/hive/warehouse/aa/part-1_copy_1
/apps/hive/warehouse/aa/part-1_copy_2
/apps/hive/warehouse/aa/part-2
/apps/hive/warehouse/aa/part-2_copy_1
/apps/hive/warehouse/aa/part-3
/apps/hive/warehouse/aa/part-3_copy_1
{noformat}


> Data loss in CONCATENATE ORC created by Spark
> -
>
> Key: HIVE-17280
> URL: https://issues.apache.org/jira/browse/HIVE-17280
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Spark
>Affects Versions: 1.2.1
> Environment: Spark 1.6.3
>Reporter: Marco Gaido
>Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15467) escape1.q hangs in TestMiniLlapLocalCliDriver

2017-08-30 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146880#comment-16146880
 ] 

Zoltan Haindrich commented on HIVE-15467:
-

the same issue happened again for me...I've looked into it a bit more...and it 
seems like there is some issues with the nodemanagers...they report that the 
local dirs are bad

resourcemanager ui shows this info:
{code}
NodeHealthReport4/4 local-dirs are bad: 
/home/kirk/hw/asf-hive/itests/qtest/target/hive/hive-localDir-nm-1_0,/home/kirk/hw/asf-hive/itests/qtest/target/hive/hive-localDir-nm-1_2,/home/kirk/hw/asf-hive/itests/qtest/target/hive/hive-localDir-nm-1_1,/home/kirk/hw/asf-hive/itests/qtest/target/hive/hive-localDir-nm-1_3;
 4/4 log-dirs are bad: 
/home/kirk/hw/asf-hive/itests/qtest/target/hive/hive-logDir-nm-1_1,/home/kirk/hw/asf-hive/itests/qtest/target/hive/hive-logDir-nm-1_0,/home/kirk/hw/asf-hive/itests/qtest/target/hive/hive-logDir-nm-1_3,/home/kirk/hw/asf-hive/itests/qtest/target/hive/hive-logDir-nm-1_2
 
{code}

nodemanagers are in an unworkable state...and because of this the tez AM stucks 
in initializing state
the resourcemanager ui seems to be not available...and I've not found any other 
usefull info...

I've switched to a dfferent cli  driver which didn't get stuck...

> escape1.q hangs in TestMiniLlapLocalCliDriver
> -
>
> Key: HIVE-15467
> URL: https://issues.apache.org/jira/browse/HIVE-15467
> Project: Hive
>  Issue Type: Bug
>Reporter: Pengcheng Xiong
>Assignee: Prasanth Jayachandran
>
> here is part of the log before it hangs
> {code}
> 2016-12-19T15:21:05,779  INFO [LlapScheduler] 
> tezplugins.LlapTaskSchedulerService: ScheduleResult for Task: 
> TaskInfo{task=attempt_1482189645956_0001_33_00_00_1, priority=1, 
> startTime=0, containerId=null, assignedNode=, uniqueId=54, 
> localityDelayTimeout=0} = DELAYED_RESOURCES
> 2016-12-19T15:21:05,779 DEBUG [LlapScheduler] 
> tezplugins.LlapTaskSchedulerService: Attempting to preempt on any host for 
> task=attempt_1482189645956_0001_33_00_00_1, pendingPreemptions=0
> 2016-12-19T15:21:05,779  INFO [LlapScheduler] 
> tezplugins.LlapTaskSchedulerService: Preempting for 
> task=attempt_1482189645956_0001_33_00_00_1 on any available host
> 2016-12-19T15:21:05,779 DEBUG [LlapScheduler] 
> tezplugins.LlapTaskSchedulerService: Unable to schedule all requests at 
> priority=1. Skipping subsequent priority levels
> 2016-12-19T15:21:07,953 DEBUG [AMReporterQueueDrainer] impl.AMReporter: 
> Removing am localhost:61788 with last associated dag 
> QueryIdentifier{appIdentifier='application_1482189645956_0001', 
> dagIdentifier=33} from heartbeat with taskCount=0, amFailed=false
> 2016-12-19T15:21:08,634  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> 2016-12-19T15:21:11,700  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> 2016-12-19T15:21:14,755  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> 2016-12-19T15:21:17,814  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> 2016-12-19T15:21:20,871  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> 2016-12-19T15:21:23,931  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> 2016-12-19T15:21:26,977  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> 2016-12-19T15:21:30,027  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> 2016-12-19T15:21:33,078  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> 2016-12-19T15:21:36,133  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> 2016-12-19T15:21:39,179  INFO [86edca30-bf12-42f8-90cd-a9fbdfbcb546 main] 
> SessionState: Map 1: 0(+1,-1)/1
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17225) HoS DPP pruning sink ops can target parallel work objects

2017-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146875#comment-16146875
 ] 

Hive QA commented on HIVE-17225:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12884385/HIVE-17225.4.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 11015 tests 
executed
*Failed tests:*
{noformat}
TestTxnCommandsBase - did not produce a TEST-*.xml file (likely timed out) 
(batchId=280)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=61)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=100)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=234)
org.apache.hadoop.hive.ql.lockmgr.TestDbTxnManager2.testDummyTxnManagerOnAcidTable
 (batchId=282)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6597/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6597/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6597/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12884385 - PreCommit-HIVE-Build

> HoS DPP pruning sink ops can target parallel work objects
> -
>
> Key: HIVE-17225
> URL: https://issues.apache.org/jira/browse/HIVE-17225
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Affects Versions: 3.0.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE17225.1.patch, HIVE-17225.2.patch, 
> HIVE-17225.3.patch, HIVE-17225.4.patch
>
>
> Setup:
> {code:sql}
> SET hive.spark.dynamic.partition.pruning=true;
> SET hive.strict.checks.cartesian.product=false;
> SET hive.auto.convert.join=true;
> CREATE TABLE partitioned_table1 (col int) PARTITIONED BY (part_col int);
> CREATE TABLE regular_table1 (col int);
> CREATE TABLE regular_table2 (col int);
> ALTER TABLE partitioned_table1 ADD PARTITION (part_col = 1);
> ALTER TABLE partitioned_table1 ADD PARTITION (part_col = 2);
> ALTER TABLE partitioned_table1 ADD PARTITION (part_col = 3);
> INSERT INTO table regular_table1 VALUES (1), (2), (3), (4), (5), (6);
> INSERT INTO table regular_table2 VALUES (1), (2), (3), (4), (5), (6);
> INSERT INTO TABLE partitioned_table1 PARTITION (part_col = 1) VALUES (1);
> INSERT INTO TABLE partitioned_table1 PARTITION (part_col = 2) VALUES (2);
> INSERT INTO TABLE partitioned_table1 PARTITION (part_col = 3) VALUES (3);
> SELECT *
> FROM   partitioned_table1,
>regular_table1 rt1,
>regular_table2 rt2
> WHERE  rt1.col = partitioned_table1.part_col
>AND rt2.col = partitioned_table1.part_col;
> {code}
> Exception:
> {code}
> 2017-08-01T13:27:47,483 ERROR [b0d354a8-4cdb-4ba9-acec-27d14926aaf4 main] 
> ql.Driver: FAILED: Execution Error, return code 3 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask. java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.FileNotFoundException: File 
> file:/Users/stakiar/Documents/idea/apache-hive/itests/qtest-spark/target/tmp/scratchdir/stakiar/b0d354a8-4cdb-4ba9-acec-27d14926aaf4/hive_2017-08-01_13-27-45_553_1088589686371686526-1/-mr-10004/3/5
>  does not exist
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:408)
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:498)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:82)
>   at 

[jira] [Commented] (HIVE-17399) Do not remove semijoin branch if it feeds to TS->DPP_EVENT

2017-08-30 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146827#comment-16146827
 ] 

Gopal V commented on HIVE-17399:


LGTM - +1 tests pending.

One minor nit - "assert false;" should be followed by a return, since not 
everyone runs with -ea.



> Do not remove semijoin branch if it feeds to TS->DPP_EVENT
> --
>
> Key: HIVE-17399
> URL: https://issues.apache.org/jira/browse/HIVE-17399
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
> Attachments: HIVE-17399.1.patch, HIVE-17399.2.patch
>
>
> If there is an incoming semijoin branch to a TS which has DPP event, then try 
> to keep it as it may serve as an excellent filter for DPP thus reducing the 
> input to join drastically.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >