[jira] [Created] (HIVE-21323) LEFT OUTER JOIN does not generate transitive IS NOT NULL filter on right side

2019-02-25 Thread Vineet Garg (JIRA)
Vineet Garg created HIVE-21323:
--

 Summary: LEFT OUTER JOIN does not generate transitive IS NOT NULL 
filter on right side
 Key: HIVE-21323
 URL: https://issues.apache.org/jira/browse/HIVE-21323
 Project: Hive
  Issue Type: Improvement
Reporter: Vineet Garg
 Fix For: 4.0.0


{code:sql}
select a.id from a  left outer join c on a.id = c.id
{code}

CBO plan:
{code:sql}
iveProject(id=[$0])
  HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], cost=[{6.0 
rows, 0.0 cpu, 0.0 io}])
HiveProject(id=[$0])
  HiveTableScan(table=[[hive_21322, a]], table:alias=[a])
HiveProject(id=[$0])
  HiveTableScan(table=[[hive_21322, c]], table:alias=[c])
{code}

Explain Plan:
{code:sql}
Stage: Stage-1
Tez
  DagId: vgarg_20190225222008_083d8041-b5dc-4af1-9dac-4ff5305ab864:10
  Edges:
Map 1 <- Map 2 (BROADCAST_EDGE)
  DagName: vgarg_20190225222008_083d8041-b5dc-4af1-9dac-4ff5305ab864:10
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: a
  Statistics: Num rows: 3 Data size: 255 Basic stats: COMPLETE 
Column stats: COMPLETE
  Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 3 Data size: 255 Basic stats: 
COMPLETE Column stats: COMPLETE
Map Join Operator
  condition map:
   Left Outer Join 0 to 1
  keys:
0 _col0 (type: string)
1 _col0 (type: string)
  outputColumnNames: _col0
  input vertices:
1 Map 2
  Statistics: Num rows: 3 Data size: 255 Basic stats: 
COMPLETE Column stats: COMPLETE
  HybridGraceHashJoin: true
  File Output Operator
compressed: false
Statistics: Num rows: 3 Data size: 255 Basic stats: 
COMPLETE Column stats: COMPLETE
table:
input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Map 2
Map Operator Tree:
TableScan
  alias: c
  Statistics: Num rows: 3 Data size: 258 Basic stats: COMPLETE 
Column stats: COMPLETE
  Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 3 Data size: 258 Basic stats: 
COMPLETE Column stats: COMPLETE
Reduce Output Operator
  key expressions: _col0 (type: string)
  sort order: +
  Map-reduce partition columns: _col0 (type: string)
  Statistics: Num rows: 3 Data size: 258 Basic stats: 
COMPLETE Column stats: COMPLETE
Execution mode: vectorized

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
{code}

There is no IS NOT NULL filter on {{c.id}}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260134323
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -5216,7 +5216,8 @@ private int updateFirstIncPendingFlag(Hive hive, 
ReplSetFirstIncLoadFlagDesc des
   for (String tableName : Utils.matchesTbl(hive, dbNameOrPattern, 
tableNameOrPattern)) {
 org.apache.hadoop.hive.metastore.api.Table tbl = 
hive.getMSC().getTable(dbNameOrPattern, tableName);
 parameters = tbl.getParameters();
-if (ReplUtils.isFirstIncPending(parameters)) {
+String incPendPara = parameters != null ? 
parameters.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG) : null;
+if (incPendPara != null && (!flag.equalsIgnoreCase(incPendPara))) {
 
 Review comment:
   no ..as per current code ..we change this value only we are going to change 
it to other one ..else no need to change ..as this is called only after 
incremental is done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260131931
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -5225,7 +5226,8 @@ private int updateFirstIncPendingFlag(Hive hive, 
ReplSetFirstIncLoadFlagDesc des
   for (String dbName : Utils.matchesDb(hive, dbNameOrPattern)) {
 Database database = hive.getMSC().getDatabase(dbName);
 parameters = database.getParameters();
-if (ReplUtils.isFirstIncPending(parameters)) {
+String incPendPara = parameters != null ? 
parameters.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG) : null;
+if (incPendPara != null && (!flag.equalsIgnoreCase(incPendPara))) {
 
 Review comment:
   Same as above.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260131896
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -5216,7 +5216,8 @@ private int updateFirstIncPendingFlag(Hive hive, 
ReplSetFirstIncLoadFlagDesc des
   for (String tableName : Utils.matchesTbl(hive, dbNameOrPattern, 
tableNameOrPattern)) {
 org.apache.hadoop.hive.metastore.api.Table tbl = 
hive.getMSC().getTable(dbNameOrPattern, tableName);
 parameters = tbl.getParameters();
-if (ReplUtils.isFirstIncPending(parameters)) {
+String incPendPara = parameters != null ? 
parameters.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG) : null;
+if (incPendPara != null && (!flag.equalsIgnoreCase(incPendPara))) {
 
 Review comment:
   If incPendPara = null also need to set the flag. Now it will skip.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260131264
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java
 ##
 @@ -426,4 +427,14 @@ public static void copyLastReplId(Map 
srcParameter, Map

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260131280
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.plan;
+import org.apache.hadoop.hive.ql.plan.Explain.Level;
+
+import java.io.Serializable;
+
+/**
+ * ReplSetFirstIncLoadFlagDesc.
+ *
+ */
+@Explain(displayName = "Set First Incr Load Flag", explainLevels = { 
Level.USER, Level.DEFAULT, Level.EXTENDED })
+public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements 
Serializable {
+
+  private static final long serialVersionUID = 1L;
+  String databaseName;
+  String tableName;
+  boolean incLoadPendingFlag;
+
+  /**
+   * For serialization only.
+   */
+  public ReplSetFirstIncLoadFlagDesc() {
+  }
+
+  public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, 
boolean incLoadPendingFlag) {
+super();
+this.databaseName = databaseName;
+this.tableName = tableName;
+this.incLoadPendingFlag = incLoadPendingFlag;
+  }
+
+  @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, 
Level.EXTENDED })
+  public String getDatabaseName() {
+return databaseName;
+  }
+
+  public void setDatabaseName(String databaseName) {
+this.databaseName = databaseName;
+  }
+
+  @Explain(displayName="table_name", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
+  public String getTableName() {
+return tableName;
+  }
+
+  public void setTableName(String tableName) {
+this.tableName = tableName;
+  }
+
+  @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260131319
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r260118313
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -316,6 +316,7 @@ public static boolean prepareImport(boolean isImportCmd,
   }
   inReplicationScope = true;
   tblDesc.setReplWriteId(writeId);
+  tblDesc.setOwnerName(tblObj.getOwner());
 
 Review comment:
   its done only for repl flow 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r260118275
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ExternalTableCopyTaskBuilder.java
 ##
 @@ -54,14 +59,53 @@
   List> tasks(TaskTracker tracker) {
 List> tasks = new ArrayList<>();
 Iterator itr = work.getPathsToCopyIterator();
-while (tracker.canAddMoreTasks() && itr.hasNext()) {
+int numTaskCanBeAdded = tracker.numTaskCanBeAdded();
+Task barrierTask = TaskFactory.get(new 
DependencyCollectionWork(), conf);
+while (numTaskCanBeAdded-- > 0 && itr.hasNext()) {
   DirCopyWork dirCopyWork = itr.next();
   Task task = TaskFactory.get(dirCopyWork, conf);
   tasks.add(task);
-  tracker.addTask(task);
+  barrierTask.addDependentTask(task);
   LOG.debug("added task for {}", dirCopyWork);
 }
-return tasks;
+
+if (!tasks.isEmpty()) {
+  tracker.addDependentTask(barrierTask);
+  tracker.addTaskList(tasks);
+  return Collections.singletonList(barrierTask);
+} else {
+  return tasks;
+}
+  }
+
+  private static Integer setTargetPathOwnerInt(Path targetPath, Path 
sourcePath, HiveConf conf) throws IOException {
+FileSystem targetFs = targetPath.getFileSystem(conf);
+if (!targetFs.exists(targetPath)) {
+  targetFs.create(targetPath);
+}
+FileStatus status = 
sourcePath.getFileSystem(conf).getFileStatus(sourcePath);
+if (status == null) {
+  throw new IOException("source path missing " + sourcePath);
+}
+targetPath.getFileSystem(conf).setOwner(targetPath, status.getOwner(), 
status.getGroup());
+return null;
+  }
+
+  private static Integer setTargetPathOwner(Path targetPath, Path sourcePath, 
HiveConf conf, String distCpDoAsUser)
+  throws IOException {
+if (distCpDoAsUser == null) {
 
 Review comment:
   same as distcp below ..i think better to handle it


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r260118223
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ExternalTableCopyTaskBuilder.java
 ##
 @@ -54,14 +59,53 @@
   List> tasks(TaskTracker tracker) {
 List> tasks = new ArrayList<>();
 Iterator itr = work.getPathsToCopyIterator();
-while (tracker.canAddMoreTasks() && itr.hasNext()) {
+int numTaskCanBeAdded = tracker.numTaskCanBeAdded();
+Task barrierTask = TaskFactory.get(new 
DependencyCollectionWork(), conf);
+while (numTaskCanBeAdded-- > 0 && itr.hasNext()) {
   DirCopyWork dirCopyWork = itr.next();
   Task task = TaskFactory.get(dirCopyWork, conf);
   tasks.add(task);
-  tracker.addTask(task);
+  barrierTask.addDependentTask(task);
   LOG.debug("added task for {}", dirCopyWork);
 }
-return tasks;
+
+if (!tasks.isEmpty()) {
+  tracker.addDependentTask(barrierTask);
+  tracker.addTaskList(tasks);
+  return Collections.singletonList(barrierTask);
+} else {
+  return tasks;
+}
+  }
+
+  private static Integer setTargetPathOwnerInt(Path targetPath, Path 
sourcePath, HiveConf conf) throws IOException {
+FileSystem targetFs = targetPath.getFileSystem(conf);
+if (!targetFs.exists(targetPath)) {
+  targetFs.create(targetPath);
+}
+FileStatus status = 
sourcePath.getFileSystem(conf).getFileStatus(sourcePath);
+if (status == null) {
+  throw new IOException("source path missing " + sourcePath);
 
 Review comment:
   same issue can happen for distcp done below  also ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r260118102
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ExternalTableCopyTaskBuilder.java
 ##
 @@ -54,14 +59,53 @@
   List> tasks(TaskTracker tracker) {
 List> tasks = new ArrayList<>();
 Iterator itr = work.getPathsToCopyIterator();
-while (tracker.canAddMoreTasks() && itr.hasNext()) {
+int numTaskCanBeAdded = tracker.numTaskCanBeAdded();
+Task barrierTask = TaskFactory.get(new 
DependencyCollectionWork(), conf);
+while (numTaskCanBeAdded-- > 0 && itr.hasNext()) {
   DirCopyWork dirCopyWork = itr.next();
   Task task = TaskFactory.get(dirCopyWork, conf);
   tasks.add(task);
-  tracker.addTask(task);
+  barrierTask.addDependentTask(task);
   LOG.debug("added task for {}", dirCopyWork);
 }
-return tasks;
+
+if (!tasks.isEmpty()) {
+  tracker.addDependentTask(barrierTask);
+  tracker.addTaskList(tasks);
 
 Review comment:
   removed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HIVE-21322) Multiple table LEFT OUTER JOIN results are incorrect when 'is not null' used in WHERE clause.

2019-02-25 Thread James Norvell (JIRA)
James Norvell created HIVE-21322:


 Summary: Multiple table LEFT OUTER JOIN results are incorrect when 
'is not null' used in WHERE clause.
 Key: HIVE-21322
 URL: https://issues.apache.org/jira/browse/HIVE-21322
 Project: Hive
  Issue Type: Bug
  Components: CBO
Affects Versions: 2.3.4
 Environment: Hive 2.3.4 (emr-5.21.0)
Reporter: James Norvell
 Attachments: explain-plans.txt

Reproduction:

Create tables: 
{code:java}
create table a (id string); insert into a values (1),(2),(3);

create table b (id string, name string); insert into b values 
(1,'a'),(2,'b'),(3,null);

create table c (id string); insert into c values (11),(22),(33);

{code}
When joining a -> b -> c on id, the following query is correct: 
{code:java}
select a.id, b.name from a left outer join b on a.id = b.id left outer join c 
on a.id = c.id where b.name is not null;

OK
1    a
2    b
Time taken: 10.231 seconds, Fetched: 2 row(s)
{code}
Switching the join order from a -> c -> b results in incorrect results: 
{code:java}
select a.id, b.name from a 
left outer join c on a.id = c.id 
left outer join b on a.id = b.id 
where b.name is not null;

OK
2    b
Time taken: 10.321 seconds, Fetched: 1 row(s)
{code}
Disabling hive.cbo.enable or changing execution engine to mr avoids the issue: 
{code:java}
set hive.cbo.enable=false;
select a.id, b.name from a left outer join c on a.id = c.id left outer join b 
on a.id = b.id where b.name is not null;
OK
1    a
2    b
Time taken: 9.614 seconds, Fetched: 2 row(s)


set hive.cbo.enable=true;
set hive.execution.engine=mr;
select a.id, b.name from a left outer join c on a.id = c.id left outer join b 
on a.id = b.id where b.name is not null;
OK
1    a
2    b
Time taken: 29.377 seconds, Fetched: 2 row(s)
{code}
Issue doesn't occur when using 'is null':
{code:java}
select a.id, b.name from a left outer join c on a.id = c.id left outer join b 
on a.id = b.id where b.name is null;

OK
3    NULL
Time taken: 9.673 seconds, Fetched: 1 row(s)
{code}
Explain plans for queries attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21321) Remove Class HiveIOExceptionHandlerChain

2019-02-25 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21321:
--

 Summary: Remove Class HiveIOExceptionHandlerChain
 Key: HIVE-21321
 URL: https://issues.apache.org/jira/browse/HIVE-21321
 Project: Hive
  Issue Type: Task
Affects Versions: 4.0.0
Reporter: BELUGA BEHR


I recently stumbled upon this code when tracking down some issue: 
{{HiveIOExceptionHandlerChain.java}}

Is anyone using this feature? Is has a configuration associated with it 
{{hive.io.exception.handlers}}. 

The code doesn't seem to have any unit tests.

Can this feature simply be removed?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21320) get_fields() and get_tables_by_type() are not protected by HMS server access control

2019-02-25 Thread Na Li (JIRA)
Na Li created HIVE-21320:


 Summary: get_fields() and get_tables_by_type() are not protected 
by HMS server access control
 Key: HIVE-21320
 URL: https://issues.apache.org/jira/browse/HIVE-21320
 Project: Hive
  Issue Type: Bug
Affects Versions: 4.0.0
Reporter: Na Li
Assignee: Na Li


User without any privilege can call these functions and get all meta data back 
as if user has full access privilege.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
miklosgergely commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259958209
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4230,6 +4229,34 @@ public static long unsetBit(long bitmap, int bitIdx) {
 }
   }
 
+  protected boolean isGroupBy(ASTNode expr) {
+boolean isGroupBy = false;
+if (expr.getParent() != null && expr.getParent() instanceof Node)
+for (Node sibling : ((Node)expr.getParent()).getChildren()) {
+  isGroupBy |= sibling instanceof ASTNode && ((ASTNode)sibling).getType() 
== HiveParser.TOK_GROUPBY;
+}
+
+return isGroupBy;
+  }
+
+  protected boolean isSelectDistinct(ASTNode expr) {
+return expr.getType() == HiveParser.TOK_SELECTDI;
+  }
+
+  protected boolean isAggregateInSelect(Node node, Collection 
aggregateFunction) {
+if (node.getChildren() == null) {
+  return false;
+}
+
+for (Node child : node.getChildren()) {
 
 Review comment:
   I doubt there is any. The above example is not valid, it says:
   
   Unsupported SubQuery Expression Invalid subquery. Subquery with DISTINCT 
clause is not supported!
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
miklosgergely commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259953429
 
 

 ##
 File path: ql/src/test/queries/clientpositive/distinct_groupby.q
 ##
 @@ -0,0 +1,57 @@
+--! qt:dataset:src1
+
 
 Review comment:
   adding q tests for non-cbo as well, good idea!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
miklosgergely commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259953166
 
 

 ##
 File path: ql/src/test/results/clientpositive/distinct_groupby.q.out
 ##
 @@ -0,0 +1,1530 @@
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+3
+PREHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Colum

Re: Review Request 69918: HIVE-21001 Update to Calcite 1.18

2019-02-25 Thread Zoltan Haindrich


> On Feb. 7, 2019, 10:16 p.m., Ashutosh Chauhan wrote:
> > ql/src/test/results/clientpositive/constant_prop_3.q.out
> > Line 286 (original), 286 (patched)
> > 
> >
> > New expression tree is longer compared to original. I guess we try to 
> > apply DeMorgan theorem here, but in this case its a net loss. Perhaps, we 
> > can add a (simple) logic which says if node count in expression tree grows 
> > after the application of theorem we throw away that.
> 
> Zoltan Haindrich wrote:
> simplification is too conservative in 1.18; see: CALCITE-2840
> 
> Ashutosh Chauhan wrote:
> We shall make CALCITE-2840 blocker for 1.19 release since its a 
> regression.

I might have been wrong...this is still present; bisecting to see what caused it


> On Feb. 7, 2019, 10:16 p.m., Ashutosh Chauhan wrote:
> > ql/src/test/results/clientpositive/llap/subquery_multi.q.out
> > Lines 2312-2313 (patched)
> > 
> >
> > Worse plan than earlier.
> 
> Zoltan Haindrich wrote:
> It seems like more accurate equals/hashcode have caused this change; 
> before CALCITE-2632 RexCorrelVariables were not properly compared; it seems 
> like that have helped/interfered with HiveRelDecorrelator's operations.
> 
> 
> https://github.com/apache/calcite/blob/ef9f926061de21ad713a76ec3ec8110e5cbd92bf/core/src/main/java/org/apache/calcite/rex/RexCorrelVariable.java#L59

fixed in latest patch


> On Feb. 7, 2019, 10:16 p.m., Ashutosh Chauhan wrote:
> > ql/src/test/results/clientpositive/perf/tez/cbo_query13.q.out
> > Lines 117-121 (original), 117-123 (patched)
> > 
> >
> > Looks like join order has changed. Is new order better?

join order is the same; however one of the inner join's has its arguments 
swapped


> On Feb. 7, 2019, 10:16 p.m., Ashutosh Chauhan wrote:
> > ql/src/test/results/clientpositive/perf/tez/constraints/cbo_ext_query1.q.out
> > Lines 62-65 (original), 62-65 (patched)
> > 
> >
> > Join order changed. Is new order better?

select has a `val > 1.2 * avg(val)` part;

in latest patch the join against the Customer table is postponed until the 
matching rows are found for the above; earlier Customer was joined earlier 
against on `val` side


> On Feb. 7, 2019, 10:16 p.m., Ashutosh Chauhan wrote:
> > ql/src/test/results/clientpositive/perf/tez/constraints/cbo_query64.q.out
> > Line 301 (original), 301 (patched)
> > 
> >
> > Cast should get folded?

it will be folded now CALCITE-2852


> On Feb. 7, 2019, 10:16 p.m., Ashutosh Chauhan wrote:
> > ql/src/test/results/clientpositive/perf/tez/query70.q.out
> > Line 113 (original), 113 (patched)
> > 
> >
> > UDFToLong(0) should be folded. Can you file a follow-up jira for it?
> 
> Zoltan Haindrich wrote:
> yes; cast(null as string) also seems to be odd
> at the ast level it looks good - calcite doesn't seem to be leaving an 
> explicit cast there
> 
> Ashutosh Chauhan wrote:
> is this tracked in a jira?

opened: HIVE-21319


- Zoltan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69918/#review212637
---


On Feb. 7, 2019, 8:08 p.m., Zoltan Haindrich wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69918/
> ---
> 
> (Updated Feb. 7, 2019, 8:08 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Jesús Camacho Rodríguez.
> 
> 
> Bugs: HIVE-21001
> https://issues.apache.org/jira/browse/HIVE-21001
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> patch#1 here is #23 on jira
> 
> 
> Diffs
> -
> 
>   
> accumulo-handler/src/test/results/positive/accumulo_predicate_pushdown.q.out 
> 8a1e0609f9f48434d8147c296984bbc0a6cbae35 
>   hbase-handler/src/test/results/positive/hbase_ppd_key_range.q.out 
> 5e051543133125a57dbf5b83b62f0a13cf7f415a 
>   hbase-handler/src/test/results/positive/hbase_pushdown.q.out 
> 57613c36f9b3376469b1b05e9a9df59bd5365450 
>   pom.xml 240472a30e033a83d1c799e636d8df29cb2c5770 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelBuilder.java 
> e85a99e84658a69c4fd93a6c352af4ead768ef67 
>   ql/src/test/queries/clientpositive/druidmini_expressions.q 
> 36aad7937d556e013773f29ecd89bf0629c1937d 
>   ql/src/test/results/clientpositive/alter_partition_coltype.q.out 
> d484f9e2237402fa475cb79a182340d7d83dadb9 
>   ql/src/

[jira] [Created] (HIVE-21319) UDFToLong should be folded in windowing expression

2019-02-25 Thread Zoltan Haindrich (JIRA)
Zoltan Haindrich created HIVE-21319:
---

 Summary: UDFToLong should be folded in windowing expression
 Key: HIVE-21319
 URL: https://issues.apache.org/jira/browse/HIVE-21319
 Project: Hive
  Issue Type: Bug
Reporter: Zoltan Haindrich
Assignee: Zoltan Haindrich


Some simplifications seem to not happen during windowing expressions

https://reviews.apache.org/r/69918/#comment298485



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21318) Update thrift client library in branch-3

2019-02-25 Thread Zoltan Haindrich (JIRA)
Zoltan Haindrich created HIVE-21318:
---

 Summary: Update thrift client library in branch-3
 Key: HIVE-21318
 URL: https://issues.apache.org/jira/browse/HIVE-21318
 Project: Hive
  Issue Type: Bug
Reporter: Zoltan Haindrich


https://issues.apache.org/jira/browse/THRIFT-4506?focusedCommentId=16772298&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16772298



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259878502
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4230,6 +4229,34 @@ public static long unsetBit(long bitmap, int bitIdx) {
 }
   }
 
+  protected boolean isGroupBy(ASTNode expr) {
+boolean isGroupBy = false;
+if (expr.getParent() != null && expr.getParent() instanceof Node)
+for (Node sibling : ((Node)expr.getParent()).getChildren()) {
+  isGroupBy |= sibling instanceof ASTNode && ((ASTNode)sibling).getType() 
== HiveParser.TOK_GROUPBY;
+}
+
+return isGroupBy;
+  }
+
+  protected boolean isSelectDistinct(ASTNode expr) {
+return expr.getType() == HiveParser.TOK_SELECTDI;
+  }
+
+  protected boolean isAggregateInSelect(Node node, Collection 
aggregateFunction) {
+if (node.getChildren() == null) {
+  return false;
+}
+
+for (Node child : node.getChildren()) {
 
 Review comment:
   I was thinking something really odd:
   ```
   select distinct (select count(*) from t where t.a=e.a) from e
   ```
   but in this case (beyond that it might not accept by hive at all) the count 
aggregate is not present at the top level.
   Do you know an example when this method returns false; however there are 
aggreagations being done?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HIVE-21317) Unit Test kafka_storage_handler Is Failing Regularly

2019-02-25 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21317:
--

 Summary: Unit Test kafka_storage_handler Is Failing Regularly
 Key: HIVE-21317
 URL: https://issues.apache.org/jira/browse/HIVE-21317
 Project: Hive
  Issue Type: Task
Affects Versions: 4.0.0
Reporter: BELUGA BEHR


{code}
org.apache.hadoop.hive.cli.TestMiniHiveKafkaCliDriver.testCliDriver[kafka_storage_handler]
 (batchId=275)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
miklosgergely commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259857044
 
 

 ##
 File path: ql/src/test/results/clientpositive/distinct_groupby.q.out
 ##
 @@ -0,0 +1,1530 @@
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+3
+PREHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Colum

[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
miklosgergely commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259839288
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4230,6 +4229,34 @@ public static long unsetBit(long bitmap, int bitIdx) {
 }
   }
 
+  protected boolean isGroupBy(ASTNode expr) {
+boolean isGroupBy = false;
+if (expr.getParent() != null && expr.getParent() instanceof Node)
+for (Node sibling : ((Node)expr.getParent()).getChildren()) {
+  isGroupBy |= sibling instanceof ASTNode && ((ASTNode)sibling).getType() 
== HiveParser.TOK_GROUPBY;
+}
+
+return isGroupBy;
+  }
+
+  protected boolean isSelectDistinct(ASTNode expr) {
+return expr.getType() == HiveParser.TOK_SELECTDI;
+  }
+
+  protected boolean isAggregateInSelect(Node node, Collection 
aggregateFunction) {
+if (node.getChildren() == null) {
+  return false;
+}
+
+for (Node child : node.getChildren()) {
 
 Review comment:
   I don't see how, all the nodes under the SELECT DISTINCT node are for 
expressions that may be aggregations. Under the SELECT DISTINCT node there may 
be things like "1 + count(*)" in it, in which case the aggregation is at a 
lower level in the tree.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
miklosgergely commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259837145
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4230,6 +4229,34 @@ public static long unsetBit(long bitmap, int bitIdx) {
 }
   }
 
+  protected boolean isGroupBy(ASTNode expr) {
 
 Review comment:
   renamed to hasGroupBySibling(ASTNode selectExpr)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] dlavati opened a new pull request #550: HIVE-21198 Introduce a database object reference class

2019-02-25 Thread GitBox
dlavati opened a new pull request #550: HIVE-21198 Introduce a database object 
reference class
URL: https://github.com/apache/hive/pull/550
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
miklosgergely commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259832367
 
 

 ##
 File path: ql/src/test/results/clientpositive/distinct_groupby.q.out
 ##
 @@ -0,0 +1,1530 @@
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+3
+PREHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Colum

[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
miklosgergely commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259824341
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4194,27 +4191,29 @@ public static long unsetBit(long bitmap, int bitIdx) {
   }
 
   /**
-   * This function is a wrapper of parseInfo.getGroupByForClause which
-   * automatically translates SELECT DISTINCT a,b,c to SELECT a,b,c GROUP BY
-   * a,b,c.
+   * Returns the GBY, if present;
+   * DISTINCT, if present, will be handled when generating the SELECT.
*/
   List getGroupByForClause(QBParseInfo parseInfo, String dest) throws 
SemanticException {
-if (parseInfo.getSelForClause(dest).getToken().getType() == 
HiveParser.TOK_SELECTDI) {
-  ASTNode selectExprs = parseInfo.getSelForClause(dest);
-  List result = new ArrayList(selectExprs == null ? 0
-  : selectExprs.getChildCount());
-  if (selectExprs != null) {
-for (int i = 0; i < selectExprs.getChildCount(); ++i) {
-  if (((ASTNode) selectExprs.getChild(i)).getToken().getType() == 
HiveParser.QUERY_HINT) {
+// When *not* invoked by CalcitePlanner, return the DISTINCT as a GBY
+// CBO will handle the DISTINCT in 
CalcitePlannerAction.genSelectLogicalPlan
+ASTNode selectExpr = parseInfo.getSelForClause(dest);
+Collection aggregateFunction = 
parseInfo.getDestToAggregationExprs().get(dest).values();
+if (isSelectDistinct(selectExpr) && !isGroupBy(selectExpr) && 
!isAggregateInSelect(selectExpr, aggregateFunction)) {
+  List result = new ArrayList(selectExpr == null ? 0 : 
selectExpr.getChildCount());
+  if (selectExpr != null) {
 
 Review comment:
   agree, removed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
miklosgergely commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259823592
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
 ##
 @@ -3730,7 +3697,9 @@ private RelNode genGBLogicalPlan(QB qb, RelNode srcRel) 
throws SemanticException
 ASTNode node = (ASTNode) selExprList.getChild(0).getChild(0);
 if (node.getToken().getType() == HiveParser.TOK_ALLCOLREF) {
   // As we said before, here we use genSelectLogicalPlan to rewrite 
AllColRef
-  srcRel = genSelectLogicalPlan(qb, srcRel, srcRel, null, null, 
true).getKey();
+  if (!(isSelectDistinct(selExprList) && isGroupBy(selExprList))) {
 
 Review comment:
   you are right, fixed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259820997
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java
 ##
 @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, 
DumpType dumpType, String dbNa
 return updateReplIdTask;
   }
 
-  private Task dbUpdateReplStateTask(String dbName, 
String replState,
+  private Task dbUpdateReplStateTask(String dbName, 
String replState, String incLoadPendFlag,
  Task preCursor) {
 HashMap mapProp = new HashMap<>();
-mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState);
 
-AlterDatabaseDesc alterDbDesc = new AlterDatabaseDesc(dbName, mapProp, new 
ReplicationSpec(replState, replState));
+// if the update is for incLoadPendFlag, then send replicationSpec as null 
to avoid replacement check.
+ReplicationSpec replicationSpec = null;
+if (incLoadPendFlag == null) {
+  mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState);
+  replicationSpec = new ReplicationSpec(replState, replState);
+} else {
+  assert replState == null;
+  mapProp.put(ReplUtils.REPL_FIRST_INC_PENDING_FLAG, incLoadPendFlag);
 
 Review comment:
   done ..added check same as ckpt flag


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HIVE-21316) Comparision of varchar column and string literal should happen in varchar

2019-02-25 Thread Zoltan Haindrich (JIRA)
Zoltan Haindrich created HIVE-21316:
---

 Summary: Comparision of varchar column and string literal should 
happen in varchar
 Key: HIVE-21316
 URL: https://issues.apache.org/jira/browse/HIVE-21316
 Project: Hive
  Issue Type: Improvement
Reporter: Zoltan Haindrich
Assignee: Zoltan Haindrich


this is most probably the root cause behind HIVE-21310 as well




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] anishek commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
anishek commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r259808530
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ExternalTableCopyTaskBuilder.java
 ##
 @@ -54,14 +59,53 @@
   List> tasks(TaskTracker tracker) {
 List> tasks = new ArrayList<>();
 Iterator itr = work.getPathsToCopyIterator();
-while (tracker.canAddMoreTasks() && itr.hasNext()) {
+int numTaskCanBeAdded = tracker.numTaskCanBeAdded();
+Task barrierTask = TaskFactory.get(new 
DependencyCollectionWork(), conf);
+while (numTaskCanBeAdded-- > 0 && itr.hasNext()) {
   DirCopyWork dirCopyWork = itr.next();
   Task task = TaskFactory.get(dirCopyWork, conf);
   tasks.add(task);
-  tracker.addTask(task);
+  barrierTask.addDependentTask(task);
   LOG.debug("added task for {}", dirCopyWork);
 }
-return tasks;
+
+if (!tasks.isEmpty()) {
+  tracker.addDependentTask(barrierTask);
+  tracker.addTaskList(tasks);
 
 Review comment:
   why do we need this, all the tasks are already added as dependents to 
barrier task.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] anishek commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
anishek commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r259810054
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ExternalTableCopyTaskBuilder.java
 ##
 @@ -54,14 +59,53 @@
   List> tasks(TaskTracker tracker) {
 List> tasks = new ArrayList<>();
 Iterator itr = work.getPathsToCopyIterator();
-while (tracker.canAddMoreTasks() && itr.hasNext()) {
+int numTaskCanBeAdded = tracker.numTaskCanBeAdded();
+Task barrierTask = TaskFactory.get(new 
DependencyCollectionWork(), conf);
+while (numTaskCanBeAdded-- > 0 && itr.hasNext()) {
   DirCopyWork dirCopyWork = itr.next();
   Task task = TaskFactory.get(dirCopyWork, conf);
   tasks.add(task);
-  tracker.addTask(task);
+  barrierTask.addDependentTask(task);
   LOG.debug("added task for {}", dirCopyWork);
 }
-return tasks;
+
+if (!tasks.isEmpty()) {
+  tracker.addDependentTask(barrierTask);
+  tracker.addTaskList(tasks);
+  return Collections.singletonList(barrierTask);
+} else {
+  return tasks;
+}
+  }
+
+  private static Integer setTargetPathOwnerInt(Path targetPath, Path 
sourcePath, HiveConf conf) throws IOException {
 
 Review comment:
   why integer return type ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] anishek commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
anishek commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r259811426
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -316,6 +316,7 @@ public static boolean prepareImport(boolean isImportCmd,
   }
   inReplicationScope = true;
   tblDesc.setReplWriteId(writeId);
+  tblDesc.setOwnerName(tblObj.getOwner());
 
 Review comment:
   have to see if this works for import/export user case where the ownership 
required might be different that the replication use case.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] anishek commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
anishek commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r259810994
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ExternalTableCopyTaskBuilder.java
 ##
 @@ -54,14 +59,53 @@
   List> tasks(TaskTracker tracker) {
 List> tasks = new ArrayList<>();
 Iterator itr = work.getPathsToCopyIterator();
-while (tracker.canAddMoreTasks() && itr.hasNext()) {
+int numTaskCanBeAdded = tracker.numTaskCanBeAdded();
+Task barrierTask = TaskFactory.get(new 
DependencyCollectionWork(), conf);
+while (numTaskCanBeAdded-- > 0 && itr.hasNext()) {
   DirCopyWork dirCopyWork = itr.next();
   Task task = TaskFactory.get(dirCopyWork, conf);
   tasks.add(task);
-  tracker.addTask(task);
+  barrierTask.addDependentTask(task);
   LOG.debug("added task for {}", dirCopyWork);
 }
-return tasks;
+
+if (!tasks.isEmpty()) {
+  tracker.addDependentTask(barrierTask);
+  tracker.addTaskList(tasks);
+  return Collections.singletonList(barrierTask);
+} else {
+  return tasks;
+}
+  }
+
+  private static Integer setTargetPathOwnerInt(Path targetPath, Path 
sourcePath, HiveConf conf) throws IOException {
+FileSystem targetFs = targetPath.getFileSystem(conf);
+if (!targetFs.exists(targetPath)) {
+  targetFs.create(targetPath);
+}
+FileStatus status = 
sourcePath.getFileSystem(conf).getFileStatus(sourcePath);
+if (status == null) {
+  throw new IOException("source path missing " + sourcePath);
+}
+targetPath.getFileSystem(conf).setOwner(targetPath, status.getOwner(), 
status.getGroup());
+return null;
+  }
+
+  private static Integer setTargetPathOwner(Path targetPath, Path sourcePath, 
HiveConf conf, String distCpDoAsUser)
+  throws IOException {
+if (distCpDoAsUser == null) {
+  return setTargetPathOwnerInt(targetPath, sourcePath, conf);
+}
+UserGroupInformation proxyUser = UserGroupInformation.createProxyUser(
+distCpDoAsUser, UserGroupInformation.getLoginUser());
+try {
+  Path finalTargetPath = targetPath;
+  Path finalSourcePath = sourcePath;
+  return proxyUser.doAs((PrivilegedExceptionAction) () ->
+  setTargetPathOwnerInt(finalTargetPath, finalSourcePath, conf));
 
 Review comment:
   may be a better method name here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] anishek commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
anishek commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r259809837
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ExternalTableCopyTaskBuilder.java
 ##
 @@ -54,14 +59,53 @@
   List> tasks(TaskTracker tracker) {
 List> tasks = new ArrayList<>();
 Iterator itr = work.getPathsToCopyIterator();
-while (tracker.canAddMoreTasks() && itr.hasNext()) {
+int numTaskCanBeAdded = tracker.numTaskCanBeAdded();
+Task barrierTask = TaskFactory.get(new 
DependencyCollectionWork(), conf);
+while (numTaskCanBeAdded-- > 0 && itr.hasNext()) {
   DirCopyWork dirCopyWork = itr.next();
   Task task = TaskFactory.get(dirCopyWork, conf);
   tasks.add(task);
-  tracker.addTask(task);
+  barrierTask.addDependentTask(task);
   LOG.debug("added task for {}", dirCopyWork);
 }
-return tasks;
+
+if (!tasks.isEmpty()) {
+  tracker.addDependentTask(barrierTask);
+  tracker.addTaskList(tasks);
+  return Collections.singletonList(barrierTask);
+} else {
+  return tasks;
+}
+  }
+
+  private static Integer setTargetPathOwnerInt(Path targetPath, Path 
sourcePath, HiveConf conf) throws IOException {
+FileSystem targetFs = targetPath.getFileSystem(conf);
+if (!targetFs.exists(targetPath)) {
+  targetFs.create(targetPath);
+}
+FileStatus status = 
sourcePath.getFileSystem(conf).getFileStatus(sourcePath);
+if (status == null) {
+  throw new IOException("source path missing " + sourcePath);
 
 Review comment:
   this would throw exception if the source path will not be available during 
load if the table was dropped in source, should handle that case, the owner 
should always be the one at the directory level for table /  partition


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] anishek commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
anishek commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r259810535
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ExternalTableCopyTaskBuilder.java
 ##
 @@ -54,14 +59,53 @@
   List> tasks(TaskTracker tracker) {
 List> tasks = new ArrayList<>();
 Iterator itr = work.getPathsToCopyIterator();
-while (tracker.canAddMoreTasks() && itr.hasNext()) {
+int numTaskCanBeAdded = tracker.numTaskCanBeAdded();
+Task barrierTask = TaskFactory.get(new 
DependencyCollectionWork(), conf);
+while (numTaskCanBeAdded-- > 0 && itr.hasNext()) {
   DirCopyWork dirCopyWork = itr.next();
   Task task = TaskFactory.get(dirCopyWork, conf);
   tasks.add(task);
-  tracker.addTask(task);
+  barrierTask.addDependentTask(task);
   LOG.debug("added task for {}", dirCopyWork);
 }
-return tasks;
+
+if (!tasks.isEmpty()) {
+  tracker.addDependentTask(barrierTask);
+  tracker.addTaskList(tasks);
+  return Collections.singletonList(barrierTask);
+} else {
+  return tasks;
+}
+  }
+
+  private static Integer setTargetPathOwnerInt(Path targetPath, Path 
sourcePath, HiveConf conf) throws IOException {
+FileSystem targetFs = targetPath.getFileSystem(conf);
+if (!targetFs.exists(targetPath)) {
+  targetFs.create(targetPath);
+}
+FileStatus status = 
sourcePath.getFileSystem(conf).getFileStatus(sourcePath);
+if (status == null) {
+  throw new IOException("source path missing " + sourcePath);
+}
+targetPath.getFileSystem(conf).setOwner(targetPath, status.getOwner(), 
status.getGroup());
+return null;
+  }
+
+  private static Integer setTargetPathOwner(Path targetPath, Path sourcePath, 
HiveConf conf, String distCpDoAsUser)
+  throws IOException {
+if (distCpDoAsUser == null) {
 
 Review comment:
   i dont think distCpDoAsUser can be null since we will always need a user to 
do distcp which would be the beacon super user.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] anishek commented on a change in pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
anishek commented on a change in pull request #549: HIVE-21314 : Hive 
Replication not retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549#discussion_r259810006
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ExternalTableCopyTaskBuilder.java
 ##
 @@ -54,14 +59,53 @@
   List> tasks(TaskTracker tracker) {
 List> tasks = new ArrayList<>();
 Iterator itr = work.getPathsToCopyIterator();
-while (tracker.canAddMoreTasks() && itr.hasNext()) {
+int numTaskCanBeAdded = tracker.numTaskCanBeAdded();
+Task barrierTask = TaskFactory.get(new 
DependencyCollectionWork(), conf);
+while (numTaskCanBeAdded-- > 0 && itr.hasNext()) {
   DirCopyWork dirCopyWork = itr.next();
   Task task = TaskFactory.get(dirCopyWork, conf);
   tasks.add(task);
-  tracker.addTask(task);
+  barrierTask.addDependentTask(task);
   LOG.debug("added task for {}", dirCopyWork);
 }
-return tasks;
+
+if (!tasks.isEmpty()) {
+  tracker.addDependentTask(barrierTask);
+  tracker.addTaskList(tasks);
+  return Collections.singletonList(barrierTask);
+} else {
+  return tasks;
+}
+  }
+
+  private static Integer setTargetPathOwnerInt(Path targetPath, Path 
sourcePath, HiveConf conf) throws IOException {
+FileSystem targetFs = targetPath.getFileSystem(conf);
+if (!targetFs.exists(targetPath)) {
+  targetFs.create(targetPath);
+}
+FileStatus status = 
sourcePath.getFileSystem(conf).getFileStatus(sourcePath);
+if (status == null) {
+  throw new IOException("source path missing " + sourcePath);
+}
+targetPath.getFileSystem(conf).setOwner(targetPath, status.getOwner(), 
status.getGroup());
+return null;
+  }
+
+  private static Integer setTargetPathOwner(Path targetPath, Path sourcePath, 
HiveConf conf, String distCpDoAsUser)
 
 Review comment:
   why integer return type ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259804317
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -5199,6 +5205,35 @@ public static boolean doesTableNeedLocation(Table tbl) {
 return retval;
   }
 
+  private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc 
desc) throws HiveException, TException {
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259803151
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1147,6 +1145,12 @@ private static void createReplImportTasks(
   if (!waitOnPrecursor){
 throw new 
SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName()));
   }
+  // For warehouse level replication, if the database itself is getting 
created in this load, then no need to
+  // check for duplicate copy. Check HIVE-21197 for more detail.
+  firstIncPending = false;
+} else {
+  // For database replication, get the flag from database parameter. Check 
HIVE-21197 for more detail.
+  firstIncPending = ReplUtils.isFirstIncPending(parentDb.getParameters());
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259803182
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259803223
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java
 ##
 @@ -48,13 +48,15 @@
 
   private final DatabaseEvent event;
   private final String dbNameToLoadIn;
+  private final boolean isTableLevelLoad;
 
-  public LoadDatabase(Context context, DatabaseEvent event, String 
dbNameToLoadIn,
+  public LoadDatabase(Context context, DatabaseEvent event, String 
dbNameToLoadIn, String tblNameToLoadIn,
 
 Review comment:
   yes


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259802388
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   changed the logic to avoid duplicate check in case number of base directory 
is more than one


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259802015
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -5199,6 +5205,35 @@ public static boolean doesTableNeedLocation(Table tbl) {
 return retval;
   }
 
+  private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc 
desc) throws HiveException, TException {
+String dbNameOrPattern = desc.getDatabaseName();
+String tableNameOrPattern = desc.getTableName();
+String flag = desc.getIncLoadPendingFlag() ? "true" : "false";
+Map parameters;
+// For database level load tableNameOrPattern will be null. Flag is set 
only in database for db level load.
+if (tableNameOrPattern != null && !tableNameOrPattern.isEmpty()) {
+  // For table level load, dbNameOrPattern is db name and not a pattern.
+  for (String tableName : Utils.matchesTbl(hive, dbNameOrPattern, 
tableNameOrPattern)) {
+org.apache.hadoop.hive.metastore.api.Table tbl = 
hive.getMSC().getTable(dbNameOrPattern, tableName);
+parameters = tbl.getParameters();
+if (ReplUtils.isFirstIncPending(parameters)) {
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259800518
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##
 @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path 
cmRoot, Hive hiveDb) th
 for (String tblName : Utils.matchesTbl(hiveDb, dbName, 
work.tableNameOrPattern)) {
   LOG.debug(
   "analyzeReplDump dumping table: " + tblName + " to db root " + 
dbRoot.toUri());
+  Table table;
+  try {
+table = hiveDb.getTable(dbName, tblName);
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259800145
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java
 ##
 @@ -426,4 +427,14 @@ public static void copyLastReplId(Map 
srcParameter, Map

[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259800197
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +192,14 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncPending(Map parameter) {
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259799886
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.plan;
+import org.apache.hadoop.hive.ql.plan.Explain.Level;
+
+import java.io.Serializable;
+
+/**
+ * ReplSetFirstIncLoadFlagDesc.
+ *
+ */
+@Explain(displayName = "Set First Incr Load Flag", explainLevels = { 
Level.USER, Level.DEFAULT, Level.EXTENDED })
+public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements 
Serializable {
+
+  private static final long serialVersionUID = 1L;
+  String databaseName;
+  String tableName;
+  boolean incLoadPendingFlag;
+
+  /**
+   * For serialization only.
+   */
+  public ReplSetFirstIncLoadFlagDesc() {
+  }
+
+  public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, 
boolean incLoadPendingFlag) {
+super();
+this.databaseName = databaseName;
+this.tableName = tableName;
+this.incLoadPendingFlag = incLoadPendingFlag;
+  }
+
+  @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, 
Level.EXTENDED })
+  public String getDatabaseName() {
+return databaseName;
+  }
+
+  public void setDatabaseName(String databaseName) {
+this.databaseName = databaseName;
+  }
+
+  @Explain(displayName="table_name", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
+  public String getTableName() {
+return tableName;
+  }
+
+  public void setTableName(String tableName) {
+this.tableName = tableName;
+  }
+
+  @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
 
 Review comment:
   no change


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259799813
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
 ##
 @@ -112,6 +118,12 @@ public void run() {
 continue;
   }
 
+  if (replIsCompactionDisabledForTable(t)) {
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259788172
 
 

 ##
 File path: ql/src/test/results/clientpositive/distinct_groupby.q.out
 ##
 @@ -0,0 +1,1530 @@
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+3
+PREHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
s

[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259780016
 
 

 ##
 File path: ql/src/test/results/clientpositive/distinct_groupby.q.out
 ##
 @@ -0,0 +1,1530 @@
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+3
+PREHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
s

[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259784389
 
 

 ##
 File path: ql/src/test/results/clientpositive/distinct_groupby.q.out
 ##
 @@ -0,0 +1,1530 @@
+PREHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct count(*) from src1 where key in 
(128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+  File Output Operator
+compressed: false
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
+table:
+input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+Fetch Operator
+  limit: -1
+  Processor Tree:
+ListSink
+
+PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+3
+PREHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src1
+ A masked pattern was here 
+POSTHOOK: query: explain select distinct * from (select distinct count(*) from 
src1 where key in (128,146,150)) as T
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src1
+ A masked pattern was here 
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Map Reduce
+  Map Operator Tree:
+  TableScan
+alias: src1
+filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE 
Column stats: COMPLETE
+Filter Operator
+  predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: 
boolean)
+  Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Select Operator
+Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE 
Column stats: COMPLETE
+Group By Operator
+  aggregations: count()
+  mode: hash
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+  Reduce Output Operator
+sort order: 
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
+value expressions: _col0 (type: bigint)
+  Execution mode: vectorized
+  Reduce Operator Tree:
+Group By Operator
+  aggregations: count(VALUE._col0)
+  mode: mergepartial
+  outputColumnNames: _col0
+  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
s

[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259794509
 
 

 ##
 File path: ql/src/test/queries/clientpositive/distinct_groupby.q
 ##
 @@ -0,0 +1,57 @@
+--! qt:dataset:src1
+
 
 Review comment:
   could we have some tests for non-cbo path as well?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259793850
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4230,6 +4229,34 @@ public static long unsetBit(long bitmap, int bitIdx) {
 }
   }
 
+  protected boolean isGroupBy(ASTNode expr) {
+boolean isGroupBy = false;
+if (expr.getParent() != null && expr.getParent() instanceof Node)
+for (Node sibling : ((Node)expr.getParent()).getChildren()) {
+  isGroupBy |= sibling instanceof ASTNode && ((ASTNode)sibling).getType() 
== HiveParser.TOK_GROUPBY;
+}
+
+return isGroupBy;
+  }
+
+  protected boolean isSelectDistinct(ASTNode expr) {
+return expr.getType() == HiveParser.TOK_SELECTDI;
+  }
+
+  protected boolean isAggregateInSelect(Node node, Collection 
aggregateFunction) {
+if (node.getChildren() == null) {
+  return false;
+}
+
+for (Node child : node.getChildren()) {
 
 Review comment:
   is it safe to traverse all the children? I'm wondering if can't we can get a 
false + somehow...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259763434
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
 ##
 @@ -3730,7 +3697,9 @@ private RelNode genGBLogicalPlan(QB qb, RelNode srcRel) 
throws SemanticException
 ASTNode node = (ASTNode) selExprList.getChild(0).getChild(0);
 if (node.getToken().getType() == HiveParser.TOK_ALLCOLREF) {
   // As we said before, here we use genSelectLogicalPlan to rewrite 
AllColRef
-  srcRel = genSelectLogicalPlan(qb, srcRel, srcRel, null, null, 
true).getKey();
+  if (!(isSelectDistinct(selExprList) && isGroupBy(selExprList))) {
 
 Review comment:
   `isSelectDistinct(selExprList)` is always true here; condition could be 
changed to `!isGroupBy`
   
   I'm not sure if I understand why we would "skip" rewrite `TOK_ALLCOLREF`  in 
case it's not in groupby; I don't think the below statements make sense:
   ```
   create table t (c1 integer,c2 integer);
   select distinct * from t group by c1;
   ```
   I might be missing something here..


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259776469
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4194,27 +4191,29 @@ public static long unsetBit(long bitmap, int bitIdx) {
   }
 
   /**
-   * This function is a wrapper of parseInfo.getGroupByForClause which
-   * automatically translates SELECT DISTINCT a,b,c to SELECT a,b,c GROUP BY
-   * a,b,c.
+   * Returns the GBY, if present;
+   * DISTINCT, if present, will be handled when generating the SELECT.
*/
   List getGroupByForClause(QBParseInfo parseInfo, String dest) throws 
SemanticException {
-if (parseInfo.getSelForClause(dest).getToken().getType() == 
HiveParser.TOK_SELECTDI) {
-  ASTNode selectExprs = parseInfo.getSelForClause(dest);
-  List result = new ArrayList(selectExprs == null ? 0
-  : selectExprs.getChildCount());
-  if (selectExprs != null) {
-for (int i = 0; i < selectExprs.getChildCount(); ++i) {
-  if (((ASTNode) selectExprs.getChild(i)).getToken().getType() == 
HiveParser.QUERY_HINT) {
+// When *not* invoked by CalcitePlanner, return the DISTINCT as a GBY
+// CBO will handle the DISTINCT in 
CalcitePlannerAction.genSelectLogicalPlan
+ASTNode selectExpr = parseInfo.getSelForClause(dest);
+Collection aggregateFunction = 
parseInfo.getDestToAggregationExprs().get(dest).values();
+if (isSelectDistinct(selectExpr) && !isGroupBy(selectExpr) && 
!isAggregateInSelect(selectExpr, aggregateFunction)) {
+  List result = new ArrayList(selectExpr == null ? 0 : 
selectExpr.getChildCount());
+  if (selectExpr != null) {
 
 Review comment:
   please kill these null checks
   if it would be null we would already have run into an NPE in the earlier and 
in the new code


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259778186
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4194,27 +4191,29 @@ public static long unsetBit(long bitmap, int bitIdx) {
   }
 
   /**
-   * This function is a wrapper of parseInfo.getGroupByForClause which
-   * automatically translates SELECT DISTINCT a,b,c to SELECT a,b,c GROUP BY
-   * a,b,c.
+   * Returns the GBY, if present;
+   * DISTINCT, if present, will be handled when generating the SELECT.
*/
   List getGroupByForClause(QBParseInfo parseInfo, String dest) throws 
SemanticException {
-if (parseInfo.getSelForClause(dest).getToken().getType() == 
HiveParser.TOK_SELECTDI) {
-  ASTNode selectExprs = parseInfo.getSelForClause(dest);
-  List result = new ArrayList(selectExprs == null ? 0
-  : selectExprs.getChildCount());
-  if (selectExprs != null) {
-for (int i = 0; i < selectExprs.getChildCount(); ++i) {
-  if (((ASTNode) selectExprs.getChild(i)).getToken().getType() == 
HiveParser.QUERY_HINT) {
+// When *not* invoked by CalcitePlanner, return the DISTINCT as a GBY
+// CBO will handle the DISTINCT in 
CalcitePlannerAction.genSelectLogicalPlan
+ASTNode selectExpr = parseInfo.getSelForClause(dest);
+Collection aggregateFunction = 
parseInfo.getDestToAggregationExprs().get(dest).values();
+if (isSelectDistinct(selectExpr) && !isGroupBy(selectExpr) && 
!isAggregateInSelect(selectExpr, aggregateFunction)) {
 
 Review comment:
   what plan are we ending up if we have select distinct which has an aggregate 
(this condition will be false) and the cbo is not working? is there a test for 
that?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By

2019-02-25 Thread GitBox
kgyrtkirk commented on a change in pull request #544: HIVE-16924 Support 
distinct in presence of Group By
URL: https://github.com/apache/hive/pull/544#discussion_r259792059
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
 ##
 @@ -4230,6 +4229,34 @@ public static long unsetBit(long bitmap, int bitIdx) {
 }
   }
 
+  protected boolean isGroupBy(ASTNode expr) {
 
 Review comment:
   this method name suggest to me that the expression *is* a groupby; but 
instead it seems it checks for wether the expression has a "brother" which is a 
groupby


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259793406
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##
 @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path 
cmRoot, Hive hiveDb) th
 for (String tblName : Utils.matchesTbl(hiveDb, dbName, 
work.tableNameOrPattern)) {
   LOG.debug(
   "analyzeReplDump dumping table: " + tblName + " to db root " + 
dbRoot.toUri());
+  Table table;
+  try {
+table = hiveDb.getTable(dbName, tblName);
 
 Review comment:
   yes...agreed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 opened a new pull request #549: HIVE-21314 : Hive Replication not retaining the owner in the replicated table

2019-02-25 Thread GitBox
maheshk114 opened a new pull request #549: HIVE-21314 : Hive Replication not 
retaining the owner in the replicated table
URL: https://github.com/apache/hive/pull/549
 
 
   …


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259778574
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##
 @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path 
cmRoot, Hive hiveDb) th
 for (String tblName : Utils.matchesTbl(hiveDb, dbName, 
work.tableNameOrPattern)) {
   LOG.debug(
   "analyzeReplDump dumping table: " + tblName + " to db root " + 
dbRoot.toUri());
+  Table table;
+  try {
+table = hiveDb.getTable(dbName, tblName);
 
 Review comment:
   "These table level checks can be done only if work.tableNameOrPattern is 
valid." with this change ..the call will be only once 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259778574
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##
 @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path 
cmRoot, Hive hiveDb) th
 for (String tblName : Utils.matchesTbl(hiveDb, dbName, 
work.tableNameOrPattern)) {
   LOG.debug(
   "analyzeReplDump dumping table: " + tblName + " to db root " + 
dbRoot.toUri());
+  Table table;
+  try {
+table = hiveDb.getTable(dbName, tblName);
 
 Review comment:
   "We invoke getTable twice. Can we reuse tableTuple that we get below inside 
try block." with this change ..the call will be only once 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776670
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +188,12 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncDone(Map parameter) {
+if (parameter == null) {
+  return true;
+}
+String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG);
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776841
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java
 ##
 @@ -71,6 +74,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) 
throws Exception {
 }
   }
 
+  @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws 
TException {
+try {
+  Database database = rs.getDatabase(getDefaultCatalog(conf), dbName);
+  if (database != null) {
+return !ReplUtils.isFirstIncDone(database.getParameters());
+  }
+  LOG.info("Unable to find database " + dbName);
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776724
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
 ##
 @@ -112,6 +118,12 @@ public void run() {
 continue;
   }
 
+  if (replIsCompactionDisabledForTable(t)) {
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776789
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1147,6 +1145,12 @@ private static void createReplImportTasks(
   if (!waitOnPrecursor){
 throw new 
SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName()));
   }
+  // For warehouse level replication, if the database itself is getting 
created in this load, then no need to
+  // check for duplicate copy. Check HIVE-21197 for more detail.
+  firstIncDone = true;
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776621
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java
 ##
 @@ -271,12 +302,13 @@ public String getName() {
 LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath);
 if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){
   ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false);
-  if (replicationSpec.isReplace() && 
conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) {
+  if (replicationSpec.isReplace() && 
(conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) {
 rcwork.setDeleteDestIfExist(true);
 rcwork.setAutoPurge(isAutoPurge);
 rcwork.setNeedRecycle(needRecycle);
   }
   rcwork.setCopyToMigratedTxnTable(copyToMigratedTxnTable);
+  rcwork.setCheckDuplicateCopy(replicationSpec.needDupCopyCheck());
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776865
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/RemoteCompactorThread.java
 ##
 @@ -53,6 +56,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) 
throws Exception {
 }
   }
 
+  @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws 
TException {
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776541
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -661,6 +663,10 @@ public int execute(DriverContext driverContext) {
   if (work.getAlterMaterializedViewDesc() != null) {
 return alterMaterializedView(db, work.getAlterMaterializedViewDesc());
   }
+
+  if (work.getReplSetFirstIncLoadFlagDesc() != null) {
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776510
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java
 ##
 @@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.util.*;
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259775595
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
 ##
 @@ -112,6 +118,12 @@ public void run() {
 continue;
   }
 
+  if (replIsCompactionDisabledForTable(t)) {
 
 Review comment:
   auto_compact flag remains forever unless someone changes it. So, if it is 
false, then additional check can be avoided in that case. But repl first inc 
flag is false most of the time. So, we end up with 2 checks always.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259774866
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.plan;
+import org.apache.hadoop.hive.ql.plan.Explain.Level;
+
+import java.io.Serializable;
+
+/**
+ * ReplSetFirstIncLoadFlagDesc.
+ *
+ */
+@Explain(displayName = "Set First Incr Load Flag", explainLevels = { 
Level.USER, Level.DEFAULT, Level.EXTENDED })
+public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements 
Serializable {
+
+  private static final long serialVersionUID = 1L;
+  String databaseName;
+  String tableName;
+  boolean incLoadPendingFlag;
+
+  /**
+   * For serialization only.
+   */
+  public ReplSetFirstIncLoadFlagDesc() {
+  }
+
+  public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, 
boolean incLoadPendingFlag) {
+super();
+this.databaseName = databaseName;
+this.tableName = tableName;
+this.incLoadPendingFlag = incLoadPendingFlag;
+  }
+
+  @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, 
Level.EXTENDED })
+  public String getDatabaseName() {
+return databaseName;
+  }
+
+  public void setDatabaseName(String databaseName) {
+this.databaseName = databaseName;
+  }
+
+  @Explain(displayName="table_name", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
+  public String getTableName() {
+return tableName;
+  }
+
+  public void setTableName(String tableName) {
+this.tableName = tableName;
+  }
+
+  @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
 
 Review comment:
   We can keep it. But it is redundant for now because EXPLAIN REPL LOAD, just 
shows ReplLoadTask. nothing else.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259774454
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java
 ##
 @@ -426,4 +427,14 @@ public static void copyLastReplId(Map 
srcParameter, Map

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259773951
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +192,14 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncPending(Map parameter) {
 
 Review comment:
   OK.. Then change it to parameters. with "s" :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259773642
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##
 @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path 
cmRoot, Hive hiveDb) th
 for (String tblName : Utils.matchesTbl(hiveDb, dbName, 
work.tableNameOrPattern)) {
   LOG.debug(
   "analyzeReplDump dumping table: " + tblName + " to db root " + 
dbRoot.toUri());
+  Table table;
+  try {
+table = hiveDb.getTable(dbName, tblName);
 
 Review comment:
   we get table from Metastore. It impacts perf.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259773205
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java
 ##
 @@ -0,0 +1,273 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.*;
+
+import static 
org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable;
+import static org.junit.Assert.*;
+
+/**
+ * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 
(Strict managed tables)
+ */
+public class TestReplicationWithTableMigrationEx {
+  @Rule
+  public final TestName testName = new TestName();
+
+  protected static final Logger LOG = 
LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class);
+  private static WarehouseInstance primary, replica;
+  private String primaryDbName, replicatedDbName;
+
+  @BeforeClass
+  public static void classLevelSetup() throws Exception {
+HashMap overrideProperties = new HashMap<>();
+internalBeforeClassSetup(overrideProperties);
+  }
+
+  static void internalBeforeClassSetup(Map overrideConfigs) 
throws Exception {
+HiveConf conf = new HiveConf(TestReplicationWithTableMigrationEx.class);
+conf.set("dfs.client.use.datanode.hostname", "true");
+conf.set("hadoop.proxyuser." + Utils.getUGI().getShortUserName() + 
".hosts", "*");
+MiniDFSCluster miniDFSCluster =
+new MiniDFSCluster.Builder(conf).numDataNodes(1).format(true).build();
+final DistributedFileSystem fs = miniDFSCluster.getFileSystem();
+HashMap hiveConfigs = new HashMap() {{
+  put("fs.defaultFS", fs.getUri().toString());
+  put("hive.support.concurrency", "true");
+  put("hive.txn.manager", 
"org.apache.hadoop.hive.ql.lockmgr.DbTxnManager");
+  put("hive.metastore.client.capability.check", "false");
+  put("hive.repl.bootstrap.dump.open.txn.timeout", "1s");
+  put("hive.exec.dynamic.partition.mode", "nonstrict");
+  put("hive.strict.checks.bucketing", "false");
+  put("hive.mapred.mode", "nonstrict");
+  put("mapred.input.dir.recursive", "true");
+  put("hive.metastore.disallow.incompatible.col.type.changes", "false");
+  put("hive.strict.managed.tables", "true");
+  put("hive.metastore.transactional.event.listeners", "");
+}};
+replica = new WarehouseInstance(LOG, miniDFSCluster, hiveConfigs);
+
+HashMap configsForPrimary = new HashMap() 
{{
+  put("fs.defaultFS", fs.getUri().toString());
+  put("hive.metastore.client.capability.check", "false");
+  put("hive.repl.bootstrap.dump.open.txn.timeout", "1s");
+  put("hive.exec.dynamic.partition.mode", "nonstrict");
+  put("hive.strict.checks.bucketing", "false");
+  put("hive.mapred.mode", "nonstrict");
+  put("mapred.input.dir.recursive", "true");
+  put("hive.metastore.disallow.incompatible.col.type.changes", "false");
+  put("hive.support.concurrency", "false");
+  put("hive.txn.manager", 
"org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager");
+  put("hive.strict.managed.tables", "false");
+}};
+configsForPrimary.putAll(overrideConfigs);
+primary = new WarehouseInstance(LOG, miniDFSCluster, configsForPrimary

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259773047
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   it is allowed today. we need not bootstrap for table level replication if 
create table event is part of inc dump.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259772504
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1147,6 +1145,12 @@ private static void createReplImportTasks(
   if (!waitOnPrecursor){
 throw new 
SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName()));
   }
+  // For warehouse level replication, if the database itself is getting 
created in this load, then no need to
+  // check for duplicate copy. Check HIVE-21197 for more detail.
+  firstIncPending = false;
+} else {
+  // For database replication, get the flag from database parameter. Check 
HIVE-21197 for more detail.
+  firstIncPending = ReplUtils.isFirstIncPending(parentDb.getParameters());
 
 Review comment:
   Yes, functionally it is right... But not so readable as we assume this flag 
is false from DB if table level replication. Pls add a comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259772546
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   Pls add a comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259771726
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java
 ##
 @@ -48,13 +48,15 @@
 
   private final DatabaseEvent event;
   private final String dbNameToLoadIn;
+  private final boolean isTableLevelLoad;
 
-  public LoadDatabase(Context context, DatabaseEvent event, String 
dbNameToLoadIn,
+  public LoadDatabase(Context context, DatabaseEvent event, String 
dbNameToLoadIn, String tblNameToLoadIn,
 
 Review comment:
   I think, this fix is partial and ideally, table level replication shoudn't 
enter LoadDatabase. Pls create a bug for this and leave it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259771264
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java
 ##
 @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, 
DumpType dumpType, String dbNa
 return updateReplIdTask;
   }
 
-  private Task dbUpdateReplStateTask(String dbName, 
String replState,
+  private Task dbUpdateReplStateTask(String dbName, 
String replState, String incLoadPendFlag,
  Task preCursor) {
 HashMap mapProp = new HashMap<>();
-mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState);
 
-AlterDatabaseDesc alterDbDesc = new AlterDatabaseDesc(dbName, mapProp, new 
ReplicationSpec(replState, replState));
+// if the update is for incLoadPendFlag, then send replicationSpec as null 
to avoid replacement check.
+ReplicationSpec replicationSpec = null;
+if (incLoadPendFlag == null) {
+  mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState);
+  replicationSpec = new ReplicationSpec(replState, replState);
+} else {
+  assert replState == null;
+  mapProp.put(ReplUtils.REPL_FIRST_INC_PENDING_FLAG, incLoadPendFlag);
 
 Review comment:
   But, this issue is there even if inc pending flag is set to false. Please 
recheck and comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770576
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncDone(Map parameter) {
+if (parameter == null) {
+  return true;
+}
+String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG);
+return compFlag == null  || compFlag.isEmpty() || 
"false".equalsIgnoreCase(compFlag);
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770626
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java
 ##
 @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, 
DumpType dumpType, String dbNa
 return updateReplIdTask;
   }
 
-  private Task dbUpdateReplStateTask(String dbName, 
String replState,
+  private Task dbUpdateReplStateTask(String dbName, 
String replState, String incLoadPendFlag,
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770334
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java
 ##
 @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() {
 Assert.assertEquals("The hash codes must be equal", 
compactionInfo.hashCode(), compactionInfo1.hashCode());
   }
 
+  @Test
+  public void testDisableCompactionDuringReplLoad() throws Exception {
+String tblName = "discomp";
+String database = "discomp_db";
+executeStatementOnDriver("drop database if exists " + database + " 
cascade", driver);
+executeStatementOnDriver("create database " + database, driver);
+executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a 
INT, b STRING) " +
+" PARTITIONED BY(ds string)" +
+" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires 
table to be bucketed
+" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver);
+executeStatementOnDriver("insert into " + database + "." + tblName + " 
partition (ds) values (1, 'fred', " +
+"'today'), (2, 'wilma', 'yesterday')", driver);
+
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
+" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", 
driver);
+List compacts = getCompactionList();
+Assert.assertEquals(0, compacts.size());
+
+executeStatementOnDriver("alter database " + database +
+" set dbproperties ('hive.repl.first.inc.pending' = 'true')", 
driver);
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770434
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -370,6 +370,9 @@ private int executeIncrementalLoad(DriverContext 
driverContext) {
 
   // If incremental events are already applied, then check and perform if 
need to bootstrap any tables.
   if (!builder.hasMoreWork() && !work.getPathsToCopyIterator().hasNext()) {
+// No need to set incremental load pending flag for external tables as 
the files will be copied to the same path
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770370
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java
 ##
 @@ -61,6 +62,21 @@ public ReplCopyTask(){
 super();
   }
 
+  // If file is already present in base directory, then remove it from the 
list.
+  // Check  HIVE-21197 for more detail
+  private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, 
List srcFiles,
+   long writeId, int stmtId) throws 
IOException {
+ListIterator iter = srcFiles.listIterator();
+Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, 
writeId, writeId, stmtId));
+while (iter.hasNext()) {
+  Path filePath = new Path(basePath, 
iter.next().getSourcePath().getName());
+  if (dstFs.exists(filePath)) {
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770296
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.*;
+
+import static 
org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable;
+import static org.junit.Assert.*;
+
+/**
+ * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 
(Strict managed tables)
+ */
+public class TestReplicationWithTableMigrationEx {
+  @Rule
+  public final TestName testName = new TestName();
+
+  protected static final Logger LOG = 
LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class);
+  private static WarehouseInstance primary, replica;
+  private String primaryDbName, replicatedDbName;
+
+  @BeforeClass
+  public static void classLevelSetup() throws Exception {
+HashMap overrideProperties = new HashMap<>();
+internalBeforeClassSetup(overrideProperties);
+  }
+
+  static void internalBeforeClassSetup(Map overrideConfigs) 
throws Exception {
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770272
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java
 ##
 @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() {
 Assert.assertEquals("The hash codes must be equal", 
compactionInfo.hashCode(), compactionInfo1.hashCode());
   }
 
+  @Test
+  public void testDisableCompactionDuringReplLoad() throws Exception {
+String tblName = "discomp";
+String database = "discomp_db";
+executeStatementOnDriver("drop database if exists " + database + " 
cascade", driver);
+executeStatementOnDriver("create database " + database, driver);
+executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a 
INT, b STRING) " +
+" PARTITIONED BY(ds string)" +
+" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires 
table to be bucketed
+" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver);
+executeStatementOnDriver("insert into " + database + "." + tblName + " 
partition (ds) values (1, 'fred', " +
+"'today'), (2, 'wilma', 'yesterday')", driver);
+
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
+" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", 
driver);
+List compacts = getCompactionList();
+Assert.assertEquals(0, compacts.size());
+
+executeStatementOnDriver("alter database " + database +
+" set dbproperties ('hive.repl.first.inc.pending' = 'true')", 
driver);
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
+" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'false')", 
driver);
+compacts = getCompactionList();
+Assert.assertEquals(0, compacts.size());
+
+executeStatementOnDriver("alter database " + database +
+" set dbproperties ('hive.repl.first.inc.pending' = 'false')", 
driver);
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259769932
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java
 ##
 @@ -164,6 +165,12 @@ public IncrementalLoadTasksBuilder(String dbName, String 
tableName, String loadP
   lastEventid);
 }
   }
+
+  ReplSetFirstIncLoadFlagDesc desc = new 
ReplSetFirstIncLoadFlagDesc(dbName, tableName);
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259769991
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -60,6 +60,7 @@
 
   public static final String LAST_REPL_ID_KEY = "hive.repl.last.repl.id";
   public static final String REPL_CHECKPOINT_KEY = "hive.repl.ckpt.key";
+  public static final String REPL_FIRST_INC_PENDING_FLAG = 
"hive.repl.first.inc.pending";
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HIVE-21315) Consolidate rawDataSize stat calculation

2019-02-25 Thread Antal Sinkovits (JIRA)
Antal Sinkovits created HIVE-21315:
--

 Summary: Consolidate rawDataSize stat calculation 
 Key: HIVE-21315
 URL: https://issues.apache.org/jira/browse/HIVE-21315
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0
Reporter: Antal Sinkovits


RawDataSize statistics represents the table size, when loaded into memory. 
Sometimes this value is used to determine, whether a table should be used in a 
map join or not.
This value should probably be the same, regardless of the underlaying  file 
format used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259759960
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java
 ##
 @@ -0,0 +1,273 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.*;
+
+import static 
org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable;
+import static org.junit.Assert.*;
+
+/**
+ * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 
(Strict managed tables)
+ */
+public class TestReplicationWithTableMigrationEx {
+  @Rule
+  public final TestName testName = new TestName();
+
+  protected static final Logger LOG = 
LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class);
+  private static WarehouseInstance primary, replica;
+  private String primaryDbName, replicatedDbName;
+
+  @BeforeClass
+  public static void classLevelSetup() throws Exception {
+HashMap overrideProperties = new HashMap<>();
+internalBeforeClassSetup(overrideProperties);
+  }
+
+  static void internalBeforeClassSetup(Map overrideConfigs) 
throws Exception {
+HiveConf conf = new HiveConf(TestReplicationWithTableMigrationEx.class);
+conf.set("dfs.client.use.datanode.hostname", "true");
+conf.set("hadoop.proxyuser." + Utils.getUGI().getShortUserName() + 
".hosts", "*");
+MiniDFSCluster miniDFSCluster =
+new MiniDFSCluster.Builder(conf).numDataNodes(1).format(true).build();
+final DistributedFileSystem fs = miniDFSCluster.getFileSystem();
+HashMap hiveConfigs = new HashMap() {{
+  put("fs.defaultFS", fs.getUri().toString());
+  put("hive.support.concurrency", "true");
+  put("hive.txn.manager", 
"org.apache.hadoop.hive.ql.lockmgr.DbTxnManager");
+  put("hive.metastore.client.capability.check", "false");
+  put("hive.repl.bootstrap.dump.open.txn.timeout", "1s");
+  put("hive.exec.dynamic.partition.mode", "nonstrict");
+  put("hive.strict.checks.bucketing", "false");
+  put("hive.mapred.mode", "nonstrict");
+  put("mapred.input.dir.recursive", "true");
+  put("hive.metastore.disallow.incompatible.col.type.changes", "false");
+  put("hive.strict.managed.tables", "true");
+  put("hive.metastore.transactional.event.listeners", "");
+}};
+replica = new WarehouseInstance(LOG, miniDFSCluster, hiveConfigs);
+
+HashMap configsForPrimary = new HashMap() 
{{
+  put("fs.defaultFS", fs.getUri().toString());
+  put("hive.metastore.client.capability.check", "false");
+  put("hive.repl.bootstrap.dump.open.txn.timeout", "1s");
+  put("hive.exec.dynamic.partition.mode", "nonstrict");
+  put("hive.strict.checks.bucketing", "false");
+  put("hive.mapred.mode", "nonstrict");
+  put("mapred.input.dir.recursive", "true");
+  put("hive.metastore.disallow.incompatible.col.type.changes", "false");
+  put("hive.support.concurrency", "false");
+  put("hive.txn.manager", 
"org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager");
+  put("hive.strict.managed.tables", "false");
+}};
+configsForPrimary.putAll(overrideConfigs);
+primary = new WarehouseInstance(LOG, miniDFSCluster, configsForPrim

[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259759565
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##
 @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path 
cmRoot, Hive hiveDb) th
 for (String tblName : Utils.matchesTbl(hiveDb, dbName, 
work.tableNameOrPattern)) {
   LOG.debug(
   "analyzeReplDump dumping table: " + tblName + " to db root " + 
dbRoot.toUri());
+  Table table;
+  try {
+table = hiveDb.getTable(dbName, tblName);
 
 Review comment:
   anyways just extra check ..so does not impact perf 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259758817
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +192,14 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncPending(Map parameter) {
 
 Review comment:
   parameter is generic enough ..i think keeping it is not an issue 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259758025
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java
 ##
 @@ -426,4 +427,14 @@ public static void copyLastReplId(Map 
srcParameter, Map

[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259757727
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
 ##
 @@ -112,6 +118,12 @@ public void run() {
 continue;
   }
 
+  if (replIsCompactionDisabledForTable(t)) {
 
 Review comment:
   i think it wont make much diff


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259757650
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.plan;
+import org.apache.hadoop.hive.ql.plan.Explain.Level;
+
+import java.io.Serializable;
+
+/**
+ * ReplSetFirstIncLoadFlagDesc.
+ *
+ */
+@Explain(displayName = "Set First Incr Load Flag", explainLevels = { 
Level.USER, Level.DEFAULT, Level.EXTENDED })
+public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements 
Serializable {
+
+  private static final long serialVersionUID = 1L;
+  String databaseName;
+  String tableName;
+  boolean incLoadPendingFlag;
+
+  /**
+   * For serialization only.
+   */
+  public ReplSetFirstIncLoadFlagDesc() {
+  }
+
+  public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, 
boolean incLoadPendingFlag) {
+super();
+this.databaseName = databaseName;
+this.tableName = tableName;
+this.incLoadPendingFlag = incLoadPendingFlag;
+  }
+
+  @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, 
Level.EXTENDED })
+  public String getDatabaseName() {
+return databaseName;
+  }
+
+  public void setDatabaseName(String databaseName) {
+this.databaseName = databaseName;
+  }
+
+  @Explain(displayName="table_name", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
+  public String getTableName() {
+return tableName;
+  }
+
+  public void setTableName(String tableName) {
+this.tableName = tableName;
+  }
+
+  @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
 
 Review comment:
   can be used for explain repl load ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259756959
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   table level replication ...create table event should not come


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259756728
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   !firstIncPending will make sure that its obtained for table level only 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259756217
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java
 ##
 @@ -48,13 +48,15 @@
 
   private final DatabaseEvent event;
   private final String dbNameToLoadIn;
+  private final boolean isTableLevelLoad;
 
-  public LoadDatabase(Context context, DatabaseEvent event, String 
dbNameToLoadIn,
+  public LoadDatabase(Context context, DatabaseEvent event, String 
dbNameToLoadIn, String tblNameToLoadIn,
 
 Review comment:
   its causing null pointer exception if db load is not done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


  1   2   >