[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260131931
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -5225,7 +5226,8 @@ private int updateFirstIncPendingFlag(Hive hive, 
ReplSetFirstIncLoadFlagDesc des
   for (String dbName : Utils.matchesDb(hive, dbNameOrPattern)) {
 Database database = hive.getMSC().getDatabase(dbName);
 parameters = database.getParameters();
-if (ReplUtils.isFirstIncPending(parameters)) {
+String incPendPara = parameters != null ? 
parameters.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG) : null;
+if (incPendPara != null && (!flag.equalsIgnoreCase(incPendPara))) {
 
 Review comment:
   Same as above.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260131896
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -5216,7 +5216,8 @@ private int updateFirstIncPendingFlag(Hive hive, 
ReplSetFirstIncLoadFlagDesc des
   for (String tableName : Utils.matchesTbl(hive, dbNameOrPattern, 
tableNameOrPattern)) {
 org.apache.hadoop.hive.metastore.api.Table tbl = 
hive.getMSC().getTable(dbNameOrPattern, tableName);
 parameters = tbl.getParameters();
-if (ReplUtils.isFirstIncPending(parameters)) {
+String incPendPara = parameters != null ? 
parameters.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG) : null;
+if (incPendPara != null && (!flag.equalsIgnoreCase(incPendPara))) {
 
 Review comment:
   If incPendPara = null also need to set the flag. Now it will skip.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260131319
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260131264
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java
 ##
 @@ -426,4 +427,14 @@ public static void copyLastReplId(Map 
srcParameter, Map

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r260131280
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.plan;
+import org.apache.hadoop.hive.ql.plan.Explain.Level;
+
+import java.io.Serializable;
+
+/**
+ * ReplSetFirstIncLoadFlagDesc.
+ *
+ */
+@Explain(displayName = "Set First Incr Load Flag", explainLevels = { 
Level.USER, Level.DEFAULT, Level.EXTENDED })
+public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements 
Serializable {
+
+  private static final long serialVersionUID = 1L;
+  String databaseName;
+  String tableName;
+  boolean incLoadPendingFlag;
+
+  /**
+   * For serialization only.
+   */
+  public ReplSetFirstIncLoadFlagDesc() {
+  }
+
+  public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, 
boolean incLoadPendingFlag) {
+super();
+this.databaseName = databaseName;
+this.tableName = tableName;
+this.incLoadPendingFlag = incLoadPendingFlag;
+  }
+
+  @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, 
Level.EXTENDED })
+  public String getDatabaseName() {
+return databaseName;
+  }
+
+  public void setDatabaseName(String databaseName) {
+this.databaseName = databaseName;
+  }
+
+  @Explain(displayName="table_name", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
+  public String getTableName() {
+return tableName;
+  }
+
+  public void setTableName(String tableName) {
+this.tableName = tableName;
+  }
+
+  @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259793406
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##
 @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path 
cmRoot, Hive hiveDb) th
 for (String tblName : Utils.matchesTbl(hiveDb, dbName, 
work.tableNameOrPattern)) {
   LOG.debug(
   "analyzeReplDump dumping table: " + tblName + " to db root " + 
dbRoot.toUri());
+  Table table;
+  try {
+table = hiveDb.getTable(dbName, tblName);
 
 Review comment:
   yes...agreed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776670
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +188,12 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncDone(Map parameter) {
+if (parameter == null) {
+  return true;
+}
+String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG);
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776841
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java
 ##
 @@ -71,6 +74,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) 
throws Exception {
 }
   }
 
+  @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws 
TException {
+try {
+  Database database = rs.getDatabase(getDefaultCatalog(conf), dbName);
+  if (database != null) {
+return !ReplUtils.isFirstIncDone(database.getParameters());
+  }
+  LOG.info("Unable to find database " + dbName);
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776724
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
 ##
 @@ -112,6 +118,12 @@ public void run() {
 continue;
   }
 
+  if (replIsCompactionDisabledForTable(t)) {
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776789
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1147,6 +1145,12 @@ private static void createReplImportTasks(
   if (!waitOnPrecursor){
 throw new 
SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName()));
   }
+  // For warehouse level replication, if the database itself is getting 
created in this load, then no need to
+  // check for duplicate copy. Check HIVE-21197 for more detail.
+  firstIncDone = true;
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776621
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java
 ##
 @@ -271,12 +302,13 @@ public String getName() {
 LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath);
 if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){
   ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false);
-  if (replicationSpec.isReplace() && 
conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) {
+  if (replicationSpec.isReplace() && 
(conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) {
 rcwork.setDeleteDestIfExist(true);
 rcwork.setAutoPurge(isAutoPurge);
 rcwork.setNeedRecycle(needRecycle);
   }
   rcwork.setCopyToMigratedTxnTable(copyToMigratedTxnTable);
+  rcwork.setCheckDuplicateCopy(replicationSpec.needDupCopyCheck());
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776865
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/RemoteCompactorThread.java
 ##
 @@ -53,6 +56,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) 
throws Exception {
 }
   }
 
+  @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws 
TException {
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776541
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -661,6 +663,10 @@ public int execute(DriverContext driverContext) {
   if (work.getAlterMaterializedViewDesc() != null) {
 return alterMaterializedView(db, work.getAlterMaterializedViewDesc());
   }
+
+  if (work.getReplSetFirstIncLoadFlagDesc() != null) {
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259776510
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java
 ##
 @@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.util.*;
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259775595
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
 ##
 @@ -112,6 +118,12 @@ public void run() {
 continue;
   }
 
+  if (replIsCompactionDisabledForTable(t)) {
 
 Review comment:
   auto_compact flag remains forever unless someone changes it. So, if it is 
false, then additional check can be avoided in that case. But repl first inc 
flag is false most of the time. So, we end up with 2 checks always.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259774866
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.plan;
+import org.apache.hadoop.hive.ql.plan.Explain.Level;
+
+import java.io.Serializable;
+
+/**
+ * ReplSetFirstIncLoadFlagDesc.
+ *
+ */
+@Explain(displayName = "Set First Incr Load Flag", explainLevels = { 
Level.USER, Level.DEFAULT, Level.EXTENDED })
+public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements 
Serializable {
+
+  private static final long serialVersionUID = 1L;
+  String databaseName;
+  String tableName;
+  boolean incLoadPendingFlag;
+
+  /**
+   * For serialization only.
+   */
+  public ReplSetFirstIncLoadFlagDesc() {
+  }
+
+  public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, 
boolean incLoadPendingFlag) {
+super();
+this.databaseName = databaseName;
+this.tableName = tableName;
+this.incLoadPendingFlag = incLoadPendingFlag;
+  }
+
+  @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, 
Level.EXTENDED })
+  public String getDatabaseName() {
+return databaseName;
+  }
+
+  public void setDatabaseName(String databaseName) {
+this.databaseName = databaseName;
+  }
+
+  @Explain(displayName="table_name", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
+  public String getTableName() {
+return tableName;
+  }
+
+  public void setTableName(String tableName) {
+this.tableName = tableName;
+  }
+
+  @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
 
 Review comment:
   We can keep it. But it is redundant for now because EXPLAIN REPL LOAD, just 
shows ReplLoadTask. nothing else.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259774454
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java
 ##
 @@ -426,4 +427,14 @@ public static void copyLastReplId(Map 
srcParameter, Map

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259773951
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +192,14 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncPending(Map parameter) {
 
 Review comment:
   OK.. Then change it to parameters. with "s" :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259773642
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##
 @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path 
cmRoot, Hive hiveDb) th
 for (String tblName : Utils.matchesTbl(hiveDb, dbName, 
work.tableNameOrPattern)) {
   LOG.debug(
   "analyzeReplDump dumping table: " + tblName + " to db root " + 
dbRoot.toUri());
+  Table table;
+  try {
+table = hiveDb.getTable(dbName, tblName);
 
 Review comment:
   we get table from Metastore. It impacts perf.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259773205
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java
 ##
 @@ -0,0 +1,273 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.*;
+
+import static 
org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable;
+import static org.junit.Assert.*;
+
+/**
+ * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 
(Strict managed tables)
+ */
+public class TestReplicationWithTableMigrationEx {
+  @Rule
+  public final TestName testName = new TestName();
+
+  protected static final Logger LOG = 
LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class);
+  private static WarehouseInstance primary, replica;
+  private String primaryDbName, replicatedDbName;
+
+  @BeforeClass
+  public static void classLevelSetup() throws Exception {
+HashMap overrideProperties = new HashMap<>();
+internalBeforeClassSetup(overrideProperties);
+  }
+
+  static void internalBeforeClassSetup(Map overrideConfigs) 
throws Exception {
+HiveConf conf = new HiveConf(TestReplicationWithTableMigrationEx.class);
+conf.set("dfs.client.use.datanode.hostname", "true");
+conf.set("hadoop.proxyuser." + Utils.getUGI().getShortUserName() + 
".hosts", "*");
+MiniDFSCluster miniDFSCluster =
+new MiniDFSCluster.Builder(conf).numDataNodes(1).format(true).build();
+final DistributedFileSystem fs = miniDFSCluster.getFileSystem();
+HashMap hiveConfigs = new HashMap() {{
+  put("fs.defaultFS", fs.getUri().toString());
+  put("hive.support.concurrency", "true");
+  put("hive.txn.manager", 
"org.apache.hadoop.hive.ql.lockmgr.DbTxnManager");
+  put("hive.metastore.client.capability.check", "false");
+  put("hive.repl.bootstrap.dump.open.txn.timeout", "1s");
+  put("hive.exec.dynamic.partition.mode", "nonstrict");
+  put("hive.strict.checks.bucketing", "false");
+  put("hive.mapred.mode", "nonstrict");
+  put("mapred.input.dir.recursive", "true");
+  put("hive.metastore.disallow.incompatible.col.type.changes", "false");
+  put("hive.strict.managed.tables", "true");
+  put("hive.metastore.transactional.event.listeners", "");
+}};
+replica = new WarehouseInstance(LOG, miniDFSCluster, hiveConfigs);
+
+HashMap configsForPrimary = new HashMap() 
{{
+  put("fs.defaultFS", fs.getUri().toString());
+  put("hive.metastore.client.capability.check", "false");
+  put("hive.repl.bootstrap.dump.open.txn.timeout", "1s");
+  put("hive.exec.dynamic.partition.mode", "nonstrict");
+  put("hive.strict.checks.bucketing", "false");
+  put("hive.mapred.mode", "nonstrict");
+  put("mapred.input.dir.recursive", "true");
+  put("hive.metastore.disallow.incompatible.col.type.changes", "false");
+  put("hive.support.concurrency", "false");
+  put("hive.txn.manager", 
"org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager");
+  put("hive.strict.managed.tables", "false");
+}};
+configsForPrimary.putAll(overrideConfigs);
+primary = new WarehouseInstance(LOG, miniDFSCluster, 

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259773047
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   it is allowed today. we need not bootstrap for table level replication if 
create table event is part of inc dump.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259772504
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1147,6 +1145,12 @@ private static void createReplImportTasks(
   if (!waitOnPrecursor){
 throw new 
SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName()));
   }
+  // For warehouse level replication, if the database itself is getting 
created in this load, then no need to
+  // check for duplicate copy. Check HIVE-21197 for more detail.
+  firstIncPending = false;
+} else {
+  // For database replication, get the flag from database parameter. Check 
HIVE-21197 for more detail.
+  firstIncPending = ReplUtils.isFirstIncPending(parentDb.getParameters());
 
 Review comment:
   Yes, functionally it is right... But not so readable as we assume this flag 
is false from DB if table level replication. Pls add a comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259772546
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   Pls add a comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259771726
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java
 ##
 @@ -48,13 +48,15 @@
 
   private final DatabaseEvent event;
   private final String dbNameToLoadIn;
+  private final boolean isTableLevelLoad;
 
-  public LoadDatabase(Context context, DatabaseEvent event, String 
dbNameToLoadIn,
+  public LoadDatabase(Context context, DatabaseEvent event, String 
dbNameToLoadIn, String tblNameToLoadIn,
 
 Review comment:
   I think, this fix is partial and ideally, table level replication shoudn't 
enter LoadDatabase. Pls create a bug for this and leave it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259771264
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java
 ##
 @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, 
DumpType dumpType, String dbNa
 return updateReplIdTask;
   }
 
-  private Task dbUpdateReplStateTask(String dbName, 
String replState,
+  private Task dbUpdateReplStateTask(String dbName, 
String replState, String incLoadPendFlag,
  Task preCursor) {
 HashMap mapProp = new HashMap<>();
-mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState);
 
-AlterDatabaseDesc alterDbDesc = new AlterDatabaseDesc(dbName, mapProp, new 
ReplicationSpec(replState, replState));
+// if the update is for incLoadPendFlag, then send replicationSpec as null 
to avoid replacement check.
+ReplicationSpec replicationSpec = null;
+if (incLoadPendFlag == null) {
+  mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState);
+  replicationSpec = new ReplicationSpec(replState, replState);
+} else {
+  assert replState == null;
+  mapProp.put(ReplUtils.REPL_FIRST_INC_PENDING_FLAG, incLoadPendFlag);
 
 Review comment:
   But, this issue is there even if inc pending flag is set to false. Please 
recheck and comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770576
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncDone(Map parameter) {
+if (parameter == null) {
+  return true;
+}
+String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG);
+return compFlag == null  || compFlag.isEmpty() || 
"false".equalsIgnoreCase(compFlag);
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770626
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java
 ##
 @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, 
DumpType dumpType, String dbNa
 return updateReplIdTask;
   }
 
-  private Task dbUpdateReplStateTask(String dbName, 
String replState,
+  private Task dbUpdateReplStateTask(String dbName, 
String replState, String incLoadPendFlag,
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770334
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java
 ##
 @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() {
 Assert.assertEquals("The hash codes must be equal", 
compactionInfo.hashCode(), compactionInfo1.hashCode());
   }
 
+  @Test
+  public void testDisableCompactionDuringReplLoad() throws Exception {
+String tblName = "discomp";
+String database = "discomp_db";
+executeStatementOnDriver("drop database if exists " + database + " 
cascade", driver);
+executeStatementOnDriver("create database " + database, driver);
+executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a 
INT, b STRING) " +
+" PARTITIONED BY(ds string)" +
+" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires 
table to be bucketed
+" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver);
+executeStatementOnDriver("insert into " + database + "." + tblName + " 
partition (ds) values (1, 'fred', " +
+"'today'), (2, 'wilma', 'yesterday')", driver);
+
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
+" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", 
driver);
+List compacts = getCompactionList();
+Assert.assertEquals(0, compacts.size());
+
+executeStatementOnDriver("alter database " + database +
+" set dbproperties ('hive.repl.first.inc.pending' = 'true')", 
driver);
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770434
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -370,6 +370,9 @@ private int executeIncrementalLoad(DriverContext 
driverContext) {
 
   // If incremental events are already applied, then check and perform if 
need to bootstrap any tables.
   if (!builder.hasMoreWork() && !work.getPathsToCopyIterator().hasNext()) {
+// No need to set incremental load pending flag for external tables as 
the files will be copied to the same path
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770370
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java
 ##
 @@ -61,6 +62,21 @@ public ReplCopyTask(){
 super();
   }
 
+  // If file is already present in base directory, then remove it from the 
list.
+  // Check  HIVE-21197 for more detail
+  private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, 
List srcFiles,
+   long writeId, int stmtId) throws 
IOException {
+ListIterator iter = srcFiles.listIterator();
+Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, 
writeId, writeId, stmtId));
+while (iter.hasNext()) {
+  Path filePath = new Path(basePath, 
iter.next().getSourcePath().getName());
+  if (dstFs.exists(filePath)) {
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770296
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.*;
+
+import static 
org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable;
+import static org.junit.Assert.*;
+
+/**
+ * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 
(Strict managed tables)
+ */
+public class TestReplicationWithTableMigrationEx {
+  @Rule
+  public final TestName testName = new TestName();
+
+  protected static final Logger LOG = 
LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class);
+  private static WarehouseInstance primary, replica;
+  private String primaryDbName, replicatedDbName;
+
+  @BeforeClass
+  public static void classLevelSetup() throws Exception {
+HashMap overrideProperties = new HashMap<>();
+internalBeforeClassSetup(overrideProperties);
+  }
+
+  static void internalBeforeClassSetup(Map overrideConfigs) 
throws Exception {
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259770272
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java
 ##
 @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() {
 Assert.assertEquals("The hash codes must be equal", 
compactionInfo.hashCode(), compactionInfo1.hashCode());
   }
 
+  @Test
+  public void testDisableCompactionDuringReplLoad() throws Exception {
+String tblName = "discomp";
+String database = "discomp_db";
+executeStatementOnDriver("drop database if exists " + database + " 
cascade", driver);
+executeStatementOnDriver("create database " + database, driver);
+executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a 
INT, b STRING) " +
+" PARTITIONED BY(ds string)" +
+" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires 
table to be bucketed
+" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver);
+executeStatementOnDriver("insert into " + database + "." + tblName + " 
partition (ds) values (1, 'fred', " +
+"'today'), (2, 'wilma', 'yesterday')", driver);
+
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
+" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", 
driver);
+List compacts = getCompactionList();
+Assert.assertEquals(0, compacts.size());
+
+executeStatementOnDriver("alter database " + database +
+" set dbproperties ('hive.repl.first.inc.pending' = 'true')", 
driver);
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
+" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'false')", 
driver);
+compacts = getCompactionList();
+Assert.assertEquals(0, compacts.size());
+
+executeStatementOnDriver("alter database " + database +
+" set dbproperties ('hive.repl.first.inc.pending' = 'false')", 
driver);
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259769932
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java
 ##
 @@ -164,6 +165,12 @@ public IncrementalLoadTasksBuilder(String dbName, String 
tableName, String loadP
   lastEventid);
 }
   }
+
+  ReplSetFirstIncLoadFlagDesc desc = new 
ReplSetFirstIncLoadFlagDesc(dbName, tableName);
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259769991
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -60,6 +60,7 @@
 
   public static final String LAST_REPL_ID_KEY = "hive.repl.last.repl.id";
   public static final String REPL_CHECKPOINT_KEY = "hive.repl.ckpt.key";
+  public static final String REPL_FIRST_INC_PENDING_FLAG = 
"hive.repl.first.inc.pending";
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259745762
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +192,14 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncPending(Map parameter) {
 
 Review comment:
   parameter can be changed to dbOrTableProperties.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259743849
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##
 @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path 
cmRoot, Hive hiveDb) th
 for (String tblName : Utils.matchesTbl(hiveDb, dbName, 
work.tableNameOrPattern)) {
   LOG.debug(
   "analyzeReplDump dumping table: " + tblName + " to db root " + 
dbRoot.toUri());
+  Table table;
+  try {
+table = hiveDb.getTable(dbName, tblName);
 
 Review comment:
   We invoke getTable twice. Can we reuse tableTuple that we get below inside 
try block.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259736959
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   For insert or alter events where previous event is create table, then 
referring to firstincpending flag could be wrong for table level replication.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259738265
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java
 ##
 @@ -0,0 +1,273 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.*;
+
+import static 
org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable;
+import static org.junit.Assert.*;
+
+/**
+ * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 
(Strict managed tables)
+ */
+public class TestReplicationWithTableMigrationEx {
+  @Rule
+  public final TestName testName = new TestName();
+
+  protected static final Logger LOG = 
LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class);
+  private static WarehouseInstance primary, replica;
+  private String primaryDbName, replicatedDbName;
+
+  @BeforeClass
+  public static void classLevelSetup() throws Exception {
+HashMap overrideProperties = new HashMap<>();
+internalBeforeClassSetup(overrideProperties);
+  }
+
+  static void internalBeforeClassSetup(Map overrideConfigs) 
throws Exception {
+HiveConf conf = new HiveConf(TestReplicationWithTableMigrationEx.class);
+conf.set("dfs.client.use.datanode.hostname", "true");
+conf.set("hadoop.proxyuser." + Utils.getUGI().getShortUserName() + 
".hosts", "*");
+MiniDFSCluster miniDFSCluster =
+new MiniDFSCluster.Builder(conf).numDataNodes(1).format(true).build();
+final DistributedFileSystem fs = miniDFSCluster.getFileSystem();
+HashMap hiveConfigs = new HashMap() {{
+  put("fs.defaultFS", fs.getUri().toString());
+  put("hive.support.concurrency", "true");
+  put("hive.txn.manager", 
"org.apache.hadoop.hive.ql.lockmgr.DbTxnManager");
+  put("hive.metastore.client.capability.check", "false");
+  put("hive.repl.bootstrap.dump.open.txn.timeout", "1s");
+  put("hive.exec.dynamic.partition.mode", "nonstrict");
+  put("hive.strict.checks.bucketing", "false");
+  put("hive.mapred.mode", "nonstrict");
+  put("mapred.input.dir.recursive", "true");
+  put("hive.metastore.disallow.incompatible.col.type.changes", "false");
+  put("hive.strict.managed.tables", "true");
+  put("hive.metastore.transactional.event.listeners", "");
+}};
+replica = new WarehouseInstance(LOG, miniDFSCluster, hiveConfigs);
+
+HashMap configsForPrimary = new HashMap() 
{{
+  put("fs.defaultFS", fs.getUri().toString());
+  put("hive.metastore.client.capability.check", "false");
+  put("hive.repl.bootstrap.dump.open.txn.timeout", "1s");
+  put("hive.exec.dynamic.partition.mode", "nonstrict");
+  put("hive.strict.checks.bucketing", "false");
+  put("hive.mapred.mode", "nonstrict");
+  put("mapred.input.dir.recursive", "true");
+  put("hive.metastore.disallow.incompatible.col.type.changes", "false");
+  put("hive.support.concurrency", "false");
+  put("hive.txn.manager", 
"org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager");
+  put("hive.strict.managed.tables", "false");
+}};
+configsForPrimary.putAll(overrideConfigs);
+primary = new WarehouseInstance(LOG, miniDFSCluster, 

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259735134
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1147,6 +1145,12 @@ private static void createReplImportTasks(
   if (!waitOnPrecursor){
 throw new 
SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName()));
   }
+  // For warehouse level replication, if the database itself is getting 
created in this load, then no need to
+  // check for duplicate copy. Check HIVE-21197 for more detail.
+  firstIncPending = false;
+} else {
+  // For database replication, get the flag from database parameter. Check 
HIVE-21197 for more detail.
+  firstIncPending = ReplUtils.isFirstIncPending(parentDb.getParameters());
 
 Review comment:
   For table level replication, we should not refer to parentDb.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259733418
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java
 ##
 @@ -48,13 +48,15 @@
 
   private final DatabaseEvent event;
   private final String dbNameToLoadIn;
+  private final boolean isTableLevelLoad;
 
-  public LoadDatabase(Context context, DatabaseEvent event, String 
dbNameToLoadIn,
+  public LoadDatabase(Context context, DatabaseEvent event, String 
dbNameToLoadIn, String tblNameToLoadIn,
 
 Review comment:
   For table level replication, we should skip LoadDatabase tasks from caller 
itself. Target Db should exist in that case.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259739290
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -5199,6 +5205,35 @@ public static boolean doesTableNeedLocation(Table tbl) {
 return retval;
   }
 
+  private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc 
desc) throws HiveException, TException {
 
 Review comment:
   ReplSetFirstIncLoadFlagDesc can be changed to ReplSetFirstIncPendingFlagDesc.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259753768
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
 ##
 @@ -112,6 +118,12 @@ public void run() {
 continue;
   }
 
+  if (replIsCompactionDisabledForTable(t)) {
 
 Review comment:
   replIsCompactionDisabledForTable is most likely return false. So, can we 
move this check after noAutoCompactSet check? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259735679
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1164,6 +1168,9 @@ private static void createReplImportTasks(
   if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) {
 dropTblTask = dropTableTask(table, x, replicationSpec);
 table = null;
+  } else if (!firstIncPending) {
+// For table level replication, get the flag from table parameter. 
Check HIVE-21197 for more detail.
+firstIncPending = ReplUtils.isFirstIncPending(table.getParameters());
 
 Review comment:
   This should be referred only in case of table level replication. Also, for 
table level replication with create table event, we should set this flag to 
false always.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259752384
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.plan;
+import org.apache.hadoop.hive.ql.plan.Explain.Level;
+
+import java.io.Serializable;
+
+/**
+ * ReplSetFirstIncLoadFlagDesc.
+ *
+ */
+@Explain(displayName = "Set First Incr Load Flag", explainLevels = { 
Level.USER, Level.DEFAULT, Level.EXTENDED })
+public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements 
Serializable {
+
+  private static final long serialVersionUID = 1L;
+  String databaseName;
+  String tableName;
+  boolean incLoadPendingFlag;
+
+  /**
+   * For serialization only.
+   */
+  public ReplSetFirstIncLoadFlagDesc() {
+  }
+
+  public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, 
boolean incLoadPendingFlag) {
+super();
+this.databaseName = databaseName;
+this.tableName = tableName;
+this.incLoadPendingFlag = incLoadPendingFlag;
+  }
+
+  @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, 
Level.EXTENDED })
+  public String getDatabaseName() {
+return databaseName;
+  }
+
+  public void setDatabaseName(String databaseName) {
+this.databaseName = databaseName;
+  }
+
+  @Explain(displayName="table_name", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
+  public String getTableName() {
+return tableName;
+  }
+
+  public void setTableName(String tableName) {
+this.tableName = tableName;
+  }
+
+  @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, 
Level.DEFAULT, Level.EXTENDED })
 
 Review comment:
   @Explain can be removed from this class as this is internal task and won't 
be featured in explain plan.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259751152
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java
 ##
 @@ -426,4 +427,14 @@ public static void copyLastReplId(Map 
srcParameter, Map

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259739884
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -5199,6 +5205,35 @@ public static boolean doesTableNeedLocation(Table tbl) {
 return retval;
   }
 
+  private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc 
desc) throws HiveException, TException {
+String dbNameOrPattern = desc.getDatabaseName();
+String tableNameOrPattern = desc.getTableName();
+String flag = desc.getIncLoadPendingFlag() ? "true" : "false";
+Map parameters;
+// For database level load tableNameOrPattern will be null. Flag is set 
only in database for db level load.
+if (tableNameOrPattern != null && !tableNameOrPattern.isEmpty()) {
+  // For table level load, dbNameOrPattern is db name and not a pattern.
+  for (String tableName : Utils.matchesTbl(hive, dbNameOrPattern, 
tableNameOrPattern)) {
+org.apache.hadoop.hive.metastore.api.Table tbl = 
hive.getMSC().getTable(dbNameOrPattern, tableName);
+parameters = tbl.getParameters();
+if (ReplUtils.isFirstIncPending(parameters)) {
 
 Review comment:
   Why do we have this check? I think, this should set if current flag is 
different from input flag.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-25 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259743377
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##
 @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path 
cmRoot, Hive hiveDb) th
 for (String tblName : Utils.matchesTbl(hiveDb, dbName, 
work.tableNameOrPattern)) {
   LOG.debug(
   "analyzeReplDump dumping table: " + tblName + " to db root " + 
dbRoot.toUri());
+  Table table;
+  try {
+table = hiveDb.getTable(dbName, tblName);
 
 Review comment:
   These table level checks can be done only if work.tableNameOrPattern is 
valid.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259606013
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java
 ##
 @@ -271,12 +299,13 @@ public String getName() {
 LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath);
 if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){
   ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false);
-  if (replicationSpec.isReplace() && 
conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) {
+  if (replicationSpec.isReplace() && 
(conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) {
 rcwork.setDeleteDestIfExist(true);
 
 Review comment:
   As per discussion, we need to skip duplicate check for replace case, as we 
may end up with skipping copy of files with same name but different content. 
Also, we may need to move those matching files to new base path which is 
complicated. So, we just skip duplicate check for replace flow.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259587931
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java
 ##
 @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() {
 Assert.assertEquals("The hash codes must be equal", 
compactionInfo.hashCode(), compactionInfo1.hashCode());
   }
 
+  @Test
+  public void testDisableCompactionDuringReplLoad() throws Exception {
+String tblName = "discomp";
+String database = "discomp_db";
+executeStatementOnDriver("drop database if exists " + database + " 
cascade", driver);
+executeStatementOnDriver("create database " + database, driver);
+executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a 
INT, b STRING) " +
+" PARTITIONED BY(ds string)" +
+" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires 
table to be bucketed
+" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver);
+executeStatementOnDriver("insert into " + database + "." + tblName + " 
partition (ds) values (1, 'fred', " +
+"'today'), (2, 'wilma', 'yesterday')", driver);
+
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
+" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", 
driver);
+List compacts = getCompactionList();
+Assert.assertEquals(0, compacts.size());
+
+executeStatementOnDriver("alter database " + database +
+" set dbproperties ('hive.repl.first.inc.pending' = 'true')", 
driver);
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
+" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'false')", 
driver);
+compacts = getCompactionList();
+Assert.assertEquals(0, compacts.size());
+
+executeStatementOnDriver("alter database " + database +
+" set dbproperties ('hive.repl.first.inc.pending' = 'false')", 
driver);
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
 
 Review comment:
   Alter table doesn't make sense now as we removed it from use.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259587889
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.*;
+
+import static 
org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable;
+import static org.junit.Assert.*;
+
+/**
+ * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 
(Strict managed tables)
+ */
+public class TestReplicationWithTableMigrationEx {
+  @Rule
+  public final TestName testName = new TestName();
+
+  protected static final Logger LOG = 
LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class);
+  private static WarehouseInstance primary, replica;
+  private String primaryDbName, replicatedDbName;
+
+  @BeforeClass
+  public static void classLevelSetup() throws Exception {
+HashMap overrideProperties = new HashMap<>();
+internalBeforeClassSetup(overrideProperties);
+  }
+
+  static void internalBeforeClassSetup(Map overrideConfigs) 
throws Exception {
 
 Review comment:
   Can we have a common base class just like BaseReplicationAcrossInstances 
instead of duplicating in both classes?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259588731
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncDone(Map parameter) {
+if (parameter == null) {
+  return true;
+}
+String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG);
+return compFlag == null  || compFlag.isEmpty() || 
"false".equalsIgnoreCase(compFlag);
 
 Review comment:
   String.equalsIgnoreCase handles null and empty cases of input argument. 
However, it returns false in both the cases. Can remove these checks but need 
to be careful.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259588983
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java
 ##
 @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, 
DumpType dumpType, String dbNa
 return updateReplIdTask;
   }
 
-  private Task dbUpdateReplStateTask(String dbName, 
String replState,
+  private Task dbUpdateReplStateTask(String dbName, 
String replState, String incLoadPendFlag,
 
 Review comment:
   Shall change incLoadPendFlag to firstIncLoadPendingFlag.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259588369
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java
 ##
 @@ -271,12 +299,13 @@ public String getName() {
 LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath);
 if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){
   ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false);
-  if (replicationSpec.isReplace() && 
conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) {
+  if (replicationSpec.isReplace() && 
(conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) {
 rcwork.setDeleteDestIfExist(true);
 
 Review comment:
   If it is insert overwrite case(writeId=2) and let's say few of the listed 
files are there in base_1. Now, this deleteDestIfExists marks to create base_2 
instead of delta_2_2 as it is IOW. So, only partial data would be copied to 
base_2. After commit of writeId=2, all readers would see partial data from 
base_2. So, need to handle this case.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259587946
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java
 ##
 @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() {
 Assert.assertEquals("The hash codes must be equal", 
compactionInfo.hashCode(), compactionInfo1.hashCode());
   }
 
+  @Test
+  public void testDisableCompactionDuringReplLoad() throws Exception {
+String tblName = "discomp";
+String database = "discomp_db";
+executeStatementOnDriver("drop database if exists " + database + " 
cascade", driver);
+executeStatementOnDriver("create database " + database, driver);
+executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a 
INT, b STRING) " +
+" PARTITIONED BY(ds string)" +
+" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires 
table to be bucketed
+" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver);
+executeStatementOnDriver("insert into " + database + "." + tblName + " 
partition (ds) values (1, 'fred', " +
+"'today'), (2, 'wilma', 'yesterday')", driver);
+
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
+" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", 
driver);
+List compacts = getCompactionList();
+Assert.assertEquals(0, compacts.size());
+
+executeStatementOnDriver("alter database " + database +
+" set dbproperties ('hive.repl.first.inc.pending' = 'true')", 
driver);
+executeStatementOnDriver("ALTER TABLE " + database + "." + tblName +
 
 Review comment:
   Alter table doesn't make sense now as we removed it from use.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259587771
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java
 ##
 @@ -71,6 +74,16 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) 
throws Exception {
 }
   }
 
+  @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws 
TException {
+try {
+  Database database = rs.getDatabase(getDefaultCatalog(conf), dbName);
 
 Review comment:
   Shall add a comment on why compaction is disabled if first incremental is 
not done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259587562
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java
 ##
 @@ -426,4 +427,14 @@ public static void copyLastReplId(Map 
srcParameter, Map

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259589411
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncDone(Map parameter) {
 
 Review comment:
   All the callers negate this result while using. Why don't we change it to 
isFirstIncPending? It will make it uniform with stored DB parameter name as 
well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259587524
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java
 ##
 @@ -426,4 +427,14 @@ public static void copyLastReplId(Map 
srcParameter, Map

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259589143
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java
 ##
 @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, 
DumpType dumpType, String dbNa
 return updateReplIdTask;
   }
 
-  private Task dbUpdateReplStateTask(String dbName, 
String replState,
+  private Task dbUpdateReplStateTask(String dbName, 
String replState, String incLoadPendFlag,
  Task preCursor) {
 HashMap mapProp = new HashMap<>();
-mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState);
 
-AlterDatabaseDesc alterDbDesc = new AlterDatabaseDesc(dbName, mapProp, new 
ReplicationSpec(replState, replState));
+// if the update is for incLoadPendFlag, then send replicationSpec as null 
to avoid replacement check.
+ReplicationSpec replicationSpec = null;
+if (incLoadPendFlag == null) {
+  mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState);
+  replicationSpec = new ReplicationSpec(replState, replState);
+} else {
+  assert replState == null;
+  mapProp.put(ReplUtils.REPL_FIRST_INC_PENDING_FLAG, incLoadPendFlag);
 
 Review comment:
   There is one corner case for A->B->C where EVENT_ALTER_DATABASE should skip 
this additional parameter while dumping.
   Let's say, bootstrap and first incremental done in B. So, this flag is false.
   Now, we trigger bootstrap dump from B->C and concurrently, there is a 
alterDb operation which logs an event with this flag as false.
   Now, when we bootstrap load in C, we set it there as true. During first 
incremental load in C, when we process AlterDb event, we set it to false, so 
that we open up compaction after this event and later events may not guarantee 
duplicate check. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259586952
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java
 ##
 @@ -135,7 +135,8 @@ private boolean isDbEmpty(String dbName) throws 
HiveException {
   }
 
   private Task alterDbTask(Database dbObj) {
-return alterDbTask(dbObj.getName(), updateDbProps(dbObj, 
context.dumpDirectory), context.hiveConf);
+return alterDbTask(dbObj.getName(), updateDbProps(dbObj, 
context.dumpDirectory, false),
 
 Review comment:
   Alter case also we need to set it. If it is bootstrapping on empty Db. May 
be we can remove the flag and just set it to true always.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259588001
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java
 ##
 @@ -61,6 +62,21 @@ public ReplCopyTask(){
 super();
   }
 
+  // If file is already present in base directory, then remove it from the 
list.
+  // Check  HIVE-21197 for more detail
+  private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, 
List srcFiles,
+   long writeId, int stmtId) throws 
IOException {
+ListIterator iter = srcFiles.listIterator();
+Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, 
writeId, writeId, stmtId));
+while (iter.hasNext()) {
+  Path filePath = new Path(basePath, 
iter.next().getSourcePath().getName());
+  if (dstFs.exists(filePath)) {
 
 Review comment:
   Do we need to have a retry mechanism if it throws IOException due to network 
glitch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259588476
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -370,6 +370,9 @@ private int executeIncrementalLoad(DriverContext 
driverContext) {
 
   // If incremental events are already applied, then check and perform if 
need to bootstrap any tables.
   if (!builder.hasMoreWork() && !work.getPathsToCopyIterator().hasNext()) {
+// No need to set incremental load pending flag for external tables as 
the files will be copied to the same path
 
 Review comment:
   Add a TODO here saying, this flag needs to be set for non-txn to txn 
migrated tables if they are bootstrapped along with incremental load (table 
level replication).
   With TODO, the current comment make sense like why it is not needed for 
external tables.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259586959
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java
 ##
 @@ -158,6 +159,15 @@ private boolean isDbEmpty(String dbName) throws 
HiveException {
 // Add the checkpoint key to the Database binding it to current dump 
directory.
 // So, if retry using same dump, we shall skip Database object update.
 parameters.put(ReplUtils.REPL_CHECKPOINT_KEY, dumpDirectory);
+
+if (needSetIncFlag) {
 
 Review comment:
   Need to set it in alter case too. Empty db bootstrap case.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-23 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259589263
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncDone(Map parameter) {
+if (parameter == null) {
+  return true;
+}
+String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG);
 
 Review comment:
   compFlag doesn't look relevant. Can change it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259260745
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java
 ##
 @@ -158,6 +159,15 @@ private boolean isDbEmpty(String dbName) throws 
HiveException {
 // Add the checkpoint key to the Database binding it to current dump 
directory.
 // So, if retry using same dump, we shall skip Database object update.
 parameters.put(ReplUtils.REPL_CHECKPOINT_KEY, dumpDirectory);
+
+if (needSetIncFlag) {
 
 Review comment:
   No need to check it. Instead just set needSetIncFlag directly using 
setBoolVar.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259253737
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.plan;
+import org.apache.hadoop.hive.ql.plan.Explain.Level;
+
+import java.io.Serializable;
+
+/**
+ * ReplSetFirstIncLoadFlagDesc.
+ *
+ */
+@Explain(displayName = "Set First Incr Load Flag", explainLevels = { 
Level.USER, Level.DEFAULT, Level.EXTENDED })
+public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements 
Serializable {
+
+  private static final long serialVersionUID = 1L;
+  String databaseName;
+  String tableName;
+
+  /**
+   * For serialization only.
+   */
+  public ReplSetFirstIncLoadFlagDesc() {
+  }
+
+  public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName) {
 
 Review comment:
   Why is it not taking extra boolean argument for the flag?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259252494
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1136,6 +1133,7 @@ private static void createReplImportTasks(
 
 Task dropTblTask = null;
 WriteEntity.WriteType lockType = WriteEntity.WriteType.DDL_NO_LOCK;
+Boolean firstIncDone;
 
 Review comment:
   Shall use primitive boolean.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259261832
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -60,6 +60,7 @@
 
   public static final String LAST_REPL_ID_KEY = "hive.repl.last.repl.id";
   public static final String REPL_CHECKPOINT_KEY = "hive.repl.ckpt.key";
+  public static final String REPL_FIRST_INC_PENDING_FLAG = 
"hive.repl.first.inc.pending";
 
 Review comment:
   Can we use generic name like "hive.repl.is.data.consistent"? This would 
avoid confusion for incremental bootstrap case of table level replication where 
few tables data will be consistent only after next incremental load.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259252954
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
 ##
 @@ -1147,6 +1145,12 @@ private static void createReplImportTasks(
   if (!waitOnPrecursor){
 throw new 
SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName()));
   }
+  // For warehouse level replication, if the database itself is getting 
created in this load, then no need to
+  // check for duplicate copy. Check HIVE-21197 for more detail.
+  firstIncDone = true;
 
 Review comment:
   Shall use ternary operator.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259232906
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 ##
 @@ -661,6 +663,10 @@ public int execute(DriverContext driverContext) {
   if (work.getAlterMaterializedViewDesc() != null) {
 return alterMaterializedView(db, work.getAlterMaterializedViewDesc());
   }
+
+  if (work.getReplSetFirstIncLoadFlagDesc() != null) {
 
 Review comment:
   Why don't we use alterTableDesc or alterDatabaseDesc instead of new type? We 
are just editing a db/table/partition property.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259233648
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java
 ##
 @@ -61,6 +62,25 @@ public ReplCopyTask(){
 super();
   }
 
+  // If file is already present in base directory, then remove it from the 
list.
+  // Check  HIVE-21197 for more detail
+  private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, 
List srcFiles,
+   long writeId, int stmtId) throws 
IOException {
+ListIterator iter = srcFiles.listIterator();
+Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, 
writeId, writeId, stmtId));
+while (iter.hasNext()) {
+  Path filePath = new Path(basePath, 
iter.next().getSourcePath().getName());
+  try {
+if (dstFs.getFileStatus(filePath) != null) {
 
 Review comment:
   Shall use fs.exists(path) method instead of getFileStatus. Then, need not 
handle exception as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259254156
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java
 ##
 @@ -71,6 +74,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) 
throws Exception {
 }
   }
 
+  @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws 
TException {
+try {
+  Database database = rs.getDatabase(getDefaultCatalog(conf), dbName);
+  if (database != null) {
+return !ReplUtils.isFirstIncDone(database.getParameters());
+  }
+  LOG.info("Unable to find database " + dbName);
 
 Review comment:
   Why 2 different flows for No database case?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259261039
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java
 ##
 @@ -164,6 +165,12 @@ public IncrementalLoadTasksBuilder(String dbName, String 
tableName, String loadP
   lastEventid);
 }
   }
+
+  ReplSetFirstIncLoadFlagDesc desc = new 
ReplSetFirstIncLoadFlagDesc(dbName, tableName);
 
 Review comment:
   Can we combine this with update last replID alterDb task?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259228579
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java
 ##
 @@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.util.*;
+
+import static 
org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable;
+import static org.junit.Assert.*;
+
+/**
+ * TestReplicationWithTableMigration - test replication for Hive2 to Hive3 
(Strict managed tables)
+ */
+public class TestReplicationWithTableMigrationMisc {
 
 Review comment:
   The name can be more meaningful. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259234204
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java
 ##
 @@ -133,8 +161,11 @@ protected int execute(DriverContext driverContext) {
 return 6;
   }
   long writeId = Long.parseLong(writeIdString);
-  toPath = new Path(toPath, 
AcidUtils.baseOrDeltaSubdir(work.getDeleteDestIfExist(), writeId, writeId,
-  
driverContext.getCtx().getHiveTxnManager().getStmtIdAndIncrement()));
+  // Set stmt id 0 for bootstrap load as the directory needs to be 
searched during incremental load to avoid any
+  // duplicate copy from the source. Check HIVE-21197 for more detail.
+  int stmtId = (writeId == 
ReplUtils.REPL_BOOTSTRAP_MIGRATION_BASE_WRITE_ID) ? 0 :
 
 Review comment:
   No need to have this check as it is migration flow. Set it to 0 always.
   Also, it is better to add a final variable in ReplUtils for this instead of 
hardcoding it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259237893
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
 ##
 @@ -112,6 +118,12 @@ public void run() {
 continue;
   }
 
+  if (replIsCompactionDisabledForTable(t)) {
 
 Review comment:
   I think, we can disable compaction in DB level only. Even table level 
replication replicates it DB level with list of tables. It will be simple and 
easy to maintain.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259230730
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java
 ##
 @@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.util.*;
+
+import static 
org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable;
+import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable;
+import static org.junit.Assert.*;
+
+/**
+ * TestReplicationWithTableMigration - test replication for Hive2 to Hive3 
(Strict managed tables)
+ */
+public class TestReplicationWithTableMigrationMisc {
+  @Rule
+  public final TestName testName = new TestName();
+
+  protected static final Logger LOG = 
LoggerFactory.getLogger(TestReplicationWithTableMigrationMisc.class);
+  private static WarehouseInstance primary, replica;
+  private String primaryDbName, replicatedDbName;
+
+  @BeforeClass
+  public static void classLevelSetup() throws Exception {
+HashMap overrideProperties = new HashMap<>();
+internalBeforeClassSetup(overrideProperties);
+  }
+
+  static void internalBeforeClassSetup(Map overrideConfigs) 
throws Exception {
+HiveConf conf = new HiveConf(TestReplicationWithTableMigrationMisc.class);
+conf.set("dfs.client.use.datanode.hostname", "true");
+conf.set("hadoop.proxyuser." + Utils.getUGI().getShortUserName() + 
".hosts", "*");
+MiniDFSCluster miniDFSCluster =
+new MiniDFSCluster.Builder(conf).numDataNodes(1).format(true).build();
+final DistributedFileSystem fs = miniDFSCluster.getFileSystem();
+HashMap hiveConfigs = new HashMap() {{
+  put("fs.defaultFS", fs.getUri().toString());
+  put("hive.support.concurrency", "true");
+  put("hive.txn.manager", 
"org.apache.hadoop.hive.ql.lockmgr.DbTxnManager");
+  put("hive.metastore.client.capability.check", "false");
+  put("hive.repl.bootstrap.dump.open.txn.timeout", "1s");
+  put("hive.exec.dynamic.partition.mode", "nonstrict");
+  put("hive.strict.checks.bucketing", "false");
+  put("hive.mapred.mode", "nonstrict");
+  put("mapred.input.dir.recursive", "true");
+  put("hive.metastore.disallow.incompatible.col.type.changes", "false");
+  put("hive.strict.managed.tables", "true");
+  put("hive.metastore.transactional.event.listeners", "");
+}};
+replica = new WarehouseInstance(LOG, miniDFSCluster, hiveConfigs);
+
+HashMap configsForPrimary = new HashMap() 
{{
+  put("fs.defaultFS", fs.getUri().toString());
+  put("hive.metastore.client.capability.check", "false");
+  put("hive.repl.bootstrap.dump.open.txn.timeout", "1s");
+  put("hive.exec.dynamic.partition.mode", "nonstrict");
+  put("hive.strict.checks.bucketing", "false");
+  put("hive.mapred.mode", "nonstrict");
+  put("mapred.input.dir.recursive", "true");
+  put("hive.metastore.disallow.incompatible.col.type.changes", "false");
+  put("hive.support.concurrency", "false");
+  put("hive.txn.manager", 
"org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager");
+  put("hive.strict.managed.tables", "false");
+}};
+configsForPrimary.putAll(overrideConfigs);
+primary = new 

[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259235570
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java
 ##
 @@ -271,12 +302,13 @@ public String getName() {
 LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath);
 if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){
   ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false);
-  if (replicationSpec.isReplace() && 
conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) {
+  if (replicationSpec.isReplace() && 
(conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) {
 rcwork.setDeleteDestIfExist(true);
 rcwork.setAutoPurge(isAutoPurge);
 rcwork.setNeedRecycle(needRecycle);
   }
   rcwork.setCopyToMigratedTxnTable(copyToMigratedTxnTable);
+  rcwork.setCheckDuplicateCopy(replicationSpec.needDupCopyCheck());
 
 Review comment:
   needDupCopyCheck seems to be a generic flag but we handle only for migration 
case. Instead, shall change it to isFirstIncAfterBootstrap which makes it clear 
that for some flow, need to handle this case differently.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259230584
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java
 ##
 @@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.parse;
+
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore;
+import 
org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection;
+import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId;
+import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils;
+import org.apache.hadoop.hive.shims.Utils;
+import org.junit.*;
+import org.junit.rules.TestName;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.util.*;
 
 Review comment:
   Shall avoid importing *.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259254333
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/RemoteCompactorThread.java
 ##
 @@ -53,6 +56,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) 
throws Exception {
 }
   }
 
+  @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws 
TException {
 
 Review comment:
   Duplicate method. Can we have a static method in MetastoreUtils that takes 
MSC as input?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259227325
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosAcrossInstances.java
 ##
 @@ -610,6 +610,11 @@ public void testIncrementalDumpOfWarehouse() throws 
Throwable {
 .run("show tables")
 .verifyResults(new String[] { "t1" });
 
+//default db is already created at replica before load. So the incr flag 
should not be set.
+
assertTrue(ReplUtils.isFirstIncDone(replica.getDatabase("default").getParameters()));
+
assertTrue(!ReplUtils.isFirstIncDone(replica.getDatabase(primaryDbName).getParameters()));
 
 Review comment:
   Shall use assertFalse method to be clear. Check for other occurrences too.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled

2019-02-22 Thread GitBox
sankarh commented on a change in pull request #541: HIVE-21197 : Hive 
Replication can add duplicate data during migration to a target with 
hive.strict.managed.tables enabled
URL: https://github.com/apache/hive/pull/541#discussion_r259236827
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java
 ##
 @@ -187,4 +188,12 @@ public static PathFilter getEventsDirectoryFilter(final 
FileSystem fs) {
   }
 };
   }
+
+  public static boolean isFirstIncDone(Map parameter) {
+if (parameter == null) {
+  return true;
+}
+String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG);
 
 Review comment:
   Shall use getBoolVar to avoid string comparison.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services