[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r260131931 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -5225,7 +5226,8 @@ private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc des for (String dbName : Utils.matchesDb(hive, dbNameOrPattern)) { Database database = hive.getMSC().getDatabase(dbName); parameters = database.getParameters(); -if (ReplUtils.isFirstIncPending(parameters)) { +String incPendPara = parameters != null ? parameters.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG) : null; +if (incPendPara != null && (!flag.equalsIgnoreCase(incPendPara))) { Review comment: Same as above. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r260131896 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -5216,7 +5216,8 @@ private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc des for (String tableName : Utils.matchesTbl(hive, dbNameOrPattern, tableNameOrPattern)) { org.apache.hadoop.hive.metastore.api.Table tbl = hive.getMSC().getTable(dbNameOrPattern, tableName); parameters = tbl.getParameters(); -if (ReplUtils.isFirstIncPending(parameters)) { +String incPendPara = parameters != null ? parameters.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG) : null; +if (incPendPara != null && (!flag.equalsIgnoreCase(incPendPara))) { Review comment: If incPendPara = null also need to set the flag. Now it will skip. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r260131319 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1164,6 +1168,9 @@ private static void createReplImportTasks( if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) { dropTblTask = dropTableTask(table, x, replicationSpec); table = null; + } else if (!firstIncPending) { +// For table level replication, get the flag from table parameter. Check HIVE-21197 for more detail. +firstIncPending = ReplUtils.isFirstIncPending(table.getParameters()); Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r260131264 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r260131280 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.plan; +import org.apache.hadoop.hive.ql.plan.Explain.Level; + +import java.io.Serializable; + +/** + * ReplSetFirstIncLoadFlagDesc. + * + */ +@Explain(displayName = "Set First Incr Load Flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) +public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements Serializable { + + private static final long serialVersionUID = 1L; + String databaseName; + String tableName; + boolean incLoadPendingFlag; + + /** + * For serialization only. + */ + public ReplSetFirstIncLoadFlagDesc() { + } + + public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, boolean incLoadPendingFlag) { +super(); +this.databaseName = databaseName; +this.tableName = tableName; +this.incLoadPendingFlag = incLoadPendingFlag; + } + + @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) + public String getDatabaseName() { +return databaseName; + } + + public void setDatabaseName(String databaseName) { +this.databaseName = databaseName; + } + + @Explain(displayName="table_name", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) + public String getTableName() { +return tableName; + } + + public void setTableName(String tableName) { +this.tableName = tableName; + } + + @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259793406 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java ## @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive hiveDb) th for (String tblName : Utils.matchesTbl(hiveDb, dbName, work.tableNameOrPattern)) { LOG.debug( "analyzeReplDump dumping table: " + tblName + " to db root " + dbRoot.toUri()); + Table table; + try { +table = hiveDb.getTable(dbName, tblName); Review comment: yes...agreed This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259776670 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +188,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259776841 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java ## @@ -71,6 +74,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) throws Exception { } } + @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws TException { +try { + Database database = rs.getDatabase(getDefaultCatalog(conf), dbName); + if (database != null) { +return !ReplUtils.isFirstIncDone(database.getParameters()); + } + LOG.info("Unable to find database " + dbName); Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259776724 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java ## @@ -112,6 +118,12 @@ public void run() { continue; } + if (replIsCompactionDisabledForTable(t)) { Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259776789 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1147,6 +1145,12 @@ private static void createReplImportTasks( if (!waitOnPrecursor){ throw new SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName())); } + // For warehouse level replication, if the database itself is getting created in this load, then no need to + // check for duplicate copy. Check HIVE-21197 for more detail. + firstIncDone = true; Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259776621 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +302,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); rcwork.setAutoPurge(isAutoPurge); rcwork.setNeedRecycle(needRecycle); } rcwork.setCopyToMigratedTxnTable(copyToMigratedTxnTable); + rcwork.setCheckDuplicateCopy(replicationSpec.needDupCopyCheck()); Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259776865 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/RemoteCompactorThread.java ## @@ -53,6 +56,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) throws Exception { } } + @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws TException { Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259776541 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -661,6 +663,10 @@ public int execute(DriverContext driverContext) { if (work.getAlterMaterializedViewDesc() != null) { return alterMaterializedView(db, work.getAlterMaterializedViewDesc()); } + + if (work.getReplSetFirstIncLoadFlagDesc() != null) { Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259776510 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java ## @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; +import java.io.IOException; +import java.util.*; Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259775595 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java ## @@ -112,6 +118,12 @@ public void run() { continue; } + if (replIsCompactionDisabledForTable(t)) { Review comment: auto_compact flag remains forever unless someone changes it. So, if it is false, then additional check can be avoided in that case. But repl first inc flag is false most of the time. So, we end up with 2 checks always. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259774866 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.plan; +import org.apache.hadoop.hive.ql.plan.Explain.Level; + +import java.io.Serializable; + +/** + * ReplSetFirstIncLoadFlagDesc. + * + */ +@Explain(displayName = "Set First Incr Load Flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) +public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements Serializable { + + private static final long serialVersionUID = 1L; + String databaseName; + String tableName; + boolean incLoadPendingFlag; + + /** + * For serialization only. + */ + public ReplSetFirstIncLoadFlagDesc() { + } + + public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, boolean incLoadPendingFlag) { +super(); +this.databaseName = databaseName; +this.tableName = tableName; +this.incLoadPendingFlag = incLoadPendingFlag; + } + + @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) + public String getDatabaseName() { +return databaseName; + } + + public void setDatabaseName(String databaseName) { +this.databaseName = databaseName; + } + + @Explain(displayName="table_name", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) + public String getTableName() { +return tableName; + } + + public void setTableName(String tableName) { +this.tableName = tableName; + } + + @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) Review comment: We can keep it. But it is redundant for now because EXPLAIN REPL LOAD, just shows ReplLoadTask. nothing else. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259774454 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259773951 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,14 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncPending(Map parameter) { Review comment: OK.. Then change it to parameters. with "s" :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259773642 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java ## @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive hiveDb) th for (String tblName : Utils.matchesTbl(hiveDb, dbName, work.tableNameOrPattern)) { LOG.debug( "analyzeReplDump dumping table: " + tblName + " to db root " + dbRoot.toUri()); + Table table; + try { +table = hiveDb.getTable(dbName, tblName); Review comment: we get table from Metastore. It impacts perf. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259773205 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java ## @@ -0,0 +1,273 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationEx { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { +HiveConf conf = new HiveConf(TestReplicationWithTableMigrationEx.class); +conf.set("dfs.client.use.datanode.hostname", "true"); +conf.set("hadoop.proxyuser." + Utils.getUGI().getShortUserName() + ".hosts", "*"); +MiniDFSCluster miniDFSCluster = +new MiniDFSCluster.Builder(conf).numDataNodes(1).format(true).build(); +final DistributedFileSystem fs = miniDFSCluster.getFileSystem(); +HashMap hiveConfigs = new HashMap() {{ + put("fs.defaultFS", fs.getUri().toString()); + put("hive.support.concurrency", "true"); + put("hive.txn.manager", "org.apache.hadoop.hive.ql.lockmgr.DbTxnManager"); + put("hive.metastore.client.capability.check", "false"); + put("hive.repl.bootstrap.dump.open.txn.timeout", "1s"); + put("hive.exec.dynamic.partition.mode", "nonstrict"); + put("hive.strict.checks.bucketing", "false"); + put("hive.mapred.mode", "nonstrict"); + put("mapred.input.dir.recursive", "true"); + put("hive.metastore.disallow.incompatible.col.type.changes", "false"); + put("hive.strict.managed.tables", "true"); + put("hive.metastore.transactional.event.listeners", ""); +}}; +replica = new WarehouseInstance(LOG, miniDFSCluster, hiveConfigs); + +HashMap configsForPrimary = new HashMap() {{ + put("fs.defaultFS", fs.getUri().toString()); + put("hive.metastore.client.capability.check", "false"); + put("hive.repl.bootstrap.dump.open.txn.timeout", "1s"); + put("hive.exec.dynamic.partition.mode", "nonstrict"); + put("hive.strict.checks.bucketing", "false"); + put("hive.mapred.mode", "nonstrict"); + put("mapred.input.dir.recursive", "true"); + put("hive.metastore.disallow.incompatible.col.type.changes", "false"); + put("hive.support.concurrency", "false"); + put("hive.txn.manager", "org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager"); + put("hive.strict.managed.tables", "false"); +}}; +configsForPrimary.putAll(overrideConfigs); +primary = new WarehouseInstance(LOG, miniDFSCluster,
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259773047 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1164,6 +1168,9 @@ private static void createReplImportTasks( if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) { dropTblTask = dropTableTask(table, x, replicationSpec); table = null; + } else if (!firstIncPending) { +// For table level replication, get the flag from table parameter. Check HIVE-21197 for more detail. +firstIncPending = ReplUtils.isFirstIncPending(table.getParameters()); Review comment: it is allowed today. we need not bootstrap for table level replication if create table event is part of inc dump. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259772504 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1147,6 +1145,12 @@ private static void createReplImportTasks( if (!waitOnPrecursor){ throw new SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName())); } + // For warehouse level replication, if the database itself is getting created in this load, then no need to + // check for duplicate copy. Check HIVE-21197 for more detail. + firstIncPending = false; +} else { + // For database replication, get the flag from database parameter. Check HIVE-21197 for more detail. + firstIncPending = ReplUtils.isFirstIncPending(parentDb.getParameters()); Review comment: Yes, functionally it is right... But not so readable as we assume this flag is false from DB if table level replication. Pls add a comment. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259772546 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1164,6 +1168,9 @@ private static void createReplImportTasks( if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) { dropTblTask = dropTableTask(table, x, replicationSpec); table = null; + } else if (!firstIncPending) { +// For table level replication, get the flag from table parameter. Check HIVE-21197 for more detail. +firstIncPending = ReplUtils.isFirstIncPending(table.getParameters()); Review comment: Pls add a comment. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259771726 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -48,13 +48,15 @@ private final DatabaseEvent event; private final String dbNameToLoadIn; + private final boolean isTableLevelLoad; - public LoadDatabase(Context context, DatabaseEvent event, String dbNameToLoadIn, + public LoadDatabase(Context context, DatabaseEvent event, String dbNameToLoadIn, String tblNameToLoadIn, Review comment: I think, this fix is partial and ideally, table level replication shoudn't enter LoadDatabase. Pls create a bug for this and leave it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259771264 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Task preCursor) { HashMap mapProp = new HashMap<>(); -mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); -AlterDatabaseDesc alterDbDesc = new AlterDatabaseDesc(dbName, mapProp, new ReplicationSpec(replState, replState)); +// if the update is for incLoadPendFlag, then send replicationSpec as null to avoid replacement check. +ReplicationSpec replicationSpec = null; +if (incLoadPendFlag == null) { + mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); + replicationSpec = new ReplicationSpec(replState, replState); +} else { + assert replState == null; + mapProp.put(ReplUtils.REPL_FIRST_INC_PENDING_FLAG, incLoadPendFlag); Review comment: But, this issue is there even if inc pending flag is set to false. Please recheck and comment. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259770576 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); +return compFlag == null || compFlag.isEmpty() || "false".equalsIgnoreCase(compFlag); Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259770626 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259770334 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java ## @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() { Assert.assertEquals("The hash codes must be equal", compactionInfo.hashCode(), compactionInfo1.hashCode()); } + @Test + public void testDisableCompactionDuringReplLoad() throws Exception { +String tblName = "discomp"; +String database = "discomp_db"; +executeStatementOnDriver("drop database if exists " + database + " cascade", driver); +executeStatementOnDriver("create database " + database, driver); +executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a INT, b STRING) " + +" PARTITIONED BY(ds string)" + +" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires table to be bucketed +" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver); +executeStatementOnDriver("insert into " + database + "." + tblName + " partition (ds) values (1, 'fred', " + +"'today'), (2, 'wilma', 'yesterday')", driver); + +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", driver); +List compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'true')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259770434 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -370,6 +370,9 @@ private int executeIncrementalLoad(DriverContext driverContext) { // If incremental events are already applied, then check and perform if need to bootstrap any tables. if (!builder.hasMoreWork() && !work.getPathsToCopyIterator().hasNext()) { +// No need to set incremental load pending flag for external tables as the files will be copied to the same path Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259770370 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -61,6 +62,21 @@ public ReplCopyTask(){ super(); } + // If file is already present in base directory, then remove it from the list. + // Check HIVE-21197 for more detail + private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, List srcFiles, + long writeId, int stmtId) throws IOException { +ListIterator iter = srcFiles.listIterator(); +Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, writeId, writeId, stmtId)); +while (iter.hasNext()) { + Path filePath = new Path(basePath, iter.next().getSourcePath().getName()); + if (dstFs.exists(filePath)) { Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259770296 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationEx { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259770272 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java ## @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() { Assert.assertEquals("The hash codes must be equal", compactionInfo.hashCode(), compactionInfo1.hashCode()); } + @Test + public void testDisableCompactionDuringReplLoad() throws Exception { +String tblName = "discomp"; +String database = "discomp_db"; +executeStatementOnDriver("drop database if exists " + database + " cascade", driver); +executeStatementOnDriver("create database " + database, driver); +executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a INT, b STRING) " + +" PARTITIONED BY(ds string)" + +" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires table to be bucketed +" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver); +executeStatementOnDriver("insert into " + database + "." + tblName + " partition (ds) values (1, 'fred', " + +"'today'), (2, 'wilma', 'yesterday')", driver); + +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", driver); +List compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'true')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'false')", driver); +compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'false')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259769932 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -164,6 +165,12 @@ public IncrementalLoadTasksBuilder(String dbName, String tableName, String loadP lastEventid); } } + + ReplSetFirstIncLoadFlagDesc desc = new ReplSetFirstIncLoadFlagDesc(dbName, tableName); Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259769991 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -60,6 +60,7 @@ public static final String LAST_REPL_ID_KEY = "hive.repl.last.repl.id"; public static final String REPL_CHECKPOINT_KEY = "hive.repl.ckpt.key"; + public static final String REPL_FIRST_INC_PENDING_FLAG = "hive.repl.first.inc.pending"; Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259745762 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,14 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncPending(Map parameter) { Review comment: parameter can be changed to dbOrTableProperties. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259743849 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java ## @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive hiveDb) th for (String tblName : Utils.matchesTbl(hiveDb, dbName, work.tableNameOrPattern)) { LOG.debug( "analyzeReplDump dumping table: " + tblName + " to db root " + dbRoot.toUri()); + Table table; + try { +table = hiveDb.getTable(dbName, tblName); Review comment: We invoke getTable twice. Can we reuse tableTuple that we get below inside try block. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259736959 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1164,6 +1168,9 @@ private static void createReplImportTasks( if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) { dropTblTask = dropTableTask(table, x, replicationSpec); table = null; + } else if (!firstIncPending) { +// For table level replication, get the flag from table parameter. Check HIVE-21197 for more detail. +firstIncPending = ReplUtils.isFirstIncPending(table.getParameters()); Review comment: For insert or alter events where previous event is create table, then referring to firstincpending flag could be wrong for table level replication. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259738265 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java ## @@ -0,0 +1,273 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationEx { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { +HiveConf conf = new HiveConf(TestReplicationWithTableMigrationEx.class); +conf.set("dfs.client.use.datanode.hostname", "true"); +conf.set("hadoop.proxyuser." + Utils.getUGI().getShortUserName() + ".hosts", "*"); +MiniDFSCluster miniDFSCluster = +new MiniDFSCluster.Builder(conf).numDataNodes(1).format(true).build(); +final DistributedFileSystem fs = miniDFSCluster.getFileSystem(); +HashMap hiveConfigs = new HashMap() {{ + put("fs.defaultFS", fs.getUri().toString()); + put("hive.support.concurrency", "true"); + put("hive.txn.manager", "org.apache.hadoop.hive.ql.lockmgr.DbTxnManager"); + put("hive.metastore.client.capability.check", "false"); + put("hive.repl.bootstrap.dump.open.txn.timeout", "1s"); + put("hive.exec.dynamic.partition.mode", "nonstrict"); + put("hive.strict.checks.bucketing", "false"); + put("hive.mapred.mode", "nonstrict"); + put("mapred.input.dir.recursive", "true"); + put("hive.metastore.disallow.incompatible.col.type.changes", "false"); + put("hive.strict.managed.tables", "true"); + put("hive.metastore.transactional.event.listeners", ""); +}}; +replica = new WarehouseInstance(LOG, miniDFSCluster, hiveConfigs); + +HashMap configsForPrimary = new HashMap() {{ + put("fs.defaultFS", fs.getUri().toString()); + put("hive.metastore.client.capability.check", "false"); + put("hive.repl.bootstrap.dump.open.txn.timeout", "1s"); + put("hive.exec.dynamic.partition.mode", "nonstrict"); + put("hive.strict.checks.bucketing", "false"); + put("hive.mapred.mode", "nonstrict"); + put("mapred.input.dir.recursive", "true"); + put("hive.metastore.disallow.incompatible.col.type.changes", "false"); + put("hive.support.concurrency", "false"); + put("hive.txn.manager", "org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager"); + put("hive.strict.managed.tables", "false"); +}}; +configsForPrimary.putAll(overrideConfigs); +primary = new WarehouseInstance(LOG, miniDFSCluster,
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259735134 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1147,6 +1145,12 @@ private static void createReplImportTasks( if (!waitOnPrecursor){ throw new SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName())); } + // For warehouse level replication, if the database itself is getting created in this load, then no need to + // check for duplicate copy. Check HIVE-21197 for more detail. + firstIncPending = false; +} else { + // For database replication, get the flag from database parameter. Check HIVE-21197 for more detail. + firstIncPending = ReplUtils.isFirstIncPending(parentDb.getParameters()); Review comment: For table level replication, we should not refer to parentDb. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259733418 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -48,13 +48,15 @@ private final DatabaseEvent event; private final String dbNameToLoadIn; + private final boolean isTableLevelLoad; - public LoadDatabase(Context context, DatabaseEvent event, String dbNameToLoadIn, + public LoadDatabase(Context context, DatabaseEvent event, String dbNameToLoadIn, String tblNameToLoadIn, Review comment: For table level replication, we should skip LoadDatabase tasks from caller itself. Target Db should exist in that case. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259739290 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -5199,6 +5205,35 @@ public static boolean doesTableNeedLocation(Table tbl) { return retval; } + private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc desc) throws HiveException, TException { Review comment: ReplSetFirstIncLoadFlagDesc can be changed to ReplSetFirstIncPendingFlagDesc. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259753768 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java ## @@ -112,6 +118,12 @@ public void run() { continue; } + if (replIsCompactionDisabledForTable(t)) { Review comment: replIsCompactionDisabledForTable is most likely return false. So, can we move this check after noAutoCompactSet check? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259735679 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1164,6 +1168,9 @@ private static void createReplImportTasks( if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) { dropTblTask = dropTableTask(table, x, replicationSpec); table = null; + } else if (!firstIncPending) { +// For table level replication, get the flag from table parameter. Check HIVE-21197 for more detail. +firstIncPending = ReplUtils.isFirstIncPending(table.getParameters()); Review comment: This should be referred only in case of table level replication. Also, for table level replication with create table event, we should set this flag to false always. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259752384 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.plan; +import org.apache.hadoop.hive.ql.plan.Explain.Level; + +import java.io.Serializable; + +/** + * ReplSetFirstIncLoadFlagDesc. + * + */ +@Explain(displayName = "Set First Incr Load Flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) +public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements Serializable { + + private static final long serialVersionUID = 1L; + String databaseName; + String tableName; + boolean incLoadPendingFlag; + + /** + * For serialization only. + */ + public ReplSetFirstIncLoadFlagDesc() { + } + + public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, boolean incLoadPendingFlag) { +super(); +this.databaseName = databaseName; +this.tableName = tableName; +this.incLoadPendingFlag = incLoadPendingFlag; + } + + @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) + public String getDatabaseName() { +return databaseName; + } + + public void setDatabaseName(String databaseName) { +this.databaseName = databaseName; + } + + @Explain(displayName="table_name", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) + public String getTableName() { +return tableName; + } + + public void setTableName(String tableName) { +this.tableName = tableName; + } + + @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) Review comment: @Explain can be removed from this class as this is internal task and won't be featured in explain plan. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259751152 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259739884 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -5199,6 +5205,35 @@ public static boolean doesTableNeedLocation(Table tbl) { return retval; } + private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc desc) throws HiveException, TException { +String dbNameOrPattern = desc.getDatabaseName(); +String tableNameOrPattern = desc.getTableName(); +String flag = desc.getIncLoadPendingFlag() ? "true" : "false"; +Map parameters; +// For database level load tableNameOrPattern will be null. Flag is set only in database for db level load. +if (tableNameOrPattern != null && !tableNameOrPattern.isEmpty()) { + // For table level load, dbNameOrPattern is db name and not a pattern. + for (String tableName : Utils.matchesTbl(hive, dbNameOrPattern, tableNameOrPattern)) { +org.apache.hadoop.hive.metastore.api.Table tbl = hive.getMSC().getTable(dbNameOrPattern, tableName); +parameters = tbl.getParameters(); +if (ReplUtils.isFirstIncPending(parameters)) { Review comment: Why do we have this check? I think, this should set if current flag is different from input flag. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259743377 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java ## @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive hiveDb) th for (String tblName : Utils.matchesTbl(hiveDb, dbName, work.tableNameOrPattern)) { LOG.debug( "analyzeReplDump dumping table: " + tblName + " to db root " + dbRoot.toUri()); + Table table; + try { +table = hiveDb.getTable(dbName, tblName); Review comment: These table level checks can be done only if work.tableNameOrPattern is valid. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606013 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +299,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); Review comment: As per discussion, we need to skip duplicate check for replace case, as we may end up with skipping copy of files with same name but different content. Also, we may need to move those matching files to new base path which is complicated. So, we just skip duplicate check for replace flow. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587931 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java ## @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() { Assert.assertEquals("The hash codes must be equal", compactionInfo.hashCode(), compactionInfo1.hashCode()); } + @Test + public void testDisableCompactionDuringReplLoad() throws Exception { +String tblName = "discomp"; +String database = "discomp_db"; +executeStatementOnDriver("drop database if exists " + database + " cascade", driver); +executeStatementOnDriver("create database " + database, driver); +executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a INT, b STRING) " + +" PARTITIONED BY(ds string)" + +" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires table to be bucketed +" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver); +executeStatementOnDriver("insert into " + database + "." + tblName + " partition (ds) values (1, 'fred', " + +"'today'), (2, 'wilma', 'yesterday')", driver); + +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", driver); +List compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'true')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'false')", driver); +compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'false')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + Review comment: Alter table doesn't make sense now as we removed it from use. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587889 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationEx { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { Review comment: Can we have a common base class just like BaseReplicationAcrossInstances instead of duplicating in both classes? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259588731 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); +return compFlag == null || compFlag.isEmpty() || "false".equalsIgnoreCase(compFlag); Review comment: String.equalsIgnoreCase handles null and empty cases of input argument. However, it returns false in both the cases. Can remove these checks but need to be careful. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259588983 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Review comment: Shall change incLoadPendFlag to firstIncLoadPendingFlag. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259588369 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +299,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); Review comment: If it is insert overwrite case(writeId=2) and let's say few of the listed files are there in base_1. Now, this deleteDestIfExists marks to create base_2 instead of delta_2_2 as it is IOW. So, only partial data would be copied to base_2. After commit of writeId=2, all readers would see partial data from base_2. So, need to handle this case. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587946 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java ## @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() { Assert.assertEquals("The hash codes must be equal", compactionInfo.hashCode(), compactionInfo1.hashCode()); } + @Test + public void testDisableCompactionDuringReplLoad() throws Exception { +String tblName = "discomp"; +String database = "discomp_db"; +executeStatementOnDriver("drop database if exists " + database + " cascade", driver); +executeStatementOnDriver("create database " + database, driver); +executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a INT, b STRING) " + +" PARTITIONED BY(ds string)" + +" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires table to be bucketed +" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver); +executeStatementOnDriver("insert into " + database + "." + tblName + " partition (ds) values (1, 'fred', " + +"'today'), (2, 'wilma', 'yesterday')", driver); + +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", driver); +List compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'true')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + Review comment: Alter table doesn't make sense now as we removed it from use. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587771 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java ## @@ -71,6 +74,16 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) throws Exception { } } + @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws TException { +try { + Database database = rs.getDatabase(getDefaultCatalog(conf), dbName); Review comment: Shall add a comment on why compaction is disabled if first incremental is not done. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587562 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259589411 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { Review comment: All the callers negate this result while using. Why don't we change it to isFirstIncPending? It will make it uniform with stored DB parameter name as well. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259587524 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259589143 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Task preCursor) { HashMap mapProp = new HashMap<>(); -mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); -AlterDatabaseDesc alterDbDesc = new AlterDatabaseDesc(dbName, mapProp, new ReplicationSpec(replState, replState)); +// if the update is for incLoadPendFlag, then send replicationSpec as null to avoid replacement check. +ReplicationSpec replicationSpec = null; +if (incLoadPendFlag == null) { + mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); + replicationSpec = new ReplicationSpec(replState, replState); +} else { + assert replState == null; + mapProp.put(ReplUtils.REPL_FIRST_INC_PENDING_FLAG, incLoadPendFlag); Review comment: There is one corner case for A->B->C where EVENT_ALTER_DATABASE should skip this additional parameter while dumping. Let's say, bootstrap and first incremental done in B. So, this flag is false. Now, we trigger bootstrap dump from B->C and concurrently, there is a alterDb operation which logs an event with this flag as false. Now, when we bootstrap load in C, we set it there as true. During first incremental load in C, when we process AlterDb event, we set it to false, so that we open up compaction after this event and later events may not guarantee duplicate check. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259586952 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -135,7 +135,8 @@ private boolean isDbEmpty(String dbName) throws HiveException { } private Task alterDbTask(Database dbObj) { -return alterDbTask(dbObj.getName(), updateDbProps(dbObj, context.dumpDirectory), context.hiveConf); +return alterDbTask(dbObj.getName(), updateDbProps(dbObj, context.dumpDirectory, false), Review comment: Alter case also we need to set it. If it is bootstrapping on empty Db. May be we can remove the flag and just set it to true always. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259588001 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -61,6 +62,21 @@ public ReplCopyTask(){ super(); } + // If file is already present in base directory, then remove it from the list. + // Check HIVE-21197 for more detail + private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, List srcFiles, + long writeId, int stmtId) throws IOException { +ListIterator iter = srcFiles.listIterator(); +Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, writeId, writeId, stmtId)); +while (iter.hasNext()) { + Path filePath = new Path(basePath, iter.next().getSourcePath().getName()); + if (dstFs.exists(filePath)) { Review comment: Do we need to have a retry mechanism if it throws IOException due to network glitch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259588476 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -370,6 +370,9 @@ private int executeIncrementalLoad(DriverContext driverContext) { // If incremental events are already applied, then check and perform if need to bootstrap any tables. if (!builder.hasMoreWork() && !work.getPathsToCopyIterator().hasNext()) { +// No need to set incremental load pending flag for external tables as the files will be copied to the same path Review comment: Add a TODO here saying, this flag needs to be set for non-txn to txn migrated tables if they are bootstrapped along with incremental load (table level replication). With TODO, the current comment make sense like why it is not needed for external tables. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259586959 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -158,6 +159,15 @@ private boolean isDbEmpty(String dbName) throws HiveException { // Add the checkpoint key to the Database binding it to current dump directory. // So, if retry using same dump, we shall skip Database object update. parameters.put(ReplUtils.REPL_CHECKPOINT_KEY, dumpDirectory); + +if (needSetIncFlag) { Review comment: Need to set it in alter case too. Empty db bootstrap case. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259589263 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); Review comment: compFlag doesn't look relevant. Can change it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259260745 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -158,6 +159,15 @@ private boolean isDbEmpty(String dbName) throws HiveException { // Add the checkpoint key to the Database binding it to current dump directory. // So, if retry using same dump, we shall skip Database object update. parameters.put(ReplUtils.REPL_CHECKPOINT_KEY, dumpDirectory); + +if (needSetIncFlag) { Review comment: No need to check it. Instead just set needSetIncFlag directly using setBoolVar. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259253737 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.plan; +import org.apache.hadoop.hive.ql.plan.Explain.Level; + +import java.io.Serializable; + +/** + * ReplSetFirstIncLoadFlagDesc. + * + */ +@Explain(displayName = "Set First Incr Load Flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) +public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements Serializable { + + private static final long serialVersionUID = 1L; + String databaseName; + String tableName; + + /** + * For serialization only. + */ + public ReplSetFirstIncLoadFlagDesc() { + } + + public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName) { Review comment: Why is it not taking extra boolean argument for the flag? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259252494 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1136,6 +1133,7 @@ private static void createReplImportTasks( Task dropTblTask = null; WriteEntity.WriteType lockType = WriteEntity.WriteType.DDL_NO_LOCK; +Boolean firstIncDone; Review comment: Shall use primitive boolean. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259261832 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -60,6 +60,7 @@ public static final String LAST_REPL_ID_KEY = "hive.repl.last.repl.id"; public static final String REPL_CHECKPOINT_KEY = "hive.repl.ckpt.key"; + public static final String REPL_FIRST_INC_PENDING_FLAG = "hive.repl.first.inc.pending"; Review comment: Can we use generic name like "hive.repl.is.data.consistent"? This would avoid confusion for incremental bootstrap case of table level replication where few tables data will be consistent only after next incremental load. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259252954 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1147,6 +1145,12 @@ private static void createReplImportTasks( if (!waitOnPrecursor){ throw new SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName())); } + // For warehouse level replication, if the database itself is getting created in this load, then no need to + // check for duplicate copy. Check HIVE-21197 for more detail. + firstIncDone = true; Review comment: Shall use ternary operator. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259232906 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -661,6 +663,10 @@ public int execute(DriverContext driverContext) { if (work.getAlterMaterializedViewDesc() != null) { return alterMaterializedView(db, work.getAlterMaterializedViewDesc()); } + + if (work.getReplSetFirstIncLoadFlagDesc() != null) { Review comment: Why don't we use alterTableDesc or alterDatabaseDesc instead of new type? We are just editing a db/table/partition property. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259233648 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -61,6 +62,25 @@ public ReplCopyTask(){ super(); } + // If file is already present in base directory, then remove it from the list. + // Check HIVE-21197 for more detail + private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, List srcFiles, + long writeId, int stmtId) throws IOException { +ListIterator iter = srcFiles.listIterator(); +Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, writeId, writeId, stmtId)); +while (iter.hasNext()) { + Path filePath = new Path(basePath, iter.next().getSourcePath().getName()); + try { +if (dstFs.getFileStatus(filePath) != null) { Review comment: Shall use fs.exists(path) method instead of getFileStatus. Then, need not handle exception as well. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259254156 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java ## @@ -71,6 +74,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) throws Exception { } } + @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws TException { +try { + Database database = rs.getDatabase(getDefaultCatalog(conf), dbName); + if (database != null) { +return !ReplUtils.isFirstIncDone(database.getParameters()); + } + LOG.info("Unable to find database " + dbName); Review comment: Why 2 different flows for No database case? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259261039 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -164,6 +165,12 @@ public IncrementalLoadTasksBuilder(String dbName, String tableName, String loadP lastEventid); } } + + ReplSetFirstIncLoadFlagDesc desc = new ReplSetFirstIncLoadFlagDesc(dbName, tableName); Review comment: Can we combine this with update last replID alterDb task? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259228579 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java ## @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigration - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationMisc { Review comment: The name can be more meaningful. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259234204 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -133,8 +161,11 @@ protected int execute(DriverContext driverContext) { return 6; } long writeId = Long.parseLong(writeIdString); - toPath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(work.getDeleteDestIfExist(), writeId, writeId, - driverContext.getCtx().getHiveTxnManager().getStmtIdAndIncrement())); + // Set stmt id 0 for bootstrap load as the directory needs to be searched during incremental load to avoid any + // duplicate copy from the source. Check HIVE-21197 for more detail. + int stmtId = (writeId == ReplUtils.REPL_BOOTSTRAP_MIGRATION_BASE_WRITE_ID) ? 0 : Review comment: No need to have this check as it is migration flow. Set it to 0 always. Also, it is better to add a final variable in ReplUtils for this instead of hardcoding it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259237893 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java ## @@ -112,6 +118,12 @@ public void run() { continue; } + if (replIsCompactionDisabledForTable(t)) { Review comment: I think, we can disable compaction in DB level only. Even table level replication replicates it DB level with list of tables. It will be simple and easy to maintain. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259230730 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java ## @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigration - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationMisc { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationMisc.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { +HiveConf conf = new HiveConf(TestReplicationWithTableMigrationMisc.class); +conf.set("dfs.client.use.datanode.hostname", "true"); +conf.set("hadoop.proxyuser." + Utils.getUGI().getShortUserName() + ".hosts", "*"); +MiniDFSCluster miniDFSCluster = +new MiniDFSCluster.Builder(conf).numDataNodes(1).format(true).build(); +final DistributedFileSystem fs = miniDFSCluster.getFileSystem(); +HashMap hiveConfigs = new HashMap() {{ + put("fs.defaultFS", fs.getUri().toString()); + put("hive.support.concurrency", "true"); + put("hive.txn.manager", "org.apache.hadoop.hive.ql.lockmgr.DbTxnManager"); + put("hive.metastore.client.capability.check", "false"); + put("hive.repl.bootstrap.dump.open.txn.timeout", "1s"); + put("hive.exec.dynamic.partition.mode", "nonstrict"); + put("hive.strict.checks.bucketing", "false"); + put("hive.mapred.mode", "nonstrict"); + put("mapred.input.dir.recursive", "true"); + put("hive.metastore.disallow.incompatible.col.type.changes", "false"); + put("hive.strict.managed.tables", "true"); + put("hive.metastore.transactional.event.listeners", ""); +}}; +replica = new WarehouseInstance(LOG, miniDFSCluster, hiveConfigs); + +HashMap configsForPrimary = new HashMap() {{ + put("fs.defaultFS", fs.getUri().toString()); + put("hive.metastore.client.capability.check", "false"); + put("hive.repl.bootstrap.dump.open.txn.timeout", "1s"); + put("hive.exec.dynamic.partition.mode", "nonstrict"); + put("hive.strict.checks.bucketing", "false"); + put("hive.mapred.mode", "nonstrict"); + put("mapred.input.dir.recursive", "true"); + put("hive.metastore.disallow.incompatible.col.type.changes", "false"); + put("hive.support.concurrency", "false"); + put("hive.txn.manager", "org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager"); + put("hive.strict.managed.tables", "false"); +}}; +configsForPrimary.putAll(overrideConfigs); +primary = new
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259235570 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +302,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); rcwork.setAutoPurge(isAutoPurge); rcwork.setNeedRecycle(needRecycle); } rcwork.setCopyToMigratedTxnTable(copyToMigratedTxnTable); + rcwork.setCheckDuplicateCopy(replicationSpec.needDupCopyCheck()); Review comment: needDupCopyCheck seems to be a generic flag but we handle only for migration case. Instead, shall change it to isFirstIncAfterBootstrap which makes it clear that for some flow, need to handle this case differently. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259230584 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java ## @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; +import java.io.IOException; +import java.util.*; Review comment: Shall avoid importing *. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259254333 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/RemoteCompactorThread.java ## @@ -53,6 +56,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) throws Exception { } } + @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws TException { Review comment: Duplicate method. Can we have a static method in MetastoreUtils that takes MSC as input? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259227325 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosAcrossInstances.java ## @@ -610,6 +610,11 @@ public void testIncrementalDumpOfWarehouse() throws Throwable { .run("show tables") .verifyResults(new String[] { "t1" }); +//default db is already created at replica before load. So the incr flag should not be set. + assertTrue(ReplUtils.isFirstIncDone(replica.getDatabase("default").getParameters())); + assertTrue(!ReplUtils.isFirstIncDone(replica.getDatabase(primaryDbName).getParameters())); Review comment: Shall use assertFalse method to be clear. Check for other occurrences too. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259236827 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +188,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); Review comment: Shall use getBoolVar to avoid string comparison. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services