[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r305710298 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/main/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatch.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import com.sun.nio.file.SensitivityWatchEventModifier; + +import javax.annotation.concurrent.NotThreadSafe; + +import java.io.Closeable; +import java.io.IOException; +import java.nio.file.FileSystems; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.StandardWatchEventKinds; +import java.nio.file.WatchEvent; +import java.nio.file.WatchKey; +import java.nio.file.WatchService; + +import static org.apache.flink.util.Preconditions.checkNotNull; + +/** + * A synchronization aid that allows a single thread to wait on the creation of a specified file. + */ +@NotThreadSafe +public class FileBasedOneShotLatch implements Closeable { + + private final Path latchFile; + + private final WatchService watchService; + + private boolean released; + + public FileBasedOneShotLatch(final Path latchFile) { + this.latchFile = checkNotNull(latchFile); + + final Path parentDir = checkNotNull(latchFile.getParent(), "latchFile must have a parent"); + this.watchService = initWatchService(parentDir); + } + + private static WatchService initWatchService(final Path parentDir) { + final WatchService watchService = createWatchService(); + watchForLatchFile(watchService, parentDir); + return watchService; + } + + private static WatchService createWatchService() { + try { + return FileSystems.getDefault().newWatchService(); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + private static void watchForLatchFile(final WatchService watchService, final Path parentDir) { + try { + parentDir.register( + watchService, + new WatchEvent.Kind[]{StandardWatchEventKinds.ENTRY_CREATE}, + SensitivityWatchEventModifier.HIGH); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + /** +* Waits until the latch file is created. +* +* @throws InterruptedException if interrupted while waiting +*/ + public void await() throws InterruptedException { + if (isReleasedOrReleasable()) { + return; + } + + awaitLatchFile(watchService); + } + + private void awaitLatchFile(final WatchService watchService) throws InterruptedException { + while (true) { + WatchKey take = watchService.take(); + if (isReleasedOrReleasable()) { Review comment: > Prone to files being deleted in-between, but this seems unlikely. True, didn't think about that case. I think it's acceptable to leave it as is because: - we wait until the job is finished before deleting files - class is contained in this module - code is simpler (as you already mentioned) - judging by [this SO answer](https://stackoverflow.com/a/11182515), even WatchService could lose the event if the file is deleted shortly after creation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r305710298 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/main/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatch.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import com.sun.nio.file.SensitivityWatchEventModifier; + +import javax.annotation.concurrent.NotThreadSafe; + +import java.io.Closeable; +import java.io.IOException; +import java.nio.file.FileSystems; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.StandardWatchEventKinds; +import java.nio.file.WatchEvent; +import java.nio.file.WatchKey; +import java.nio.file.WatchService; + +import static org.apache.flink.util.Preconditions.checkNotNull; + +/** + * A synchronization aid that allows a single thread to wait on the creation of a specified file. + */ +@NotThreadSafe +public class FileBasedOneShotLatch implements Closeable { + + private final Path latchFile; + + private final WatchService watchService; + + private boolean released; + + public FileBasedOneShotLatch(final Path latchFile) { + this.latchFile = checkNotNull(latchFile); + + final Path parentDir = checkNotNull(latchFile.getParent(), "latchFile must have a parent"); + this.watchService = initWatchService(parentDir); + } + + private static WatchService initWatchService(final Path parentDir) { + final WatchService watchService = createWatchService(); + watchForLatchFile(watchService, parentDir); + return watchService; + } + + private static WatchService createWatchService() { + try { + return FileSystems.getDefault().newWatchService(); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + private static void watchForLatchFile(final WatchService watchService, final Path parentDir) { + try { + parentDir.register( + watchService, + new WatchEvent.Kind[]{StandardWatchEventKinds.ENTRY_CREATE}, + SensitivityWatchEventModifier.HIGH); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + /** +* Waits until the latch file is created. +* +* @throws InterruptedException if interrupted while waiting +*/ + public void await() throws InterruptedException { + if (isReleasedOrReleasable()) { + return; + } + + awaitLatchFile(watchService); + } + + private void awaitLatchFile(final WatchService watchService) throws InterruptedException { + while (true) { + WatchKey take = watchService.take(); + if (isReleasedOrReleasable()) { Review comment: > Prone to files being deleted in-between, but this seems unlikely. True, didn't think about that case. I think it's acceptable to leave it as is because: - we wait until the job is finished before deleting files - class is contained in this module - code is simpler (as you already mentioned) - judging by [this SO answer](https://stackoverflow.com/a/11182515), even WatchService could miss the file if it is deleted shortly after creation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r305710298 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/main/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatch.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import com.sun.nio.file.SensitivityWatchEventModifier; + +import javax.annotation.concurrent.NotThreadSafe; + +import java.io.Closeable; +import java.io.IOException; +import java.nio.file.FileSystems; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.StandardWatchEventKinds; +import java.nio.file.WatchEvent; +import java.nio.file.WatchKey; +import java.nio.file.WatchService; + +import static org.apache.flink.util.Preconditions.checkNotNull; + +/** + * A synchronization aid that allows a single thread to wait on the creation of a specified file. + */ +@NotThreadSafe +public class FileBasedOneShotLatch implements Closeable { + + private final Path latchFile; + + private final WatchService watchService; + + private boolean released; + + public FileBasedOneShotLatch(final Path latchFile) { + this.latchFile = checkNotNull(latchFile); + + final Path parentDir = checkNotNull(latchFile.getParent(), "latchFile must have a parent"); + this.watchService = initWatchService(parentDir); + } + + private static WatchService initWatchService(final Path parentDir) { + final WatchService watchService = createWatchService(); + watchForLatchFile(watchService, parentDir); + return watchService; + } + + private static WatchService createWatchService() { + try { + return FileSystems.getDefault().newWatchService(); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + private static void watchForLatchFile(final WatchService watchService, final Path parentDir) { + try { + parentDir.register( + watchService, + new WatchEvent.Kind[]{StandardWatchEventKinds.ENTRY_CREATE}, + SensitivityWatchEventModifier.HIGH); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + /** +* Waits until the latch file is created. +* +* @throws InterruptedException if interrupted while waiting +*/ + public void await() throws InterruptedException { + if (isReleasedOrReleasable()) { + return; + } + + awaitLatchFile(watchService); + } + + private void awaitLatchFile(final WatchService watchService) throws InterruptedException { + while (true) { + WatchKey take = watchService.take(); + if (isReleasedOrReleasable()) { Review comment: > Prone to files being deleted in-between, but this seems unlikely. True, didn't think about that case. I think it's acceptable to leave it as is because: - we wait until the job is finished before deleting files - class is contained in this module - code is simpler (as you already mentioned) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r305113537 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/main/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatch.java ## @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import com.sun.nio.file.SensitivityWatchEventModifier; + +import javax.annotation.concurrent.NotThreadSafe; + +import java.io.Closeable; +import java.io.IOException; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.StandardWatchEventKinds; +import java.nio.file.WatchEvent; +import java.nio.file.WatchKey; +import java.nio.file.WatchService; + +import static org.apache.flink.util.Preconditions.checkNotNull; + +/** + * A synchronization aid that allows a single thread to wait on the creation of a specified file. + */ +@NotThreadSafe +public class FileBasedOneShotLatch implements Closeable { + + private final Path latchFile; + + private final WatchService watchService; + + private boolean released; + + public FileBasedOneShotLatch(final Path latchFile) { + this.latchFile = checkNotNull(latchFile); + + final Path parentDir = checkNotNull(latchFile.getParent(), "latchFile must have a parent"); + this.watchService = initWatchService(parentDir); + } + + private static WatchService initWatchService(final Path parentDir) { + final WatchService watchService = createWatchService(parentDir); + watchForLatchFile(watchService, parentDir); + return watchService; + } + + private static WatchService createWatchService(final Path parentDir) { + try { + return parentDir.getFileSystem().newWatchService(); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + private static void watchForLatchFile(final WatchService watchService, final Path parentDir) { + try { + parentDir.register( + watchService, + new WatchEvent.Kind[]{StandardWatchEventKinds.ENTRY_CREATE}, + SensitivityWatchEventModifier.HIGH); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + /** +* Waits until the latch file is created. +* +* @throws InterruptedException if interrupted while waiting +*/ + public void await() throws InterruptedException { + if (isReleasedOrReleasable()) { + return; + } + + awaitLatchFile(watchService); + } + + private void awaitLatchFile(final WatchService watchService) throws InterruptedException { + while (true) { + WatchKey take = watchService.take(); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r305113144 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/main/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatch.java ## @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import com.sun.nio.file.SensitivityWatchEventModifier; + +import javax.annotation.concurrent.NotThreadSafe; + +import java.io.Closeable; +import java.io.IOException; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.StandardWatchEventKinds; +import java.nio.file.WatchEvent; +import java.nio.file.WatchKey; +import java.nio.file.WatchService; + +import static org.apache.flink.util.Preconditions.checkNotNull; + +/** + * A synchronization aid that allows a single thread to wait on the creation of a specified file. + */ +@NotThreadSafe +public class FileBasedOneShotLatch implements Closeable { + + private final Path latchFile; + + private final WatchService watchService; + + private boolean released; + + public FileBasedOneShotLatch(final Path latchFile) { + this.latchFile = checkNotNull(latchFile); + + final Path parentDir = checkNotNull(latchFile.getParent(), "latchFile must have a parent"); + this.watchService = initWatchService(parentDir); + } + + private static WatchService initWatchService(final Path parentDir) { + final WatchService watchService = createWatchService(parentDir); + watchForLatchFile(watchService, parentDir); + return watchService; + } + + private static WatchService createWatchService(final Path parentDir) { + try { + return parentDir.getFileSystem().newWatchService(); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + private static void watchForLatchFile(final WatchService watchService, final Path parentDir) { + try { + parentDir.register( + watchService, + new WatchEvent.Kind[]{StandardWatchEventKinds.ENTRY_CREATE}, + SensitivityWatchEventModifier.HIGH); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + /** +* Waits until the latch file is created. +* +* @throws InterruptedException if interrupted while waiting +*/ + public void await() throws InterruptedException { + if (isReleasedOrReleasable()) { + return; + } + + awaitLatchFile(watchService); + } + + private void awaitLatchFile(final WatchService watchService) throws InterruptedException { + while (true) { + WatchKey take = watchService.take(); Review comment: rename to `watchKey` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304980645 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/pom.xml ## @@ -0,0 +1,76 @@ + + +http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd";> + + 4.0.0 + + + org.apache.flink + flink-end-to-end-tests + 1.9-SNAPSHOT + .. + + + flink-dataset-fine-grained-recovery-test + flink-dataset-fine-grained-recovery-test + jar + + + + org.apache.flink + flink-java + ${project.version} + provided + + + + junit + junit Review comment: Removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304980030 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/test/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatchTest.java ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.io.File; +import java.util.concurrent.atomic.AtomicBoolean; + +import static org.junit.Assert.assertTrue; + +/** + * Tests for {@link FileBasedOneShotLatch}. + */ +public class FileBasedOneShotLatchTest { Review comment: Changed the surefire config. I find it awkward to have a dependency on `flink-test-utils` with `compile` scope since the job is strictly speaking not a test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304974598 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/pom.xml ## @@ -0,0 +1,76 @@ + + +http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd";> + + 4.0.0 + + + org.apache.flink + flink-end-to-end-tests + 1.9-SNAPSHOT + .. + + + flink-dataset-fine-grained-recovery-test + flink-dataset-fine-grained-recovery-test + jar + + + + org.apache.flink + flink-java + ${project.version} + provided + + + + junit + junit Review comment: Actually this dependency is not needed since it already comes from `flink-parent` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304932329 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/test/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatchTest.java ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.io.File; +import java.util.concurrent.atomic.AtomicBoolean; + +import static org.junit.Assert.assertTrue; + +/** + * Tests for {@link FileBasedOneShotLatch}. + */ +public class FileBasedOneShotLatchTest { + + @Rule + public TemporaryFolder temporaryFolder = new TemporaryFolder(); Review comment: accidentally used `--amend` when fixing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304931295 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/test/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatchTest.java ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.io.File; +import java.util.concurrent.atomic.AtomicBoolean; + +import static org.junit.Assert.assertTrue; + +/** + * Tests for {@link FileBasedOneShotLatch}. + */ +public class FileBasedOneShotLatchTest { + + @Rule + public TemporaryFolder temporaryFolder = new TemporaryFolder(); + + private FileBasedOneShotLatch latch; + + private File latchFile; + + @Before + public void setUp() { + latchFile = new File(temporaryFolder.getRoot(), "latchFile"); + latch = new FileBasedOneShotLatch(latchFile.toPath()); + } + + @Test + public void awaitReturnsWhenFileIsCreated() throws Exception { + final AtomicBoolean awaitCompleted = new AtomicBoolean(); + final Thread thread = new Thread(() -> { + try { + latch.await(); + awaitCompleted.set(true); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + }); + thread.start(); + + latchFile.createNewFile(); Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304931209 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/main/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatch.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import com.sun.nio.file.SensitivityWatchEventModifier; + +import javax.annotation.concurrent.NotThreadSafe; + +import java.io.Closeable; +import java.io.IOException; +import java.nio.file.FileSystems; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.StandardWatchEventKinds; +import java.nio.file.WatchEvent; +import java.nio.file.WatchKey; +import java.nio.file.WatchService; + +import static org.apache.flink.util.Preconditions.checkNotNull; + +/** + * A synchronization aid that allows a single thread to wait on the creation of a specified file. + */ +@NotThreadSafe +public class FileBasedOneShotLatch implements Closeable { + + private final Path latchFile; + + private final WatchService watchService; + + private boolean released; + + public FileBasedOneShotLatch(final Path latchFile) { + this.latchFile = checkNotNull(latchFile); + + final Path parentDir = checkNotNull(latchFile.getParent(), "latchFile must have a parent"); + this.watchService = initWatchService(parentDir); + } + + private static WatchService initWatchService(final Path parentDir) { + final WatchService watchService = createWatchService(); + watchForLatchFile(watchService, parentDir); + return watchService; + } + + private static WatchService createWatchService() { + try { + return FileSystems.getDefault().newWatchService(); Review comment: good suggestion, done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304931093 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/test/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatchTest.java ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.io.File; +import java.util.concurrent.atomic.AtomicBoolean; + +import static org.junit.Assert.assertTrue; + +/** + * Tests for {@link FileBasedOneShotLatch}. + */ +public class FileBasedOneShotLatchTest { + + @Rule + public TemporaryFolder temporaryFolder = new TemporaryFolder(); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304930823 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/test/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatchTest.java ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.io.File; +import java.util.concurrent.atomic.AtomicBoolean; + +import static org.junit.Assert.assertTrue; + +/** + * Tests for {@link FileBasedOneShotLatch}. + */ +public class FileBasedOneShotLatchTest { + + @Rule + public TemporaryFolder temporaryFolder = new TemporaryFolder(); + + private FileBasedOneShotLatch latch; + + private File latchFile; + + @Before + public void setUp() { + latchFile = new File(temporaryFolder.getRoot(), "latchFile"); + latch = new FileBasedOneShotLatch(latchFile.toPath()); + } + + @Test + public void awaitReturnsWhenFileIsCreated() throws Exception { + final AtomicBoolean awaitCompleted = new AtomicBoolean(); + final Thread thread = new Thread(() -> { + try { + latch.await(); + awaitCompleted.set(true); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + }); + thread.start(); + + latchFile.createNewFile(); Review comment: I will add a new test case. The latch should not block. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304930823 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/test/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatchTest.java ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.io.File; +import java.util.concurrent.atomic.AtomicBoolean; + +import static org.junit.Assert.assertTrue; + +/** + * Tests for {@link FileBasedOneShotLatch}. + */ +public class FileBasedOneShotLatchTest { + + @Rule + public TemporaryFolder temporaryFolder = new TemporaryFolder(); + + private FileBasedOneShotLatch latch; + + private File latchFile; + + @Before + public void setUp() { + latchFile = new File(temporaryFolder.getRoot(), "latchFile"); + latch = new FileBasedOneShotLatch(latchFile.toPath()); + } + + @Test + public void awaitReturnsWhenFileIsCreated() throws Exception { + final AtomicBoolean awaitCompleted = new AtomicBoolean(); + final Thread thread = new Thread(() -> { + try { + latch.await(); + awaitCompleted.set(true); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + }); + thread.start(); + + latchFile.createNewFile(); Review comment: I will add a new test case This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304926563 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/main/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatch.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import com.sun.nio.file.SensitivityWatchEventModifier; + +import javax.annotation.concurrent.NotThreadSafe; + +import java.io.Closeable; +import java.io.IOException; +import java.nio.file.FileSystems; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.StandardWatchEventKinds; +import java.nio.file.WatchEvent; +import java.nio.file.WatchKey; +import java.nio.file.WatchService; + +import static org.apache.flink.util.Preconditions.checkNotNull; + +/** + * A synchronization aid that allows a single thread to wait on the creation of a specified file. + */ +@NotThreadSafe +public class FileBasedOneShotLatch implements Closeable { + + private final Path latchFile; + + private final WatchService watchService; + + private boolean released; + + public FileBasedOneShotLatch(final Path latchFile) { + this.latchFile = checkNotNull(latchFile); + + final Path parentDir = checkNotNull(latchFile.getParent(), "latchFile must have a parent"); + this.watchService = initWatchService(parentDir); + } + + private static WatchService initWatchService(final Path parentDir) { + final WatchService watchService = createWatchService(); + watchForLatchFile(watchService, parentDir); + return watchService; + } + + private static WatchService createWatchService() { + try { + return FileSystems.getDefault().newWatchService(); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + private static void watchForLatchFile(final WatchService watchService, final Path parentDir) { + try { + parentDir.register( + watchService, + new WatchEvent.Kind[]{StandardWatchEventKinds.ENTRY_CREATE}, + SensitivityWatchEventModifier.HIGH); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + /** +* Waits until the latch file is created. +* +* @throws InterruptedException if interrupted while waiting +*/ + public void await() throws InterruptedException { + if (isReleasedOrReleasable()) { + return; + } + + awaitLatchFile(watchService); + } + + private void awaitLatchFile(final WatchService watchService) throws InterruptedException { + while (true) { + WatchKey take = watchService.take(); + if (isReleasedOrReleasable()) { Review comment: It could be that other files are created in the directory. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304509653 ## File path: flink-end-to-end-tests/test-scripts/test_ha_dataset.sh ## @@ -53,20 +52,51 @@ function run_ha_test() { wait_job_running ${JOB_ID} -# start the watchdog that keeps the number of JMs stable -start_ha_jm_watchdog 1 "StandaloneSessionClusterEntrypoint" start_jm_cmd "8081" - +local c for (( c=0; c<${JM_KILLS}; c++ )); do # kill the JM and wait for watchdog to # create a new one which will take over kill_single 'StandaloneSessionClusterEntrypoint' wait_job_running ${JOB_ID} done -cancel_job ${JOB_ID} +for (( c=0; c<${TM_KILLS}; c++ )); do +sleep $(( ( RANDOM % 10 ) + 1 )) +kill_and_replace_random_task_manager +wait_job_running ${JOB_ID} +done + +wait_job_terminal_state ${JOB_ID} "FINISHED" Review comment: I added a new job that blocks on an external condition. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r304509103 ## File path: flink-end-to-end-tests/flink-dataset-fine-grained-recovery-test/src/test/java/org/apache/flink/batch/tests/util/FileBasedOneShotLatchTest.java ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.flink.batch.tests.util; + +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.io.File; +import java.util.concurrent.atomic.AtomicBoolean; + +import static org.junit.Assert.assertTrue; + +/** + * Tests for {@link FileBasedOneShotLatch}. + */ +public class FileBasedOneShotLatchTest { Review comment: Test is not run due to surefire config in the `flink-end-to-end-tests` module. I don't have a solution. Suggestions welcome. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r302172084 ## File path: flink-end-to-end-tests/test-scripts/test_ha_dataset.sh ## @@ -53,20 +52,51 @@ function run_ha_test() { wait_job_running ${JOB_ID} -# start the watchdog that keeps the number of JMs stable -start_ha_jm_watchdog 1 "StandaloneSessionClusterEntrypoint" start_jm_cmd "8081" - +local c for (( c=0; c<${JM_KILLS}; c++ )); do # kill the JM and wait for watchdog to # create a new one which will take over kill_single 'StandaloneSessionClusterEntrypoint' wait_job_running ${JOB_ID} done -cancel_job ${JOB_ID} +for (( c=0; c<${TM_KILLS}; c++ )); do +sleep $(( ( RANDOM % 10 ) + 1 )) +kill_and_replace_random_task_manager +wait_job_running ${JOB_ID} +done + +wait_job_terminal_state ${JOB_ID} "FINISHED" Review comment: These are valid concerns. > How much longer does the test now run for? The test runs 4.5-5 minutes on my machine. It takes around 2 minutes to complete the batch job after the last injected fault (time determined using unscientific methods). The test in its current form is rather similar to `test_batch_allround.sh` so there is a chance that these can be merged. > I like neither option, do admit though that this would make it very difficult (or even impossible) to verify the correctness of the output. I don't see a good solution yet. Here are some options: 1. Make job block on external signals (files), and make job smaller (smaller dataset) 1. Leave it as before, i.e., don't verify correctness of the output (but use infinite data source) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r302172084 ## File path: flink-end-to-end-tests/test-scripts/test_ha_dataset.sh ## @@ -53,20 +52,51 @@ function run_ha_test() { wait_job_running ${JOB_ID} -# start the watchdog that keeps the number of JMs stable -start_ha_jm_watchdog 1 "StandaloneSessionClusterEntrypoint" start_jm_cmd "8081" - +local c for (( c=0; c<${JM_KILLS}; c++ )); do # kill the JM and wait for watchdog to # create a new one which will take over kill_single 'StandaloneSessionClusterEntrypoint' wait_job_running ${JOB_ID} done -cancel_job ${JOB_ID} +for (( c=0; c<${TM_KILLS}; c++ )); do +sleep $(( ( RANDOM % 10 ) + 1 )) +kill_and_replace_random_task_manager +wait_job_running ${JOB_ID} +done + +wait_job_terminal_state ${JOB_ID} "FINISHED" Review comment: These are valid concerns. > How much longer does the test now run for? The test runs 4.5-5 minutes on my machine. It takes around 2 minutes to complete the batch job after the last injected fault (time determined using unscientific methods). The test in its current form is rather similar to `test_batch_allround.sh` so there is a chance that these can be merged. > I like neither option, do admit though that this would make it very difficult (or even impossible) to verify the correctness of the output. I don't see a good solution yet. Here are some options: 1. Make job block on external signals (files), and make job smaller (smaller dataset) 1. Leave it as before, i.e., don't verify correctness of the output This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r302172084 ## File path: flink-end-to-end-tests/test-scripts/test_ha_dataset.sh ## @@ -53,20 +52,51 @@ function run_ha_test() { wait_job_running ${JOB_ID} -# start the watchdog that keeps the number of JMs stable -start_ha_jm_watchdog 1 "StandaloneSessionClusterEntrypoint" start_jm_cmd "8081" - +local c for (( c=0; c<${JM_KILLS}; c++ )); do # kill the JM and wait for watchdog to # create a new one which will take over kill_single 'StandaloneSessionClusterEntrypoint' wait_job_running ${JOB_ID} done -cancel_job ${JOB_ID} +for (( c=0; c<${TM_KILLS}; c++ )); do +sleep $(( ( RANDOM % 10 ) + 1 )) +kill_and_replace_random_task_manager +wait_job_running ${JOB_ID} +done + +wait_job_terminal_state ${JOB_ID} "FINISHED" Review comment: These are valid concerns. > How much longer does the test now run for? The test runs 4.5-5 minutes on my machine. It takes around 2 minutes to complete the batch job after the last injected fault (time determined using unscientific methods). > I like neither option, do admit though that this would make it very difficult (or even impossible) to verify the correctness of the output. I don't see a good solution yet. Here are some options: 1. Make job block on external signals (files) 1. Leave it as before, i.e., don't verify correctness of the output This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r302167993 ## File path: flink-end-to-end-tests/test-scripts/test_ha_dataset.sh ## @@ -53,20 +50,51 @@ function run_ha_test() { wait_job_running ${JOB_ID} -# start the watchdog that keeps the number of JMs stable -start_ha_jm_watchdog 1 "StandaloneSessionClusterEntrypoint" start_jm_cmd "8081" - +local c for (( c=0; c<${JM_KILLS}; c++ )); do # kill the JM and wait for watchdog to # create a new one which will take over kill_single 'StandaloneSessionClusterEntrypoint' wait_job_running ${JOB_ID} done -cancel_job ${JOB_ID} +for (( c=0; c<${TM_KILLS}; c++ )); do +sleep $(( ( RANDOM % 10 ) + 1 )) +kill_and_replace_random_task_manager +wait_job_running ${JOB_ID} Review comment: `wait_job_running` can be omitted. In fact it only asserts that the job appears in the `flink list` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy
GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA dataset E2E test with new RestartPipelinedRegionStrategy URL: https://github.com/apache/flink/pull/9060#discussion_r302055257 ## File path: flink-end-to-end-tests/test-scripts/test_ha_dataset.sh ## @@ -53,20 +52,51 @@ function run_ha_test() { wait_job_running ${JOB_ID} -# start the watchdog that keeps the number of JMs stable -start_ha_jm_watchdog 1 "StandaloneSessionClusterEntrypoint" start_jm_cmd "8081" - +local c for (( c=0; c<${JM_KILLS}; c++ )); do # kill the JM and wait for watchdog to # create a new one which will take over kill_single 'StandaloneSessionClusterEntrypoint' wait_job_running ${JOB_ID} done -cancel_job ${JOB_ID} +for (( c=0; c<${TM_KILLS}; c++ )); do +sleep $(( ( RANDOM % 10 ) + 1 )) +kill_and_replace_random_task_manager +wait_job_running ${JOB_ID} Review comment: `wait_job_running` will terminate the script if the job does not become running within a timeout (10s). Since we are not launching a new process by invoking the function, the main script will exit. Am I missing something? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services