Re: [PR] [Flink 34569][e2e] fail fast if AWS cli container fails to start [flink]

2024-03-17 Thread via GitHub


robobario commented on code in PR #24491:
URL: https://github.com/apache/flink/pull/24491#discussion_r1527607029


##
flink-end-to-end-tests/test-scripts/common_s3_operations.sh:
##
@@ -58,7 +64,11 @@ function aws_cli_stop() {
 if [[ $AWSCLI_CONTAINER_ID ]]; then
   aws_cli_stop
 fi
-aws_cli_start
+aws_cli_start || aws_cli_start

Review Comment:
   I've added in a failsafe to kill/remove if `CONTAINER_ID` is non-empty after 
the docker run fails.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [Flink 34569][e2e] fail fast if AWS cli container fails to start [flink]

2024-03-14 Thread via GitHub


robobario commented on code in PR #24491:
URL: https://github.com/apache/flink/pull/24491#discussion_r1525716429


##
flink-end-to-end-tests/test-scripts/common_s3_operations.sh:
##
@@ -58,7 +64,11 @@ function aws_cli_stop() {
 if [[ $AWSCLI_CONTAINER_ID ]]; then
   aws_cli_stop
 fi
-aws_cli_start
+aws_cli_start || aws_cli_start

Review Comment:
   I'm assuming that `docker run -d ...` with detach did not create a container 
if the exit code is non-zero. So failed `aws_cli_start` shouldn't leave 
something to be cleaned. But I can't find docs stating that.
   
   Happy to remove the retry, getting the test to fail faster with a more 
obvious cause is an improvement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [Flink 34569][e2e] fail fast if AWS cli container fails to start [flink]

2024-03-14 Thread via GitHub


JingGe commented on code in PR #24491:
URL: https://github.com/apache/flink/pull/24491#discussion_r1525296378


##
flink-end-to-end-tests/test-scripts/common_s3_operations.sh:
##
@@ -58,7 +64,11 @@ function aws_cli_stop() {
 if [[ $AWSCLI_CONTAINER_ID ]]; then
   aws_cli_stop
 fi
-aws_cli_start
+aws_cli_start || aws_cli_start

Review Comment:
   Will simple retry like this let the script be executed multiple times 
without proper cleanup of previous container, given the aws_cli_start is not 
Idempotent, afaiu.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [Flink 34569][e2e] fail fast if AWS cli container fails to start [flink]

2024-03-14 Thread via GitHub


JingGe commented on code in PR #24491:
URL: https://github.com/apache/flink/pull/24491#discussion_r1525296378


##
flink-end-to-end-tests/test-scripts/common_s3_operations.sh:
##
@@ -58,7 +64,11 @@ function aws_cli_stop() {
 if [[ $AWSCLI_CONTAINER_ID ]]; then
   aws_cli_stop
 fi
-aws_cli_start
+aws_cli_start || aws_cli_start

Review Comment:
   Will simple retry like this let the script be executed multiple times 
without proper cleanup of previous containers, given the aws_cli_start is not 
Idempotent, afaiu.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [Flink 34569][e2e] fail fast if AWS cli container fails to start [flink]

2024-03-13 Thread via GitHub


flinkbot commented on PR #24491:
URL: https://github.com/apache/flink/pull/24491#issuecomment-1996064367

   
   ## CI report:
   
   * 119cd86a12b4d65ed43000f58df4a423f9e2aae7 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [Flink 34569][e2e] fail fast if AWS cli container fails to start [flink]

2024-03-13 Thread via GitHub


robobario commented on code in PR #24491:
URL: https://github.com/apache/flink/pull/24491#discussion_r1524015981


##
flink-end-to-end-tests/test-scripts/common_s3_operations.sh:
##
@@ -29,12 +29,18 @@
 #   AWSCLI_CONTAINER_ID
 ###
 function aws_cli_start() {
-  export AWSCLI_CONTAINER_ID=$(docker run -d \

Review Comment:
   see https://www.shellcheck.net/wiki/SC2155



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [Flink 34569][e2e] fail fast if AWS cli container fails to start [flink]

2024-03-13 Thread via GitHub


robobario opened a new pull request, #24491:
URL: https://github.com/apache/flink/pull/24491

   ## What is the purpose of the change
   
   This pull request aims to make end-to-end test scripts that source 
`common_s3_operations.sh` fail fast if the aws cli container fails to start. It 
also adds a single naive retry aiming to recover from a transient network 
failure.
   
   [FLINK-34569](https://issues.apache.org/jira/browse/FLINK-34569) describes 
an issue where an end-to-end test run took 15 minutes to fail after the aws cli 
container failed to start. From the test logs:
   
   ```
   2024-03-02T04:10:55.5496990Z Unable to find image 'banst/awscli:latest' 
locally 2024-03-02T04:10:56.3857380Z docker: Error response from daemon: Head 
"https://registry-1.docker.io/v2/banst/awscli/manifests/latest": read tcp 
10.1.0.97:33016->54.236.113.205:443: read: connection reset by peer. 
2024-03-02T04:10:56.3857877Z See 'docker run --help'. 
2024-03-02T04:10:56.4586492Z Error: No such object:
   ```
   
   This failure isn't handled and so later we were stuck in a loop trying to 
docker exec commands like `docker exec -t "" command`.
   
   To test it locally I've been provoking docker run failures by changing the 
image name to something non-existent.
   
   ## Brief change log
   
 - *Fail fast if aws cli container fails to run*
 - *Add naive retry when creating aws cli container*
 - *Add --rm to jq docker run commands to remove them on exit*
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   I verified that it fails fast by modifying the awscli image to have a 
non-existant name, to provoke a `docker run` failure, causing it to fail like:
   
   ```
   
==
   Running 'test-scripts/test_file_sink.sh s3 StreamingFileSink 
skip_check_exceptions'
   
==
   TEST_DATA_DIR: 
/home/roby/development/redhat-managed-kafka/upstream/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-53909550201
   Flink dist directory: 
/home/roby/development/redhat-managed-kafka/upstream/flink/flink-dist/target/flink-1.20-SNAPSHOT-bin/flink-1.20-SNAPSHOT
   Found AWS bucket robeyoun-testing-flink-13-03-2024, running the e2e test.
   Found AWS access key, running the e2e test.
   Found AWS secret key, running the e2e test.
   Unable to find image 'banstz/awscli:latest' locally
   docker: Error response from daemon: pull access denied for banstz/awscli, 
repository does not exist or may require 'docker login': denied: requested 
access to the resource is denied.
   See 'docker run --help'.
   running aws cli container failed
   Unable to find image 'banstz/awscli:latest' locally
   docker: Error response from daemon: pull access denied for banstz/awscli, 
repository does not exist or may require 'docker login': denied: requested 
access to the resource is denied.
   See 'docker run --help'.
   running aws cli container failed
   running the aws cli container failed
   [FAIL] Test script contains errors.
   Checking for errors...
   No errors in log files.
   Checking for exceptions...
   No exceptions in log files.
   Checking for non-empty .out files...
   grep: 
/home/roby/development/redhat-managed-kafka/upstream/flink/build-target/log/*.out:
 No such file or directory
   No non-empty .out files.
   
   [FAIL] 'test-scripts/test_file_sink.sh s3 StreamingFileSink 
skip_check_exceptions' failed after 0 minutes and 6 seconds! Test exited with 
exit code 1
   ```
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
 - The serializers:no
 - The runtime per-record code paths (performance sensitive): no
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
 - The S3 file system connector: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org