Hi,

The error is pasted below my procedure.


*### My Procedure for Beam on "a long-running Local Flink Cluster":*

Beam WordCount Quickstart: https://beam.apache.org/get-
started/quickstart-java/

Run:
> cd /user/me/beam
> mvn archetype:generate \
>     -DarchetypeGroupId=org.apache.beam \
>     -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
>     -DarchetypeVersion=2.0.0 \
>     -DgroupId=org.example \
>     -DartifactId=word-count-beam \
>     -Dversion="0.1" \
>     -Dpackage=org.apache.beam.examples \
>     -DinteractiveMode=false

Beam on FlinkRunner Guide: https://beam.apache.org/
documentation/runners/flink/

Navigate into word-count-beam and identify the appropriate Flink version to
be 1.2.1:
> cd word-count-beam
> mvn dependency:tree -Pflink-runner | grep flink

Local Flink Cluster Quickstart Guide: https://ci.apache.org/
projects/flink/flink-docs-release-1.2/quickstart/setup_quickstart.html

Follow the Local Flink Cluster Quickstart Guide. The "Apache Flink Web
Dashboard" opens in a browser showing jobs successfully running and
completed as I submit them. I keep this running.

Back in /user/me/beam/word-count-beam/, run:
> mvn package exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
\
>     -Dexec.args=" \
>         --runner=FlinkRunner \
>         --flinkMaster=localhost:6123 \
>         --filesToStage=target/word-count-beam-0.1.jar \
>         --inputFile=/user/me/beam/word-count-beam/pom.xml \
>         --output=/user/me/beam/word-count-beam/output_01" \
>     -Pflink-runner

The flinkMaster host:port is identified in the JobManager tab of the Apache
Flink Web Dashboard. Note that the Beam guide says to use
"--filesToStage=target/word-count-beam-bundled-0.1.jar", but Maven actually
only builds "target/word-count-beam-0.1.jar".

The above command runs until it reaches the errors pasted below. The job
never makes it onto the Apache Flink Web Dashboard, and no output is
produced.

Note that the following command (under the "Flink-local" tab on the Beam
Quickstart Guide) works fine, but it starts it's own instance of a Local
Flink Cluster. The job never makes it onto the Apache Flink Web Dashboard
of my long-standing Local Flink Cluster I set up above. This makes sense,
because it doesn't use "-m" to connect to the long-running JobManager.
> mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
\
>     -Dexec.args=" \
>          --runner=FlinkRunner \
>          --inputFile=pom.xml \
>          --output=counts" \
>     -Pflink-runner


*### My Procedure for Beam on "a long-running Flink Cluster on YARN":*

Flink on YARN Setup: https://ci.apache.org/projects/flink/flink-docs-
release-1.2/setup/yarn_setup.html

Same as procedure above, except I am on a YARN cluster with HDFS and
ZooKeeper, etc.

As before, I have the right Flink version, the Apache Flink Web Dashboard
works, I run the org.apache.beam.examples.WordCount command as above with
the appropriate flinkMaster host:port as identified from the JobManager tab
of the Apache Flink Web Dashboard.

The command runs until it reaches the errors pasted below. The job never
makes it onto the Apache Flink Web Dashboard, and no output is produced.


*### Both Procedures:*

In both procedures, I tested the accompanying DirectRunner and
"Flink-local" commands provided in the Beam Quickstart Guide work fine. It
is only when I attempt to run a job on a long-running Local Flink Cluster
or long-running Flink Cluster on YARN that the below issues occur.


*### The Error:*
...
Jun 15, 2017 9:20:02 AM org.apache.beam.runners.flink.FlinkRunner run
INFO: Starting execution of Flink program.
...
INFO: Starting remoting
Jun 15, 2017 9:20:03 AM akka.event.slf4j.Slf4jLogger$$
anonfun$receive$1$$anonfun$applyOrElse$3 apply$mcV$sp
INFO: Remoting started; listening on addresses :[akka.tcp://
[email protected]:54735]
Jun 15, 2017 9:20:03 AM org.apache.flink.runtime.client.JobClientActor
handleMessage
INFO: Received SubmitJobAndWait(JobGraph(jobId:
18391da12bb10e38134be676e2bc1002)) but there is no connection to a
JobManager yet.
Jun 15, 2017 9:20:03 AM
org.apache.flink.runtime.client.JobSubmissionClientActor
handleCustomMessage
INFO: Received job wordcount-chris0hebert-0615142002-7d71a380 (
18391da12bb10e38134be676e2bc1002).
Jun 15, 2017 9:20:03 AM org.apache.flink.runtime.client.JobClientActor
disconnectFromJobManager
INFO: Disconnect from JobManager null.
Jun 15, 2017 9:20:03 AM akka.event.slf4j.Slf4jLogger$$
anonfun$receive$1$$anonfun$applyOrElse$2 apply$mcV$sp

[**************** THIS LOOKS IMPORTANT ******************]

WARNING: Association with remote system [akka.tcp://flink@localhost:6123]
has failed, address is now gated for [5000] ms. Reason: [Disassociated]
Jun 15, 2017 9:21:03 AM org.apache.flink.runtime.client.JobClientActor
terminate
...
Jun 15, 2017 9:21:03 AM org.apache.beam.runners.flink.FlinkRunner run
SEVERE: Pipeline execution failed
org.apache.flink.client.program.ProgramInvocationException: The program
execution failed: Couldn't retrieve the JobExecutionResult from the
JobManager.
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
...
Caused by: 
org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException:
Lost connection to the JobManager.
at org.apache.flink.runtime.client.JobClientActor.
handleMessage(JobClientActor.java:207)
...
[WARNING]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
Caused by: org.apache.flink.client.program.ProgramInvocationException: The
program execution failed: Couldn't retrieve the JobExecutionResult from the
JobManager.
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
...
Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't
retrieve the JobExecutionResult from the JobManager.
at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:
294)
...
Caused by: 
org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException:
Lost connection to the JobManager.
at org.apache.flink.runtime.client.JobClientActor.
handleMessage(JobClientActor.java:207)
...
[INFO] ------------------------------------------------------------
------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------
------------
...
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java
(default-cli) on project word-count-beam: An exception occured while
executing the Java class. null: InvocationTargetException: Pipeline
execution failed: The program execution failed: Couldn't retrieve the
JobExecutionResult from the JobManager. Lost connection to the JobManager.
-> [Help 1]
...

The same error occurs when I run the the Beam WordCount on the Flink
YARN-Cluster, except obviously my JobManager's address and port is
different when mentioned in the "WARNING".


*### The Ask:*

What am I missing?

Reply via email to