Re: Problems trying to run Beam on Flink on Yarn

Niels Basjes Thu, 20 Jul 2017 08:25:53 -0700

Hi Aljoscha,


On Thu, Jul 20, 2017 at 10:31 AM, Aljoscha Krettek <aljos...@apache.org>
 wrote:

> Just for testing, could you try and place your word-count-beam-0.1.jar
> directly in the Flink lib folder and don’t specify it with —filesToStage?
>


I did that and it solved the problem.
Let me elaborate.

First of all to get the wordcount running in combination with HDFS I needed
to change a few things:

I added this to the shade plugin (pom.xml) to avoid neding to specify the
main class

<transformer 
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
  <mainClass>org.apache.beam.examples.WordCount</mainClass>
</transformer>

and this dependency to get access to HDFS

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-sdks-java-io-hadoop-file-system</artifactId>
    <version>${beam.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.8.0</version>
</dependency>

and in WordCount.java I changed this

       public interface WordCountOptions extends HadoopFileSystemOptions {

and added this to the main method:

       options.setHdfsConfiguration(Collections.singletonList(new
Configuration()));

after a

              mvn clean package -Pflink-runner
              cp target/word-count-beam-0.1.jar flink-1.2.1/lib/

I ran

./flink-1.2.1/bin/flink run \
    -m yarn-cluster                                         \
    --yarncontainer                 1                       \
    --yarnslots                     4                       \
    --yarnjobManagerMemory          2000                    \
    --yarntaskManagerMemory         2000                    \
    --yarnname "Word count on Beam on Flink on Yarn"        \
    ./flink-1.2.1/lib/word-count-beam-0.1.jar               \
    --runner=FlinkRunner                                    \
    --inputFile=hdfs:///user/nbasjes/helloworld.txt         \
    --output=hdfs:///user/nbasjes/hello-counts


Which succeeded.

Using this I had enough hints to start digging.

I found that my setup is:
1) HDP 2.3.4.0-3485 (which I know is 'no that new')
2) Everything is configured correctly.
Including the environment variables
HADOOP_CONF_DIR=/etc/hadoop/conf/
HBASE_CONF_DIR=/etc/hbase/conf/
HIVE_CONF_DIR=/etc/hive/conf/
YARN_CONF_DIR=/etc/hadoop/conf/

3) Because HBASE_CONF_DIR has been defined (and it should because I need
HBase) the flink script includes the output of 'hbase classpath'  (side
note: I wrote that for Flink a long time ago).
4) In the classpath of the application you now have:
     /usr/hdp/2.3.4.0-3485/hbase/lib/jruby-complete-1.6.8.jar
which contains
     2313  09-14-2010 00:05   org/joda/time/Duration.class



So I tried to put 'just' a newer joda version in the lib directory and keep
the application itself in a 'normal' location.

rm flink-1.2.1/lib/word-count-beam-0.1.jar
cp /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/joda-time-2.9.1.jar
 flink-1.2.1/lib/

hdfs dfs -rm hello-counts*


./flink-1.2.1/bin/flink run \
    -m yarn-cluster                                         \
    --yarncontainer                 1                       \
    --yarnslots                     4                       \
    --yarnjobManagerMemory          2000                    \
    --yarntaskManagerMemory         2000                    \
    --yarnname "Word count on Beam on Flink on Yarn"        \
    ./target/word-count-beam-0.1.jar               \
    --runner=FlinkRunner                                    \
    --inputFile=hdfs:///user/nbasjes/helloworld.txt         \
    --output=hdfs:///user/nbasjes/hello-counts

hdfs dfs -cat hello-counts*


And this works too!

Thanks for the suggestion!

Next step: getting both Kafka and HBase connections running ... :)

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Problems trying to run Beam on Flink on Yarn

Reply via email to