Hi all,
We're working on a project where we're limited to one big development
machine for now. We want to start developing data processing pipelines in
Python, which should eventually be ported to a currently unknown setup on a
separate cluster or cloud, so we went with Beam for its portability.
For the development setup, we wanted to have the least amount of overhead
possible, so we deployed a one node flink cluster with docker-compose. The
whole setup is defined by the following docker-compose.yml:
```
version: "2.1"
services:
flink-jobmanager:
image: flink:1.9
network_mode: host
command: jobmanager
environment:
- JOB_MANAGER_RPC_ADDRESS=localhost
flink-taskmanager:
image: flink:1.9
network_mode: host
depends_on:
- flink-jobmanager
command: taskmanager
environment:
- JOB_MANAGER_RPC_ADDRESS=localhost
volumes:
- staging-dir:/tmp/beam-artifact-staging
- /usr/bin/docker:/usr/bin/docker
- /var/run/docker.sock:/var/run/docker.sock
user: flink:${DOCKER_GID}
beam-jobserver:
image: apache/beam_flink1.9_job_server:2.20.0
network_mode: host
command: --flink-master=localhost:8081
volumes:
- staging-dir:/tmp/beam-artifact-staging
volumes:
staging-dir:
```
We can submit and run pipelines with the following options:
```
'runner': 'PortableRunner',
'job_endpoint': 'localhost:8099',
```
The environment type for the SDK Harness is configured to the default
'docker'.
However, we cannot write output files to the host system. To fix this,
I tried to mount a host directory to the Beam SDK Container (I had to
rebuild the Beam Job Server jar and image to do this). This seems to have
worked, as the output file is created on the host system. However the
pipeline silently fails, and the output file remains empty. Running the
pipeline with DirectRunner confirms that the pipeline is working.
Looking at the output logs, the following error is thrown in the Flink Task
Manager:
flink-taskmanager_1 | java.lang.NoClassDefFoundError:
org/apache/beam/vendor/grpc/v1p26p0/io/netty/buffer/PoolArena$1
I don't know if this is a result of me rebuilding the Job Server, or caused
by another issue.
We currently do not have a distributed file system available. Is there any
way to make writing to the host system possible?
Kind regards,
Robbe
[image: https://ml6.eu] <https://ml6.eu>
Robbe Sneyders
ML6 Gent
<https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl>
M: +32 474 71 31 08