`env.java.opts` not persisting after job canceled or failed and then restarted

Aaron Levin Thu, 17 Jan 2019 08:53:59 -0800

Hello!

*tl;dr*: settings in `env.java.opts` seem to stop having impact when a job
is canceled or fails and then is restarted (with or without
savepoint/checkpoints). If I restart the task-managers, the `env.java.opts`
seem to start having impact again and our job will run without failure.
More below.


We use consume Snappy-compressed sequence files in our flink job. This
requires access to the hadoop native libraries. In our `flink-conf.yaml`
for both the task manager and the job manager, we put:

```
env.java.opts: -Djava.library.path=/usr/local/hadoop/lib/native
```

If I launch our job on freshly-restarted task managers, the job operates
fine. If at some point I cancel the job or if the job restarts for some
other reason, the job will begin to crashloop because it tries to open a
Snappy-compressed file but doesn't have access to the codec from the native
hadoop libraries in `/usr/local/hadoop/lib/native`. If I then restart the
task manager while the job is crashlooping, the job is start running
without any codec failures.

The only reason I can conjure that would cause the Snappy compression to
fail is if the `env.java.opts` were not being passed through to the job on
restart for some reason.

Does anyone know what's going on? Am I missing some additional
configuration? I really appreciate any help!

About our setup:

- Flink Version: 1.7.0
- Deployment: Standalone in HA
- Hadoop/S3 setup: we do *not* set `HADOOP_CLASSPATH`. We use Flink’s
shaded jars to access our files in S3. We do not use the
`bundled-with-hadoop` distribution of Flink.

Best,

Aaron Levin

`env.java.opts` not persisting after job canceled or failed and then restarted

Reply via email to