Re: `env.java.opts` not persisting after job canceled or failed and then restarted

Ufuk Celebi Mon, 21 Jan 2019 14:16:08 -0800

Hey Aaron,

sorry for the late reply.


(1) I think I was able to reproduce this issue using snappy-java. I've
filed a ticket here:
https://issues.apache.org/jira/browse/FLINK-11402. Can you check the
ticket description whether it's in line with what you are
experiencing? Most importantly, do you see the same Exception being
reported after cancelling and re-starting the job?

(2) I don't think it's caused by the environment options not being
picked up. You can check the head of the log files of the JobManager
or TaskManager to verify that your provided option is picked up as
expected. You should see something similar to this:

2019-01-21 22:53:49,863 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
--------------------------------------------------------------------------------
2019-01-21 22:53:49,864 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
Starting StandaloneSessionClusterEntrypoint (Version: 1.7.0,
Rev:49da9f9, Date:28.11.2018 @ 17:59:06 UTC)
...
2019-01-21 22:53:49,865 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM
Options:
2019-01-21 22:53:49,865 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
-Xms1024m
2019-01-21 22:53:49,865 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
-Xmx1024m
You are looking for this line ----> 2019-01-21 22:53:49,865 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
-Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64/ <----
2019-01-21 22:53:49,865 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
-Dlog.file=/.../flink-1.7.0/log/flink-standalonesession-0.local.log
...
2019-01-21 22:53:49,866 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
Program Arguments:
2019-01-21 22:53:49,866 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
--configDir
2019-01-21 22:53:49,866 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
/.../flink-1.7.0/conf
2019-01-21 22:53:49,866 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
--executionMode
2019-01-21 22:53:49,866 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
cluster
...
2019-01-21 22:53:49,866 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
--------------------------------------------------------------------------------

Can you verify that you see the log messages as expected?

(3) As noted FLINK-11402, is it possible to package the snappy library
as part of your user code instead of loading the library via
java.library.path? In my example, that seems to work fine.

– Ufuk

On Thu, Jan 17, 2019 at 5:53 PM Aaron Levin <aaronle...@stripe.com> wrote:
>
> Hello!
>
> *tl;dr*: settings in `env.java.opts` seem to stop having impact when a job is 
> canceled or fails and then is restarted (with or without 
> savepoint/checkpoints). If I restart the task-managers, the `env.java.opts` 
> seem to start having impact again and our job will run without failure. More 
> below.
>
> We use consume Snappy-compressed sequence files in our flink job. This 
> requires access to the hadoop native libraries. In our `flink-conf.yaml` for 
> both the task manager and the job manager, we put:
>
> ```
> env.java.opts: -Djava.library.path=/usr/local/hadoop/lib/native
> ```
>
> If I launch our job on freshly-restarted task managers, the job operates 
> fine. If at some point I cancel the job or if the job restarts for some other 
> reason, the job will begin to crashloop because it tries to open a 
> Snappy-compressed file but doesn't have access to the codec from the native 
> hadoop libraries in `/usr/local/hadoop/lib/native`. If I then restart the 
> task manager while the job is crashlooping, the job is start running without 
> any codec failures.
>
> The only reason I can conjure that would cause the Snappy compression to fail 
> is if the `env.java.opts` were not being passed through to the job on restart 
> for some reason.
>
> Does anyone know what's going on? Am I missing some additional configuration? 
> I really appreciate any help!
>
> About our setup:
>
> - Flink Version: 1.7.0
> - Deployment: Standalone in HA
> - Hadoop/S3 setup: we do *not* set `HADOOP_CLASSPATH`. We use Flink’s shaded 
> jars to access our files in S3. We do not use the `bundled-with-hadoop` 
> distribution of Flink.
>
> Best,
>
> Aaron Levin

Re: `env.java.opts` not persisting after job canceled or failed and then restarted

Reply via email to