Sayat Satybaldiyev created FLINK-10286:
------------------------------------------

             Summary: Flink Persist Invalid Job Graph in Zookeeper
                 Key: FLINK-10286
                 URL: https://issues.apache.org/jira/browse/FLINK-10286
             Project: Flink
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.6.0
            Reporter: Sayat Satybaldiyev


In HA mode Flink 1.6, Flink persist job graph in Zookpeer even if the job was 
not accepted by Job Manager. This particularly bad as later if JM dies and 
restarts JM tries to recover the job and obviously fails and dies completely.

 

How to reproduce:

1. Have HA Flink cluster 1.6

2. Submit invalid job, in my case I'm put invalid file schema for rocksdb state 
backed

```

StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
env.enableCheckpointing(5000);
RocksDBStateBackend backend = new 
RocksDBStateBackend("hddd:///tmp/flink/rocksdb");

backend.setPredefinedOptions(PredefinedOptions.FLASH_SSD_OPTIMIZED);
env.setStateBackend(backend);

```

Client returns:

```

The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not submit 
job (JobID: 9680f02ae2f3806c3b4da25bfacd0749)

```

JM does not accept job, this truncated error log from JM:

```

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
submit job.
... 24 more
Caused by: java.util.concurrent.CompletionException: 
java.lang.RuntimeException: 
org.apache.flink.runtime.client.JobExecutionException: Could not set up 
JobManager

 

Caused by: java.lang.RuntimeException: Failed to start checkpoint ID counter: 
Could not find a file system implementation for scheme 'hddd'. The scheme is 
not directly supported by Flink and no Hadoop file system to support this 
scheme could be loaded.

 

```

4. Go to ZK and observe that JM has saved job to ZK

ls /flink/flink_ns/jobgraphs/9680f02ae2f3806c3b4da25bfacd0749
[7f392fd9-cedc-4978-9186-1f54b98eeeb7]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to