[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-27 Thread Konstantin Knauf (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848736#comment-16848736
 ] 

Konstantin Knauf commented on FLINK-12381:
--

[~till.rohrmann] I opened  a PR for the first step for a subtask issue 
https://issues.apache.org/jira/browse/FLINK-12617.

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Assignee: Konstantin Knauf
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-24 Thread Konstantin Knauf (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847466#comment-16847466
 ] 

Konstantin Knauf commented on FLINK-12381:
--

Thanks [~till.rohrmann] for the comments.  Overall, we end up with following 
behavior for the {{JobID}} in the {{StandaloneClusterEntryPoint}} by ascending 
precedence:

* 0 (HA), random (non HA)
*  {{--job-id}}

I will do this in a first step and will in parallel look into how to 
automatically set the {{--job-id}} per Deployment in Kubernetes.


> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Assignee: Konstantin Knauf
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-17 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842131#comment-16842131
 ] 

Till Rohrmann commented on FLINK-12381:
---

Thanks for tackling this problem [~knaufk].

1) In order to provide a workaround in case that it should break existing 
setups, we could still respect the {{--job-id}} parameter. That way the user 
can still configure the {{JobID}} to be {{0}}.

2) Yes this is the problem. The current {{StandaloneJobClusterEntrypoint}} 
already supports to pass in an explicit {{JobID}} via the {{--job-id}} command 
line option. For K8s it might be more convenient to use an environment variable 
which the deployment could set. One could then say that we give highest 
precedence to the command line option > env variable > {{0}}.

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Assignee: Konstantin Knauf
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-16 Thread Konstantin Knauf (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841524#comment-16841524
 ] 

Konstantin Knauf commented on FLINK-12381:
--

[~till.rohrmann] I am happy to take this. Let me summarize what based on the 
discussion, I think, we need to do:

1) To solve the issue at hand, we only need to make sure, that {{JobID}}s are 
per-default randomly generated in {{StandaloneJobClusterEntrypoint}}, instead 
of defaulting to {{0}}, when HA is disabled. Are we breaking any public API by 
this? As far as I know, no, and a note in the release docs suffices.

2) In the case of a Job Cluster with HA enabled my understanding is, that there 
could still be a conflict even with different 
{{high-availability.cluster-id}}s, because the {{cluster-id}} is not part of 
the checkpoint path (or is it?). It only prevents clashes in Zookeeper. So, if 
there are left over checkpoints (reatained checkpoint or job never reached 
terminal state) from a previous job with a different {{cluster-id}}, but same 
{{state.checkpoints.dir}} and same default {{JobID}} ({{0}}), there would still 
be an issue. Therefore, in this case we would like to inject the {{JobID}} 
instead of defaulting to {{0}}.




> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-09 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836192#comment-16836192
 ] 

Till Rohrmann commented on FLINK-12381:
---

If you are using the {{StandaloneJobClusterEntrypoint}}, then a K8s deployment 
effectively defines the lifetime of a job. Differently said, restarts of the 
entrypoint pod won't be considered as a "new job submission" but as a "job 
recovery". Admittedly, this only makes sense if Flink can also retrieve 
checkpoints from the previous run which is currently only possible when HA is 
enabled.

I guess the desired behavior would be if HA is disabled, then the 
{{StandaloneJobClusterEntrypoint}} should assign a random {{JobID}} when it 
starts. If HA is enabled, then the {{StandaloneJobClusterEntrypoint}} should 
generate the {{JobID}} in a predictable way (either being specified by the 
user, derived from some env variable, defaulting to 0 if nothing has been 
specified).

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-08 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835800#comment-16835800
 ] 

Henrik commented on FLINK-12381:


Yes, you can see it like that (a new cluster), I suppose.

So does that mean that flink is useless without HA then? Because if I don't 
have HA, and the node I'm running it on, or the k8s pod I'm running it in, 
restarts, it's a new cluster?

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-08 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835591#comment-16835591
 ] 

Till Rohrmann commented on FLINK-12381:
---

But you said that you killed a cluster and restarted it. This makes it 
effectively two cluster which are executed, just not concurrently. Since 
killing a cluster can leave some checkpoint files behind, it could pretty much 
explain what you are observing. The easiest way to verify this is to start the 
job cluster with a different {{JobID}} via {{--job-id}}.

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-08 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835507#comment-16835507
 ] 

Henrik commented on FLINK-12381:


It's just a single "cluster", so I don't think what you write above applies. 
There's only a single app/job. So I don't see how setting a job id helps.

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-08 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835474#comment-16835474
 ] 

Till Rohrmann commented on FLINK-12381:
---

I think the observed problem is caused by FLINK-10291. FLINK-10291 was needed 
because the {{StandaloneJobClusterEntrypoin}} generates the {{JobGraph}} from a 
file which is stored within the container image. In order to resume from 
existing checkpoints in case of a failover, it was needed to create the 
{{JobGraph}} with a predictable {{JobID}}.

Due to this change, one needs to configure a different 
{{high-availability.cluster-id}} for different jobs/cluster if using the 
{{StandaloneJobClusterEntrypoint}}. Otherwise it is not possible to distinguish 
different clusters (in the context of the job mode it is equivalent to a job).

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-08 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835360#comment-16835360
 ] 

Stephan Ewen commented on FLINK-12381:
--

Flink does not overwrite checkpoints, because this could lead to one job 
causing another job data loss.

Paths between jobs should be different due to different JobIDs. Can you share 
your setup, specifically why your job ID is all zero?

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)