Re: Flink 1.3 - Checkpointing failing

2017-06-03 Thread Tzu-Li (Gordon) Tai
Hi Mahesh,

Thanks a lot for reporting this. This would be a bug: 
https://issues.apache.org/jira/browse/FLINK-6844.
Apparently the TraversableSerializer could take part in checkpointing and 
therefore should implement the new compatibility methods.

I’ll make sure that the fix for this gets into 1.3.1.

Cheers,
Gordon

On 3 June 2017 at 5:41:12 AM, Ted Yu (yuzhih...@gmail.com) wrote:

If I read CompositeTypeSerializerConfigSnapshot ctor correctly:

    for (TypeSerializer nestedSerializer : nestedSerializers) {
      TypeSerializerConfigSnapshot configSnapshot = 
nestedSerializer.snapshotConfiguration();
      this.nestedSerializersAndConfigs.add(

The UnsupportedOperationException thrown by snapshotConfiguration() should be 
caught without proceeding to nestedSerializersAndConfigs.add(). 

On Fri, Jun 2, 2017 at 7:02 PM, Ted Yu  wrote:
Your case doesn't seem like FLINK-5462 since there was no CancellationException
in the stack trace you posted.



The exception from TraversableSerializer.snapshotConfiguration() was added by 
FLINK-6178

FYI

On Fri, Jun 2, 2017 at 4:04 PM, MAHESH KUMAR  
wrote:
Hi Team,

We have some test cases written using StreamingMultipleProgramsTestBase
It was working fine in version 1.2, we get the following error in version 1.3
It seems like CheckpointCoordinator fails after this error and Checkpointing no 
longer occurs.

I came across this bug: https://issues.apache.org/jira/browse/FLINK-5462 , It 
looks kind of similar but I am not exactly sure.

2017-06-02 16:11:07,048  INFO | flink-akka.actor.default-dispatcher-3 | 
org.apache.flink.runtime.executiongraph.ExecutionGraph  | Could not restart the 
job pipeline_message_auditor (f54182ae17352efb9aa40667c283ce09) because the 
restart strategy prevented it.
org.apache.flink.streaming.runtime.tasks.AsynchronousException: 
java.lang.Exception: Could not materialize checkpoint 1 for operator 
TriggerWindow(TumblingProcessingTimeWindows(4000), 
ReducingStateDescriptor{serializer=com.oracle.ci.flink.streaming.MessageAuditorStreamingHelper$$anonfun$1$$anon$7$$anon$2@e42b922d,
 
reduceFunction=org.apache.flink.streaming.api.scala.function.util.ScalaReduceFunction@59623f44},
 ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java:301)) -> 
(Map -> Sink: auditor_out-kafkaSink, Map -> Sink: auditor_expire-kafkaSink) 
(1/1).
at 
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:963)
 ~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_112]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_112]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
~[na:1.8.0_112]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
~[na:1.8.0_112]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]
Caused by: java.lang.Exception: Could not materialize checkpoint 1 for operator 
TriggerWindow(TumblingProcessingTimeWindows(4000), 
ReducingStateDescriptor{serializer=com.oracle.ci.flink.streaming.MessageAuditorStreamingHelper$$anonfun$1$$anon$7$$anon$2@e42b922d,
 
reduceFunction=org.apache.flink.streaming.api.scala.function.util.ScalaReduceFunction@59623f44},
 ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java:301)) -> 
(Map -> Sink: auditor_out-kafkaSink, Map -> Sink: auditor_expire-kafkaSink) 
(1/1).
... 6 common frames omitted
Caused by: java.util.concurrent.ExecutionException: 
java.lang.UnsupportedOperationException
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.8.0_112]
at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[na:1.8.0_112]
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) 
~[flink-core-1.3.0.jar:1.3.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:893)
 ~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
... 5 common frames omitted
Suppressed: java.lang.Exception: Could not properly cancel managed keyed state 
future.
at 
org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90)
 ~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1018)
 ~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:957)
 ~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
... 5 common frames omitted
Caused by: java.util.concurrent.ExecutionException: 
java.lang.UnsupportedOperationException
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
at 

Re: Flink 1.3 - Checkpointing failing

2017-06-02 Thread Ted Yu
If I read CompositeTypeSerializerConfigSnapshot ctor correctly:

for (TypeSerializer nestedSerializer : nestedSerializers) {
  TypeSerializerConfigSnapshot configSnapshot =
nestedSerializer.snapshotConfiguration();
  this.nestedSerializersAndConfigs.add(

The UnsupportedOperationException thrown by snapshotConfiguration() should
be caught without proceeding to nestedSerializersAndConfigs.add().

On Fri, Jun 2, 2017 at 7:02 PM, Ted Yu  wrote:

> Your case doesn't seem like FLINK-5462 since there was no 
> CancellationException
> in the stack trace you posted.
>
> The exception from TraversableSerializer.snapshotConfiguration() was
> added by FLINK-6178
>
> FYI
>
> On Fri, Jun 2, 2017 at 4:04 PM, MAHESH KUMAR  > wrote:
>
>> Hi Team,
>>
>> We have some test cases written using StreamingMultipleProgramsTestBase
>> It was working fine in version 1.2, we get the following error in version
>> 1.3
>> It seems like CheckpointCoordinator fails after this error and
>> Checkpointing no longer occurs.
>>
>> I came across this bug: https://issues.apache.org/jira/browse/FLINK-5462
>> , It looks kind of similar but I am not exactly sure.
>>
>> 2017-06-02 16:11:07,048  INFO | flink-akka.actor.default-dispatcher-3 |
>> org.apache.flink.runtime.executiongraph.ExecutionGraph  | Could not
>> restart the job pipeline_message_auditor (f54182ae17352efb9aa40667c283ce09)
>> because the restart strategy prevented it.
>> org.apache.flink.streaming.runtime.tasks.AsynchronousException:
>> java.lang.Exception: Could not materialize checkpoint 1 for operator
>> TriggerWindow(TumblingProcessingTimeWindows(4000),
>> ReducingStateDescriptor{serializer=com.oracle.ci.flink.
>> streaming.MessageAuditorStreamingHelper$$anonfun$1$$anon$7$$
>> anon$2@e42b922d, reduceFunction=org.apache.flin
>> k.streaming.api.scala.function.util.ScalaReduceFunction@59623f44},
>> ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java:301))
>> -> (Map -> Sink: auditor_out-kafkaSink, Map -> Sink:
>> auditor_expire-kafkaSink) (1/1).
>> at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncChe
>> ckpointRunnable.run(StreamTask.java:963) ~[flink-streaming-java_2.11-1.
>> 3.0.jar:1.3.0]
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> ~[na:1.8.0_112]
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> ~[na:1.8.0_112]
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> ~[na:1.8.0_112]
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> ~[na:1.8.0_112]
>> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]
>> Caused by: java.lang.Exception: Could not materialize checkpoint 1 for
>> operator TriggerWindow(TumblingProcessingTimeWindows(4000),
>> ReducingStateDescriptor{serializer=com.oracle.ci.flink.
>> streaming.MessageAuditorStreamingHelper$$anonfun$1$$anon$7$$
>> anon$2@e42b922d, reduceFunction=org.apache.flin
>> k.streaming.api.scala.function.util.ScalaReduceFunction@59623f44},
>> ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java:301))
>> -> (Map -> Sink: auditor_out-kafkaSink, Map -> Sink:
>> auditor_expire-kafkaSink) (1/1).
>> ... 6 common frames omitted
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.UnsupportedOperationException
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> ~[na:1.8.0_112]
>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> ~[na:1.8.0_112]
>> at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>> ~[flink-core-1.3.0.jar:1.3.0]
>> at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncChe
>> ckpointRunnable.run(StreamTask.java:893) ~[flink-streaming-java_2.11-1.
>> 3.0.jar:1.3.0]
>> ... 5 common frames omitted
>> Suppressed: java.lang.Exception: Could not properly cancel managed keyed
>> state future.
>> at org.apache.flink.streaming.api.operators.OperatorSnapshotRes
>> ult.cancel(OperatorSnapshotResult.java:90) ~[flink-streaming-java_2.11-1.
>> 3.0.jar:1.3.0]
>> at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncChe
>> ckpointRunnable.cleanup(StreamTask.java:1018)
>> ~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
>> at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncChe
>> ckpointRunnable.run(StreamTask.java:957) ~[flink-streaming-java_2.11-1.
>> 3.0.jar:1.3.0]
>> ... 5 common frames omitted
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.UnsupportedOperationException
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUt
>> il.java:43)
>> at org.apache.flink.runtime.state.StateUtil.discardStateFuture(
>> StateUtil.java:85)
>> at org.apache.flink.streaming.api.operators.OperatorSnapshotRes
>> ult.cancel(OperatorSnapshotResult.java:88)
>> ... 7 common frames omitted

Re: Flink 1.3 - Checkpointing failing

2017-06-02 Thread Ted Yu
Your case doesn't seem like FLINK-5462 since there was no CancellationException
in the stack trace you posted.

The exception from TraversableSerializer.snapshotConfiguration() was added
by FLINK-6178

FYI

On Fri, Jun 2, 2017 at 4:04 PM, MAHESH KUMAR 
wrote:

> Hi Team,
>
> We have some test cases written using StreamingMultipleProgramsTestBase
> It was working fine in version 1.2, we get the following error in version
> 1.3
> It seems like CheckpointCoordinator fails after this error and
> Checkpointing no longer occurs.
>
> I came across this bug: https://issues.apache.org/jira/browse/FLINK-5462
> , It looks kind of similar but I am not exactly sure.
>
> 2017-06-02 16:11:07,048  INFO | flink-akka.actor.default-dispatcher-3 |
> org.apache.flink.runtime.executiongraph.ExecutionGraph  | Could not
> restart the job pipeline_message_auditor (f54182ae17352efb9aa40667c283ce09)
> because the restart strategy prevented it.
> org.apache.flink.streaming.runtime.tasks.AsynchronousException:
> java.lang.Exception: Could not materialize checkpoint 1 for operator
> TriggerWindow(TumblingProcessingTimeWindows(4000),
> ReducingStateDescriptor{serializer=com.oracle.ci.flink.streaming.
> MessageAuditorStreamingHelper$$anonfun$1$$anon$7$$anon$2@e42b922d,
> reduceFunction=org.apache.flink.streaming.api.scala.function.util.
> ScalaReduceFunction@59623f44}, ProcessingTimeTrigger(),
> WindowedStream.reduce(WindowedStream.java:301)) -> (Map -> Sink:
> auditor_out-kafkaSink, Map -> Sink: auditor_expire-kafkaSink) (1/1).
> at org.apache.flink.streaming.runtime.tasks.StreamTask$
> AsyncCheckpointRunnable.run(StreamTask.java:963)
> ~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_112]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ~[na:1.8.0_112]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_112]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[na:1.8.0_112]
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]
> Caused by: java.lang.Exception: Could not materialize checkpoint 1 for
> operator TriggerWindow(TumblingProcessingTimeWindows(4000),
> ReducingStateDescriptor{serializer=com.oracle.ci.flink.streaming.
> MessageAuditorStreamingHelper$$anonfun$1$$anon$7$$anon$2@e42b922d,
> reduceFunction=org.apache.flink.streaming.api.scala.function.util.
> ScalaReduceFunction@59623f44}, ProcessingTimeTrigger(),
> WindowedStream.reduce(WindowedStream.java:301)) -> (Map -> Sink:
> auditor_out-kafkaSink, Map -> Sink: auditor_expire-kafkaSink) (1/1).
> ... 6 common frames omitted
> Caused by: java.util.concurrent.ExecutionException: java.lang.
> UnsupportedOperationException
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> ~[na:1.8.0_112]
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> ~[na:1.8.0_112]
> at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
> ~[flink-core-1.3.0.jar:1.3.0]
> at org.apache.flink.streaming.runtime.tasks.StreamTask$
> AsyncCheckpointRunnable.run(StreamTask.java:893)
> ~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
> ... 5 common frames omitted
> Suppressed: java.lang.Exception: Could not properly cancel managed keyed
> state future.
> at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(
> OperatorSnapshotResult.java:90) ~[flink-streaming-java_2.11-1.
> 3.0.jar:1.3.0]
> at org.apache.flink.streaming.runtime.tasks.StreamTask$
> AsyncCheckpointRunnable.cleanup(StreamTask.java:1018)
> ~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
> at org.apache.flink.streaming.runtime.tasks.StreamTask$
> AsyncCheckpointRunnable.run(StreamTask.java:957)
> ~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
> ... 5 common frames omitted
> Caused by: java.util.concurrent.ExecutionException: java.lang.
> UnsupportedOperationException
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
> at org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.
> java:85)
> at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(
> OperatorSnapshotResult.java:88)
> ... 7 common frames omitted
> Caused by: java.lang.UnsupportedOperationException: null
> at org.apache.flink.api.scala.typeutils.TraversableSerializer.
> snapshotConfiguration(TraversableSerializer.scala:155)
> at org.apache.flink.api.common.typeutils.CompositeTypeSerializerConfigS
> napshot.(CompositeTypeSerializerConfigSnapshot.java:53)
> at org.apache.flink.api.java.typeutils.runtime.
> TupleSerializerConfigSnapshot.(TupleSerializerConfigSnapshot.
> java:45)
> at org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase.
> snapshotConfiguration(TupleSerializerBase.java:132)
> at 

Flink 1.3 - Checkpointing failing

2017-06-02 Thread MAHESH KUMAR
Hi Team,

We have some test cases written using StreamingMultipleProgramsTestBase
It was working fine in version 1.2, we get the following error in version
1.3
It seems like CheckpointCoordinator fails after this error and
Checkpointing no longer occurs.

I came across this bug: https://issues.apache.org/jira/browse/FLINK-5462 ,
It looks kind of similar but I am not exactly sure.

2017-06-02 16:11:07,048  INFO | flink-akka.actor.default-dispatcher-3 |
org.apache.flink.runtime.executiongraph.ExecutionGraph  | Could not restart
the job pipeline_message_auditor (f54182ae17352efb9aa40667c283ce09) because
the restart strategy prevented it.
org.apache.flink.streaming.runtime.tasks.AsynchronousException:
java.lang.Exception: Could not materialize checkpoint 1 for operator
TriggerWindow(TumblingProcessingTimeWindows(4000),
ReducingStateDescriptor{serializer=com.oracle.ci.flink.streaming.MessageAuditorStreamingHelper$$anonfun$1$$anon$7$$anon$2@e42b922d,
reduceFunction=org.apache.flink.streaming.api.scala.function.util.ScalaReduceFunction@59623f44},
ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java:301)) ->
(Map -> Sink: auditor_out-kafkaSink, Map -> Sink: auditor_expire-kafkaSink)
(1/1).
at
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:963)
~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[na:1.8.0_112]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_112]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_112]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
~[na:1.8.0_112]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]
Caused by: java.lang.Exception: Could not materialize checkpoint 1 for
operator TriggerWindow(TumblingProcessingTimeWindows(4000),
ReducingStateDescriptor{serializer=com.oracle.ci.flink.streaming.MessageAuditorStreamingHelper$$anonfun$1$$anon$7$$anon$2@e42b922d,
reduceFunction=org.apache.flink.streaming.api.scala.function.util.ScalaReduceFunction@59623f44},
ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java:301)) ->
(Map -> Sink: auditor_out-kafkaSink, Map -> Sink: auditor_expire-kafkaSink)
(1/1).
... 6 common frames omitted
Caused by: java.util.concurrent.ExecutionException:
java.lang.UnsupportedOperationException
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
~[na:1.8.0_112]
at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[na:1.8.0_112]
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
~[flink-core-1.3.0.jar:1.3.0]
at
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:893)
~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
... 5 common frames omitted
Suppressed: java.lang.Exception: Could not properly cancel managed keyed
state future.
at
org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90)
~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
at
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1018)
~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
at
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:957)
~[flink-streaming-java_2.11-1.3.0.jar:1.3.0]
... 5 common frames omitted
Caused by: java.util.concurrent.ExecutionException:
java.lang.UnsupportedOperationException
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
at
org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85)
at
org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88)
... 7 common frames omitted
Caused by: java.lang.UnsupportedOperationException: null
at
org.apache.flink.api.scala.typeutils.TraversableSerializer.snapshotConfiguration(TraversableSerializer.scala:155)
at
org.apache.flink.api.common.typeutils.CompositeTypeSerializerConfigSnapshot.(CompositeTypeSerializerConfigSnapshot.java:53)
at
org.apache.flink.api.java.typeutils.runtime.TupleSerializerConfigSnapshot.(TupleSerializerConfigSnapshot.java:45)
at
org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase.snapshotConfiguration(TupleSerializerBase.java:132)
at
org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase.snapshotConfiguration(TupleSerializerBase.java:39)
at
org.apache.flink.runtime.state.RegisteredKeyedBackendStateMetaInfo.snapshot(RegisteredKeyedBackendStateMetaInfo.java:71)
at
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.writeKVStateMetaData(RocksDBKeyedStateBackend.java:591)
at