dcausse created this task.
dcausse added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  This error causes the pipeline to restart and might trigger the latency alert 
//WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh//.
  
  It was seen on the pipeline running in codfw right after one kafka node was 
removed from the cluster.
  It was not a single instance of the error and it occurred several times 
after, timeline:
  
  - 2021-12-15T16:12 kafka-main2003 is removed from the cluster
  - 2021-12-15T16:17 flink fails
  - 2021-12-15T16:19 flink fails
  - 2021-12-15T16:29 flink fails
  - 2021-12-15T16:37 flink fails
  - 2021-12-15T16:42 flink fails
  - 2021-12-16T10:46 flink fails
  
  The pipeline restarting after a kafka broker is removed is something we 
should expect but the subsequent failures seem to suggest that this setup flink 
+ kafka-main minus one broker is less stable than usual.
  
  Flink is properly resuming without user-facing issues, it's noticeable only 
because the WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh is being triggered.
  
  The flink error stack is:
  
    org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete 
snapshot 216620 for operator RDFPatchChunkOperation -> 
MeasureEventProcessingLatencyOperation -> Sink: 
codfw.rdf-streaming-updater.mutation:0 (1/1)#1. Failure reason: Checkpoint was 
declined.
        at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:241)
        at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:162)
        at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:371)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:685)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:606)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:571)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:298)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$9(StreamTask.java:1003)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:993)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:951)
        at 
org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:115)
        at 
org.apache.flink.streaming.runtime.io.SingleCheckpointBarrierHandler.processBarrier(SingleCheckpointBarrierHandler.java:156)
        at 
org.apache.flink.streaming.runtime.io.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:178)
        at 
org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:155)
        at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:179)
        at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:395)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:191)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:609)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
        at java.base/java.lang.Thread.run(Thread.java:834)
    Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired 
after 60000milliseconds while awaiting InitProducerId
        
    error_type
    org.apache.flink.runtime.checkpoint.CheckpointException

TASK DETAIL
  https://phabricator.wikimedia.org/T297870

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: BTullis, elukey, dcausse, Aklapper, MPhamWMF, CBogen, Namenlos314, Gq86, 
Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to