dcausse created this task.
dcausse added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  While deploying a new version of the streaming-updater (0.3.103) flink failed 
with:
  
    java.io.EOFException
        at java.base/java.io.DataInputStream.readFully(DataInputStream.java:202)
        at java.base/java.io.DataInputStream.readFully(DataInputStream.java:170)
        at 
org.apache.flink.api.common.typeutils.base.array.BytePrimitiveArraySerializer.deserialize(BytePrimitiveArraySerializer.java:82)
        at 
org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKVStateData(RocksDBFullRestoreOperation.java:229)
        at 
org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKeyGroupsInStateHandle(RocksDBFullRestoreOperation.java:158)
        at 
org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restore(RocksDBFullRestoreOperation.java:142)
        at 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:284)
        at 
org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:587)
        at 
org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:93)
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328)
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345)
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163)
        at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272)
        at 
org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:425)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$2(StreamTask.java:535)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:525)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:565)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
        at java.base/java.lang.Thread.run(Thread.java:834)
  
  notes:
  
  - the problem occurs when using two different savepoints:
    - (thanos) 
`rdf-streaming-updater-codfw/commons/savepoints/deploy_0_3_103/savepoint-c4b021-9c4cd6541ec7/`
    - (thanos) 
`rdf-streaming-updater-codfw/commons/savepoints/savepoint-c4b021-818dd669a47e/`
  - the exception is only visible on a taskmanager POD running on 
`kubernetes2003` (logs 
<https://logstash.wikimedia.org/app/discover#/?_g=(filters:!(),query:(language:lucene,query:kubernetes2003),refreshInterval:(pause:!t,value:0),time:(from:'2022-02-22T16:00:00.000Z',to:'2022-02-22T19:30:00.000Z'))&_a=(columns:!(log,host,message,kubernetes.pod_id),filters:!(),index:'logstash-*',interval:auto,query:(language:lucene,query:'kubernetes.master_url:%22https:%2F%2Fkubemaster.svc.codfw.wmnet%22%20AND%20kubernetes.namespace_name:%22rdf-streaming-updater%22%20AND%20java.io.EOFException%20AND%20kubernetes.labels.component:taskmanager'),sort:!())>)
  - the problem occurs when loading the savepoints using the previous version 
0.3.99
  - the deploy worked fine on staging for both wdqs and wcqs, fine as well on 
wcqs at codfw
  - the system was able to resume using 0.3.99 and a previous checkpoint 
`rdf-streaming-updater-codfw/wikidata/checkpoints/e245dd1e76d56d9ded351b27cf2d4c2a/chk-415014`.
  
  AC:
  
  - understand the root cause of the failure

TASK DETAIL
  https://phabricator.wikimedia.org/T302396

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Aklapper, dcausse, MPhamWMF, CBogen, Namenlos314, Gq86, 
Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to