We had issues like that in the past (e.g. FLINK-24923 [1], FLINK-10683 [2]). The error you're observing is caused by an unexpected byte being read from the socket. The BlobServer protocol expects either 0 (for put messages) or 1 (for get messages) being retrieved as a header for new message blocks [3]. Reading different values might mean that there is some other process sending data to the port the BlobServer is listening on. May you check your network traffic?
Matthias [1] https://issues.apache.org/jira/browse/FLINK-24923 [2] https://issues.apache.org/jira/browse/FLINK-10683 [3] https://github.com/apache/flink/blob/ab264e4ab5a3bc6961a5128b1c7e19752508a7ca/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServerConnection.java#L115 On Fri, Jan 20, 2023 at 11:26 PM Yang Liu <y....@fetchrewards.com> wrote: > Hello, > > Is anyone familiar with the "blob server connection"? We have constantly > been seeing the "Error while executing Blob connection" error, which > sometimes causes a job stuck in the middle of a run if there are too many > connection errors and eventually causes a failure, though most of the time > the streaming run mode can recover from that failure in the subsequent > iterations of runs, but that slows down the entire process. We tried > adjusting the blob.fetch.num-concurrent and some other blob parameters, but > it was not very helpful, so we want to know what might be the root cause of > the issue. Are there any Flink metrics or tools to help us monitor the blob > server connections? > > We use: > > - Flink Kubernetes Operator > - Flink 1.15.3 and 1.16.0 > - Kafka, filesystem(S3) > - Hudi 0.11.1 > > Full error message: > > java.io.IOException: Unknown operation 71 > at > org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:116) > [flink-dist-1.15.3.jar:1.15.3] > 2023-01-19 16:44:37,448 ERROR > org.apache.flink.runtime.blob.BlobServerConnection [] - Error while > executing BLOB connection. > > > Best regards, > Yang >