[ 
https://issues.apache.org/jira/browse/FLINK-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tzu-Li (Gordon) Tai reopened FLINK-7590:
----------------------------------------

> Flink failed to flush and close the file system output stream for 
> checkpointing because of s3 read timeout
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-7590
>                 URL: https://issues.apache.org/jira/browse/FLINK-7590
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.3.2
>            Reporter: Bowen Li
>            Priority: Major
>             Fix For: 1.4.0, 1.3.4
>
>
> Flink job failed once over the weekend because of the following issue. It 
> picked itself up afterwards and has been running well. But the issue might 
> worth taking a look at.
> {code:java}
> 2017-09-03 13:18:38,998 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - reduce 
> (14/18) (c97256badc87e995d456e7a13cec5de9) switched from RUNNING to FAILED.
> AsynchronousException{java.lang.Exception: Could not materialize checkpoint 
> 163 for operator reduce (14/18).}
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:970)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.Exception: Could not materialize checkpoint 163 for 
> operator reduce (14/18).
>       ... 6 more
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: 
> Could not flush and close the file system output stream to 
> s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the 
> stream state handle
>       at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>       at 
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)
>       ... 5 more
>       Suppressed: java.lang.Exception: Could not properly cancel managed 
> keyed state future.
>               at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90)
>               at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1023)
>               at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:961)
>               ... 5 more
>       Caused by: java.util.concurrent.ExecutionException: 
> java.io.IOException: Could not flush and close the file system output stream 
> to s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain 
> the stream state handle
>               at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>               at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>               at 
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>               at 
> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85)
>               at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88)
>               ... 7 more
>       Caused by: java.io.IOException: Could not flush and close the file 
> system output stream to s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa 
> in order to obtain the stream state handle
>               at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:336)
>               at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeSnapshotStreamAndGetHandle(RocksDBKeyedStateBackend.java:693)
>               at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeCheckpointStream(RocksDBKeyedStateBackend.java:531)
>               at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:420)
>               at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:399)
>               at 
> org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)
>               at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>               at 
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
>               at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)
>               ... 5 more
>       Caused by: com.amazonaws.AmazonClientException: Unable to unmarshall 
> response (Failed to parse XML document with handler class 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler).
>  Response Code: 200, Response Text: OK
>               at 
> com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
>               at 
> com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
>               at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
>               at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
>               at 
> com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2524)
>               at 
> com.amazonaws.services.s3.transfer.internal.UploadMonitor.completeMultipartUpload(UploadMonitor.java:236)
>               at 
> com.amazonaws.services.s3.transfer.internal.UploadMonitor.poll(UploadMonitor.java:183)
>               at 
> com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:152)
>               at 
> com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:50)
>               ... 4 more
>       Caused by: com.amazonaws.AmazonClientException: Failed to parse XML 
> document with handler class 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler
>               at 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
>               at 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseCompleteMultipartUploadResponse(XmlResponsesSaxParser.java:425)
>               at 
> com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:200)
>               at 
> com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:197)
>               at 
> com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
>               at 
> com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:44)
>               at 
> com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:30)
>               at 
> com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
>               ... 12 more
>       Caused by: java.net.SocketTimeoutException: Read timed out
>               at java.net.SocketInputStream.socketRead0(Native Method)
>               at 
> java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>               at java.net.SocketInputStream.read(SocketInputStream.java:171)
>               at java.net.SocketInputStream.read(SocketInputStream.java:141)
>               at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
>               at sun.security.ssl.InputRecord.read(InputRecord.java:503)
>               at 
> sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
>               at 
> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
>               at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
>               at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>               at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>               at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>               at 
> org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:266)
>               at 
> org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
>               at 
> org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
>               at 
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
>               at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>               at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>               at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>               at java.io.InputStreamReader.read(InputStreamReader.java:184)
>               at java.io.BufferedReader.fill(BufferedReader.java:161)
>               at java.io.BufferedReader.read1(BufferedReader.java:212)
>               at java.io.BufferedReader.read(BufferedReader.java:286)
>               at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
>               at org.apache.xerces.impl.XMLEntityScanner.skipSpaces(Unknown 
> Source)
>               at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$TrailingMiscDispatcher.dispatch(Unknown
>  Source)
>               at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>               at org.apache.xerces.parsers.XML11Configuration.parse(Unknown 
> Source)
>               at org.apache.xerces.parsers.XML11Configuration.parse(Unknown 
> Source)
>               at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>               at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown 
> Source)
>               at 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
>               ... 19 more
>       [CIRCULAR REFERENCE:java.io.IOException: Could not flush and close the 
> file system output stream to 
> s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the 
> stream state handle]
> 2017-09-03 13:18:39,000 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job 
> com.offerup.stream_processing.item_view_stats.ItemViewStatsStreamingApp 
> (aac822203a47d504ecd9b73a77c60cd5) switched from state RUNNING to FAILING.
> AsynchronousException{java.lang.Exception: Could not materialize checkpoint 
> 163 for operator reduce (14/18).}
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:970)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.Exception: Could not materialize checkpoint 163 for 
> operator reduce (14/18).
>       ... 6 more
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: 
> Could not flush and close the file system output stream to 
> s3://xxx/aac822203a47d504ecd9b73a77c60cd5/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa
>  in order to obtain the stream state handle
>       at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>       at 
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)
>       ... 5 more
>       Suppressed: java.lang.Exception: Could not properly cancel managed 
> keyed state future.
>               at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90)
>               at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1023)
>               at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:961)
>               ... 5 more
>       Caused by: java.util.concurrent.ExecutionException: 
> java.io.IOException: Could not flush and close the file system output stream 
> to s3://xxxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain 
> the stream state handle
>               at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>               at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>               at 
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>               at 
> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85)
>               at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88)
>               ... 7 more
>       Caused by: java.io.IOException: Could not flush and close the file 
> system output stream to 
> s3://xxx/aac822203a47d504ecd9b73a77c60cd5/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa
>  in order to obtain the stream state handle
>               at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:336)
>               at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeSnapshotStreamAndGetHandle(RocksDBKeyedStateBackend.java:693)
>               at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeCheckpointStream(RocksDBKeyedStateBackend.java:531)
>               at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:420)
>               at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:399)
>               at 
> org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)
>               at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>               at 
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
>               at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)
>               ... 5 more
>       Caused by: com.amazonaws.AmazonClientException: Unable to unmarshall 
> response (Failed to parse XML document with handler class 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler).
>  Response Code: 200, Response Text: OK
>               at 
> com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
>               at 
> com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
>               at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
>               at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
>               at 
> com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2524)
>               at 
> com.amazonaws.services.s3.transfer.internal.UploadMonitor.completeMultipartUpload(UploadMonitor.java:236)
>               at 
> com.amazonaws.services.s3.transfer.internal.UploadMonitor.poll(UploadMonitor.java:183)
>               at 
> com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:152)
>               at 
> com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:50)
>               ... 4 more
>       Caused by: com.amazonaws.AmazonClientException: Failed to parse XML 
> document with handler class 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler
>               at 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
>               at 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseCompleteMultipartUploadResponse(XmlResponsesSaxParser.java:425)
>               at 
> com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:200)
>               at 
> com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:197)
>               at 
> com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
>               at 
> com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:44)
>               at 
> com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:30)
>               at 
> com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
>               ... 12 more
>       Caused by: java.net.SocketTimeoutException: Read timed out
>               at java.net.SocketInputStream.socketRead0(Native Method)
>               at 
> java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>               at java.net.SocketInputStream.read(SocketInputStream.java:171)
>               at java.net.SocketInputStream.read(SocketInputStream.java:141)
>               at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
>               at sun.security.ssl.InputRecord.read(InputRecord.java:503)
>               at 
> sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
>               at 
> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
>               at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
>               at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>               at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>               at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>               at 
> org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:266)
>               at 
> org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
>               at 
> org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
>               at 
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
>               at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>               at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>               at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>               at java.io.InputStreamReader.read(InputStreamReader.java:184)
>               at java.io.BufferedReader.fill(BufferedReader.java:161)
>               at java.io.BufferedReader.read1(BufferedReader.java:212)
>               at java.io.BufferedReader.read(BufferedReader.java:286)
>               at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
>               at org.apache.xerces.impl.XMLEntityScanner.skipSpaces(Unknown 
> Source)
>               at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$TrailingMiscDispatcher.dispatch(Unknown
>  Source)
>               at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>               at org.apache.xerces.parsers.XML11Configuration.parse(Unknown 
> Source)
>               at org.apache.xerces.parsers.XML11Configuration.parse(Unknown 
> Source)
>               at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>               at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown 
> Source)
>               at 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
>               ... 19 more
>       [CIRCULAR REFERENCE:java.io.IOException: Could not flush and close the 
> file system output stream to 
> s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the 
> stream state handle]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to