[jira] [Commented] (SPARK-26359) Spark checkpoint restore fails after query restart
[ https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733935#comment-16733935 ] Kaspar Tint commented on SPARK-26359: - [~gsomogyi]: Yes, this ticket can be closed. The solution mentioned here should be enough to work around the issue. > Spark checkpoint restore fails after query restart > -- > > Key: SPARK-26359 > URL: https://issues.apache.org/jira/browse/SPARK-26359 > Project: Spark > Issue Type: Bug > Components: Spark Submit, Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 deployed in standalone-client mode > Checkpointing is done to S3 > The Spark application in question is responsible for running 4 different > queries > Queries are written using Structured Streaming > We are using the following algorithm for hopes of better performance: > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the > default is 1 >Reporter: Kaspar Tint >Priority: Major > Attachments: driver-redacted, metadata, redacted-offsets, > state-redacted, worker-redacted > > > We had an incident where one of our structured streaming queries could not be > restarted after an usual S3 checkpointing failure. Now to clarify before > everything else - we are aware of the issues with S3 and are working towards > moving to HDFS but this will take time. S3 will cause queries to fail quite > often during peak hours and we have separate logic to handle this that will > attempt to restart the failed queries if possible. > In this particular case we could not restart one of the failed queries. Seems > like between detecting a failure in the query and starting it up again > something went really wrong with Spark and state in checkpoint folder got > corrupted for some reason. > The issue starts with the usual *FileNotFoundException* that happens with S3 > {code:java} > 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = > c074233a-2563-40fc-8036-b5e38e2e2c42, runId = > e607eb6e-8431-4269-acab-cc2c1f9f09dd] > terminated with error > java.io.FileNotFoundException: No such file or directory: > s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37 > 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) > at > org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131) > at > org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726) > at > org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699) > at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965) > at > org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL > og.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381) > at >
[jira] [Commented] (SPARK-26396) Kafka consumer cache overflow since 2.4.x
[ https://issues.apache.org/jira/browse/SPARK-26396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725685#comment-16725685 ] Kaspar Tint commented on SPARK-26396: - Yes, issue can be closed. I do think the documentation about this setting could clarify what was say'd here tough. > Kafka consumer cache overflow since 2.4.x > - > > Key: SPARK-26396 > URL: https://issues.apache.org/jira/browse/SPARK-26396 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4 standalone client mode >Reporter: Kaspar Tint >Priority: Major > > We are experiencing an issue where the Kafka consumer cache seems to overflow > constantly upon starting the application. This issue appeared after upgrading > to Spark 2.4. > We would get constant warnings like this: > {code:java} > 18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43) > {code} > This application is running 4 different Spark Structured Streaming queries > against the same Kafka topic that has 90 partitions. We used to run it with > just the default settings so it defaulted to cache size 64 on Spark 2.3 but > now we tried to put it to 180 or 360. With 360 we will have a lot less noise > about the overflow but resource need will increase substantially. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26396) Kafka consumer cache overflow since 2.4.x
[ https://issues.apache.org/jira/browse/SPARK-26396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725104#comment-16725104 ] Kaspar Tint commented on SPARK-26396: - So in case there are four groupid's all consuming same topic with 90 partitions 360 for one JVM should be correct, right? > Kafka consumer cache overflow since 2.4.x > - > > Key: SPARK-26396 > URL: https://issues.apache.org/jira/browse/SPARK-26396 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4 standalone client mode >Reporter: Kaspar Tint >Priority: Major > > We are experiencing an issue where the Kafka consumer cache seems to overflow > constantly upon starting the application. This issue appeared after upgrading > to Spark 2.4. > We would get constant warnings like this: > {code:java} > 18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43) > {code} > This application is running 4 different Spark Structured Streaming queries > against the same Kafka topic that has 90 partitions. We used to run it with > just the default settings so it defaulted to cache size 64 on Spark 2.3 but > now we tried to put it to 180 or 360. With 360 we will have a lot less noise > about the overflow but resource need will increase substantially. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26396) Kafka consumer cache overflow since 2.4.x
[ https://issues.apache.org/jira/browse/SPARK-26396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725050#comment-16725050 ] Kaspar Tint edited comment on SPARK-26396 at 12/19/18 2:39 PM: --- Any exact formula to use for this when considering that the application can have many different queries? We don't need that many executors in dev for instance but in production we indeed have plenty of them. was (Author: tint): Any exact formula to use for this when considering that the application can have many different queries? > Kafka consumer cache overflow since 2.4.x > - > > Key: SPARK-26396 > URL: https://issues.apache.org/jira/browse/SPARK-26396 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4 standalone client mode >Reporter: Kaspar Tint >Priority: Major > > We are experiencing an issue where the Kafka consumer cache seems to overflow > constantly upon starting the application. This issue appeared after upgrading > to Spark 2.4. > We would get constant warnings like this: > {code:java} > 18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43) > {code} > This application is running 4 different Spark Structured Streaming queries > against the same Kafka topic that has 90 partitions. We used to run it with > just the default settings so it defaulted to cache size 64 on Spark 2.3 but > now we tried to put it to 180 or 360. With 360 we will have a lot less noise > about the overflow but resource need will increase substantially. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26396) Kafka consumer cache overflow since 2.4.x
[ https://issues.apache.org/jira/browse/SPARK-26396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725050#comment-16725050 ] Kaspar Tint commented on SPARK-26396: - Any exact formula to use for this when considering that the application can have many different queries? > Kafka consumer cache overflow since 2.4.x > - > > Key: SPARK-26396 > URL: https://issues.apache.org/jira/browse/SPARK-26396 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4 standalone client mode >Reporter: Kaspar Tint >Priority: Major > > We are experiencing an issue where the Kafka consumer cache seems to overflow > constantly upon starting the application. This issue appeared after upgrading > to Spark 2.4. > We would get constant warnings like this: > {code:java} > 18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57) > 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max > capacity of 180, removing consumer for > CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43) > {code} > This application is running 4 different Spark Structured Streaming queries > against the same Kafka topic that has 90 partitions. We used to run it with > just the default settings so it defaulted to cache size 64 on Spark 2.3 but > now we tried to put it to 180 or 360. With 360 we will have a lot less noise > about the overflow but resource need will increase substantially. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26396) Kafka consumer cache overflow since 2.4.x
Kaspar Tint created SPARK-26396: --- Summary: Kafka consumer cache overflow since 2.4.x Key: SPARK-26396 URL: https://issues.apache.org/jira/browse/SPARK-26396 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Environment: Spark 2.4 standalone client mode Reporter: Kaspar Tint We are experiencing an issue where the Kafka consumer cache seems to overflow constantly upon starting the application. This issue appeared after upgrading to Spark 2.4. We would get constant warnings like this: {code:java} 18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max capacity of 180, removing consumer for CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76) 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max capacity of 180, removing consumer for CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30) 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max capacity of 180, removing consumer for CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57) 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max capacity of 180, removing consumer for CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43) {code} This application is running 4 different Spark Structured Streaming queries against the same Kafka topic that has 90 partitions. We used to run it with just the default settings so it defaulted to cache size 64 on Spark 2.3 but now we tried to put it to 180 or 360. With 360 we will have a lot less noise about the overflow but resource need will increase substantially. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26359) Spark checkpoint restore fails after query restart
[ https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721213#comment-16721213 ] Kaspar Tint commented on SPARK-26359: - Means we changed the configuration *spark.sql.streaming.checkpointLocation: "s3a://some.domain/spark/checkpoints/49"* to *spark.sql.streaming.checkpointLocation: "s3a://some.domain/spark/checkpoints/50"* and restarted the whole Spark application. This obviously results in dataloss and is not a preferred solution. > Spark checkpoint restore fails after query restart > -- > > Key: SPARK-26359 > URL: https://issues.apache.org/jira/browse/SPARK-26359 > Project: Spark > Issue Type: Bug > Components: Spark Submit, Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 deployed in standalone-client mode > Checkpointing is done to S3 > The Spark application in question is responsible for running 4 different > queries > Queries are written using Structured Streaming > We are using the following algorithm for hopes of better performance: > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the > default is 1 >Reporter: Kaspar Tint >Priority: Major > Attachments: driver-redacted, metadata, redacted-offsets, > state-redacted, worker-redacted > > > We had an incident where one of our structured streaming queries could not be > restarted after an usual S3 checkpointing failure. Now to clarify before > everything else - we are aware of the issues with S3 and are working towards > moving to HDFS but this will take time. S3 will cause queries to fail quite > often during peak hours and we have separate logic to handle this that will > attempt to restart the failed queries if possible. > In this particular case we could not restart one of the failed queries. Seems > like between detecting a failure in the query and starting it up again > something went really wrong with Spark and state in checkpoint folder got > corrupted for some reason. > The issue starts with the usual *FileNotFoundException* that happens with S3 > {code:java} > 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = > c074233a-2563-40fc-8036-b5e38e2e2c42, runId = > e607eb6e-8431-4269-acab-cc2c1f9f09dd] > terminated with error > java.io.FileNotFoundException: No such file or directory: > s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37 > 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) > at > org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131) > at > org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726) > at > org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699) > at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965) > at > org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL > og.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381) > at >
[jira] [Commented] (SPARK-26359) Spark checkpoint restore fails after query restart
[ https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721107#comment-16721107 ] Kaspar Tint commented on SPARK-26359: - This seems to be related to the fact that we decided to try the *spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version* 2 instead of the default 1. Gives a much better performance but looks like this issue here is the tradeoff then? Do you have any suggestions maybe on how to recover from this in a sane way? > Spark checkpoint restore fails after query restart > -- > > Key: SPARK-26359 > URL: https://issues.apache.org/jira/browse/SPARK-26359 > Project: Spark > Issue Type: Bug > Components: Spark Submit, Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 deployed in standalone-client mode > Checkpointing is done to S3 > The Spark application in question is responsible for running 4 different > queries > Queries are written using Structured Streaming > We are using the following algorithm for hopes of better performance: > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the > default is 1 >Reporter: Kaspar Tint >Priority: Major > Attachments: driver-redacted, metadata, redacted-offsets, > state-redacted, worker-redacted > > > We had an incident where one of our structured streaming queries could not be > restarted after an usual S3 checkpointing failure. Now to clarify before > everything else - we are aware of the issues with S3 and are working towards > moving to HDFS but this will take time. S3 will cause queries to fail quite > often during peak hours and we have separate logic to handle this that will > attempt to restart the failed queries if possible. > In this particular case we could not restart one of the failed queries. Seems > like between detecting a failure in the query and starting it up again > something went really wrong with Spark and state in checkpoint folder got > corrupted for some reason. > The issue starts with the usual *FileNotFoundException* that happens with S3 > {code:java} > 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = > c074233a-2563-40fc-8036-b5e38e2e2c42, runId = > e607eb6e-8431-4269-acab-cc2c1f9f09dd] > terminated with error > java.io.FileNotFoundException: No such file or directory: > s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37 > 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) > at > org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131) > at > org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726) > at > org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699) > at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965) > at > org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL > og.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381) > at >
[jira] [Updated] (SPARK-26359) Spark checkpoint restore fails after query restart
[ https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaspar Tint updated SPARK-26359: Environment: Spark 2.4.0 deployed in standalone-client mode Checkpointing is done to S3 The Spark application in question is responsible for running 4 different queries Queries are written using Structured Streaming We are using the following algorithm for hopes of better performance: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the default is 1 was: Spark 2.4.0 deployed in standalone-client mode Checkpointing is done to S3 The Spark application in question is responsible for running 4 different queries Queries are written using Structured Streaming > Spark checkpoint restore fails after query restart > -- > > Key: SPARK-26359 > URL: https://issues.apache.org/jira/browse/SPARK-26359 > Project: Spark > Issue Type: Bug > Components: Spark Submit, Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 deployed in standalone-client mode > Checkpointing is done to S3 > The Spark application in question is responsible for running 4 different > queries > Queries are written using Structured Streaming > We are using the following algorithm for hopes of better performance: > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the > default is 1 >Reporter: Kaspar Tint >Priority: Major > Attachments: driver-redacted, metadata, redacted-offsets, > state-redacted, worker-redacted > > > We had an incident where one of our structured streaming queries could not be > restarted after an usual S3 checkpointing failure. Now to clarify before > everything else - we are aware of the issues with S3 and are working towards > moving to HDFS but this will take time. S3 will cause queries to fail quite > often during peak hours and we have separate logic to handle this that will > attempt to restart the failed queries if possible. > In this particular case we could not restart one of the failed queries. Seems > like between detecting a failure in the query and starting it up again > something went really wrong with Spark and state in checkpoint folder got > corrupted for some reason. > The issue starts with the usual *FileNotFoundException* that happens with S3 > {code:java} > 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = > c074233a-2563-40fc-8036-b5e38e2e2c42, runId = > e607eb6e-8431-4269-acab-cc2c1f9f09dd] > terminated with error > java.io.FileNotFoundException: No such file or directory: > s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37 > 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) > at > org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131) > at > org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726) > at > org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699) > at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965) > at > org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL > og.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382) > at >
[jira] [Updated] (SPARK-26359) Spark checkpoint restore fails after query restart
[ https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaspar Tint updated SPARK-26359: Attachment: worker-redacted state-redacted redacted-offsets metadata driver-redacted > Spark checkpoint restore fails after query restart > -- > > Key: SPARK-26359 > URL: https://issues.apache.org/jira/browse/SPARK-26359 > Project: Spark > Issue Type: Bug > Components: Spark Submit, Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 deployed in standalone-client mode > Checkpointing is done to S3 > The Spark application in question is responsible for running 4 different > queries > Queries are written using Structured Streaming >Reporter: Kaspar Tint >Priority: Major > Attachments: driver-redacted, metadata, redacted-offsets, > state-redacted, worker-redacted > > > We had an incident where one of our structured streaming queries could not be > restarted after an usual S3 checkpointing failure. Now to clarify before > everything else - we are aware of the issues with S3 and are working towards > moving to HDFS but this will take time. S3 will cause queries to fail quite > often during peak hours and we have separate logic to handle this that will > attempt to restart the failed queries if possible. > In this particular case we could not restart one of the failed queries. Seems > like between detecting a failure in the query and starting it up again > something went really wrong with Spark and state in checkpoint folder got > corrupted for some reason. > The issue starts with the usual *FileNotFoundException* that happens with S3 > {code:java} > 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = > c074233a-2563-40fc-8036-b5e38e2e2c42, runId = > e607eb6e-8431-4269-acab-cc2c1f9f09dd] > terminated with error > java.io.FileNotFoundException: No such file or directory: > s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37 > 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) > at > org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131) > at > org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726) > at > org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699) > at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965) > at > org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL > og.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at >
[jira] [Created] (SPARK-26359) Spark checkpoint restore fails after query restart
Kaspar Tint created SPARK-26359: --- Summary: Spark checkpoint restore fails after query restart Key: SPARK-26359 URL: https://issues.apache.org/jira/browse/SPARK-26359 Project: Spark Issue Type: Bug Components: Spark Submit, Structured Streaming Affects Versions: 2.4.0 Environment: Spark 2.4.0 deployed in standalone-client mode Checkpointing is done to S3 The Spark application in question is responsible for running 4 different queries Queries are written using Structured Streaming Reporter: Kaspar Tint We had an incident where one of our structured streaming queries could not be restarted after an usual S3 checkpointing failure. Now to clarify before everything else - we are aware of the issues with S3 and are working towards moving to HDFS but this will take time. S3 will cause queries to fail quite often during peak hours and we have separate logic to handle this that will attempt to restart the failed queries if possible. In this particular case we could not restart one of the failed queries. Seems like between detecting a failure in the query and starting it up again something went really wrong with Spark and state in checkpoint folder got corrupted for some reason. The issue starts with the usual *FileNotFoundException* that happens with S3 {code:java} 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = c074233a-2563-40fc-8036-b5e38e2e2c42, runId = e607eb6e-8431-4269-acab-cc2c1f9f09dd] terminated with error java.io.FileNotFoundException: No such file or directory: s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) at org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715) at org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131) at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965) at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331) at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL og.scala:126) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcZ$sp(MicroBatchExecution.scala:381) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:337) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:337) at
[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718912#comment-16718912 ] Kaspar Tint edited comment on SPARK-25136 at 12/12/18 12:26 PM: Does it make sense for our S3 case to be a separate issue then? was (Author: tint): Does it make sense for our S3 case to be a separate issue then? And if so is it possible to provide logs in some safe manner or do I need to redact all sensitive information before. > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > to file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}} > As the shutdown is occurring the worker reports > {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}} > The restart began at 00:39:38. > Driver log entries > {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = > e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = > b7ee0163-47db-4392-ab66-94d36ce63074]. Use > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 > to store the query checkpoint.}} > {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, > 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading > delta file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > of HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: > > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > does not exist}} > {{Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}} >
[jira] [Commented] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718912#comment-16718912 ] Kaspar Tint commented on SPARK-25136: - Does it make sense for our S3 case to be a separate issue then? And if so is it possible to provide logs in some safe manner or do I need to redact all sensitive information before. > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > to file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}} > As the shutdown is occurring the worker reports > {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}} > The restart began at 00:39:38. > Driver log entries > {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = > e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = > b7ee0163-47db-4392-ab66-94d36ce63074]. Use > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 > to store the query checkpoint.}} > {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, > 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading > delta file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > of HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: > > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > does not exist}} > {{Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}} > > Worker log entries > {{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance >
[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718732#comment-16718732 ] Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:18 AM: Yes... we are sadly aware of this also. Slowly transitioning to HDFS. But the specific issue at hand here seems like not S3 specific ? was (Author: tint): Yes... we are sadly aware of this also. Slowly transitioning to HDFS. But the issue seems like not S3 specific here ? > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > to file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}} > As the shutdown is occurring the worker reports > {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}} > The restart began at 00:39:38. > Driver log entries > {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = > e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = > b7ee0163-47db-4392-ab66-94d36ce63074]. Use > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 > to store the query checkpoint.}} > {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, > 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading > delta file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > of HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: > > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > does not exist}} > {{Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: >
[jira] [Commented] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718732#comment-16718732 ] Kaspar Tint commented on SPARK-25136: - Yes... we are sadly aware of this also. Slowly transitioning to HDFS. But the issue seems like not S3 specific here ? > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > to file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}} > As the shutdown is occurring the worker reports > {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}} > The restart began at 00:39:38. > Driver log entries > {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = > e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = > b7ee0163-47db-4392-ab66-94d36ce63074]. Use > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 > to store the query checkpoint.}} > {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, > 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading > delta file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > of HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: > > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > does not exist}} > {{Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}} > > Worker log entries > {{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance >
[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719 ] Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:09 AM: Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? was (Author: tint): Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? [s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp] So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for >
[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719 ] Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:10 AM: Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? {code:java} s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp {code} So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? was (Author: tint): Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? {code} s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp {/code} So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for >
[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719 ] Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:09 AM: Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? {code} s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp {/code} So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? was (Author: tint): Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for >
[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719 ] Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:09 AM: Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? was (Author: tint): Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > to file >
[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719 ] Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:08 AM: Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? [s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp] So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? was (Author: tint): Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? *{{[s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp]}}* So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = >
[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719 ] Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:08 AM: Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? [s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp] So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? was (Author: tint): Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? [s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp] So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = >
[jira] [Commented] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719 ] Kaspar Tint commented on SPARK-25136: - Looks like we bumped into a similar issue with S3 checkpointing. A bunch of queries failed because of usual S3 issues and when restarting one of the failed queries... it ran into a problem fetching the deltafile required to continue processing. It threw a {code:java} IllegalStateException: Error reading delta file s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code} But when looking into the S3 bucket, it looks like there actually was a delta file but I guess it was not finished yet? *{{[s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp]}}* So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta yet. The query would never be able to recover from this and our application logic would just keep restarting this query until an engineer stepped in and bumped the checkpoint version manually. We can create a safety net for this issue meanwhile and delete the metadata manually if such issue appears... but it would be nice to have more clarity into what is going on here, why it happened and if it can be fixed on Spark side? > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > to file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}} > As the shutdown is occurring the worker reports > {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}} > The restart began at 00:39:38. > Driver log entries > {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = > e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = > b7ee0163-47db-4392-ab66-94d36ce63074]. Use > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 > to store the query checkpoint.}} > {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, >
[jira] [Commented] (SPARK-10881) Unable to use custom log4j appender in spark executor
[ https://issues.apache.org/jira/browse/SPARK-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477398#comment-16477398 ] Kaspar Tint commented on SPARK-10881: - This seems to be an issue with 2.2.X Spark also. We did not experience the issue with Spark 2.1.X tough. > Unable to use custom log4j appender in spark executor > - > > Key: SPARK-10881 > URL: https://issues.apache.org/jira/browse/SPARK-10881 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: David Moravek >Priority: Minor > > In CoarseGrainedExecutorBackend, log4j is initialized, before userclasspath > gets registered: > https://github.com/apache/spark/blob/v1.3.1/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L126 > In order to use custom appender, one has to distribute it using `spark-submit > --files` and set it via spark.executor.extraClassPath -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org