[jira] [Commented] (SPARK-26359) Spark checkpoint restore fails after query restart

2019-01-04 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733935#comment-16733935
 ] 

Kaspar Tint commented on SPARK-26359:
-

[~gsomogyi]: Yes, this ticket can be closed. The solution mentioned here should 
be enough to work around the issue.

> Spark checkpoint restore fails after query restart
> --
>
> Key: SPARK-26359
> URL: https://issues.apache.org/jira/browse/SPARK-26359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 deployed in standalone-client mode
> Checkpointing is done to S3
> The Spark application in question is responsible for running 4 different 
> queries
> Queries are written using Structured Streaming
> We are using the following algorithm for hopes of better performance:
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the 
> default is 1
>Reporter: Kaspar Tint
>Priority: Major
> Attachments: driver-redacted, metadata, redacted-offsets, 
> state-redacted, worker-redacted
>
>
> We had an incident where one of our structured streaming queries could not be 
> restarted after an usual S3 checkpointing failure. Now to clarify before 
> everything else - we are aware of the issues with S3 and are working towards 
> moving to HDFS but this will take time. S3 will cause queries to fail quite 
> often during peak hours and we have separate logic to handle this that will 
> attempt to restart the failed queries if possible.
> In this particular case we could not restart one of the failed queries. Seems 
> like between detecting a failure in the query and starting it up again 
> something went really wrong with Spark and state in checkpoint folder got 
> corrupted for some reason.
> The issue starts with the usual *FileNotFoundException* that happens with S3
> {code:java}
> 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = 
> c074233a-2563-40fc-8036-b5e38e2e2c42, runId = 
> e607eb6e-8431-4269-acab-cc2c1f9f09dd]
> terminated with error
> java.io.FileNotFoundException: No such file or directory: 
> s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37
> 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at 
> org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715)
> at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965)
> at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331)
> at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL
> og.scala:126)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> 

[jira] [Commented] (SPARK-26396) Kafka consumer cache overflow since 2.4.x

2018-12-20 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725685#comment-16725685
 ] 

Kaspar Tint commented on SPARK-26396:
-

Yes, issue can be closed. I do think the documentation about this setting could 
clarify what was say'd here tough.

> Kafka consumer cache overflow since 2.4.x
> -
>
> Key: SPARK-26396
> URL: https://issues.apache.org/jira/browse/SPARK-26396
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4 standalone client mode
>Reporter: Kaspar Tint
>Priority: Major
>
> We are experiencing an issue where the Kafka consumer cache seems to overflow 
> constantly upon starting the application. This issue appeared after upgrading 
> to Spark 2.4.
> We would get constant warnings like this:
> {code:java}
> 18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43)
> {code}
> This application is running 4 different Spark Structured Streaming queries 
> against the same Kafka topic that has 90 partitions. We used to run it with 
> just the default settings so it defaulted to cache size 64 on Spark 2.3 but 
> now we tried to put it to 180 or 360. With 360 we will have a lot less noise 
> about the overflow but resource need will increase substantially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26396) Kafka consumer cache overflow since 2.4.x

2018-12-19 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725104#comment-16725104
 ] 

Kaspar Tint commented on SPARK-26396:
-

So in case there are four groupid's all consuming same topic with 90 
partitions  360 for one JVM should be correct, right?

> Kafka consumer cache overflow since 2.4.x
> -
>
> Key: SPARK-26396
> URL: https://issues.apache.org/jira/browse/SPARK-26396
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4 standalone client mode
>Reporter: Kaspar Tint
>Priority: Major
>
> We are experiencing an issue where the Kafka consumer cache seems to overflow 
> constantly upon starting the application. This issue appeared after upgrading 
> to Spark 2.4.
> We would get constant warnings like this:
> {code:java}
> 18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43)
> {code}
> This application is running 4 different Spark Structured Streaming queries 
> against the same Kafka topic that has 90 partitions. We used to run it with 
> just the default settings so it defaulted to cache size 64 on Spark 2.3 but 
> now we tried to put it to 180 or 360. With 360 we will have a lot less noise 
> about the overflow but resource need will increase substantially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26396) Kafka consumer cache overflow since 2.4.x

2018-12-19 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725050#comment-16725050
 ] 

Kaspar Tint edited comment on SPARK-26396 at 12/19/18 2:39 PM:
---

Any exact formula to use for this when considering that the application can 
have many different queries? We don't need that many executors in dev for 
instance but in production we indeed have plenty of them.


was (Author: tint):
Any exact formula to use for this when considering that the application can 
have many different queries? 

> Kafka consumer cache overflow since 2.4.x
> -
>
> Key: SPARK-26396
> URL: https://issues.apache.org/jira/browse/SPARK-26396
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4 standalone client mode
>Reporter: Kaspar Tint
>Priority: Major
>
> We are experiencing an issue where the Kafka consumer cache seems to overflow 
> constantly upon starting the application. This issue appeared after upgrading 
> to Spark 2.4.
> We would get constant warnings like this:
> {code:java}
> 18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43)
> {code}
> This application is running 4 different Spark Structured Streaming queries 
> against the same Kafka topic that has 90 partitions. We used to run it with 
> just the default settings so it defaulted to cache size 64 on Spark 2.3 but 
> now we tried to put it to 180 or 360. With 360 we will have a lot less noise 
> about the overflow but resource need will increase substantially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26396) Kafka consumer cache overflow since 2.4.x

2018-12-19 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725050#comment-16725050
 ] 

Kaspar Tint commented on SPARK-26396:
-

Any exact formula to use for this when considering that the application can 
have many different queries? 

> Kafka consumer cache overflow since 2.4.x
> -
>
> Key: SPARK-26396
> URL: https://issues.apache.org/jira/browse/SPARK-26396
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4 standalone client mode
>Reporter: Kaspar Tint
>Priority: Major
>
> We are experiencing an issue where the Kafka consumer cache seems to overflow 
> constantly upon starting the application. This issue appeared after upgrading 
> to Spark 2.4.
> We would get constant warnings like this:
> {code:java}
> 18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43)
> {code}
> This application is running 4 different Spark Structured Streaming queries 
> against the same Kafka topic that has 90 partitions. We used to run it with 
> just the default settings so it defaulted to cache size 64 on Spark 2.3 but 
> now we tried to put it to 180 or 360. With 360 we will have a lot less noise 
> about the overflow but resource need will increase substantially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26396) Kafka consumer cache overflow since 2.4.x

2018-12-18 Thread Kaspar Tint (JIRA)
Kaspar Tint created SPARK-26396:
---

 Summary: Kafka consumer cache overflow since 2.4.x
 Key: SPARK-26396
 URL: https://issues.apache.org/jira/browse/SPARK-26396
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
 Environment: Spark 2.4 standalone client mode
Reporter: Kaspar Tint


We are experiencing an issue where the Kafka consumer cache seems to overflow 
constantly upon starting the application. This issue appeared after upgrading 
to Spark 2.4.

We would get constant warnings like this:
{code:java}
18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
capacity of 180, removing consumer for 
CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76)
18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
capacity of 180, removing consumer for 
CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30)
18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
capacity of 180, removing consumer for 
CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57)
18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
capacity of 180, removing consumer for 
CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43)
{code}

This application is running 4 different Spark Structured Streaming queries 
against the same Kafka topic that has 90 partitions. We used to run it with 
just the default settings so it defaulted to cache size 64 on Spark 2.3 but now 
we tried to put it to 180 or 360. With 360 we will have a lot less noise about 
the overflow but resource need will increase substantially.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26359) Spark checkpoint restore fails after query restart

2018-12-14 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721213#comment-16721213
 ] 

Kaspar Tint commented on SPARK-26359:
-

Means we changed the configuration *spark.sql.streaming.checkpointLocation: 
"s3a://some.domain/spark/checkpoints/49"* to 
*spark.sql.streaming.checkpointLocation: 
"s3a://some.domain/spark/checkpoints/50"* and restarted the whole Spark 
application. This obviously results in dataloss and is not a preferred solution.

> Spark checkpoint restore fails after query restart
> --
>
> Key: SPARK-26359
> URL: https://issues.apache.org/jira/browse/SPARK-26359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 deployed in standalone-client mode
> Checkpointing is done to S3
> The Spark application in question is responsible for running 4 different 
> queries
> Queries are written using Structured Streaming
> We are using the following algorithm for hopes of better performance:
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the 
> default is 1
>Reporter: Kaspar Tint
>Priority: Major
> Attachments: driver-redacted, metadata, redacted-offsets, 
> state-redacted, worker-redacted
>
>
> We had an incident where one of our structured streaming queries could not be 
> restarted after an usual S3 checkpointing failure. Now to clarify before 
> everything else - we are aware of the issues with S3 and are working towards 
> moving to HDFS but this will take time. S3 will cause queries to fail quite 
> often during peak hours and we have separate logic to handle this that will 
> attempt to restart the failed queries if possible.
> In this particular case we could not restart one of the failed queries. Seems 
> like between detecting a failure in the query and starting it up again 
> something went really wrong with Spark and state in checkpoint folder got 
> corrupted for some reason.
> The issue starts with the usual *FileNotFoundException* that happens with S3
> {code:java}
> 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = 
> c074233a-2563-40fc-8036-b5e38e2e2c42, runId = 
> e607eb6e-8431-4269-acab-cc2c1f9f09dd]
> terminated with error
> java.io.FileNotFoundException: No such file or directory: 
> s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37
> 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at 
> org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715)
> at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965)
> at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331)
> at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL
> og.scala:126)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> 

[jira] [Commented] (SPARK-26359) Spark checkpoint restore fails after query restart

2018-12-14 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721107#comment-16721107
 ] 

Kaspar Tint commented on SPARK-26359:
-

This seems to be related to the fact that we decided to try the 
*spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version* 2 instead of the 
default 1. Gives a much better performance but looks like this issue here is 
the tradeoff then? 

Do you have any suggestions maybe on how to recover from this in a sane way?

> Spark checkpoint restore fails after query restart
> --
>
> Key: SPARK-26359
> URL: https://issues.apache.org/jira/browse/SPARK-26359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 deployed in standalone-client mode
> Checkpointing is done to S3
> The Spark application in question is responsible for running 4 different 
> queries
> Queries are written using Structured Streaming
> We are using the following algorithm for hopes of better performance:
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the 
> default is 1
>Reporter: Kaspar Tint
>Priority: Major
> Attachments: driver-redacted, metadata, redacted-offsets, 
> state-redacted, worker-redacted
>
>
> We had an incident where one of our structured streaming queries could not be 
> restarted after an usual S3 checkpointing failure. Now to clarify before 
> everything else - we are aware of the issues with S3 and are working towards 
> moving to HDFS but this will take time. S3 will cause queries to fail quite 
> often during peak hours and we have separate logic to handle this that will 
> attempt to restart the failed queries if possible.
> In this particular case we could not restart one of the failed queries. Seems 
> like between detecting a failure in the query and starting it up again 
> something went really wrong with Spark and state in checkpoint folder got 
> corrupted for some reason.
> The issue starts with the usual *FileNotFoundException* that happens with S3
> {code:java}
> 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = 
> c074233a-2563-40fc-8036-b5e38e2e2c42, runId = 
> e607eb6e-8431-4269-acab-cc2c1f9f09dd]
> terminated with error
> java.io.FileNotFoundException: No such file or directory: 
> s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37
> 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at 
> org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715)
> at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965)
> at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331)
> at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL
> og.scala:126)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> 

[jira] [Updated] (SPARK-26359) Spark checkpoint restore fails after query restart

2018-12-13 Thread Kaspar Tint (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaspar Tint updated SPARK-26359:

Environment: 
Spark 2.4.0 deployed in standalone-client mode
Checkpointing is done to S3
The Spark application in question is responsible for running 4 different queries
Queries are written using Structured Streaming

We are using the following algorithm for hopes of better performance:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the 
default is 1

  was:
Spark 2.4.0 deployed in standalone-client mode
Checkpointing is done to S3
The Spark application in question is responsible for running 4 different queries
Queries are written using Structured Streaming


> Spark checkpoint restore fails after query restart
> --
>
> Key: SPARK-26359
> URL: https://issues.apache.org/jira/browse/SPARK-26359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 deployed in standalone-client mode
> Checkpointing is done to S3
> The Spark application in question is responsible for running 4 different 
> queries
> Queries are written using Structured Streaming
> We are using the following algorithm for hopes of better performance:
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the 
> default is 1
>Reporter: Kaspar Tint
>Priority: Major
> Attachments: driver-redacted, metadata, redacted-offsets, 
> state-redacted, worker-redacted
>
>
> We had an incident where one of our structured streaming queries could not be 
> restarted after an usual S3 checkpointing failure. Now to clarify before 
> everything else - we are aware of the issues with S3 and are working towards 
> moving to HDFS but this will take time. S3 will cause queries to fail quite 
> often during peak hours and we have separate logic to handle this that will 
> attempt to restart the failed queries if possible.
> In this particular case we could not restart one of the failed queries. Seems 
> like between detecting a failure in the query and starting it up again 
> something went really wrong with Spark and state in checkpoint folder got 
> corrupted for some reason.
> The issue starts with the usual *FileNotFoundException* that happens with S3
> {code:java}
> 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = 
> c074233a-2563-40fc-8036-b5e38e2e2c42, runId = 
> e607eb6e-8431-4269-acab-cc2c1f9f09dd]
> terminated with error
> java.io.FileNotFoundException: No such file or directory: 
> s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37
> 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at 
> org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715)
> at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965)
> at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331)
> at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL
> og.scala:126)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382)
> at 
> 

[jira] [Updated] (SPARK-26359) Spark checkpoint restore fails after query restart

2018-12-13 Thread Kaspar Tint (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaspar Tint updated SPARK-26359:

Attachment: worker-redacted
state-redacted
redacted-offsets
metadata
driver-redacted

> Spark checkpoint restore fails after query restart
> --
>
> Key: SPARK-26359
> URL: https://issues.apache.org/jira/browse/SPARK-26359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 deployed in standalone-client mode
> Checkpointing is done to S3
> The Spark application in question is responsible for running 4 different 
> queries
> Queries are written using Structured Streaming
>Reporter: Kaspar Tint
>Priority: Major
> Attachments: driver-redacted, metadata, redacted-offsets, 
> state-redacted, worker-redacted
>
>
> We had an incident where one of our structured streaming queries could not be 
> restarted after an usual S3 checkpointing failure. Now to clarify before 
> everything else - we are aware of the issues with S3 and are working towards 
> moving to HDFS but this will take time. S3 will cause queries to fail quite 
> often during peak hours and we have separate logic to handle this that will 
> attempt to restart the failed queries if possible.
> In this particular case we could not restart one of the failed queries. Seems 
> like between detecting a failure in the query and starting it up again 
> something went really wrong with Spark and state in checkpoint folder got 
> corrupted for some reason.
> The issue starts with the usual *FileNotFoundException* that happens with S3
> {code:java}
> 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = 
> c074233a-2563-40fc-8036-b5e38e2e2c42, runId = 
> e607eb6e-8431-4269-acab-cc2c1f9f09dd]
> terminated with error
> java.io.FileNotFoundException: No such file or directory: 
> s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37
> 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at 
> org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715)
> at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965)
> at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331)
> at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL
> og.scala:126)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at 
> 

[jira] [Created] (SPARK-26359) Spark checkpoint restore fails after query restart

2018-12-13 Thread Kaspar Tint (JIRA)
Kaspar Tint created SPARK-26359:
---

 Summary: Spark checkpoint restore fails after query restart
 Key: SPARK-26359
 URL: https://issues.apache.org/jira/browse/SPARK-26359
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit, Structured Streaming
Affects Versions: 2.4.0
 Environment: Spark 2.4.0 deployed in standalone-client mode
Checkpointing is done to S3
The Spark application in question is responsible for running 4 different queries
Queries are written using Structured Streaming
Reporter: Kaspar Tint


We had an incident where one of our structured streaming queries could not be 
restarted after an usual S3 checkpointing failure. Now to clarify before 
everything else - we are aware of the issues with S3 and are working towards 
moving to HDFS but this will take time. S3 will cause queries to fail quite 
often during peak hours and we have separate logic to handle this that will 
attempt to restart the failed queries if possible.

In this particular case we could not restart one of the failed queries. Seems 
like between detecting a failure in the query and starting it up again 
something went really wrong with Spark and state in checkpoint folder got 
corrupted for some reason.

The issue starts with the usual *FileNotFoundException* that happens with S3
{code:java}
2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = 
c074233a-2563-40fc-8036-b5e38e2e2c42, runId = 
e607eb6e-8431-4269-acab-cc2c1f9f09dd]
terminated with error
java.io.FileNotFoundException: No such file or directory: 
s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37
348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
at 
org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715)
at 
org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
at 
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
at 
org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965)
at 
org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331)
at 
org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL
og.scala:126)
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcZ$sp(MicroBatchExecution.scala:381)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:337)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:337)
at 

[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718912#comment-16718912
 ] 

Kaspar Tint edited comment on SPARK-25136 at 12/12/18 12:26 PM:


Does it make sense for our S3 case to be a separate issue then?


was (Author: tint):
Does it make sense for our S3 case to be a separate issue then? And if so is it 
possible to provide logs in some safe manner or do I need to redact all 
sensitive information before.

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  to file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}}
> As the shutdown is occurring the worker reports
> {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}}
> The restart began at 00:39:38.
> Driver log entries
> {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = 
> e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = 
> b7ee0163-47db-4392-ab66-94d36ce63074]. Use 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040
>  to store the query checkpoint.}}
> {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, 
> 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading 
> delta file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  of HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]:
>  
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  does not exist}}
> {{Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}
> 

[jira] [Commented] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718912#comment-16718912
 ] 

Kaspar Tint commented on SPARK-25136:
-

Does it make sense for our S3 case to be a separate issue then? And if so is it 
possible to provide logs in some safe manner or do I need to redact all 
sensitive information before.

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  to file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}}
> As the shutdown is occurring the worker reports
> {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}}
> The restart began at 00:39:38.
> Driver log entries
> {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = 
> e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = 
> b7ee0163-47db-4392-ab66-94d36ce63074]. Use 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040
>  to store the query checkpoint.}}
> {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, 
> 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading 
> delta file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  of HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]:
>  
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  does not exist}}
> {{Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}
>  
> Worker log entries
> {{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance 
> 

[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718732#comment-16718732
 ] 

Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:18 AM:


Yes... we are sadly aware of this also. Slowly transitioning to HDFS. But the 
specific issue at hand here seems like not S3 specific ?


was (Author: tint):
Yes... we are sadly aware of this also. Slowly transitioning to HDFS. But the 
issue seems like not S3 specific here ?

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  to file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}}
> As the shutdown is occurring the worker reports
> {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}}
> The restart began at 00:39:38.
> Driver log entries
> {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = 
> e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = 
> b7ee0163-47db-4392-ab66-94d36ce63074]. Use 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040
>  to store the query checkpoint.}}
> {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, 
> 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading 
> delta file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  of HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]:
>  
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  does not exist}}
> {{Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> 

[jira] [Commented] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718732#comment-16718732
 ] 

Kaspar Tint commented on SPARK-25136:
-

Yes... we are sadly aware of this also. Slowly transitioning to HDFS. But the 
issue seems like not S3 specific here ?

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  to file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}}
> As the shutdown is occurring the worker reports
> {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}}
> The restart began at 00:39:38.
> Driver log entries
> {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = 
> e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = 
> b7ee0163-47db-4392-ab66-94d36ce63074]. Use 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040
>  to store the query checkpoint.}}
> {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, 
> 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading 
> delta file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  of HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]:
>  
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  does not exist}}
> {{Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}
>  
> Worker log entries
> {{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance 
> 

[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719
 ] 

Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:09 AM:


Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

 

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?


was (Author: tint):
Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

[s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp]

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> 

[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719
 ] 

Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:10 AM:


Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

{code:java}

 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp

{code}

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?


was (Author: tint):
Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

{code}

 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp

{/code}

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> 

[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719
 ] 

Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:09 AM:


Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

{code}

 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp

{/code}

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?


was (Author: tint):
Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> 

[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719
 ] 

Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:09 AM:


Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?


was (Author: tint):
Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

 

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  to file 
> 

[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719
 ] 

Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:08 AM:


Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

[s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp]

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?


was (Author: tint):
Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

 
*{{[s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp]}}*

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> 

[jira] [Comment Edited] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719
 ] 

Kaspar Tint edited comment on SPARK-25136 at 12/12/18 10:08 AM:


Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

[s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp]

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?


was (Author: tint):
Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

[s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp]

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> 

[jira] [Commented] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2018-12-12 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718719#comment-16718719
 ] 

Kaspar Tint commented on SPARK-25136:
-

Looks like we bumped into a similar issue with S3 checkpointing. A bunch of 
queries failed because of usual S3 issues and when restarting one of the failed 
queries... it ran into a problem fetching the deltafile required to continue 
processing. 

 

It threw a
{code:java}
IllegalStateException: Error reading delta file 
s3://some.domain/spark/checkpoints/49/feedback/state/2/11/36870.delta{code}
But when looking into the S3 bucket, it looks like there actually was a delta 
file but I guess it was not finished yet?

 
*{{[s3://some.domain/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp|s3://com.twilio.prod.cops-tooling/supernetwork/insights/spark/checkpoints/49/feedback/state/2/11/.36870.delta.02e75d1f-aa48-4c61-9257-a5928b32e22e.TID2469780.tmp]}}*

So it was a dot file with a .tmp suffix and had not been renamed to 36870.delta 
yet. The query would never be able to recover from this and our application 
logic would just keep restarting this query until an engineer stepped in and 
bumped the checkpoint version manually.

 

We can create a safety net for this issue meanwhile and delete the metadata 
manually if such issue appears...  but it would be nice to have more clarity 
into what is going on here, why it happened and if it can be fixed on Spark 
side?

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  to file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}}
> As the shutdown is occurring the worker reports
> {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}}
> The restart began at 00:39:38.
> Driver log entries
> {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = 
> e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = 
> b7ee0163-47db-4392-ab66-94d36ce63074]. Use 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040
>  to store the query checkpoint.}}
> {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, 
> 

[jira] [Commented] (SPARK-10881) Unable to use custom log4j appender in spark executor

2018-05-16 Thread Kaspar Tint (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477398#comment-16477398
 ] 

Kaspar Tint commented on SPARK-10881:
-

This seems to be an issue with 2.2.X Spark also. We did not experience the 
issue with Spark 2.1.X tough.

> Unable to use custom log4j appender in spark executor
> -
>
> Key: SPARK-10881
> URL: https://issues.apache.org/jira/browse/SPARK-10881
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: David Moravek
>Priority: Minor
>
> In CoarseGrainedExecutorBackend, log4j is initialized, before userclasspath 
> gets registered:
> https://github.com/apache/spark/blob/v1.3.1/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L126
> In order to use custom appender, one has to distribute it using `spark-submit 
> --files` and set it via spark.executor.extraClassPath



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org