Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration
Thank you your reply, I will open pull request for this doc issue. The logic is clear. пт, 14 апр. 2017, 23:34 Michael Armbrust: > 1) could we update documentation for Structured Streaming and describe >> that checkpointing could be specified by >> spark.sql.streaming.checkpointLocation on SparkSession level and thus >> automatically checkpoint dirs will be created per foreach query? >> >> > Sure, please open a pull request. > > >> 2) Do we really need to specify the checkpoint dir per query? what the >> reason for this? finally we will be forced to write some checkpointDir name >> generator, for example associate it with some particular named query and so >> on? >> > > Every query needs to have a unique checkpoint as this is how we track what > has been processed. If we don't have this, we can't restart the query > where it left off. In you example, I would suggest including the metric > name in the checkpoint location path. > -- *Yours faithfully, * *Kate Eri.*
SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration
Hello, guys. I have initiated the ticket https://issues.apache.org/jira/browse/SPARK-20325 , My case was: I launch two streams from one source stream *streamToProcess *like this streamToProcess .groupBy(metric) .agg(count(metric)) .writeStream .outputMode("complete") .option("checkpointLocation", checkpointDir) .foreach(kafkaWriter) .start() After that I’ve got an exception: Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as another query with same id is already active. Perhaps you are attempting to restart a query from checkpoint that is already active. It is caused by that *StreamingQueryManager.scala* get the checkpoint dir from stream’s configuration, and because my streams have equal checkpointDirs, the second stream tries to recover instead of creating of new one.For more details watch the ticket: SPARK-20325 1) could we update documentation for Structured Streaming and describe that checkpointing could be specified by spark.sql.streaming.checkpointLocation on SparkSession level and thus automatically checkpoint dirs will be created per foreach query? 2) Do we really need to specify the checkpoint dir per query? what the reason for this? finally we will be forced to write some checkpointDir name generator, for example associate it with some particular named query and so on? -- *Yours faithfully, * *Kate Eri.*