Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Katherin Eri
Thank you your reply, I will open pull request for this doc issue. The
logic is clear.

пт, 14 апр. 2017, 23:34 Michael Armbrust :

> 1)  could we update documentation for Structured Streaming and describe
>> that checkpointing could be specified by
>> spark.sql.streaming.checkpointLocation on SparkSession level and thus
>> automatically checkpoint dirs will be created per foreach query?
>>
>>
> Sure, please open a pull request.
>
>
>> 2) Do we really need to specify the checkpoint dir per query? what the
>> reason for this? finally we will be forced to write some checkpointDir name
>> generator, for example associate it with some particular named query and so
>> on?
>>
>
> Every query needs to have a unique checkpoint as this is how we track what
> has been processed.  If we don't have this, we can't restart the query
> where it left off.  In you example, I would suggest including the metric
> name in the checkpoint location path.
>
-- 

*Yours faithfully, *

*Kate Eri.*


SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Katherin Eri
Hello, guys.

I have initiated the ticket
https://issues.apache.org/jira/browse/SPARK-20325 ,

My case was: I launch two streams from one source stream *streamToProcess *like
this


streamToProcess

.groupBy(metric)

.agg(count(metric))

.writeStream

.outputMode("complete")

.option("checkpointLocation", checkpointDir)

.foreach(kafkaWriter)

.start()


After that I’ve got an exception:

Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as
another query with same id is already active. Perhaps you are attempting to
restart a query from checkpoint that is already active.


It is caused by that *StreamingQueryManager.scala* get the checkpoint dir
from stream’s configuration, and because my streams have equal
checkpointDirs, the second stream tries to recover instead of creating of
new one.For more details watch the ticket: SPARK-20325


1)  could we update documentation for Structured Streaming and describe
that checkpointing could be specified by
spark.sql.streaming.checkpointLocation on SparkSession level and thus
automatically checkpoint dirs will be created per foreach query?

2) Do we really need to specify the checkpoint dir per query? what the
reason for this? finally we will be forced to write some checkpointDir name
generator, for example associate it with some particular named query and so
on?

-- 

*Yours faithfully, *

*Kate Eri.*