[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-06-18 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-864207085


   Thanks @HeartSaVioR @xuanyuanking! I've updated the change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-06-10 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-858953384


   @HeartSaVioR @xuanyuanking Can we move forward with this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-06-04 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-854377685


   Oh, not need to apologize. I've not updated this yet. :) This is a SQL 
config now. Please help review if you find some time. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-06-03 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-854377685


   Oh, not need to apologize. I've not updated this yet. :) This is a SQL 
config now. Please help review if you find some time. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-05-31 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-851802006


   Okay, sounds good. Let me change to using a source option.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-05-31 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-851618733


   @HeartSaVioR Does it sound okay for you? If okay, still prefer an option 
over config? If so, please let me know so I can change to use option.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-05-30 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-851107724


   > Which steps end users require to do to resolve such case with your PR? 
Deleting metadata directory and letting read path to ignore the metadata?
   
   Currently in the use-case, what the users do is, when they change the query 
and the checkpoint doesn't work anymore, they clean up the metadata directory, 
run the changed query with new checkpoint.
   
   They have another Spark app reading from the streaming query output. But as 
Spark respects the metadata, the another Spark app can only read the files 
written by the changed streaming query (i.e. the files recorded in the 
metadata). The other files written before changing the streaming query, are 
ignored by Spark now.
   
   > I know this is a valid workaround to unblock such case end users would be 
stuck on reusing directory, but they should be quite cautious as they must 
remember the state of directory; the metadata won't have some parts of output, 
which is easy to forget. Once they forget the fact and also forget setting the 
flag on read query, only the parts of output will be read and they will 
complain about the result of read query without indicating what they did.
   > 
   > So just allowing end users to ignore metadata is simple, but the risks on 
turning on the flag are not that simple. Let's take our responsibility to guide 
the meaning of ignoring metadata and try to provide the possible risks.
   
   I agree. That is why this config is internal only so far. I should also add 
more cautious wordings in the config doc too. I have discussed with the users, 
seems to me they should know what they are asking and be cautious of the effect 
of this config.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-05-30 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-851067116


   > What's the solution of this? Doesn't it mean you want to make the 
directory be writable from multiple queries?
   
   The use-case looks like this. The user wants to write to same output 
directory. But once they change something in the query and previous checkpoint 
cannot be used anymore, they need to use a new checkpoint directory (and 
metadata directory). They don't write to the output from multiple queries at 
the same time.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-05-30 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-851065018


   > And one more, I think let file stream sink to ignore metadata directory on 
reading existing metadata but write to the metadata directory is odd and 
error-prone. The metadata is no longer valid when Spark starts to write a new 
metadata on the same directory, and the option must be set to true for such 
directory to read properly despite Spark writes the metadata. There's no 
indication and end users have to memorize it.
   
   I don't know if it is a typo, but this doesn't let file stream sink but 
actually lets file stream source (and batch read path) to ignore metadata 
directory when reading the output of file stream sink. It doesn't change how 
file stream sink reads or writes to the metadata directory. Is it possible we 
are talking two different things?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-05-30 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-851061991


   > This was already proposed before from a part of #31638, though I'm not 
sure you've indicated this.
   
   Oh, this comes from internal customer request. It seems hard to do 
workaround so I basically think it makes sense to support such use-case. I'm 
not aware of the previous PR including it.
   
   I'm okay if you think an option is better than a config.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32702: [SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink

2021-05-30 Thread GitBox


viirya commented on pull request #32702:
URL: https://github.com/apache/spark/pull/32702#issuecomment-851028415


   cc @HeartSaVioR @xuanyuanking 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org