[GitHub] [spark] zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink

2019-12-05 Thread GitBox
zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source 
files for FileStreamSource if the files belong to the output of FileStreamSink
URL: https://github.com/apache/spark/pull/26590#issuecomment-562441349
 
 
   Thanks! Merging to master,


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink

2019-12-05 Thread GitBox
zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source 
files for FileStreamSource if the files belong to the output of FileStreamSink
URL: https://github.com/apache/spark/pull/26590#issuecomment-562346871
 
 
   LGTM.
   
   retest this please. Triggering another test since the last run was 3 days 
ago.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink

2019-11-21 Thread GitBox
zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source 
files for FileStreamSource if the files belong to the output of FileStreamSink
URL: https://github.com/apache/spark/pull/26590#issuecomment-557216576
 
 
   > Checking all the files in all the directories in each micro-batch is 
definitely an overkill.
   
   +1.
   
   I think the fundamental issue is the FileIndex interface doesn't work for 
complicated things. There are multiple issues here. Another example: if a user 
is using a glob path in `FileStreamSource`, we always go to 
`InMemoryFileIndex`, even if there are some matched paths created by 
`FileStreamSink`. `InMemoryFileIndex` knowns nothing about 
`MetadataLogFileIndex` and uses its own logic to list files.
   
   Ideally, the defending codes should be added when doing the file listing if 
we would like to prevent such cases because it can also prevent reading 
incorrect files. However, I think that's a pretty large change and probably not 
worth (I have not yet figured out how to make Hadoop's glob pattern codes 
understand `MetadataLogFileIndex`, maybe impossible).
   
   Hence I suggest we just block the `cleanSource` option when listing files 
using `MetadataLogFileIndex`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink

2019-11-20 Thread GitBox
zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source 
files for FileStreamSource if the files belong to the output of FileStreamSink
URL: https://github.com/apache/spark/pull/26590#issuecomment-556950146
 
 
   @HeartSaVioR I think we can simply detect whether we are using 
`MetadataLogFileIndex` here: 
https://github.com/apache/spark/blob/ba2bc4b0e0eea0c1b6732a18cb20e61e4f693156/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L205
   
   We don't need to do such complicated check because for cases you are 
checking, we won't go through `MetadataLogFileIndex` so the result is not 
correct anyway and the user should not use such path.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org