[GitHub] [spark] HeartSaVioR opened a new pull request #26920: [SPARK-30281][SS] Consider partitioned/recursive option while verifying archive path on FileStreamSource

GitBox Mon, 16 Dec 2019 22:42:28 -0800

HeartSaVioR opened a new pull request #26920: [SPARK-30281][SS] Consider 
partitioned/recursive option while verifying archive path on FileStreamSource
URL: https://github.com/apache/spark/pull/26920
 
 
   ### What changes were proposed in this pull request?
   
   This patch renews the verification logic of archive path for 
FileStreamSource, as we found the logic doesn't take partitioned/recursive 
options into account.
   
   Before the patch, it only requires the archive path to have depth more than 
2 (two subdirectories from root), leveraging the fact FileStreamSource normally 
reads the files where the parent directory matches the pattern or the file 
itself matches the pattern. Given 'archive' operation moves the files to the 
base archive path with retaining the full path, archive path is tend to be safe 
if the depth is more than 2, meaning FileStreamSource doesn't re-read archived 
files as new source files.
   
   WIth partitioned/recursive options, the fact is invalid, as FileStreamSource 
can read any files in any depth of subdirectories for source pattern. To deal 
with this correctly, we have to renew the verification logic, which may not 
intuitive and simple but works for all cases.
   
   The new verification logic prevents both cases:
   
   1) archive path matches with source pattern as "prefix" (the depth of 
archive path > the depth of source pattern)
   
   e.g.
   * source pattern: `/hello*/spar?`
   * archive path: `/hello/spark/structured/streaming`
   
   Any files in archive path will match with source pattern when recursive 
option is enabled.
   
   2) source pattern matches with archive path as "prefix" (the depth of source 
pattern > the depth of archive path)
   
   e.g.
   * source pattern: `/hello*/spar?/structured/hello2*`
   * archive path: `/hello/spark/structured`
   
   Some archive files will not match with source pattern, e.g. file path:  
`/hello/spark/structured/hello2`, then final archived path: 
`/hello/spark/structured/hello/spark/structured/hello2`.
   
   But some other archive files will still match with source pattern, e.g. file 
path: `/hello2/spark/structured/hello2`, then final archived path: 
`/hello/spark/structured/hello2/spark/structured/hello2` which matches with 
source pattern when recursive is enabled.
   
   Implicitly it also prevents archive path matches with source pattern as full 
match (same depth).
   
   We would want to prevent any source files to be archived and added to new 
source files again, so the patch takes most restrictive approach to prevent the 
possible cases.
   
   ### Why are the changes needed?
   
   Without this patch, there's a chance archived files are included as new 
source files when partitioned/recursive option is enabled, as current condition 
doesn't take these options into account.
   
   ### Does this PR introduce any user-facing change?
   
   Only for Spark 3.0.0-preview 1 - end users are required to provide archive 
path with ensuring a bit complicated conditions, instead of simply higher than 
2 depths.
   
   ### How was this patch tested?
   
   New UT.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR opened a new pull request #26920: [SPARK-30281][SS] Consider partitioned/recursive option while verifying archive path on FileStreamSource

Reply via email to