zsxwing commented on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651454246


   The numbers are pretty impressive. Thanks a lot for your work.
   
   My high level comments regarding the PR:
   
   - The compression codec should not be hardcoded. It's better to allow the 
user to config the writer codec like other codec configs in Spark. The reader 
should be able to identify the codec and load files correctly. Otherwise, we 
would need to bump the version again when adding more codec support.
   - The cost of bumping file sink version is much higher than the checkpoint. 
The file sink metadata should be compatible with old versions if possible like 
other storage formats such as parquet, orc.
     - For example, in some company, the team who generates the output may not 
be able to convince the downstream to upgrade their Spark version.
     - The file sink metadata can be read by an external system. Bumping the 
version will need the ecosystem to build a new reader.
     - If a user hits a critical issue in new Spark version, they may want to 
rollback. Bumping the version will break this workflow.
   - It's better to make the default behavior like this:
     - Use v1 if an existing directory is using v1, so that we can still  
support rollback. The user can set a flag to enable v2 explicitly.
     - Use v2 if it's a new directory.
   
   I'm wondering if we can step back to think about whether it is possible 
solve the issue without bumping the version. IMO, the major issue is the number 
of files increases infinitely over time. This PR increase the upper bound but 
doesn't change the complexity. Maybe we should try to see how to reduce the 
number of files, such as supporting data compaction? You may notice that 
`FileStreamSinkLog.DELETE_ACTION` is never used. This is because it was added 
to support data compaction but we didn't implement it yet.
   
   I totally understand that we may not be able to solve the metadata issue 
without bumping the file metadata version. Then it's better to design a better 
file sink log format to solve all of the following issues so that we don't need 
to bump the version again in future:
   
   - Data compaction support.
   - Open source format. It would be great that the new file sink log format 
can be read by Spark and other systems. An extra benefit is we can use Spark 
itself to process the metadata so that we don't have to put all metadata in the 
driver memory.
   - Better file stream source support. Since we always append files to the 
file sink, when reading a file sink using streaming queries, it would be great 
we can locate the new appended files quickly. For example, we can use `log file 
name + line number` as the streaming source offset instead, then we can jump to 
the log file directly.
   - Filter push down support when reading a file sink directory. For example, 
if a user queries only one partition, we don't need to load metadata of other 
partitions into the driver memory.
   - Decouple the file sink from the streaming query. Currently, one file sink 
output always maps to one streaming query. The user may want to use another 
streaming query or even batch query to write to the same file sink. But as we 
use the batch id as the file name, this is not supported.
   - Use relative path. Copying the file sink directory will break it today. It 
would be great that we remember the relative paths so that people can move or 
copy the directory.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to