zsxwing commented on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651454246
The numbers are pretty impressive. Thanks a lot for your work. My high level comments regarding the PR: - The compression codec should not be hardcoded. It's better to allow the user to config the writer codec like other codec configs in Spark. The reader should be able to identify the codec and load files correctly. Otherwise, we would need to bump the version again when adding more codec support. - The cost of bumping file sink version is much higher than the checkpoint. The file sink metadata should be compatible with old versions if possible like other storage formats such as parquet, orc. - For example, in some company, the team who generates the output may not be able to convince the downstream to upgrade their Spark version. - The file sink metadata can be read by an external system. Bumping the version will need the ecosystem to build a new reader. - If a user hits a critical issue in new Spark version, they may want to rollback. Bumping the version will break this workflow. - It's better to make the default behavior like this: - Use v1 if an existing directory is using v1, so that we can still support rollback. The user can set a flag to enable v2 explicitly. - Use v2 if it's a new directory. I'm wondering if we can step back to think about whether it is possible solve the issue without bumping the version. IMO, the major issue is the number of files increases infinitely over time. This PR increase the upper bound but doesn't change the complexity. Maybe we should try to see how to reduce the number of files, such as supporting data compaction? You may notice that `FileStreamSinkLog.DELETE_ACTION` is never used. This is because it was added to support data compaction but we didn't implement it yet. I totally understand that we may not be able to solve the metadata issue without bumping the file metadata version. Then it's better to design a better file sink log format to solve all of the following issues so that we don't need to bump the version again in future: - Data compaction support. - Open source format. It would be great that the new file sink log format can be read by Spark and other systems. An extra benefit is we can use Spark itself to process the metadata so that we don't have to put all metadata in the driver memory. - Better file stream source support. Since we always append files to the file sink, when reading a file sink using streaming queries, it would be great we can locate the new appended files quickly. For example, we can use `log file name + line number` as the streaming source offset instead, then we can jump to the log file directly. - Filter push down support when reading a file sink directory. For example, if a user queries only one partition, we don't need to load metadata of other partitions into the driver memory. - Decouple the file sink from the streaming query. Currently, one file sink output always maps to one streaming query. The user may want to use another streaming query or even batch query to write to the same file sink. But as we use the batch id as the file name, this is not supported. - Use relative path. Copying the file sink directory will break it today. It would be great that we remember the relative paths so that people can move or copy the directory. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org