Attila Zsolt Piros created SPARK-40039: ------------------------------------------
Summary: Introducing checkpoint file manager based on Hadoop's Abortable interface Key: SPARK-40039 URL: https://issues.apache.org/jira/browse/SPARK-40039 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Attila Zsolt Piros Assignee: Attila Zsolt Piros Currently on S3 the checkpoint file manager (called FileContextBasedCheckpointFileManager) is based on rename. So when a file is opened for an atomic stream a temporary file used instead and when the stream is committed the file is renamed. But on S3 a rename will be a file copy. So it has some serious performance implication. But on Hadoop 3 there is new interface introduce called *Abortable* and *S3AFileSystem* has this capability which is implemented by on top S3's multipart upload. So when the file is committed a POST is sent ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) and when aborted a DELETE will be send ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org