[jira] [Updated] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40039: - Fix Version/s: (was: 3.4.0) > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-40039: --- Summary: Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface (was: Introducing checkpoint file manager based on Hadoop's Abortable interface) > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org