[ https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran resolved SPARK-31911. ------------------------------------ Fix Version/s: 3.0.1 2.4.7 Resolution: Fixed > Using S3A staging committer, pending uploads are committed more than once and > listed incorrectly in _SUCCESS data > ----------------------------------------------------------------------------------------------------------------- > > Key: SPARK-31911 > URL: https://issues.apache.org/jira/browse/SPARK-31911 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.4.4 > Reporter: Brandon > Priority: Major > Fix For: 3.0.1, 2.4.7 > > > First of all thanks for the great work on the S3 committers. I was able set > up the directory staging committer in my environment following docs at > [https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast] > and tested one of my Spark applications using it. The Spark version is 2.4.4 > with Hadoop 3.2.1 and the cloud committer bindings. The application writes > multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to > Spark in parallel. > I think I'm seeing a bug where the staging committer will complete pending > uploads more than once. The main symptom how I discovered this is that the > _SUCCESS data files under each table will contain overlapping file names that > belong to separate tables. From my reading of the code, that's because the > filenames in _SUCCESS reflect which multipart uploads were completed in the > commit for that particular table. > An example: > Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and > DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition > so writes one partition file. > When the two writes are done, > * /a/_SUCCESS contains two filenames: /a/part-0000 and /b/part-0000. > * /b/_SUCCESS contains the same two filenames. > Setting S3A logs to debug, I see the commitJob operation belonging to table a > includes completing the uploads of /a/part-0000 and /b/part-0000. Then again, > commitJob for table b includes the same completions. I haven't had a problem > yet, but I wonder if having these extra requests would become an issue at > higher scale, where dozens of commits with hundreds of files may be happening > concurrently in the application. > I believe this may be caused from the way the pendingSet files are stored in > the staging directory. They are stored under one directory named by the > jobID, in the Hadoop code. However, for all write jobs executed by the Spark > application, the jobID passed to Hadoop is the same - the application ID. > Maybe the staging commit algorithm was built on the assumption that each > instance of the algorithm would use a unique random jobID. > [~ste...@apache.org] , [~rdblue] Having seen your names on most of this work > (thank you), I would be interested to know your thoughts on this. Also it's > my first time opening a bug here, so let me know if there's anything else I > can do to help report the issue. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org