[jira] [Resolved] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

Steve Loughran (Jira) Thu, 10 Mar 2022 05:50:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran resolved SPARK-31911.
------------------------------------
    Fix Version/s: 3.0.1
                   2.4.7
       Resolution: Fixed

> Using S3A staging committer, pending uploads are committed more than once and 
> listed incorrectly in _SUCCESS data
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31911
>                 URL: https://issues.apache.org/jira/browse/SPARK-31911
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.4.4
>            Reporter: Brandon
>            Priority: Major
>             Fix For: 3.0.1, 2.4.7
>
>
> First of all thanks for the great work on the S3 committers. I was able set 
> up the directory staging committer in my environment following docs at 
> [https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
>  and tested one of my Spark applications using it. The Spark version is 2.4.4 
> with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
> multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
> Spark in parallel.
> I think I'm seeing a bug where the staging committer will complete pending 
> uploads more than once. The main symptom how I discovered this is that the 
> _SUCCESS data files under each table will contain overlapping file names that 
> belong to separate tables. From my reading of the code, that's because the 
> filenames in _SUCCESS reflect which multipart uploads were completed in the 
> commit for that particular table.
> An example:
> Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
> DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition 
> so writes one partition file.
> When the two writes are done,
>  * /a/_SUCCESS contains two filenames: /a/part-0000 and /b/part-0000.
>  * /b/_SUCCESS contains the same two filenames.
> Setting S3A logs to debug, I see the commitJob operation belonging to table a 
> includes completing the uploads of /a/part-0000 and /b/part-0000. Then again, 
> commitJob for table b includes the same completions. I haven't had a problem 
> yet, but I wonder if having these extra requests would become an issue at 
> higher scale, where dozens of commits with hundreds of files may be happening 
> concurrently in the application.
> I believe this may be caused from the way the pendingSet files are stored in 
> the staging directory. They are stored under one directory named by the 
> jobID, in the Hadoop code. However, for all write jobs executed by the Spark 
> application, the jobID passed to Hadoop is the same - the application ID. 
> Maybe the staging commit algorithm was built on the assumption that each 
> instance of the algorithm would use a unique random jobID.
> [~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
> (thank you), I would be interested to know your thoughts on this. Also it's 
> my first time opening a bug here, so let me know if there's anything else I 
> can do to help report the issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

Reply via email to