Hi, per https://spark.apache.org/docs/latest/cloud-integration.html, when using S3 storage one is advised to set these options:
spark.sql.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol > spark.sql.parquet.output.committer.class > org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter However, looking at code and trying simple tests suggests that BindingParquetOutputCommitter is not used at all. Specifically, I used this code import org.apache.log4j.{Level, Logger} Logger.getLogger("org.apache.spark.internal.io.cloud").setLevel(Level.TRACE) Logger.getLogger("org.apache.hadoop.mapreduce.lib.output").setLevel(Level.DEBUG) val spark = SparkSession.builder().master("local[*]") .config("spark.sql.sources.outputCommitterClass", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter") .config("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter") .config("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol") .config("fs.s3a.committer.magic.enabled", "true") .config("fs.s3.committer.magic.enabled", "true") .config("spark.hadoop.fs.s3a.committer.name", "magic") .config("spark.hadoop.fs.s3.committer.name", "magic") .getOrCreate() import spark.implicits._ val df = Seq("foo", "bar").toDF("s") df.write.mode("overwrite").parquet("s3://<some-s3-bucket>/2021-09-07-parquet") I observe that magic committer is used, and I get trace log message from PathOutputCommitProtocol, but not from BindingParquetOutputCommitter. If I remove configuration options that set BindingParquetOutputCommitter, I still see magic committer used. The spark.sql.parquet.output.committer.class option is only used in ParquetFileFormat, where it is copied to spark.sql.sources.outputCommitterClass, and that option, in turn, is only used by SQLHadoopMapReduceCommitProtocol - which we don't use here. So, it sounds like setting parquet.output.committer.class to org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter is no longer necessary? Or is there some code path where it matters? -- Vladimir Prus http://vladimirprus.com