[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs

2022-02-24 Thread GitBox


dongjoon-hyun commented on pull request #32518:
URL: https://github.com/apache/spark/pull/32518#issuecomment-1050425417


   Hi, @itayB . If you are using Apache Spark 3.2.1 with Hadoop 3.3.1, you 
don't need to the first one. However, you still needs the Parquet 
recommendation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs

2021-05-12 Thread GitBox


dongjoon-hyun commented on pull request #32518:
URL: https://github.com/apache/spark/pull/32518#issuecomment-840019899


   Although AppVeyor build failed due to timeout, Jenkins passed. Merged to 
master.
   Thank you, @dbtsai , @HyukjinKwon , @steveloughran . This is a part of 
efforts to give Apache Spark 3.2.0 a better cloud support in the end.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs

2021-05-12 Thread GitBox


dongjoon-hyun commented on pull request #32518:
URL: https://github.com/apache/spark/pull/32518#issuecomment-839992094


   Here are my thoughts.
   
   > Be aware that https://issues.apache.org/jira/browse/HADOOP-17483 turns the 
magic committer on everywhere
   
   Of course, we've been waiting for Apache Hadoop 3.3.1 as a next step of 
Hadoop 3.2.2. We are going to upgrade willingly.
   
   > so this patch will make the magic committer the default on s3.
   
   This patch fills the missing parts only when Spark's configuration 
`spark.hadoop.fs.s3a.bucket..committer.magic.enabled=true` is not 
provided. So, it's orthogonal to Hadoop default configuration.
   
   > I am perfectly happy with this.
   
   Thank you. Yes, for S3, this is a correct and better direction and 
especially useful when we build Apache Spark source with a provided hadoop 
versions like 3.2.x or 3.3.0.
   
   > Note also that MAPREDUCE-7431 is adding a committer for ABFS and GCS for 
max performance on abfs and performance and correctness on gcs. (it'll work on 
HDFS too, FWIW). Those changes needed in the spark config will be needed there 
too.
   
   Also, thank you for the head-ups. Yep, definitely, we are looking forward to 
seeing it. In addition to S3's offset bug, those will be beneficial to the end 
users.
   
   > Now, one of the reasons that binding factory stuff is in the spark 
codebase is that it was still using some of the old MRv1 algorithms to create 
and invoke committers, rather than the V2 APIs, which automatically go through 
the factory mechanism. So the real solution here would to be find those bits of 
the spark code which uses org.apache.hadoop.mapred.FileOutputCommitter and 
other stuff in the same package and see if it can be replaced with a move to 
the stuff in org.apache.hadoop.mapreduce.lib.output.
   
   Yes, it's related to the non-trivial code path at this stage and may cause 
another regression. I hope we can revisit that later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs

2021-05-12 Thread GitBox


dongjoon-hyun commented on pull request #32518:
URL: https://github.com/apache/spark/pull/32518#issuecomment-839980053


   Thank you so much for review and comments, @steveloughran !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs

2021-05-12 Thread GitBox


dongjoon-hyun commented on pull request #32518:
URL: https://github.com/apache/spark/pull/32518#issuecomment-839841587


   > LGTM. Should we eventually do this in Hadoop, cc @steveloughran and 
@dongjoon-hyun ?
   
   Thank you for review, @dbtsai . The following two are Spark configurations.
   ```
   
spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
   
spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs

2021-05-11 Thread GitBox


dongjoon-hyun commented on pull request #32518:
URL: https://github.com/apache/spark/pull/32518#issuecomment-839470621


   Hi, @steveloughran . Could you review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org