Hey Spark user community, I am writing Parquet files from Spark to S3 using S3a. I was reading this article about improving S3 bucket performance, specifically about how it can help to introduce randomness to your key names so that data is written to different partitions.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/ Is there a straight forward way to accomplish this randomness in Spark via the DataSet API? The only thing that I could think of would be to actually split the large set into multiple sets (based on row boundaries), and then write each one with the random key name. Is there an easier way that I am missing? Thanks in advance! Subhash