Hi, We were wanting to enable `spark.rdd.compress` for spark applications at our company. For applications where the storage type is `MEMORY_AND_DISK` and there was a disk spill of the RDDs, we are seeing an overall increase of around 7% (tested for 5 spark applications with overall runtime varying from 10 mins to 2 hour)with respect to the aggregated task duration (Sum of the runtime of all the successful tasks across all stages). We are using lz4 compression as the compression algorithm.
Is this increase expected ? For `MEMORY_AND_DISK` when there is no spill to disk, we were seeing differences in task duration ranging from -10% to +10% before enabling RDD compression and after enabling RDD compression. Is this expected ? My understanding if there's no disk spill when we are caching RDDs using `MEMORY_AND_DISK` there should not be any change in performance I wanted to know if someone tried enabling RDD compression and have done some benchmarks related to what would be the overall increase/decrease in task duration/vcore hours ? Or what would be a better metric to measure the impact/change to help us make a decision whether we want to enable this feature or not. We use spark version 3.3.2 and YARN version = 3.3.4. We use celeborn for external shuffle service. We are self hosting our cluster on aws ec2 instances Test Setup I was running the spark applications on a dedicated queue in YARN once with spark.rdd.compress= false and again with spark.rdd.compress=true for the same input data Since our company is sensitive to increase in costs we don't want to enable it if the cost increases too much for us and we use aggregated task duration as the metric for measuring the cost. Let me know if anyone has any opinions or suggestions. Thank you Guruprasad Veerannavaru