Hi,

We were wanting to enable `spark.rdd.compress` for spark applications at
our company. For applications where the storage type is `MEMORY_AND_DISK`
and there was a disk spill of the RDDs, we are seeing an overall increase
of around 7% (tested for 5 spark applications with overall runtime varying
from 10 mins to 2 hour)with respect to the aggregated task duration (Sum of
the runtime of all the successful tasks across all stages). We are using
lz4 compression as the compression algorithm.

Is this increase expected ?

For `MEMORY_AND_DISK` when there is no spill to disk, we were seeing
differences in task duration ranging from -10% to +10% before enabling RDD
compression and after enabling RDD compression. Is this expected ? My
understanding if there's no disk spill when we are caching RDDs using
`MEMORY_AND_DISK` there should not be any change in performance

I wanted to know if someone tried enabling RDD compression and have done
some benchmarks related to what would be the overall increase/decrease in
task duration/vcore hours ?

Or what would be a better metric to measure the impact/change to help us
make a decision whether we want to enable this feature or not.

We use spark version 3.3.2 and YARN version = 3.3.4. We use celeborn for
external shuffle service. We are self hosting our cluster on aws ec2
instances

Test Setup
I was running the spark applications on a dedicated queue in YARN once with
spark.rdd.compress= false and again with spark.rdd.compress=true for the
same input data

Since our company is sensitive to increase in costs we don't want to enable
it if the cost increases too much for us and we use aggregated task
duration as the metric for measuring the cost.

Let me know if anyone has any opinions or suggestions.

Thank you
Guruprasad Veerannavaru

Reply via email to