Hello Experts I am trying to maximise the resource utilisation on my 3 node spark cluster (2 data nodes and 1 driver) so that the job finishes quickest. I am trying to create a benchmark so I can recommend an optimal POD for the job 128GB x 16 cores I have standalone spark running 2.4.0 HTOP shows only half of the memory is in use. So what will be alternatives I can try? CPU is always 100 % for the allocated resources I can reduce per executor memory to 32 GB and increase number of executors? I have the following properties:
spark.driver.maxResultSize 64g spark.driver.memory 100g spark.driver.port 33631 spark.dynamicAllocation.enabled true spark.dynamicAllocation.executorIdleTimeout 60s spark.executor.cores 8 spark.executor.id driver spark.executor.instances 4 spark.executor.memory 64g spark.files file://dist/xxxx-0.0.1-py3.7.egg spark.locality.wait 10s 100 spark.shuffle.service.enabled true On Fri, Dec 20, 2019 at 10:56 AM zhangliyun <kelly...@126.com> wrote: > Hi all: > i want to ask a question about how to estimate the rdd size( according > to byte) when it is not saved to disk because the job spends long time if > the output is very huge and output partition number is small. > > > following step is what i can solve for this problem > > 1.sample 0.01 's original data > > 2.compute sample data count > > 3. if sample data count >0, cache the sample data and compute sample > data size > > 4.compute original rdd total count > > 5.estimate the rdd size as ${total count}* ${sampel data size} / > ${sample rdd count} > > The code is here > <https://github.com/kellyzly/sparkcode/blob/master/EstimateDataSetSize.scala#L24> > . > > My question > 1. can i use above way to solve the problem? If can not, where is wrong? > 2. Is there any existed solution ( existed API in spark) to solve the > problem? > > > > Best Regards > Kelly Zhang > > > > -- -Sriram