Hi,

Can Please provide some more info about your Spark cluster setup. You
mentioned Hadoop as the underlying storage. I assume that there is data
locality between your Spark cluster and the

the underlying hadoop.


In your SQL statement below

select count(*) from (
        select *distinct* c_last_name, c_first_name, d_date
You will have to shuffle because of distinct as each worker will have to
read data separately and perform the reduce task to get the local distinct
value and one final shuffle to get the actual distinct
for all the data.

Increasing the number of cores implies increasing parallel action on the
underlying data for the same executor. So there would be more distinct jobs
here. My view would be that you plot the execution time
of tasks (y-axis) with x-axis being parameters of interest so it would be
easier to grab it. You can use Excel for it.

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 10 Feb 2022 at 13:16, Sean Owen <sro...@gmail.com> wrote:

> More cores is finishing faster as expected. My guess is that you are
> getting more parallelism overall and that speeds things up. However with
> more tasks executing concurrently on one machine, you are getting some
> contention, so it's possible more tasks are taking longer - a little I/O
> contention, CPU throttling under load, etc. Could be other reasons but this
> is possibly quite normal.
>
> On Thu, Feb 10, 2022 at 7:02 AM 15927907...@163.com <15927907...@163.com>
> wrote:
>
>> Hello,
>>     I recently used spark3.2 to do a test based on the TPC-DS dataset,
>> and the entire TPC-DS data scale is 1TB(located in HDFS). But I encountered
>> a problem that I couldn't understand, and I hope to get your help.
>>     The SQL statement tested is as follows:
>>      select count(*) from (
>>         select distinct c_last_name, c_first_name, d_date
>>          from store_sales, data_dim, customer
>>                     where store_sales.ss_sold_date_sk = date_dim.d_date_sk
>>              and store_sales.ss_customer_sk = customer.c_customer_sk
>>              and d_month_seq between 1200 and 1200 + 11
>>     )hot_cust
>>      limit 100;
>>
>> Three tables are used here, and their sizes are: store_sales(387.97GB),
>> date_dim(9.84GB),customer(1.52GB)
>>
>> When submitting the application, the driver and executor parameters are
>> set as follows:
>> DRIVER_OPTIONS="--master spark://xx.xx.xx.xx:7078 --total-executor-cores
>> 48 --conf spark.driver.memory=100g --conf spark.default.parallelism=100
>> --conf spark.sql.adaptive.enabled=false"
>>  EXECUTOR_OPTIONS="--executor-cores 12 --conf spark.executor.memory=50g
>> --conf spark.task.cpus=1 --conf spark.sql.shuffle.partitions=100 --conf
>> spark.sql.crossJoin.enabled=true"
>>
>> The Spark cluster is built on a physical node with only one worker, and
>> the submitted tasks are in StandAlone mode. A total of 56 cores, 376GB
>> of memory. I kept executor-num=4 unchanged (that is, the total executors
>> remained unchanged), and adjusted executor-cores=2, 4, 8, 12. The
>> corresponding application time was: 96s, 66s, 50s, 49s. It was found that
>> the total- executor-cores >=32, the time-consuming is almost unchanged.
>> I monitored CPU utilization, IO, and found no bottlenecks. Then I
>> analyzed the task time in the longest stage of each application in the
>> spark web UI interface(The longest time-consuming stage in the four test
>> cases is stage4, and the corresponding time-consuming: 48s, 28s, 22s, 23s
>> ). We see that with the increase of cpu cores, the task time in stage4
>> will also increase. The average task time is: 3s, 4s, 5s, 7s. And each
>> executor takes longer to run the first batch of tasks than subsequent
>> tasks. The more cores there are, the greater the time gap:
>>
>> Test-Case-ID total-executor-cores executor-cores total application time  
>> stage4
>> time average task time  The first batch of tasks takes time  The
>> remaining batch tasks take time
>> 1 8 2 96s 48s 7s 5-6s 3-4s
>> 2 16 4 66s 28s 5s 7-8s ~4s
>> 3 32 8 50s 23s 4s 10-12s ~5s
>> 4 48 12 49s 22s 3s 14-15s ~6s
>>
>> This result confuses me, why after increasing the number of cores, the
>> execution time of tasks becomes longer. I compared the data size processed
>> by each task in different cases, they are all consistent. Or why is the
>> execution efficiency almost unchanged after the number of cores increases
>> to a certain number.
>> Looking forward to your reply.
>>
>> ------------------------------
>> 15927907...@163.com
>>
>

Reply via email to