Re: Performance of Query 4 on TPC-DS Benchmark

2024-11-12 Thread Andrei Lepikhov

On 11/11/24 17:49, Ba Jinsheng wrote:


 >It is all the time a challenge for PostgreSQL to estimate such a filter
 >because of absent information on joint column distribution.
 >Can you research this way by building extended statistics on these
 >clauses? It could move the plan to the more optimal direction.

Thanks a lot for your effort to analyze this issue, and we really 
appreciate your suggestions!  Currently, we focus on exposing these 
issues that affect performance. In the future, we may consider to look 
into such a direction as you suggested.
It would be better to participate in further analysis - at least, to 
find out general solution and classify your findings.
For example in your query, if you replace CTE with a table (see in the 
attachment) and execute vacuum analyze on this table you will have 
well-estimated query which executes fast.





 > Have you tried any tools to improve the cardinality yet, like aqo [0]?
Yes, but it takes nearly 1 hour to run this query at a time, so I only 
run "EXPLAIN ANALYZE" once, and the performance seems slightly improved.
Hmm, AQO iteratively approaches optimal decision. Sometimes it is a 
local optimum, but we still don't have a method practical enough to kick 
its out of the trap of local optimum.


--
regards, Andrei Lepikhov

tpcds-1.sql
Description: application/sql


year_total.sql
Description: application/sql


Re: Performance of Query 4 on TPC-DS Benchmark

2024-11-11 Thread Ba Jinsheng

>> The default configurations of PostgreSQL incur the error: "ERROR:  could not 
>> resize shared memory segment "/PostgreSQL.3539600020" to 2097152 bytes: No 
>> space left on device"

>No comment on your optimiser experiments for now, but for this error:
>it reminds me of a low/default --shm-size limit from Docker, or other
>similar container stuff?


Aha, you are right. The error disappears if running PostgreSQL out of docker 
container.
Notice: This email is generated from the account of an NUS alumnus. Contents, 
views, and opinions therein are solely those of the sender.


Re: Performance of Query 4 on TPC-DS Benchmark

2024-11-11 Thread Ba Jinsheng

>It is all the time a challenge for PostgreSQL to estimate such a filter
>because of absent information on joint column distribution.
>Can you research this way by building extended statistics on these
>clauses? It could move the plan to the more optimal direction.

Thanks a lot for your effort to analyze this issue, and we really appreciate 
your suggestions!  Currently, we focus on exposing these issues that affect 
performance. In the future, we may consider to look into such a direction as 
you suggested.


> Have you tried any tools to improve the cardinality yet, like aqo [0]?
Yes, but it takes nearly 1 hour to run this query at a time, so I only run 
"EXPLAIN ANALYZE" once, and the performance seems slightly improved.


  QUERY PLAN

 Limit  (cost=293880.50..293880.50 rows=1 width=132) (actual 
time=2527921.078..2527921.233 rows=8 loops=1)
   CTE year_total
 ->  Gather  (cost=115049.92..233367.07 rows=384208 width=216) (actual 
time=1116.139..4005.105 rows=384208 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Append  (cost=114049.92..193946.27 rows=160087 
width=216) (actual time=2430.791..2510.131 rows=128069 loops=3)
 ->  HashAggregate  (cost=190763.57..193145.83 rows=190581 
width=216) (actual time=3977.521..4070.200 rows=190581 loops=1)
   Group Key: customer.c_customer_id, 
customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag, 
customer.c_birth_country, customer.c_login, customer.c_email_address, 
date_dim.d_year
   Worker 1:  Batches: 1  Memory Usage: 120857kB
   ->  Hash Join  (cost=8151.60..103486.35 rows=2685453 
width=174) (actual time=64.667..1605.601 rows=2685453 loops=1)
 Hash Cond: (store_sales.ss_sold_date_sk = 
date_dim.d_date_sk)
 ->  Hash Join  (cost=5103.00..93216.88 
rows=2750652 width=174) (actual time=48.111..1121.801 rows=2750652 loops=1)
   Hash Cond: (store_sales.ss_customer_sk = 
customer.c_customer_sk)
   ->  Seq Scan on store_sales  
(cost=0.00..80552.52 rows=2880404 width=30) (actual time=0.068..230.529 
rows=2880404 loops=1)
   ->  Hash  (cost=3853.00..3853.00 rows=10 
width=152) (actual time=47.735..47.735 rows=10 loops=1)
 Buckets: 131072  Batches: 1  Memory 
Usage: 17161kB
 ->  Seq Scan on customer  
(cost=0.00..3853.00 rows=10 width=152) (actual time=0.012..25.023 
rows=10 loops=1)
 ->  Hash  (cost=2135.49..2135.49 rows=73049 
width=8) (actual time=16.242..16.242 rows=73049 loops=1)
   Buckets: 131072  Batches: 1  Memory Usage: 
3878kB
   ->  Seq Scan on date_dim  
(cost=0.00..2135.49 rows=73049 width=8) (actual time=0.074..8.744 rows=73049 
loops=1)
 ->  HashAggregate  (cost=114049.92..115762.15 rows=136978 
width=216) (actual time=2199.723..2268.851 rows=136978 loops=1)
   Group Key: customer_1.c_customer_id, 
customer_1.c_first_name, customer_1.c_last_name, 
customer_1.c_preferred_cust_flag, customer_1.c_birth_country, 
customer_1.c_login, customer_1.c_email_address, date_dim_1.d_year
   Worker 0:  Batches: 1  Memory Usage: 88089kB
   ->  Hash Join  (cost=8151.60..67544.41 rows=1430939 
width=177) (actual time=81.920..911.231 rows=1430939 loops=1)
 Hash Cond: (catalog_sales.cs_sold_date_sk = 
date_dim_1.d_date_sk)
 ->  Hash Join  (cost=5103.00..60729.97 
rows=1434519 width=177) (actual time=53.469..638.140 rows=1434519 loops=1)
   Hash Cond: 
(catalog_sales.cs_bill_customer_sk = customer_1.c_customer_sk)
   ->  Seq Scan on catalog_sales  
(cost=0.00..51842.75 rows=1441548 width=33) (actual time=0.066..134.023 
rows=1441548 loops=1)
   ->  Hash  (cost=3853.00..3853.00 rows=10 
width=152) (actual time=52.937..52.937 rows=10 loops=1)
 Buckets: 131072  Batches: 1  Memory 
Usage: 17161kB
 ->  Seq Scan on customer customer_1  
(cost=0.00..3853.00 rows=10 width=152) (actual time=0.019..27.549 
rows=10 loops=1)
 ->  Hash  (cost=2135.49..2135.49 rows=73049 
width=8) (actual time=27.968..27.968 rows=73049 loops=1)
   Buckets: 131072  Batches: 1  Memory Usage: 
3878kB
   ->  Seq Scan on date_dim date_dim_1  
(cost=0.00..2135.49 rows=73049 width=8) (actual time=0.099..14.115 rows=73049 
loops=1)
 ->  HashAggregate  (cost=61268.33..61976.44 rows=56649 
width=

Re: Performance of Query 4 on TPC-DS Benchmark

2024-11-11 Thread Andrei Lepikhov

On 11/11/24 02:35, Ba Jinsheng wrote:

Hi all,

Please see this case:


Query 4 on TPC-DS benchmark:

Thank you for interesting example!
Looking into explains I see two sortings:
->  Sort  (cost=794037.94..794037.95 rows=1 width=132)
   (actual time=3024403.310..3024403.313 rows=8 loops=1)
->  Sort  (cost=794033.93..794033.94 rows=1 width=132)
   (actual time=8068.869..8068.872 rows=8 loops=1)

Almost the same cost and different execution time. So, I think, the core 
of the problem in accuracy of selectivity estimation.

In this specific example I see lots of composite scan filters:
- ((sale_type = 'w'::text) AND (dyear = 2002))
- ((year_total > '0'::numeric) AND (sale_type = 'w'::text) AND (dyear = 
2001))
- ((year_total > '0'::numeric) AND (sale_type = 's'::text) AND (dyear = 
2001))


It is all the time a challenge for PostgreSQL to estimate such a filter 
because of absent information on joint column distribution.
Can you research this way by building extended statistics on these 
clauses? It could move the plan to the more optimal direction.


--
regards, Andrei Lepikhov




Re: Performance of Query 4 on TPC-DS Benchmark

2024-11-10 Thread Alena Rybakina

On 10.11.2024 23:16, Alena Rybakina wrote:


Hi!

On 10.11.2024 22:35, Ba Jinsheng wrote:

Hi all,

Please see this case:


Query 4 on TPC-DS benchmark:

with year_total as (
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_preferred_cust_flag
       ,c_birth_country customer_birth_country
       ,c_login customer_login
       ,c_email_address customer_email_address
       ,d_year dyear
 ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2) 
year_total

       ,'s' sale_type
 from customer
     ,store_sales
     ,date_dim
 where c_customer_sk = ss_customer_sk
   and ss_sold_date_sk = d_date_sk
 group by c_customer_id
         ,c_first_name
         ,c_last_name
         ,c_preferred_cust_flag
         ,c_birth_country
         ,c_login
         ,c_email_address
         ,d_year
 union all
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_preferred_cust_flag
       ,c_birth_country customer_birth_country
       ,c_login customer_login
       ,c_email_address customer_email_address
       ,d_year dyear
 ,sumcs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2) 
) year_total

       ,'c' sale_type
 from customer
     ,catalog_sales
     ,date_dim
 where c_customer_sk = cs_bill_customer_sk
   and cs_sold_date_sk = d_date_sk
 group by c_customer_id
         ,c_first_name
         ,c_last_name
         ,c_preferred_cust_flag
         ,c_birth_country
         ,c_login
         ,c_email_address
         ,d_year
union all
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_preferred_cust_flag
       ,c_birth_country customer_birth_country
       ,c_login customer_login
       ,c_email_address customer_email_address
       ,d_year dyear
 ,sumws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2) 
) year_total

       ,'w' sale_type
 from customer
     ,web_sales
     ,date_dim
 where c_customer_sk = ws_bill_customer_sk
   and ws_sold_date_sk = d_date_sk
 group by c_customer_id
         ,c_first_name
         ,c_last_name
         ,c_preferred_cust_flag
         ,c_birth_country
         ,c_login
         ,c_email_address
         ,d_year
         )
  select
                  t_s_secyear.customer_id
                 ,t_s_secyear.customer_first_name
                 ,t_s_secyear.customer_last_name
                 ,t_s_secyear.customer_email_address
 from year_total t_s_firstyear
     ,year_total t_s_secyear
     ,year_total t_c_firstyear
     ,year_total t_c_secyear
     ,year_total t_w_firstyear
     ,year_total t_w_secyear
 where t_s_secyear.customer_id = t_s_firstyear.customer_id
   and t_s_firstyear.customer_id = t_c_secyear.customer_id
   and t_s_firstyear.customer_id = t_c_firstyear.customer_id
   and t_s_firstyear.customer_id = t_w_firstyear.customer_id
   and t_s_firstyear.customer_id = t_w_secyear.customer_id
   and t_s_firstyear.sale_type = 's'
   and t_c_firstyear.sale_type = 'c'
   and t_w_firstyear.sale_type = 'w'
   and t_s_secyear.sale_type = 's'
   and t_c_secyear.sale_type = 'c'
   and t_w_secyear.sale_type = 'w'
   and t_s_firstyear.dyear =  2001
   and t_s_secyear.dyear = 2001+1
   and t_c_firstyear.dyear =  2001
   and t_c_secyear.dyear =  2001+1
   and t_w_firstyear.dyear = 2001
   and t_w_secyear.dyear = 2001+1
   and t_s_firstyear.year_total > 0
   and t_c_firstyear.year_total > 0
   and t_w_firstyear.year_total > 0
   and case when t_c_firstyear.year_total > 0 then 
t_c_secyear.year_total / t_c_firstyear.year_total else null end
           > case when t_s_firstyear.year_total > 0 then 
t_s_secyear.year_total / t_s_firstyear.year_total else null end
   and case when t_c_firstyear.year_total > 0 then 
t_c_secyear.year_total / t_c_firstyear.year_total else null end
           > case when t_w_firstyear.year_total > 0 then 
t_w_secyear.year_total / t_w_firstyear.year_total else null end

 order by t_s_secyear.customer_id
         ,t_s_secyear.customer_first_name
         ,t_s_secyear.customer_last_name
         ,t_s_secyear.customer_email_address
limit 100;





The execution time is more than 50 minutes:
                                                   QUERY PLAN

 Limit  (cost=1255378.56..1255378.57 rows=1 width=132) (actual 
time=3024403.311..3024403.342 rows=8 loops=1)

   CTE year_total
     ->  Append  (cost=197433.23..461340.62 ro

Re: Performance of Query 4 on TPC-DS Benchmark

2024-11-10 Thread Alena Rybakina

Hi!

On 10.11.2024 22:35, Ba Jinsheng wrote:

Hi all,

Please see this case:


Query 4 on TPC-DS benchmark:

with year_total as (
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_preferred_cust_flag
       ,c_birth_country customer_birth_country
       ,c_login customer_login
       ,c_email_address customer_email_address
       ,d_year dyear
 ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2) 
year_total

       ,'s' sale_type
 from customer
     ,store_sales
     ,date_dim
 where c_customer_sk = ss_customer_sk
   and ss_sold_date_sk = d_date_sk
 group by c_customer_id
         ,c_first_name
         ,c_last_name
         ,c_preferred_cust_flag
         ,c_birth_country
         ,c_login
         ,c_email_address
         ,d_year
 union all
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_preferred_cust_flag
       ,c_birth_country customer_birth_country
       ,c_login customer_login
       ,c_email_address customer_email_address
       ,d_year dyear
 ,sumcs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2) 
) year_total

       ,'c' sale_type
 from customer
     ,catalog_sales
     ,date_dim
 where c_customer_sk = cs_bill_customer_sk
   and cs_sold_date_sk = d_date_sk
 group by c_customer_id
         ,c_first_name
         ,c_last_name
         ,c_preferred_cust_flag
         ,c_birth_country
         ,c_login
         ,c_email_address
         ,d_year
union all
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_preferred_cust_flag
       ,c_birth_country customer_birth_country
       ,c_login customer_login
       ,c_email_address customer_email_address
       ,d_year dyear
 ,sumws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2) 
) year_total

       ,'w' sale_type
 from customer
     ,web_sales
     ,date_dim
 where c_customer_sk = ws_bill_customer_sk
   and ws_sold_date_sk = d_date_sk
 group by c_customer_id
         ,c_first_name
         ,c_last_name
         ,c_preferred_cust_flag
         ,c_birth_country
         ,c_login
         ,c_email_address
         ,d_year
         )
  select
                  t_s_secyear.customer_id
                 ,t_s_secyear.customer_first_name
                 ,t_s_secyear.customer_last_name
                 ,t_s_secyear.customer_email_address
 from year_total t_s_firstyear
     ,year_total t_s_secyear
     ,year_total t_c_firstyear
     ,year_total t_c_secyear
     ,year_total t_w_firstyear
     ,year_total t_w_secyear
 where t_s_secyear.customer_id = t_s_firstyear.customer_id
   and t_s_firstyear.customer_id = t_c_secyear.customer_id
   and t_s_firstyear.customer_id = t_c_firstyear.customer_id
   and t_s_firstyear.customer_id = t_w_firstyear.customer_id
   and t_s_firstyear.customer_id = t_w_secyear.customer_id
   and t_s_firstyear.sale_type = 's'
   and t_c_firstyear.sale_type = 'c'
   and t_w_firstyear.sale_type = 'w'
   and t_s_secyear.sale_type = 's'
   and t_c_secyear.sale_type = 'c'
   and t_w_secyear.sale_type = 'w'
   and t_s_firstyear.dyear =  2001
   and t_s_secyear.dyear = 2001+1
   and t_c_firstyear.dyear =  2001
   and t_c_secyear.dyear =  2001+1
   and t_w_firstyear.dyear = 2001
   and t_w_secyear.dyear = 2001+1
   and t_s_firstyear.year_total > 0
   and t_c_firstyear.year_total > 0
   and t_w_firstyear.year_total > 0
   and case when t_c_firstyear.year_total > 0 then 
t_c_secyear.year_total / t_c_firstyear.year_total else null end
           > case when t_s_firstyear.year_total > 0 then 
t_s_secyear.year_total / t_s_firstyear.year_total else null end
   and case when t_c_firstyear.year_total > 0 then 
t_c_secyear.year_total / t_c_firstyear.year_total else null end
           > case when t_w_firstyear.year_total > 0 then 
t_w_secyear.year_total / t_w_firstyear.year_total else null end

 order by t_s_secyear.customer_id
         ,t_s_secyear.customer_first_name
         ,t_s_secyear.customer_last_name
         ,t_s_secyear.customer_email_address
limit 100;





The execution time is more than 50 minutes:
                                               QUERY PLAN

 Limit  (cost=1255378.56..1255378.57 rows=1 width=132) (actual 
time=3024403.311..3024403.342 rows=8 loops=1)

   CTE year_total
     ->  Append  (cost=197433.23..461340.62 rows=5041142 width=216) 
(actual time=4126.043..789

Re: Performance of Query 4 on TPC-DS Benchmark

2024-11-10 Thread Thomas Munro
On Mon, Nov 11, 2024 at 8:36 AM Ba Jinsheng  wrote:
> The default configurations of PostgreSQL incur the error: "ERROR:  could not 
> resize shared memory segment "/PostgreSQL.3539600020" to 2097152 bytes: No 
> space left on device"

No comment on your optimiser experiments for now, but for this error:
it reminds me of a low/default --shm-size limit from Docker, or other
similar container stuff?