subject:"\[PERFORM\] Query Performance"

Re: [PERFORM] query performance issue

2017-11-15 Thread Pavel Stehule

2017-11-15 20:58 GMT+01:00 Gunther :

>
> On 11/15/2017 8:12, Pavel Stehule wrote:
>
> There is wrong plan due wrong estimation
>
> for this query you should to penalize nested loop
>
> set enable_nestloop to off;
>
> before evaluation of this query
>
>
> You are not the only one with this issue. May I suggest to look at this
> thread a little earlier this month.
>
> http://www.postgresql-archive.org/OLAP-reporting-queries-
> fall-into-nested-loops-over-seq-scans-or-other-horrible-
> planner-choices-tp5990160.html
>
> where this has been discussed in some length.
>

It is typical issue. The source of these problems are correlations between
columns (it can be fixed partially by multicolumn statistics in PostgreSQL
10). Another problem is missing multi table statistics - PostgreSQL planner
expects so any value from dictionary has same probability, what is not
usually true. Some OLAP techniques like calendar tables has usually very
bad impact on estimations with this results.

Regards

Pavel

> regards,
> -Gunther
>
>
>

Re: [PERFORM] query performance issue

2017-11-15 Thread Gunther



On 11/15/2017 8:12, Pavel Stehule wrote:

There is wrong plan due wrong estimation

for this query you should to penalize nested loop

set enable_nestloop to off;

before evaluation of this query


You are not the only one with this issue. May I suggest to look at this 
thread a little earlier this month.


http://www.postgresql-archive.org/OLAP-reporting-queries-fall-into-nested-loops-over-seq-scans-or-other-horrible-planner-choices-tp5990160.html

where this has been discussed in some length.

regards,
-Gunther

Re: [PERFORM] query performance issue

2017-11-15 Thread Justin Pryzby

On Wed, Nov 15, 2017 at 03:03:39PM +0530, Samir Magar wrote:
> I am having performance issues with one of the query.
> The query is taking 39 min to fetch 3.5 mil records.
> 
> I want to reduce that time to 15 mins.
> could you please suggest something to its performance?

> "HashAggregate  (cost=4459.68..4459.69 rows=1 width=27) (actual 
> time=2890035.403..2892173.601 rows=3489861 loops=1)"

Looks to me like the problem is here:

> "  ->  Index Only Scan using idxdq7 on dlr_qlfy  (cost=0.43..4.45 ROWS=1 
> width=16) (actual time=0.009..0.066 ROWS=121 loops=103987)"
> "Index Cond: ((qlfy_grp_id = dlr_grp.dlr_grp_id) AND (qlf_flg = 
> 'N'::bpchar) AND (cog_grp_id = dlr_grp_dlr_xref_1.dlr_grp_id))"
> "Heap Fetches: 0"

Returning 100x more rows than expected and bubbling up through a cascade of
nested loops.

Are those 3 conditions independent ?  Or, perhaps, are rows for which
"qlfy_grp_id=dlr_grp.dlr_grp_id" is true always going to have
"cog_grp_id = dlr_grp_dlr_xref_1.dlr_grp_id" ?

Even if it's not "always" true, if rows which pass the one condition are more
likely to pass the other condition, this will cause an underestimate, as
obvserved.

You can do an experiment SELECTing just from those two tables joined and see if
you can reproduce the problem with poor rowcount estimate (hopefully in much
less than 15min).

If you can't drop one of the two conditions, you can make PG treat it as a
single condition for purpose of determining expected selectivity, using a ROW()
comparison like:

ROW(qlfy_grp_id, cog_grp_id) = ROW(dlr_grp.dlr_grp_id, 
dlr_grp_dlr_xref_1.dlr_grp_id)

If you're running PG96+ you may also be able to work around this by adding FKs.

Justin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query performance issue

2017-11-15 Thread Pavel Stehule

2017-11-15 13:54 GMT+01:00 Samir Magar :

> please find the EXPLAIN ANALYZE output.
>
> On Wed, Nov 15, 2017 at 3:13 PM, Pavel Stehule 
> wrote:
>
>> Hi
>>
>> please send EXPLAIN ANALYZE output.
>>
>> Regards
>>
>> Pavel
>>
>> 2017-11-15 10:33 GMT+01:00 Samir Magar :
>>
>>> Hello,
>>> I am having performance issues with one of the query.
>>> The query is taking 39 min to fetch 3.5 mil records.
>>>
>>> I want to reduce that time to 15 mins.
>>> could you please suggest something to its performance?
>>>
>>> server configuration:
>>>  CPUs = 4
>>> memory = 16 GM
>>> shared_buffers = 3 GB
>>> work_mem = 100MB
>>> effective_cache_size = 12 GB
>>>
>>> we are doing the vacuum/analyze regularly on the database.
>>>
>>> attached is the query with its explain plan.
>>>
>>>

There is wrong plan due wrong estimation

for this query you should to penalize nested loop

set enable_nestloop to off;

before evaluation of this query


Thanks,
>>> Samir Magar
>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.
>>> org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>
>>>
>>
>

Re: [PERFORM] query performance issue

2017-11-15 Thread Samir Magar

please find the EXPLAIN ANALYZE output.

On Wed, Nov 15, 2017 at 3:13 PM, Pavel Stehule 
wrote:

> Hi
>
> please send EXPLAIN ANALYZE output.
>
> Regards
>
> Pavel
>
> 2017-11-15 10:33 GMT+01:00 Samir Magar :
>
>> Hello,
>> I am having performance issues with one of the query.
>> The query is taking 39 min to fetch 3.5 mil records.
>>
>> I want to reduce that time to 15 mins.
>> could you please suggest something to its performance?
>>
>> server configuration:
>>  CPUs = 4
>> memory = 16 GM
>> shared_buffers = 3 GB
>> work_mem = 100MB
>> effective_cache_size = 12 GB
>>
>> we are doing the vacuum/analyze regularly on the database.
>>
>> attached is the query with its explain plan.
>>
>> Thanks,
>> Samir Magar
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
>> )
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>
"HashAggregate  (cost=4459.68..4459.69 rows=1 width=27) (actual 
time=2890035.403..2892173.601 rows=3489861 loops=1)"
"  Group Key: dlr_qlfy.dlr_qlfy_id, nmq_req.grace_prd, nmq_req.hide_prg_flg, 
nmq_req.ntfy_dlr_flg, dlr_loc.acct_num, nmq_req.nmq_req_id, new_mdl.pi_mdl_id"
"  ->  Nested Loop  (cost=3.59..4459.67 rows=1 width=27) (actual 
time=0.228..2864594.177 rows=12321289 loops=1)"
"->  Nested Loop  (cost=3.31..4459.29 rows=1 width=27) (actual 
time=0.221..2819927.249 rows=12321289 loops=1)"
"  ->  Nested Loop  (cost=3.03..4451.45 rows=1 width=15) (actual 
time=0.158..36816.304 rows=12612983 loops=1)"
"Join Filter: (lead_loc.dlr_loc_id = 
dlr_grp_1.lead_dlr_loc_id)"
"->  Nested Loop  (cost=0.58..1358.94 rows=263 width=15) 
(actual time=0.046..363.150 rows=52261 loops=1)"
"  ->  Nested Loop  (cost=0.29..1227.46 rows=169 
width=15) (actual time=0.024..86.909 rows=12151 loops=1)"
"->  Seq Scan on dlr_loc lead_loc  
(cost=0.00..757.80 rows=169 width=4) (actual time=0.010..31.028 rows=12151 
loops=1)"
"  Filter: (acct_num = cog_parnt_acct)"
"  Rows Removed by Filter: 21593"
"->  Index Only Scan using "IDX_101" on dlr_loc 
cog_lead  (cost=0.29..2.77 rows=1 width=11) (actual time=0.003..0.004 rows=1 
loops=12151)"
"  Index Cond: (dlr_loc_id = 
lead_loc.dlr_loc_id)"
"  Heap Fetches: 0"
"  ->  Index Scan using idx_14 on stg_acflx_nmq_dlrs  
(cost=0.29..0.63 rows=15 width=14) (actual time=0.008..0.019 rows=4 
loops=12151)"
"Index Cond: (rltnp_lead_acct = 
cog_lead.acct_num)"
"->  Nested Loop  (cost=2.45..11.75 rows=1 width=33) 
(actual time=0.058..0.615 rows=241 loops=52261)"
"  ->  Index Only Scan using idx3 on dlr_grp dlr_grp_1  
(cost=0.29..0.32 rows=1 width=8) (actual time=0.005..0.005 rows=1 loops=52261)"
"Index Cond: ((lead_dlr_loc_id = 
cog_lead.dlr_loc_id) AND (dlr_grp_typ = 'COG'::bpchar))"
"Heap Fetches: 0"
"  ->  Nested Loop  (cost=2.17..11.42 rows=1 width=37) 
(actual time=0.051..0.530 rows=236 loops=53436)"
"Join Filter: (dlr_loc_2.acct_num = 
dlr_loc.acct_num)"
"->  Nested Loop  (cost=0.58..0.77 rows=1 
width=11) (actual time=0.015..0.016 rows=1 loops=53436)"
"  ->  Index Only Scan using idx6 on 
dlr_loc dlr_loc_2  (cost=0.29..0.32 rows=1 width=11) (actual time=0.009..0.009 
rows=1 loops=53436)"
"Index Cond: ((acct_num = 
stg_acflx_nmq_dlrs.acct_id) AND (is_actv = 'Y'::bpchar))"
"Heap Fetches: 0"
"  ->  Index Only Scan using idx7 on 
dlr_grp_dlr_xref dlr_grp_dlr_xref_1  (cost=0.29..0.43 rows=1 width=8) (actual 
time=0.004..0.005 rows=1 loops=53402)"
"Index Cond: ((dlr_loc_id = 
dlr_loc_2.dlr_loc_id) AND (dlr_grp_id = dlr_grp_1.dlr_grp_id))"
"Heap Fetches: 0"
"->  Nested Loop  (cost=1.58..10.64 rows=1 
width=26) (actual time=0.036..0.425 rows=243 loops=51988)"
"  ->  Index Only Scan using idx10 on 
dlr_loc  (cost=0.29..0.32 rows=1 width=7) (actual time=0.009..0.009 rows=1 
loops=51988)"
"Index Cond: ((is_actv = 
'Y'::bpchar) AND (acct_num = stg_acflx_nmq_dlrs.acct_id))"
"Heap Fetches: 0"
"  ->  Nested Loop  (cost=1.29..10.30 
rows=1 width=19) (actual time=0.026..0.354 rows=243 loops=51988)"
"

Re: [PERFORM] query performance issue

2017-11-15 Thread Pavel Stehule

Hi

please send EXPLAIN ANALYZE output.

Regards

Pavel

2017-11-15 10:33 GMT+01:00 Samir Magar :

> Hello,
> I am having performance issues with one of the query.
> The query is taking 39 min to fetch 3.5 mil records.
>
> I want to reduce that time to 15 mins.
> could you please suggest something to its performance?
>
> server configuration:
>  CPUs = 4
> memory = 16 GM
> shared_buffers = 3 GB
> work_mem = 100MB
> effective_cache_size = 12 GB
>
> we are doing the vacuum/analyze regularly on the database.
>
> attached is the query with its explain plan.
>
> Thanks,
> Samir Magar
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
>

[PERFORM] query performance issue

2017-11-15 Thread Samir Magar

Hello,
I am having performance issues with one of the query.
The query is taking 39 min to fetch 3.5 mil records.

I want to reduce that time to 15 mins.
could you please suggest something to its performance?

server configuration:
 CPUs = 4
memory = 16 GM
shared_buffers = 3 GB
work_mem = 100MB
effective_cache_size = 12 GB

we are doing the vacuum/analyze regularly on the database.

attached is the query with its explain plan.

Thanks,
Samir Magar
query:

SELECT 
DISTINCT
DLR_QLFY.DLR_QLFY_ID as DLR_QLFY_ID, NMQ_REQ.GRACE_PRD as GRACE_PRD, 
NMQ_REQ.HIDE_PRG_FLG as HIDE_PRG_FLG, NMQ_REQ.NTFY_DLR_FLG as NTFY_DLR_FLG, 
DLR_LOC.ACCT_NUM as ACCT_NUM, NMQ_REQ.NMQ_REQ_ID as NMQ_REQ_ID, 
NEW_MDL.PI_MDL_ID as PI_MDL_ID 

 
FROM test.DLR_QLFY INNER JOIN 
(SELECT DLR_GRP.DLR_GRP_ID AS LOC_GRP_ID,LEAD_DLR_LOC_ID,DLR_LOC.ACCT_NUM AS 
LOC_ACCT_NUM FROM test.DLR_GRP, test.DLR_GRP_DLR_XREF, test.DLR_LOC WHERE 
DLR_GRP.DLR_GRP_ID=DLR_GRP_DLR_XREF.DLR_GRP_ID AND 
DLR_GRP_DLR_XREF.DLR_LOC_ID=DLR_LOC.DLR_LOC_ID AND (DLR_GRP.DLR_GRP_TYP='LOC' 
OR DLR_GRP.DLR_GRP_TYP='COG') AND DLR_LOC.IS_ACTV='Y' ) LOC_GRP
ON DLR_QLFY.QLFY_GRP_ID=LOC_GRP.LOC_GRP_ID
INNER JOIN (SELECT DLR_GRP.DLR_GRP_ID AS 
COG_GRP_ID,LEAD_DLR_LOC_ID,DLR_LOC.ACCT_NUM AS COG_ACCT_NUM FROM 
test.DLR_GRP,test.DLR_GRP_DLR_XREF,test.DLR_LOC WHERE 
DLR_GRP.DLR_GRP_ID=DLR_GRP_DLR_XREF.DLR_GRP_ID AND 
DLR_GRP_DLR_XREF.DLR_LOC_ID=DLR_LOC.DLR_LOC_ID AND DLR_GRP.DLR_GRP_TYP='COG' 
AND DLR_LOC.IS_ACTV='Y' ) COG_GRP
ON DLR_QLFY.COG_GRP_ID=COG_GRP.COG_GRP_ID
INNER JOIN test.DLR_LOC
ON DLR_LOC.ACCT_NUM=LOC_GRP.LOC_ACCT_NUM
AND  DLR_LOC.ACCT_NUM=COG_GRP.COG_ACCT_NUM
INNER JOIN test.DLR_LOC LEAD_LOC
ON LEAD_LOC.DLR_LOC_ID=COG_GRP.LEAD_DLR_LOC_ID
AND  LEAD_LOC.ACCT_NUM=LEAD_LOC.COG_PARNT_ACCT
INNER JOIN test.DLR_LOC COG_LEAD
ON COG_LEAD.DLR_LOC_ID=COG_GRP.LEAD_DLR_LOC_ID
INNER JOIN test.NMQ_REQ
ON DLR_QLFY.NMQ_REQ_ID=NMQ_REQ.NMQ_REQ_ID
INNER JOIN test.NEW_MDL
ON NMQ_REQ.NEW_MDL_ID = NEW_MDL.NEW_MDL_ID
INNER JOIN test.STG_ACFLX_NMQ_DLRS
ON COG_LEAD.ACCT_NUM=STG_ACFLX_NMQ_DLRS.RLTNP_LEAD_ACCT
AND  STG_ACFLX_NMQ_DLRS.ACCT_ID=DLR_LOC.ACCT_NUM

WHERE 
 DLR_LOC.IS_ACTV='Y'
AND  DLR_QLFY.QLF_FLG='N'
AND  NMQ_REQ.PGM_DSBL_FLG != 'Y'
AND  (NMQ_REQ.PGM_START_DT <= CURRENT_DATE
 AND NMQ_REQ.PGM_END_DT > CURRENT_DATE)
 
AND   DLR_QLFY.DLR_QLFY_ID NOT IN (SELECT DLR_QLFY.DLR_QLFY_ID FROM 
test.DLR_QLFY WHERE QLF_FLG='Y' AND DLR_QLFY.NMQ_REQ_ID=NMQ_REQ.NMQ_REQ_ID);


---

access plan




"HashAggregate  (cost=4538.33..4538.34 rows=1 width=27)"
"  Group Key: dlr_qlfy.dlr_qlfy_id, nmq_req.grace_prd, nmq_req.hide_prg_flg, 
nmq_req.ntfy_dlr_flg, dlr_loc.acct_num, nmq_req.nmq_req_id, new_mdl.pi_mdl_id"
"  ->  Nested Loop  (cost=3.59..4538.31 rows=1 width=27)"
"->  Nested Loop  (cost=3.31..4537.94 rows=1 width=27)"
"  ->  Nested Loop  (cost=3.03..4530.16 rows=1 width=15)"
"Join Filter: (lead_loc.dlr_loc_id = 
dlr_grp_1.lead_dlr_loc_id)"
"->  Nested Loop  (cost=0.58..1438.27 rows=263 width=15)"
"  ->  Nested Loop  (cost=0.29..1306.78 rows=169 
width=15)"
"->  Seq Scan on dlr_loc lead_loc  
(cost=0.00..757.12 rows=169 width=4)"
"  Filter: (acct_num = cog_parnt_acct)"
"->  Index Only Scan using "IDX_101" on dlr_loc 
cog_lead  (cost=0.29..3.24 rows=1 width=11)"
"  Index Cond: (dlr_loc_id = 
lead_loc.dlr_loc_id)"
"  ->  Index Scan using idx_14 on stg_acflx_nmq_dlrs  
(cost=0.29..0.63 rows=15 width=14)"
"Index Cond: (rltnp_lead_acct = 
cog_lead.acct_num)"
"->  Nested Loop  (cost=2.45..11.74 rows=1 width=33)"
"  ->  Index Only Scan using idx3 on dlr_grp dlr_grp_1  
(cost=0.29..0.32 rows=1 width=8)"
"Index Cond: ((lead_dlr_loc_id = 
cog_lead.dlr_loc_id) AND (dlr_grp_typ = 'COG'::bpchar))"
"  ->  Nested Loop  (cost=2.17..11.41 rows=1 width=37)"
"Join Filter: (dlr_loc_2.acct_num = 
dlr_loc.acct_num)"
"->  Nested Loop  (cost=0.58..0.77 rows=1 
width=11)"
"  ->  Index Only Scan using idx6 on 
dlr_loc dlr_loc_2  (cost=0.29..0.32 rows=1 width=11)"
"Index Cond: ((acct_num = 
stg_acflx_nmq_dlrs.acct_id) AND (is_actv = 'Y'::bpchar))"
"  ->  Index Only Scan using idx7 on 
dlr_grp_dlr_xref dlr_grp_dlr_xref_1  (cost=0.29..0.44 rows=1 width=8)"
"Index Cond: ((dlr_loc_id = 
dlr_loc_2.dlr_loc_id) AND (dlr_grp_id = dlr_grp_1.dlr_grp_id))"
"->  Nested Loop  (cost=1.58..10.63 rows=1 
width=26)"

Re: [PERFORM] Query performance changes significantly depending on limit value

2017-02-23 Thread Pavel Stehule

2017-02-23 17:45 GMT+01:00 Rowan Seymour :

> Not sure what other options we have other than an EAV approach since we
> allow users to define their own attribute types (attribute type is in
> contacts_contactfield, attribute value is in values_value). Would you
> expect modelling that with a JSON column to perform better?
>

Should be - maybe hstore, jsonb with special index. EAV works if you don't
do massive operations.

Usually the best approach is mix design - what can be relational - often
attributes used in filters should be rational (columnar) and others can be
in some unrelational type - XML, JSON, ...

Regards

Pavel


>
> Thanks for the tips!
>
> On 23 February 2017 at 17:35, Pavel Stehule 
> wrote:
>
>>
>>
>> 2017-02-23 15:02 GMT+01:00 Rowan Seymour :
>>
>>> Hi Pavel. That suggestion gets me as far as LIMIT 694 with the fast plan
>>> then things get slow again. This is now what happens at LIMIT 695:
>>>
>>> Limit  (cost=35945.78..50034.52 rows=695 width=88) (actual
>>> time=12852.580..12854.382 rows=695 loops=1)
>>>   Buffers: shared hit=6 read=66689
>>>   ->  Merge Join  (cost=35945.78..56176.80 rows=998 width=88) (actual
>>> time=12852.577..12854.271 rows=695 loops=1)
>>> Merge Cond: (contacts_contact.id =
>>> contacts_contactgroup_contacts.contact_id)
>>> Buffers: shared hit=6 read=66689
>>> ->  Sort  (cost=35944.53..35949.54 rows=2004 width=92) (actual
>>> time=12852.486..12852.577 rows=710 loops=1)
>>>   Sort Key: contacts_contact.id
>>>   Sort Method: quicksort  Memory: 34327kB
>>>   Buffers: shared hit=6 read=66677
>>>   ->  Hash Join  (cost=6816.19..35834.63 rows=2004 width=92)
>>> (actual time=721.293..12591.204 rows=200412 loops=1)
>>> Hash Cond: (contacts_contact.id = u0.contact_id)
>>> Buffers: shared hit=6 read=66677
>>> ->  Seq Scan on contacts_contact
>>>  (cost=0.00..25266.00 rows=100 width=88) (actual time=0.003..267.258
>>> rows=100 loops=1)
>>>   Buffers: shared hit=1 read=15265
>>> ->  Hash  (cost=6813.14..6813.14 rows=244 width=4)
>>> (actual time=714.373..714.373 rows=200412 loops=1)
>>>   Buckets: 1024  Batches: 1  Memory Usage: 7046kB
>>>   Buffers: shared hit=5 read=51412
>>>   ->  HashAggregate  (cost=6810.70..6813.14
>>> rows=244 width=4) (actual time=561.099..644.822 rows=200412 loops=1)
>>> Buffers: shared hit=5 read=51412
>>> ->  Bitmap Heap Scan on values_value u0
>>>  (cost=60.98..6805.69 rows=2004 width=4) (actual time=75.410..364.976
>>> rows=200412 loops=1)
>>>   Recheck Cond: ((contact_field_id =
>>> 1) AND (upper(string_value) = 'F'::text))
>>>   Buffers: shared hit=5 read=51412
>>>   ->  Bitmap Index Scan on
>>> values_value_field_string_value_contact  (cost=0.00..60.47 rows=2004
>>> width=0) (actual time=57.642..57.642 rows=200412 loops=1)
>>> Index Cond:
>>> ((contact_field_id = 1) AND (upper(string_value) = 'F'::text))
>>> Buffers: shared hit=5
>>> read=765
>>> ->  Index Only Scan Backward using 
>>> contacts_contactgroup_contacts_contactgroup_id_0f909f73_uniq
>>> on contacts_contactgroup_contacts  (cost=0.43..18967.29 rows=497992
>>> width=4) (actual time=0.080..0.651 rows=1707 loops=1)
>>>   Index Cond: (contactgroup_id = 1)
>>>   Heap Fetches: 0
>>>   Buffers: shared read=12
>>> Total runtime: 12863.938 ms
>>>
>>> https://explain.depesz.com/s/nfw1
>>>
>>> Can you explain a bit more about what you mean about " dependency
>>> between contact_field_id = 1 and upper(string_value) = 'F'::text"?
>>>
>>
>> look to related node in plan
>>
>>
>> ->  Hash  (cost=6813.14..6813.14 rows=244 width=4)
>> (actual time=714.373..714.373 rows=200412 loops=1)
>>   Buckets: 1024  Batches: 1  Memory Usage: 7046kB
>>   Buffers: shared hit=5 read=51412
>>   ->  HashAggregate  (cost=6810.70..6813.14
>> rows=244 width=4) (actual time=561.099..644.822 rows=200412 loops=1)
>> Buffers: shared hit=5 read=51412
>> ->  Bitmap Heap Scan on values_value u0
>>  (cost=60.98..6805.69 rows=2004 width=4) (actual time=75.410..364.976
>> rows=200412 loops=1)
>>   Recheck Cond: ((contact_field_id =
>> 1) AND (upper(string_value) = 'F'::text))
>>   Buffers: shared hit=5 read=51412
>>
>> There is lot of significant differences between estimation (2004) and
>> reality (200412) - two orders - so the plan must be suboptimal
>>

Re: [PERFORM] Query performance changes significantly depending on limit value

2017-02-23 Thread Rowan Seymour

Not sure what other options we have other than an EAV approach since we
allow users to define their own attribute types (attribute type is in
contacts_contactfield, attribute value is in values_value). Would you
expect modelling that with a JSON column to perform better?

Thanks for the tips!

On 23 February 2017 at 17:35, Pavel Stehule  wrote:

>
>
> 2017-02-23 15:02 GMT+01:00 Rowan Seymour :
>
>> Hi Pavel. That suggestion gets me as far as LIMIT 694 with the fast plan
>> then things get slow again. This is now what happens at LIMIT 695:
>>
>> Limit  (cost=35945.78..50034.52 rows=695 width=88) (actual
>> time=12852.580..12854.382 rows=695 loops=1)
>>   Buffers: shared hit=6 read=66689
>>   ->  Merge Join  (cost=35945.78..56176.80 rows=998 width=88) (actual
>> time=12852.577..12854.271 rows=695 loops=1)
>> Merge Cond: (contacts_contact.id = contacts_contactgroup_contacts
>> .contact_id)
>> Buffers: shared hit=6 read=66689
>> ->  Sort  (cost=35944.53..35949.54 rows=2004 width=92) (actual
>> time=12852.486..12852.577 rows=710 loops=1)
>>   Sort Key: contacts_contact.id
>>   Sort Method: quicksort  Memory: 34327kB
>>   Buffers: shared hit=6 read=66677
>>   ->  Hash Join  (cost=6816.19..35834.63 rows=2004 width=92)
>> (actual time=721.293..12591.204 rows=200412 loops=1)
>> Hash Cond: (contacts_contact.id = u0.contact_id)
>> Buffers: shared hit=6 read=66677
>> ->  Seq Scan on contacts_contact
>>  (cost=0.00..25266.00 rows=100 width=88) (actual time=0.003..267.258
>> rows=100 loops=1)
>>   Buffers: shared hit=1 read=15265
>> ->  Hash  (cost=6813.14..6813.14 rows=244 width=4)
>> (actual time=714.373..714.373 rows=200412 loops=1)
>>   Buckets: 1024  Batches: 1  Memory Usage: 7046kB
>>   Buffers: shared hit=5 read=51412
>>   ->  HashAggregate  (cost=6810.70..6813.14
>> rows=244 width=4) (actual time=561.099..644.822 rows=200412 loops=1)
>> Buffers: shared hit=5 read=51412
>> ->  Bitmap Heap Scan on values_value u0
>>  (cost=60.98..6805.69 rows=2004 width=4) (actual time=75.410..364.976
>> rows=200412 loops=1)
>>   Recheck Cond: ((contact_field_id =
>> 1) AND (upper(string_value) = 'F'::text))
>>   Buffers: shared hit=5 read=51412
>>   ->  Bitmap Index Scan on
>> values_value_field_string_value_contact  (cost=0.00..60.47 rows=2004
>> width=0) (actual time=57.642..57.642 rows=200412 loops=1)
>> Index Cond:
>> ((contact_field_id = 1) AND (upper(string_value) = 'F'::text))
>> Buffers: shared hit=5 read=765
>> ->  Index Only Scan Backward using 
>> contacts_contactgroup_contacts_contactgroup_id_0f909f73_uniq
>> on contacts_contactgroup_contacts  (cost=0.43..18967.29 rows=497992
>> width=4) (actual time=0.080..0.651 rows=1707 loops=1)
>>   Index Cond: (contactgroup_id = 1)
>>   Heap Fetches: 0
>>   Buffers: shared read=12
>> Total runtime: 12863.938 ms
>>
>> https://explain.depesz.com/s/nfw1
>>
>> Can you explain a bit more about what you mean about " dependency
>> between contact_field_id = 1 and upper(string_value) = 'F'::text"?
>>
>
> look to related node in plan
>
>
> ->  Hash  (cost=6813.14..6813.14 rows=244 width=4)
> (actual time=714.373..714.373 rows=200412 loops=1)
>   Buckets: 1024  Batches: 1  Memory Usage: 7046kB
>   Buffers: shared hit=5 read=51412
>   ->  HashAggregate  (cost=6810.70..6813.14
> rows=244 width=4) (actual time=561.099..644.822 rows=200412 loops=1)
> Buffers: shared hit=5 read=51412
> ->  Bitmap Heap Scan on values_value u0
>  (cost=60.98..6805.69 rows=2004 width=4) (actual time=75.410..364.976
> rows=200412 loops=1)
>   Recheck Cond: ((contact_field_id =
> 1) AND (upper(string_value) = 'F'::text))
>   Buffers: shared hit=5 read=51412
>
> There is lot of significant differences between estimation (2004) and
> reality (200412) - two orders - so the plan must be suboptimal
>
> I am looking to your schema - and it is variant on EAV table - this is
> antippatern and for more then small returned rows it should be slow.
>
> Regards
>
> Pavel
>
>
>
>> Btw I created the index values_value_field_string_value_contact as
>>
>> CREATE INDEX values_value_field_string_value_contact
>> ON values_value(contact_field_id, UPPER(string_value), contact_id DESC)
>> WHERE contact_field_id IS NOT NULL;
>>
>> I'm not sure why it needs the conta

Re: [PERFORM] Query performance changes significantly depending on limit value

2017-02-23 Thread Pavel Stehule

2017-02-23 15:02 GMT+01:00 Rowan Seymour :

> Hi Pavel. That suggestion gets me as far as LIMIT 694 with the fast plan
> then things get slow again. This is now what happens at LIMIT 695:
>
> Limit  (cost=35945.78..50034.52 rows=695 width=88) (actual
> time=12852.580..12854.382 rows=695 loops=1)
>   Buffers: shared hit=6 read=66689
>   ->  Merge Join  (cost=35945.78..56176.80 rows=998 width=88) (actual
> time=12852.577..12854.271 rows=695 loops=1)
> Merge Cond: (contacts_contact.id = contacts_contactgroup_
> contacts.contact_id)
> Buffers: shared hit=6 read=66689
> ->  Sort  (cost=35944.53..35949.54 rows=2004 width=92) (actual
> time=12852.486..12852.577 rows=710 loops=1)
>   Sort Key: contacts_contact.id
>   Sort Method: quicksort  Memory: 34327kB
>   Buffers: shared hit=6 read=66677
>   ->  Hash Join  (cost=6816.19..35834.63 rows=2004 width=92)
> (actual time=721.293..12591.204 rows=200412 loops=1)
> Hash Cond: (contacts_contact.id = u0.contact_id)
> Buffers: shared hit=6 read=66677
> ->  Seq Scan on contacts_contact  (cost=0.00..25266.00
> rows=100 width=88) (actual time=0.003..267.258 rows=100 loops=1)
>   Buffers: shared hit=1 read=15265
> ->  Hash  (cost=6813.14..6813.14 rows=244 width=4)
> (actual time=714.373..714.373 rows=200412 loops=1)
>   Buckets: 1024  Batches: 1  Memory Usage: 7046kB
>   Buffers: shared hit=5 read=51412
>   ->  HashAggregate  (cost=6810.70..6813.14
> rows=244 width=4) (actual time=561.099..644.822 rows=200412 loops=1)
> Buffers: shared hit=5 read=51412
> ->  Bitmap Heap Scan on values_value u0
>  (cost=60.98..6805.69 rows=2004 width=4) (actual time=75.410..364.976
> rows=200412 loops=1)
>   Recheck Cond: ((contact_field_id =
> 1) AND (upper(string_value) = 'F'::text))
>   Buffers: shared hit=5 read=51412
>   ->  Bitmap Index Scan on
> values_value_field_string_value_contact  (cost=0.00..60.47 rows=2004
> width=0) (actual time=57.642..57.642 rows=200412 loops=1)
> Index Cond: ((contact_field_id
> = 1) AND (upper(string_value) = 'F'::text))
> Buffers: shared hit=5 read=765
> ->  Index Only Scan Backward using contacts_contactgroup_
> contacts_contactgroup_id_0f909f73_uniq on contacts_contactgroup_contacts
>  (cost=0.43..18967.29 rows=497992 width=4) (actual time=0.080..0.651
> rows=1707 loops=1)
>   Index Cond: (contactgroup_id = 1)
>   Heap Fetches: 0
>   Buffers: shared read=12
> Total runtime: 12863.938 ms
>
> https://explain.depesz.com/s/nfw1
>
> Can you explain a bit more about what you mean about " dependency between
> contact_field_id = 1 and upper(string_value) = 'F'::text"?
>

look to related node in plan


->  Hash  (cost=6813.14..6813.14 rows=244 width=4)
(actual time=714.373..714.373 rows=200412 loops=1)
  Buckets: 1024  Batches: 1  Memory Usage: 7046kB
  Buffers: shared hit=5 read=51412
  ->  HashAggregate  (cost=6810.70..6813.14
rows=244 width=4) (actual time=561.099..644.822 rows=200412 loops=1)
Buffers: shared hit=5 read=51412
->  Bitmap Heap Scan on values_value u0
 (cost=60.98..6805.69 rows=2004 width=4) (actual time=75.410..364.976
rows=200412 loops=1)
  Recheck Cond: ((contact_field_id = 1)
AND (upper(string_value) = 'F'::text))
  Buffers: shared hit=5 read=51412

There is lot of significant differences between estimation (2004) and
reality (200412) - two orders - so the plan must be suboptimal

I am looking to your schema - and it is variant on EAV table - this is
antippatern and for more then small returned rows it should be slow.

Regards

Pavel



> Btw I created the index values_value_field_string_value_contact as
>
> CREATE INDEX values_value_field_string_value_contact
> ON values_value(contact_field_id, UPPER(string_value), contact_id DESC)
> WHERE contact_field_id IS NOT NULL;
>
> I'm not sure why it needs the contact_id column but without it the planner
> picks a slow approach for even smaller LIMIT values.
>
>
> On 23 February 2017 at 15:32, Pavel Stehule 
> wrote:
>
>>
>>
>> 2017-02-23 14:11 GMT+01:00 Rowan Seymour :
>>
>>> Hi guys
>>>
>>> I'm a bit stuck on a query that performs fantastically up to a certain
>>> limit value, after which the planner goes off in a completely different
>>> direction and performance gets dramatically worse. Am using Postgresql 9.3
>>>
>>> You c

Re: [PERFORM] Query performance changes significantly depending on limit value

2017-02-23 Thread Rowan Seymour

Hi Pavel. That suggestion gets me as far as LIMIT 694 with the fast plan
then things get slow again. This is now what happens at LIMIT 695:

Limit  (cost=35945.78..50034.52 rows=695 width=88) (actual
time=12852.580..12854.382 rows=695 loops=1)
  Buffers: shared hit=6 read=66689
  ->  Merge Join  (cost=35945.78..56176.80 rows=998 width=88) (actual
time=12852.577..12854.271 rows=695 loops=1)
Merge Cond: (contacts_contact.id =
contacts_contactgroup_contacts.contact_id)
Buffers: shared hit=6 read=66689
->  Sort  (cost=35944.53..35949.54 rows=2004 width=92) (actual
time=12852.486..12852.577 rows=710 loops=1)
  Sort Key: contacts_contact.id
  Sort Method: quicksort  Memory: 34327kB
  Buffers: shared hit=6 read=66677
  ->  Hash Join  (cost=6816.19..35834.63 rows=2004 width=92)
(actual time=721.293..12591.204 rows=200412 loops=1)
Hash Cond: (contacts_contact.id = u0.contact_id)
Buffers: shared hit=6 read=66677
->  Seq Scan on contacts_contact  (cost=0.00..25266.00
rows=100 width=88) (actual time=0.003..267.258 rows=100 loops=1)
  Buffers: shared hit=1 read=15265
->  Hash  (cost=6813.14..6813.14 rows=244 width=4)
(actual time=714.373..714.373 rows=200412 loops=1)
  Buckets: 1024  Batches: 1  Memory Usage: 7046kB
  Buffers: shared hit=5 read=51412
  ->  HashAggregate  (cost=6810.70..6813.14
rows=244 width=4) (actual time=561.099..644.822 rows=200412 loops=1)
Buffers: shared hit=5 read=51412
->  Bitmap Heap Scan on values_value u0
 (cost=60.98..6805.69 rows=2004 width=4) (actual time=75.410..364.976
rows=200412 loops=1)
  Recheck Cond: ((contact_field_id = 1)
AND (upper(string_value) = 'F'::text))
  Buffers: shared hit=5 read=51412
  ->  Bitmap Index Scan on
values_value_field_string_value_contact  (cost=0.00..60.47 rows=2004
width=0) (actual time=57.642..57.642 rows=200412 loops=1)
Index Cond: ((contact_field_id
= 1) AND (upper(string_value) = 'F'::text))
Buffers: shared hit=5 read=765
->  Index Only Scan Backward using
contacts_contactgroup_contacts_contactgroup_id_0f909f73_uniq on
contacts_contactgroup_contacts  (cost=0.43..18967.29 rows=497992 width=4)
(actual time=0.080..0.651 rows=1707 loops=1)
  Index Cond: (contactgroup_id = 1)
  Heap Fetches: 0
  Buffers: shared read=12
Total runtime: 12863.938 ms

https://explain.depesz.com/s/nfw1

Can you explain a bit more about what you mean about " dependency between
contact_field_id = 1 and upper(string_value) = 'F'::text"?

Btw I created the index values_value_field_string_value_contact as

CREATE INDEX values_value_field_string_value_contact
ON values_value(contact_field_id, UPPER(string_value), contact_id DESC)
WHERE contact_field_id IS NOT NULL;

I'm not sure why it needs the contact_id column but without it the planner
picks a slow approach for even smaller LIMIT values.


On 23 February 2017 at 15:32, Pavel Stehule  wrote:

>
>
> 2017-02-23 14:11 GMT+01:00 Rowan Seymour :
>
>> Hi guys
>>
>> I'm a bit stuck on a query that performs fantastically up to a certain
>> limit value, after which the planner goes off in a completely different
>> direction and performance gets dramatically worse. Am using Postgresql 9.3
>>
>> You can see all the relevant schemas at http://pastebin.com/PNEqw2id and
>> in the test database there are 1,000,000 records in contacts_contact, and
>> about half of those will match the subquery on values_value.
>>
>> The query in question is:
>>
>> SELECT "contacts_contact".* FROM "contacts_contact"
>> INNER JOIN "contacts_contactgroup_contacts" ON ("contacts_contact"."id"
>> = "contacts_contactgroup_contacts"."contact_id")
>> WHERE ("contacts_contactgroup_contacts"."contactgroup_id" = 1
>>AND "contacts_contact"."id" IN (
>>  SELECT U0."contact_id" FROM "values_value" U0 WHERE
>> (U0."contact_field_id" = 1 AND UPPER(U0."string_value"::text) = UPPER('F'))
>>)
>> ) ORDER BY "contacts_contact"."id" DESC LIMIT 222;
>>
>> With that limit of 222, it performs like:
>>
>> Limit  (cost=3.09..13256.36 rows=222 width=88) (actual time=0.122..3.358
>> rows=222 loops=1)
>>   Buffers: shared hit=708 read=63
>>   ->  Nested Loop  (cost=3.09..59583.10 rows=998 width=88) (actual
>> time=0.120..3.304 rows=222 loops=1)
>> Buffers: shared hit=708 read=63
>> ->  Merge Semi Join  (cost=2.65..51687.89 rows=2004 width=92)
>> (actual time=0.103..1.968 rows=227 loops=1)
>>   Merge Cond: (contacts_contact.id = u0.contact_id)
>>   Buffers: shared hit=24 read=63
>>

Re: [PERFORM] Query performance changes significantly depending on limit value

2017-02-23 Thread Pavel Stehule

2017-02-23 14:11 GMT+01:00 Rowan Seymour :

> Hi guys
>
> I'm a bit stuck on a query that performs fantastically up to a certain
> limit value, after which the planner goes off in a completely different
> direction and performance gets dramatically worse. Am using Postgresql 9.3
>
> You can see all the relevant schemas at http://pastebin.com/PNEqw2id and
> in the test database there are 1,000,000 records in contacts_contact, and
> about half of those will match the subquery on values_value.
>
> The query in question is:
>
> SELECT "contacts_contact".* FROM "contacts_contact"
> INNER JOIN "contacts_contactgroup_contacts" ON ("contacts_contact"."id" =
> "contacts_contactgroup_contacts"."contact_id")
> WHERE ("contacts_contactgroup_contacts"."contactgroup_id" = 1
>AND "contacts_contact"."id" IN (
>  SELECT U0."contact_id" FROM "values_value" U0 WHERE
> (U0."contact_field_id" = 1 AND UPPER(U0."string_value"::text) = UPPER('F'))
>)
> ) ORDER BY "contacts_contact"."id" DESC LIMIT 222;
>
> With that limit of 222, it performs like:
>
> Limit  (cost=3.09..13256.36 rows=222 width=88) (actual time=0.122..3.358
> rows=222 loops=1)
>   Buffers: shared hit=708 read=63
>   ->  Nested Loop  (cost=3.09..59583.10 rows=998 width=88) (actual
> time=0.120..3.304 rows=222 loops=1)
> Buffers: shared hit=708 read=63
> ->  Merge Semi Join  (cost=2.65..51687.89 rows=2004 width=92)
> (actual time=0.103..1.968 rows=227 loops=1)
>   Merge Cond: (contacts_contact.id = u0.contact_id)
>   Buffers: shared hit=24 read=63
>   ->  Index Scan Backward using contacts_contact_pkey on
> contacts_contact  (cost=0.42..41249.43 rows=100 width=88) (actual
> time=0.008..0.502 rows=1117 loops=1)
> Buffers: shared hit=22 read=2
>   ->  Index Scan using values_value_field_string_value_contact
> on values_value u0  (cost=0.43..7934.72 rows=2004 width=4) (actual
> time=0.086..0.857 rows=227 loops=1)
> Index Cond: ((contact_field_id = 1) AND
> (upper(string_value) = 'F'::text))
> Buffers: shared hit=2 read=61
> ->  Index Only Scan using contacts_contactgroup_
> contacts_contactgroup_id_0f909f73_uniq on contacts_contactgroup_contacts
>  (cost=0.43..3.93 rows=1 width=4) (actual time=0.005..0.005 rows=1
> loops=227)
>   Index Cond: ((contactgroup_id = 1) AND (contact_id =
> contacts_contact.id))
>   Heap Fetches: 0
>   Buffers: shared hit=684
> Total runtime: 3.488 ms
>
> https://explain.depesz.com/s/iPPJ
>
> But if increase the limit to 223 then it performs like:
>
> Limit  (cost=8785.68..13306.24 rows=223 width=88) (actual
> time=2685.830..2686.534 rows=223 loops=1)
>   Buffers: shared hit=767648 read=86530
>   ->  Merge Join  (cost=8785.68..29016.70 rows=998 width=88) (actual
> time=2685.828..2686.461 rows=223 loops=1)
> Merge Cond: (contacts_contact.id = contacts_contactgroup_
> contacts.contact_id)
> Buffers: shared hit=767648 read=86530
> ->  Sort  (cost=8784.44..8789.45 rows=2004 width=92) (actual
> time=2685.742..2685.804 rows=228 loops=1)
>   Sort Key: contacts_contact.id
>   Sort Method: quicksort  Memory: 34327kB
>   Buffers: shared hit=767648 read=86524
>   ->  Nested Loop  (cost=6811.12..8674.53 rows=2004 width=92)
> (actual time=646.573..2417.291 rows=200412 loops=1)
>

There is pretty bad estimation probably due dependency between
contact_field_id = 1 and upper(string_value) = 'F'::text

The most simple solution is disable nested loop - set enable_nestloop to off

Regards

Pavel


> Buffers: shared hit=767648 read=86524
> ->  HashAggregate  (cost=6810.70..6813.14 rows=244
> width=4) (actual time=646.532..766.200 rows=200412 loops=1)
>   Buffers: shared read=51417
>   ->  Bitmap Heap Scan on values_value u0
>  (cost=60.98..6805.69 rows=2004 width=4) (actual time=92.016..433.709
> rows=200412 loops=1)
> Recheck Cond: ((contact_field_id = 1) AND
> (upper(string_value) = 'F'::text))
> Buffers: shared read=51417
> ->  Bitmap Index Scan on
> values_value_field_string_value_contact  (cost=0.00..60.47 rows=2004
> width=0) (actual time=70.647..70.647 rows=200412 loops=1)
>   Index Cond: ((contact_field_id = 1)
> AND (upper(string_value) = 'F'::text))
>   Buffers: shared read=770
> ->  Index Scan using contacts_contact_pkey on
> contacts_contact  (cost=0.42..7.62 rows=1 width=88) (actual
> time=0.007..0.007 rows=1 loops=200412)
>   Index Cond: (id = u0.contact_id)
>   Buffers: shared hit=767648 read=35107
> ->  Index Only Scan Backward using contacts_contactgroup_
>

[PERFORM] Query performance changes significantly depending on limit value

2017-02-23 Thread Rowan Seymour

Hi guys

I'm a bit stuck on a query that performs fantastically up to a certain
limit value, after which the planner goes off in a completely different
direction and performance gets dramatically worse. Am using Postgresql 9.3

You can see all the relevant schemas at http://pastebin.com/PNEqw2id and in
the test database there are 1,000,000 records in contacts_contact, and
about half of those will match the subquery on values_value.

The query in question is:

SELECT "contacts_contact".* FROM "contacts_contact"
INNER JOIN "contacts_contactgroup_contacts" ON ("contacts_contact"."id" =
"contacts_contactgroup_contacts"."contact_id")
WHERE ("contacts_contactgroup_contacts"."contactgroup_id" = 1
   AND "contacts_contact"."id" IN (
 SELECT U0."contact_id" FROM "values_value" U0 WHERE
(U0."contact_field_id" = 1 AND UPPER(U0."string_value"::text) = UPPER('F'))
   )
) ORDER BY "contacts_contact"."id" DESC LIMIT 222;

With that limit of 222, it performs like:

Limit  (cost=3.09..13256.36 rows=222 width=88) (actual time=0.122..3.358
rows=222 loops=1)
  Buffers: shared hit=708 read=63
  ->  Nested Loop  (cost=3.09..59583.10 rows=998 width=88) (actual
time=0.120..3.304 rows=222 loops=1)
Buffers: shared hit=708 read=63
->  Merge Semi Join  (cost=2.65..51687.89 rows=2004 width=92)
(actual time=0.103..1.968 rows=227 loops=1)
  Merge Cond: (contacts_contact.id = u0.contact_id)
  Buffers: shared hit=24 read=63
  ->  Index Scan Backward using contacts_contact_pkey on
contacts_contact  (cost=0.42..41249.43 rows=100 width=88) (actual
time=0.008..0.502 rows=1117 loops=1)
Buffers: shared hit=22 read=2
  ->  Index Scan using values_value_field_string_value_contact
on values_value u0  (cost=0.43..7934.72 rows=2004 width=4) (actual
time=0.086..0.857 rows=227 loops=1)
Index Cond: ((contact_field_id = 1) AND
(upper(string_value) = 'F'::text))
Buffers: shared hit=2 read=61
->  Index Only Scan using
contacts_contactgroup_contacts_contactgroup_id_0f909f73_uniq on
contacts_contactgroup_contacts  (cost=0.43..3.93 rows=1 width=4) (actual
time=0.005..0.005 rows=1 loops=227)
  Index Cond: ((contactgroup_id = 1) AND (contact_id =
contacts_contact.id))
  Heap Fetches: 0
  Buffers: shared hit=684
Total runtime: 3.488 ms

https://explain.depesz.com/s/iPPJ

But if increase the limit to 223 then it performs like:

Limit  (cost=8785.68..13306.24 rows=223 width=88) (actual
time=2685.830..2686.534 rows=223 loops=1)
  Buffers: shared hit=767648 read=86530
  ->  Merge Join  (cost=8785.68..29016.70 rows=998 width=88) (actual
time=2685.828..2686.461 rows=223 loops=1)
Merge Cond: (contacts_contact.id =
contacts_contactgroup_contacts.contact_id)
Buffers: shared hit=767648 read=86530
->  Sort  (cost=8784.44..8789.45 rows=2004 width=92) (actual
time=2685.742..2685.804 rows=228 loops=1)
  Sort Key: contacts_contact.id
  Sort Method: quicksort  Memory: 34327kB
  Buffers: shared hit=767648 read=86524
  ->  Nested Loop  (cost=6811.12..8674.53 rows=2004 width=92)
(actual time=646.573..2417.291 rows=200412 loops=1)
Buffers: shared hit=767648 read=86524
->  HashAggregate  (cost=6810.70..6813.14 rows=244
width=4) (actual time=646.532..766.200 rows=200412 loops=1)
  Buffers: shared read=51417
  ->  Bitmap Heap Scan on values_value u0
 (cost=60.98..6805.69 rows=2004 width=4) (actual time=92.016..433.709
rows=200412 loops=1)
Recheck Cond: ((contact_field_id = 1) AND
(upper(string_value) = 'F'::text))
Buffers: shared read=51417
->  Bitmap Index Scan on
values_value_field_string_value_contact  (cost=0.00..60.47 rows=2004
width=0) (actual time=70.647..70.647 rows=200412 loops=1)
  Index Cond: ((contact_field_id = 1)
AND (upper(string_value) = 'F'::text))
  Buffers: shared read=770
->  Index Scan using contacts_contact_pkey on
contacts_contact  (cost=0.42..7.62 rows=1 width=88) (actual
time=0.007..0.007 rows=1 loops=200412)
  Index Cond: (id = u0.contact_id)
  Buffers: shared hit=767648 read=35107
->  Index Only Scan Backward using
contacts_contactgroup_contacts_contactgroup_id_0f909f73_uniq on
contacts_contactgroup_contacts  (cost=0.43..18967.29 rows=497992 width=4)
(actual time=0.073..0.273 rows=550 loops=1)
  Index Cond: (contactgroup_id = 1)
  Heap Fetches: 0
  Buffers: shared read=6
Total runtime: 2695.301 ms

https://explain.depesz.com/s/gXS

I've tried running ANALYZE but that actually reduced the limit at which
things get worse. Any insight into the

[PERFORM] Query Performance

2017-02-20 Thread Diego Vargas

Hi All,

I'm having some trouble improving the timing of a set of queries to a
partitioned table.
Basically, I'm trying to find an index that would be used instead of a
bitmap heap scan by when the data is taken from disk. Or in any case,
something that would make the process of retrieving the data from disk
faster.

I've installed postgreSQL compiling the source: PostgreSQL 9.2.20 on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat
4.4.7-17), 64-bit
And these are the current changes on the configuration file:
name|  current_setting   |source
++--
 application_name   | psql   | client
 client_encoding| UTF8   | client
 DateStyle  | ISO, MDY   | configuration file
 default_text_search_config | pg_catalog.english | configuration file
 lc_messages| en_US.UTF-8| configuration file
 lc_monetary| en_US.UTF-8| configuration file
 lc_numeric | en_US.UTF-8| configuration file
 lc_time| en_US.UTF-8| configuration file
 log_destination| stderr | configuration file
 log_directory  | pg_log | configuration file
 log_filename   | postgresql-%a.log  | configuration file
 log_rotation_age   | 1d | configuration file
 log_rotation_size  | 0  | configuration file
 log_timezone   | UTC| configuration file
 log_truncate_on_rotation   | on | configuration file
 logging_collector  | on | configuration file
 max_connections| 100| configuration file
 max_stack_depth| 2MB| environment variable
 shared_buffers | 6GB| configuration file
 TimeZone   | UTC| configuration file
 work_mem   | 50MB   | configuration file

I'm running on CentOS 6.8, and all the tests are being done through psql.

Now, this is the table in question:
lportal=# \d+ data_jsons_partition
 Table "data_jsons_partition"
 Column  |Type | Modifiers | Storage  |
Stats target | Description
-+-+---+
--+--+-
 id  | integer |   | plain
|  |
 site_id | integer |   | plain
|  |
 site_name   | character varying(255)  |   | extended
|  |
 measured_on | date|   | plain
|  |
 protocol| text|   | extended
|  |
 data| json|   | extended
|  |
 created_at  | timestamp without time zone |   | plain
|  |
 updated_at  | timestamp without time zone |   | plain
|  |
 org_name| character varying   |   | extended
|  |
 org_id  | integer |   | plain
|  |
 lat | double precision|   | plain
|  |
 long| double precision|   | plain
|  |
 elev| double precision|   | plain
|  |
Triggers:
insert_measurement_trigger BEFORE INSERT ON data_jsons_partition FOR
EACH ROW EXECUTE PROCEDURE data_insert_trigger()
Child tables: partitions.partition_a_data_jsons_part,
  partitions.partition_b_data_jsons_part,
  ...
  partitions.partition_aa_data_jsons_part,
  partitions.partition_ab_data_jsons_part


The child tables exists based on the protocol column. Now, each partition
looks like this:

lportal=# \d+ partitions.partition_ab_data_jsons_part
Table "partitions.partition_ab_data_jsons_part"
 Column  |Type | Modifiers | Storage  |
Stats target | Description
-+-+---+
--+--+-
 id  | integer | not null  | plain
|  |
 site_id | integer |   | plain
|  |
 site_name   | character varying(255)  |   | extended
|  |
 measured_on | date|   | plain
|  |
 protocol| text|   | extended
|  |
 data| json|   | extended
|  |
 created_at  | timestamp without time zone |   | plain
|

Re: [PERFORM] Query performance

2015-01-30 Thread Pavel Stehule

2015-01-31 2:40 GMT+01:00 Jim Nasby :

> On 1/25/15 2:03 AM, Pavel Stehule wrote:
>
>> It might not always be an integer, just happens to be so here.
>> Should I try text instead? I don't have to have the case-insensitive
>> matching.
>>
>>
>> text can be better
>>
>
> bytea would be even better yet, because that will always be a straight
> binary comparison. text will worry about conversion and what not (though,
> perhaps there's a way to force that to use C or SQL instead of something
> like UTF8, short of changing the encoding of the whole database).
>

true,

good idea

Regards

Pavel

> --
> Jim Nasby, Data Architect, Blue Treble Consulting
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>

Re: [PERFORM] Query performance

2015-01-30 Thread Jim Nasby


On 1/25/15 2:03 AM, Pavel Stehule wrote:

It might not always be an integer, just happens to be so here.
Should I try text instead? I don't have to have the case-insensitive
matching.


text can be better


bytea would be even better yet, because that will always be a straight 
binary comparison. text will worry about conversion and what not 
(though, perhaps there's a way to force that to use C or SQL instead of 
something like UTF8, short of changing the encoding of the whole database).

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance

2015-01-25 Thread Marc Mamin


>I have an events table that records page views and purchases (type = 'viewed' 
>or type='purchased'). I have a query that figures out "people who 
>bought/viewed this also bought/viewed that".
>
>It worked fine, taking about 0.1 seconds to complete, until a few hours ago 
>when it started taking hours to complete. Vacuum/analyze didn't help.  Turned 
>out there was one session_id that had 400k rows in the system. Deleting that 
>made the query performant again.
>
>Is there anything I can do to make the query work better in cases like that? 
>Missing index, or better query?
>
>This is on 9.3.5.
>
>The below is reproduced at the following URL if it's not formatted correctly 
>in the email. 
>https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

Hello,

here  are 2 variations that should be somewhat faster.

 It seems you may have duplicate (site_id,session_id,product_id)
 which would false the result. In that case you'll need some more logic in the 
query.

 select
'82503' as product_id,
e2.site_id,
e2.product_id,
count(nullif(e2.type='viewed', false)) view_count,
count(nullif(e2.type='purchased', false)) purchase_count
  from events e1
  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
  where
e1.product_id = '82503' and
e2.product_id != '82503'
 group by  e2.product_id, e2.site_id;


 OR:

 WITH SALL as(
  select
 e2.site_id,
 e2.product_id,
 count(nullif(e2.type='viewed', false)) view_count,
 count(nullif(e2.type='purchased', false)) purchase_count
   from events e1
   join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
   where
 e1.product_id = '82503'
  group by  e2.product_id, e2.site_id
 )
 SELECT
'82503' as product_id_1,
site_id,
product_id,
view_count,
purchase_count
 FROM SALL
 WHERE product_id != '82503';


 regards,
 Marc Mamin



>explain  select
>   e1.product_id,
>   e2.site_id,
>   e2.product_id,
>   count(nullif(e2.type='viewed', false)) view_count,
>   count(nullif(e2.type='purchased', false)) purchase_count
> from events e1
> join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
> where
>   e1.product_id = '82503' and
>   e1.product_id != e2.product_id
> group by e1.product_id, e2.product_id, e2.site_id;
> QUERY PLAN
>
> GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>   ->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
> Sort Key: e1.product_id, e2.product_id, e2.site_id
> ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>   ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
> rows=369 width=49)
> Recheck Cond: (product_id = '82503'::citext)
> ->  Bitmap Index Scan on events_product_id_site_id_idx  
> (cost=0.00..11.20 rows=369 width=0)
>   Index Cond: (product_id = '82503'::citext)
>   ->  Index Scan using events_session_id_type_product_id_idx on 
> events e2  (cost=0.56..51.28 rows=12 width=51)
> Index Cond: ((session_id = e1.session_id) AND (type = 
> e1.type))
> Filter: (e1.product_id <> product_id)
>(11 rows)
>
>recommender_production=> \d events
>Table "public.events"
>   Column|   Type   |  Modifiers
>-+--+-
> id  | bigint   | not null default 
> nextval('events_id_seq'::regclass)
> user_id | citext   |
> session_id  | citext   | not null
> product_id  | citext   | not null
> site_id | citext   | not null
> type| text | not null
> happened_at | timestamp with time zone | not null
> created_at  | timestamp with time zone | not null
>Indexes:
>"events_pkey" PRIMARY KEY, btree (id)
>"events_product_id_site_id_idx" btree (product_id, site_id)
>"events_session_id_type_product_id_idx" btree (session_id, type, 
> product_id)
>Check constraints:
>"events_session_id_check" CHECK (length(session_id::text) < 255)
>"events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
> 'viewed'::text]))
>"events_user_id_check" CHECK (length(user_id::text) < 255)
>
>
>
>

Re: [PERFORM] Query performance

2015-01-25 Thread Tomas Vondra

Hi,

On 25.1.2015 07:38, Joe Van Dyk wrote:
> 
> Here's one that's not quite as well: http://explain.depesz.com/s/SgT

As Pavel already pointed out, the first problem is this part of the plan:

Seq Scan on events e2 (cost=0.00..120,179.60 rows=4,450,241 width=51)
(actual time=0.014..33,773.370 rows=4,450,865 loops=1)
Filter: (product_id <> '81716'::citext)

Consuming ~33 seconds of the runtime. If you can make this faster
somehow (e.g. by getting rid of the citext cast), that'd be nice.

Another issue is that the hashjoin is batched:

   Buckets: 65536 Batches: 8 Memory Usage: 46085kB

The hash preparation takes ~40 seconds, so maybe try to give it a bit
more memory - I assume you have work_mem=64MB, so try doubling that
(ISTM 512MB should work with a single batch). Maybe this won't really
improve the performance, though. It still has to process ~4.5M rows.

Increasing the work mem could also result in switching to hash
aggregate, making the sort (~30 seconds) unnecessary.

Anyway, ISTM this works as expected, i.e.

(a) with rare product_id values the queries are fast
(b) with common product_id values the queries are slow

That's expected, because (b) needs to process much more data. I don't
think you can magically make it run as fast as (a). The best solution
might be to keep a pre-aggregated results - I don't think you really
need exact answers when recommending "similar" products.

I also wonder if you really need to join the tables? I mean, what if you
do something like this:

CREATE TABLE events_aggregated AS SELECT
   site_id,
   array_agg(product_id) AS product_ids,
   count(nullif(e2.type='viewed', false)) view_count,
   count(nullif(e2.type='purchased', false)) purchase_count
FROM events
GROUP BY 1;

and then using intarray with GIN indexes to query this table?
Something like this:

  CREATE products_agg_idx ON aggregated
   USING GIN (product_ids gin__int_ops);

  SELECT * FROM events_aggregated WHERE product_ids @> ARRAY['82503'];

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance

2015-01-25 Thread Pavel Stehule

2015-01-25 8:20 GMT+01:00 Joe Van Dyk :

> On Sat, Jan 24, 2015 at 11:14 PM, Pavel Stehule 
> wrote:
>
>>
>>
>> 2015-01-25 7:38 GMT+01:00 Joe Van Dyk :
>>
>>>
>>>
>>> On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule >> > wrote:
>>>
 Hi

 this plan looks well

 Regards

 Pavel

>>>
>>> Here's one that's not quite as well: http://explain.depesz.com/s/SgT
>>>
>>
>> I see a possible issue
>>
>> (product_id <> '81716'::citext) .. this operation is CPU expensive and
>> maybe nonsense
>>
>> product_id should be integer -- and if it isn't - it should not be on 4M
>> rows extremly fast - mainly on citext
>>
>> try to force a opposite cast - you will safe a case insensitive text
>> comparation
>>
>> product_id::int <> 81716
>>
>
> It might not always be an integer, just happens to be so here. Should I
> try text instead? I don't have to have the case-insensitive matching.
>

text can be better

this design is unhappy, but you cannot to change ot probably



>
> Joe
>
>
>>
>> Regards
>>
>> Pavel
>>
>>
>>
>>
>>>
>>> Joe
>>>
>>>

 2015-01-25 6:45 GMT+01:00 Joe Van Dyk :

> Oops, didn't run vacuum analyze after deleting the events. Here is
> another 'explain analyze': http://explain.depesz.com/s/AviN
>
> On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:
>
>> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:
>>
>>> I have an events table that records page views and purchases (type =
>>> 'viewed' or type='purchased'). I have a query that figures out "people 
>>> who
>>> bought/viewed this also bought/viewed that".
>>>
>>> It worked fine, taking about 0.1 seconds to complete, until a few
>>> hours ago when it started taking hours to complete. Vacuum/analyze 
>>> didn't
>>> help.  Turned out there was one session_id that had 400k rows in the
>>> system. Deleting that made the query performant again.
>>>
>>> Is there anything I can do to make the query work better in cases
>>> like that? Missing index, or better query?
>>>
>>> This is on 9.3.5.
>>>
>>> The below is reproduced at the following URL if it's not formatted
>>> correctly in the email.
>>> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>>>
>>> explain  select
>>>e1.product_id,
>>>e2.site_id,
>>>e2.product_id,
>>>count(nullif(e2.type='viewed', false)) view_count,
>>>count(nullif(e2.type='purchased', false)) purchase_count
>>>  from events e1
>>>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>>>  where
>>>e1.product_id = '82503' and
>>>e1.product_id != e2.product_id
>>>  group by e1.product_id, e2.product_id, e2.site_id;
>>>  QUERY PLAN
>>> 
>>>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>>>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>>>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>>>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>>>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
>>> rows=369 width=49)
>>>  Recheck Cond: (product_id = '82503'::citext)
>>>  ->  Bitmap Index Scan on 
>>> events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
>>>Index Cond: (product_id = '82503'::citext)
>>>->  Index Scan using 
>>> events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 
>>> rows=12 width=51)
>>>  Index Cond: ((session_id = e1.session_id) AND 
>>> (type = e1.type))
>>>  Filter: (e1.product_id <> product_id)
>>> (11 rows)
>>>
>>> recommender_production=> \d events
>>> Table "public.events"
>>>Column|   Type   |  Modifiers
>>> -+--+-
>>>  id  | bigint   | not null default 
>>> nextval('events_id_seq'::regclass)
>>>  user_id | citext   |
>>>  session_id  | citext   | not null
>>>  product_id  | citext   | not null
>>>  site_id | citext   | not null
>>>  type| text | not null
>>>  happened_at | timestamp with time zone | not null
>>>  created_at  | timestamp with time zone | not null
>>> Indexes:
>>> "events_pkey" PRIMARY KEY, btree (id)
>>> "events_product_id_site_id_idx" btree

Re: [PERFORM] Query performance

2015-01-24 Thread Joe Van Dyk

On Sat, Jan 24, 2015 at 11:14 PM, Pavel Stehule 
wrote:

>
>
> 2015-01-25 7:38 GMT+01:00 Joe Van Dyk :
>
>>
>>
>> On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule 
>> wrote:
>>
>>> Hi
>>>
>>> this plan looks well
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>
>> Here's one that's not quite as well: http://explain.depesz.com/s/SgT
>>
>
> I see a possible issue
>
> (product_id <> '81716'::citext) .. this operation is CPU expensive and
> maybe nonsense
>
> product_id should be integer -- and if it isn't - it should not be on 4M
> rows extremly fast - mainly on citext
>
> try to force a opposite cast - you will safe a case insensitive text
> comparation
>
> product_id::int <> 81716
>

It might not always be an integer, just happens to be so here. Should I try
text instead? I don't have to have the case-insensitive matching.

Joe


>
> Regards
>
> Pavel
>
>
>
>
>>
>> Joe
>>
>>
>>>
>>> 2015-01-25 6:45 GMT+01:00 Joe Van Dyk :
>>>
 Oops, didn't run vacuum analyze after deleting the events. Here is
 another 'explain analyze': http://explain.depesz.com/s/AviN

 On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:

> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:
>
>> I have an events table that records page views and purchases (type =
>> 'viewed' or type='purchased'). I have a query that figures out "people 
>> who
>> bought/viewed this also bought/viewed that".
>>
>> It worked fine, taking about 0.1 seconds to complete, until a few
>> hours ago when it started taking hours to complete. Vacuum/analyze didn't
>> help.  Turned out there was one session_id that had 400k rows in the
>> system. Deleting that made the query performant again.
>>
>> Is there anything I can do to make the query work better in cases
>> like that? Missing index, or better query?
>>
>> This is on 9.3.5.
>>
>> The below is reproduced at the following URL if it's not formatted
>> correctly in the email.
>> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>>
>> explain  select
>>e1.product_id,
>>e2.site_id,
>>e2.product_id,
>>count(nullif(e2.type='viewed', false)) view_count,
>>count(nullif(e2.type='purchased', false)) purchase_count
>>  from events e1
>>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>>  where
>>e1.product_id = '82503' and
>>e1.product_id != e2.product_id
>>  group by e1.product_id, e2.product_id, e2.site_id;
>>  QUERY PLAN
>> 
>>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
>> rows=369 width=49)
>>  Recheck Cond: (product_id = '82503'::citext)
>>  ->  Bitmap Index Scan on 
>> events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
>>Index Cond: (product_id = '82503'::citext)
>>->  Index Scan using 
>> events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 
>> rows=12 width=51)
>>  Index Cond: ((session_id = e1.session_id) AND (type 
>> = e1.type))
>>  Filter: (e1.product_id <> product_id)
>> (11 rows)
>>
>> recommender_production=> \d events
>> Table "public.events"
>>Column|   Type   |  Modifiers
>> -+--+-
>>  id  | bigint   | not null default 
>> nextval('events_id_seq'::regclass)
>>  user_id | citext   |
>>  session_id  | citext   | not null
>>  product_id  | citext   | not null
>>  site_id | citext   | not null
>>  type| text | not null
>>  happened_at | timestamp with time zone | not null
>>  created_at  | timestamp with time zone | not null
>> Indexes:
>> "events_pkey" PRIMARY KEY, btree (id)
>> "events_product_id_site_id_idx" btree (product_id, site_id)
>> "events_session_id_type_product_id_idx" btree (session_id, type, 
>> product_id)
>> Check constraints:
>> "events_session_id_check" CHECK (length(session_id::text) < 255)
>> "events_type_check" CHECK (type = ANY (ARRA

Re: [PERFORM] Query performance

2015-01-24 Thread Pavel Stehule

2015-01-25 7:38 GMT+01:00 Joe Van Dyk :

>
>
> On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule 
> wrote:
>
>> Hi
>>
>> this plan looks well
>>
>> Regards
>>
>> Pavel
>>
>
> Here's one that's not quite as well: http://explain.depesz.com/s/SgT
>

I see a possible issue

(product_id <> '81716'::citext) .. this operation is CPU expensive and
maybe nonsense

product_id should be integer -- and if it isn't - it should not be on 4M
rows extremly fast - mainly on citext

try to force a opposite cast - you will safe a case insensitive text
comparation

product_id::int <> 81716

Regards

Pavel




>
> Joe
>
>
>>
>> 2015-01-25 6:45 GMT+01:00 Joe Van Dyk :
>>
>>> Oops, didn't run vacuum analyze after deleting the events. Here is
>>> another 'explain analyze': http://explain.depesz.com/s/AviN
>>>
>>> On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:
>>>
 On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:

> I have an events table that records page views and purchases (type =
> 'viewed' or type='purchased'). I have a query that figures out "people who
> bought/viewed this also bought/viewed that".
>
> It worked fine, taking about 0.1 seconds to complete, until a few
> hours ago when it started taking hours to complete. Vacuum/analyze didn't
> help.  Turned out there was one session_id that had 400k rows in the
> system. Deleting that made the query performant again.
>
> Is there anything I can do to make the query work better in cases like
> that? Missing index, or better query?
>
> This is on 9.3.5.
>
> The below is reproduced at the following URL if it's not formatted
> correctly in the email.
> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>
> explain  select
>e1.product_id,
>e2.site_id,
>e2.product_id,
>count(nullif(e2.type='viewed', false)) view_count,
>count(nullif(e2.type='purchased', false)) purchase_count
>  from events e1
>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>  where
>e1.product_id = '82503' and
>e1.product_id != e2.product_id
>  group by e1.product_id, e2.product_id, e2.site_id;
>  QUERY PLAN
> 
>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
> rows=369 width=49)
>  Recheck Cond: (product_id = '82503'::citext)
>  ->  Bitmap Index Scan on 
> events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
>Index Cond: (product_id = '82503'::citext)
>->  Index Scan using events_session_id_type_product_id_idx 
> on events e2  (cost=0.56..51.28 rows=12 width=51)
>  Index Cond: ((session_id = e1.session_id) AND (type 
> = e1.type))
>  Filter: (e1.product_id <> product_id)
> (11 rows)
>
> recommender_production=> \d events
> Table "public.events"
>Column|   Type   |  Modifiers
> -+--+-
>  id  | bigint   | not null default 
> nextval('events_id_seq'::regclass)
>  user_id | citext   |
>  session_id  | citext   | not null
>  product_id  | citext   | not null
>  site_id | citext   | not null
>  type| text | not null
>  happened_at | timestamp with time zone | not null
>  created_at  | timestamp with time zone | not null
> Indexes:
> "events_pkey" PRIMARY KEY, btree (id)
> "events_product_id_site_id_idx" btree (product_id, site_id)
> "events_session_id_type_product_id_idx" btree (session_id, type, 
> product_id)
> Check constraints:
> "events_session_id_check" CHECK (length(session_id::text) < 255)
> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
> 'viewed'::text]))
> "events_user_id_check" CHECK (length(user_id::text) < 255)
>
>
>
>
 After removing the session with 400k events, I was able to do an
 explain analyze, here is one of them:
 http://explain.depesz.com/s/PFNk

>>>
>>>
>>
>

Re: [PERFORM] Query performance

2015-01-24 Thread Joe Van Dyk

On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule 
wrote:

> Hi
>
> this plan looks well
>
> Regards
>
> Pavel
>

Here's one that's not quite as well: http://explain.depesz.com/s/SgT

Joe


>
> 2015-01-25 6:45 GMT+01:00 Joe Van Dyk :
>
>> Oops, didn't run vacuum analyze after deleting the events. Here is
>> another 'explain analyze': http://explain.depesz.com/s/AviN
>>
>> On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:
>>
>>> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:
>>>
 I have an events table that records page views and purchases (type =
 'viewed' or type='purchased'). I have a query that figures out "people who
 bought/viewed this also bought/viewed that".

 It worked fine, taking about 0.1 seconds to complete, until a few hours
 ago when it started taking hours to complete. Vacuum/analyze didn't help.
 Turned out there was one session_id that had 400k rows in the system.
 Deleting that made the query performant again.

 Is there anything I can do to make the query work better in cases like
 that? Missing index, or better query?

 This is on 9.3.5.

 The below is reproduced at the following URL if it's not formatted
 correctly in the email.
 https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

 explain  select
e1.product_id,
e2.site_id,
e2.product_id,
count(nullif(e2.type='viewed', false)) view_count,
count(nullif(e2.type='purchased', false)) purchase_count
  from events e1
  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
  where
e1.product_id = '82503' and
e1.product_id != e2.product_id
  group by e1.product_id, e2.product_id, e2.site_id;
  QUERY PLAN
 
  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
  Sort Key: e1.product_id, e2.product_id, e2.site_id
  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
 rows=369 width=49)
  Recheck Cond: (product_id = '82503'::citext)
  ->  Bitmap Index Scan on 
 events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
Index Cond: (product_id = '82503'::citext)
->  Index Scan using events_session_id_type_product_id_idx 
 on events e2  (cost=0.56..51.28 rows=12 width=51)
  Index Cond: ((session_id = e1.session_id) AND (type = 
 e1.type))
  Filter: (e1.product_id <> product_id)
 (11 rows)

 recommender_production=> \d events
 Table "public.events"
Column|   Type   |  Modifiers
 -+--+-
  id  | bigint   | not null default 
 nextval('events_id_seq'::regclass)
  user_id | citext   |
  session_id  | citext   | not null
  product_id  | citext   | not null
  site_id | citext   | not null
  type| text | not null
  happened_at | timestamp with time zone | not null
  created_at  | timestamp with time zone | not null
 Indexes:
 "events_pkey" PRIMARY KEY, btree (id)
 "events_product_id_site_id_idx" btree (product_id, site_id)
 "events_session_id_type_product_id_idx" btree (session_id, type, 
 product_id)
 Check constraints:
 "events_session_id_check" CHECK (length(session_id::text) < 255)
 "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
 'viewed'::text]))
 "events_user_id_check" CHECK (length(user_id::text) < 255)




>>> After removing the session with 400k events, I was able to do an explain
>>> analyze, here is one of them:
>>> http://explain.depesz.com/s/PFNk
>>>
>>
>>
>

Re: [PERFORM] Query performance

2015-01-24 Thread Pavel Stehule

Hi

this plan looks well

Regards

Pavel

2015-01-25 6:45 GMT+01:00 Joe Van Dyk :

> Oops, didn't run vacuum analyze after deleting the events. Here is another
> 'explain analyze': http://explain.depesz.com/s/AviN
>
> On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:
>
>> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:
>>
>>> I have an events table that records page views and purchases (type =
>>> 'viewed' or type='purchased'). I have a query that figures out "people who
>>> bought/viewed this also bought/viewed that".
>>>
>>> It worked fine, taking about 0.1 seconds to complete, until a few hours
>>> ago when it started taking hours to complete. Vacuum/analyze didn't help.
>>> Turned out there was one session_id that had 400k rows in the system.
>>> Deleting that made the query performant again.
>>>
>>> Is there anything I can do to make the query work better in cases like
>>> that? Missing index, or better query?
>>>
>>> This is on 9.3.5.
>>>
>>> The below is reproduced at the following URL if it's not formatted
>>> correctly in the email.
>>> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>>>
>>> explain  select
>>>e1.product_id,
>>>e2.site_id,
>>>e2.product_id,
>>>count(nullif(e2.type='viewed', false)) view_count,
>>>count(nullif(e2.type='purchased', false)) purchase_count
>>>  from events e1
>>>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>>>  where
>>>e1.product_id = '82503' and
>>>e1.product_id != e2.product_id
>>>  group by e1.product_id, e2.product_id, e2.site_id;
>>>  QUERY PLAN
>>> 
>>>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>>>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>>>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>>>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>>>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
>>> rows=369 width=49)
>>>  Recheck Cond: (product_id = '82503'::citext)
>>>  ->  Bitmap Index Scan on events_product_id_site_id_idx 
>>>  (cost=0.00..11.20 rows=369 width=0)
>>>Index Cond: (product_id = '82503'::citext)
>>>->  Index Scan using events_session_id_type_product_id_idx 
>>> on events e2  (cost=0.56..51.28 rows=12 width=51)
>>>  Index Cond: ((session_id = e1.session_id) AND (type = 
>>> e1.type))
>>>  Filter: (e1.product_id <> product_id)
>>> (11 rows)
>>>
>>> recommender_production=> \d events
>>> Table "public.events"
>>>Column|   Type   |  Modifiers
>>> -+--+-
>>>  id  | bigint   | not null default 
>>> nextval('events_id_seq'::regclass)
>>>  user_id | citext   |
>>>  session_id  | citext   | not null
>>>  product_id  | citext   | not null
>>>  site_id | citext   | not null
>>>  type| text | not null
>>>  happened_at | timestamp with time zone | not null
>>>  created_at  | timestamp with time zone | not null
>>> Indexes:
>>> "events_pkey" PRIMARY KEY, btree (id)
>>> "events_product_id_site_id_idx" btree (product_id, site_id)
>>> "events_session_id_type_product_id_idx" btree (session_id, type, 
>>> product_id)
>>> Check constraints:
>>> "events_session_id_check" CHECK (length(session_id::text) < 255)
>>> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
>>> 'viewed'::text]))
>>> "events_user_id_check" CHECK (length(user_id::text) < 255)
>>>
>>>
>>>
>>>
>> After removing the session with 400k events, I was able to do an explain
>> analyze, here is one of them:
>> http://explain.depesz.com/s/PFNk
>>
>
>

Re: [PERFORM] Query performance

2015-01-24 Thread Joe Van Dyk

Oops, didn't run vacuum analyze after deleting the events. Here is another
'explain analyze': http://explain.depesz.com/s/AviN

On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:

> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:
>
>> I have an events table that records page views and purchases (type =
>> 'viewed' or type='purchased'). I have a query that figures out "people who
>> bought/viewed this also bought/viewed that".
>>
>> It worked fine, taking about 0.1 seconds to complete, until a few hours
>> ago when it started taking hours to complete. Vacuum/analyze didn't help.
>> Turned out there was one session_id that had 400k rows in the system.
>> Deleting that made the query performant again.
>>
>> Is there anything I can do to make the query work better in cases like
>> that? Missing index, or better query?
>>
>> This is on 9.3.5.
>>
>> The below is reproduced at the following URL if it's not formatted
>> correctly in the email.
>> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>>
>> explain  select
>>e1.product_id,
>>e2.site_id,
>>e2.product_id,
>>count(nullif(e2.type='viewed', false)) view_count,
>>count(nullif(e2.type='purchased', false)) purchase_count
>>  from events e1
>>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>>  where
>>e1.product_id = '82503' and
>>e1.product_id != e2.product_id
>>  group by e1.product_id, e2.product_id, e2.site_id;
>>  QUERY PLAN
>> 
>>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
>> rows=369 width=49)
>>  Recheck Cond: (product_id = '82503'::citext)
>>  ->  Bitmap Index Scan on events_product_id_site_id_idx  
>> (cost=0.00..11.20 rows=369 width=0)
>>Index Cond: (product_id = '82503'::citext)
>>->  Index Scan using events_session_id_type_product_id_idx on 
>> events e2  (cost=0.56..51.28 rows=12 width=51)
>>  Index Cond: ((session_id = e1.session_id) AND (type = 
>> e1.type))
>>  Filter: (e1.product_id <> product_id)
>> (11 rows)
>>
>> recommender_production=> \d events
>> Table "public.events"
>>Column|   Type   |  Modifiers
>> -+--+-
>>  id  | bigint   | not null default 
>> nextval('events_id_seq'::regclass)
>>  user_id | citext   |
>>  session_id  | citext   | not null
>>  product_id  | citext   | not null
>>  site_id | citext   | not null
>>  type| text | not null
>>  happened_at | timestamp with time zone | not null
>>  created_at  | timestamp with time zone | not null
>> Indexes:
>> "events_pkey" PRIMARY KEY, btree (id)
>> "events_product_id_site_id_idx" btree (product_id, site_id)
>> "events_session_id_type_product_id_idx" btree (session_id, type, 
>> product_id)
>> Check constraints:
>> "events_session_id_check" CHECK (length(session_id::text) < 255)
>> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
>> 'viewed'::text]))
>> "events_user_id_check" CHECK (length(user_id::text) < 255)
>>
>>
>>
>>
> After removing the session with 400k events, I was able to do an explain
> analyze, here is one of them:
> http://explain.depesz.com/s/PFNk
>

Re: [PERFORM] Query performance

2015-01-24 Thread Joe Van Dyk

On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:

> I have an events table that records page views and purchases (type =
> 'viewed' or type='purchased'). I have a query that figures out "people who
> bought/viewed this also bought/viewed that".
>
> It worked fine, taking about 0.1 seconds to complete, until a few hours
> ago when it started taking hours to complete. Vacuum/analyze didn't help.
> Turned out there was one session_id that had 400k rows in the system.
> Deleting that made the query performant again.
>
> Is there anything I can do to make the query work better in cases like
> that? Missing index, or better query?
>
> This is on 9.3.5.
>
> The below is reproduced at the following URL if it's not formatted
> correctly in the email.
> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>
> explain  select
>e1.product_id,
>e2.site_id,
>e2.product_id,
>count(nullif(e2.type='viewed', false)) view_count,
>count(nullif(e2.type='purchased', false)) purchase_count
>  from events e1
>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>  where
>e1.product_id = '82503' and
>e1.product_id != e2.product_id
>  group by e1.product_id, e2.product_id, e2.site_id;
>  QUERY PLAN
> 
>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
> rows=369 width=49)
>  Recheck Cond: (product_id = '82503'::citext)
>  ->  Bitmap Index Scan on events_product_id_site_id_idx  
> (cost=0.00..11.20 rows=369 width=0)
>Index Cond: (product_id = '82503'::citext)
>->  Index Scan using events_session_id_type_product_id_idx on 
> events e2  (cost=0.56..51.28 rows=12 width=51)
>  Index Cond: ((session_id = e1.session_id) AND (type = 
> e1.type))
>  Filter: (e1.product_id <> product_id)
> (11 rows)
>
> recommender_production=> \d events
> Table "public.events"
>Column|   Type   |  Modifiers
> -+--+-
>  id  | bigint   | not null default 
> nextval('events_id_seq'::regclass)
>  user_id | citext   |
>  session_id  | citext   | not null
>  product_id  | citext   | not null
>  site_id | citext   | not null
>  type| text | not null
>  happened_at | timestamp with time zone | not null
>  created_at  | timestamp with time zone | not null
> Indexes:
> "events_pkey" PRIMARY KEY, btree (id)
> "events_product_id_site_id_idx" btree (product_id, site_id)
> "events_session_id_type_product_id_idx" btree (session_id, type, 
> product_id)
> Check constraints:
> "events_session_id_check" CHECK (length(session_id::text) < 255)
> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
> 'viewed'::text]))
> "events_user_id_check" CHECK (length(user_id::text) < 255)
>
>
>
>
After removing the session with 400k events, I was able to do an explain
analyze, here is one of them:
http://explain.depesz.com/s/PFNk

[PERFORM] Query performance

2015-01-24 Thread Joe Van Dyk

I have an events table that records page views and purchases (type =
'viewed' or type='purchased'). I have a query that figures out "people who
bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago
when it started taking hours to complete. Vacuum/analyze didn't help.
Turned out there was one session_id that had 400k rows in the system.
Deleting that made the query performant again.

Is there anything I can do to make the query work better in cases like
that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted
correctly in the email.
https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain  select
   e1.product_id,
   e2.site_id,
   e2.product_id,
   count(nullif(e2.type='viewed', false)) view_count,
   count(nullif(e2.type='purchased', false)) purchase_count
 from events e1
 join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
 where
   e1.product_id = '82503' and
   e1.product_id != e2.product_id
 group by e1.product_id, e2.product_id, e2.site_id;
 QUERY PLAN

 GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
   ->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
 Sort Key: e1.product_id, e2.product_id, e2.site_id
 ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
   ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31
rows=369 width=49)
 Recheck Cond: (product_id = '82503'::citext)
 ->  Bitmap Index Scan on
events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
   Index Cond: (product_id = '82503'::citext)
   ->  Index Scan using
events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28
rows=12 width=51)
 Index Cond: ((session_id = e1.session_id) AND
(type = e1.type))
 Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events
Table "public.events"
   Column|   Type   |  Modifiers
-+--+-
 id  | bigint   | not null default
nextval('events_id_seq'::regclass)
 user_id | citext   |
 session_id  | citext   | not null
 product_id  | citext   | not null
 site_id | citext   | not null
 type| text | not null
 happened_at | timestamp with time zone | not null
 created_at  | timestamp with time zone | not null
Indexes:
"events_pkey" PRIMARY KEY, btree (id)
"events_product_id_site_id_idx" btree (product_id, site_id)
"events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:
"events_session_id_check" CHECK (length(session_id::text) < 255)
"events_type_check" CHECK (type = ANY (ARRAY['purchased'::text,
'viewed'::text]))
"events_user_id_check" CHECK (length(user_id::text) < 255)

Re: [PERFORM] Query Performance Problem

2014-10-21 Thread Felipe Santos

2014-10-21 10:57 GMT-02:00 :

>
>
> Hi all,
>
> I'm experimenting with table partitioning though inheritance. I'm testing
> a query as follows:
>
> explain (analyze, buffers)
> select response.id
> from claim.response
> where response.account_id = 4766
> and response.expire_timestamp is null
> and response.create_timestamp >= DATE '2014-08-01'
> order by create_timestamp;
>
> The response table looks like this:
> "account_id";"integer"
> "file_type_id";"integer"
> "receiver_inbound_detail_id";"integer"
> "processing_status_id";"integer"
> "processing";"boolean"
> "expire_timestamp";"timestamp without time zone"
> "last_mod_timestamp";"timestamp without time zone"
> "create_timestamp";"timestamp without time zone"
> "response_trace_nbr";"character varying"
> "posted_timestamp";"timestamp without time zone"
> "need_to_post";"boolean"
> "response_message";"text"
> "worked";"boolean"
> "response_status_id";"integer"
> "response_type_id";"integer"
> "outbound_claim_detail_id";"bigint"
> "id";"bigint"
>
> Here are some rowcounts:
>
> SELECT count(*) from claim_response.response_201408;
>   count
> -
>  4585746
> (1 row)
>
> Time: 7271.054 ms
> SELECT count(*) from claim_response.response_201409;
>   count
> -
>  3523370
> (1 row)
>
> Time: 4341.116 ms
> SELECT count(*) from claim_response.response_201410;
>  count
> ---
>154
> (1 row)
>
> Time: 0.258 ms
>
> The entire table has 225,665,512 rows. I read that a partitioning rule of
> thumb is that benefits of partitioning occur starting around 100 million
> rows.
>
> SELECT count(*) from claim.response;
>count
> ---
>  225665512
> (1 row)
>
> Time: 685064.637 ms
>
>
> The partitioning is on the create_timestamp field.
>
> The server is Red Hat Enterprise Linux Server release 6.2 (Santiago) on a
> VM machine - 8 GB RAM with 2 CPUs:
>
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):2
> On-line CPU(s) list:   0,1
> Thread(s) per core:1
> Core(s) per socket:2
> CPU socket(s): 1
> NUMA node(s):  1
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 44
> Stepping:  2
> CPU MHz:   2660.000
> BogoMIPS:  5320.00
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  12288K
> NUMA node0 CPU(s): 0,1
>
>
>
> 2 users,  load average: 0.00, 0.12, 0.37
>
>
> Please see the following for the explain analysis :
>
> http://explain.depesz.com/s/I3SL
>
> I'm trying to understand why I'm getting the yellow, orange, and red on
> the inclusive, and the yellow on the exclusive. (referring to the
> explain.depesz.com/s/I3SL page.)
> I'm relatively new to PostgreSQL, but I've been an Oracle DBA for some
> time. I suspect the I/O may be dragging but I don't know how to dig that
> information out from here. Please point out anything else you can decipher
> from this.
>
> Thanks,
>
> John
>


Hi John,

Dont know about the colors, but the Stats tab looks fine. You've got
yourself 5 Index Scans, which are a very fast way to dig data.

 I noticed you've also cast your filter field "(create_timestamp >=
'2014-08-01'::date)". As far as I know, Postgresql doesn't need this kind
of explicit conversion. You would be fine with just "(create_timestamp >=
'2014-08-01')".

Regards,

Felipe

[PERFORM] Query Performance Problem

2014-10-21 Thread john

Hi all,I'm experimenting with table partitioning though inheritance. I'm testing a query as follows:explain (analyze, buffers)select response.idfrom claim.responsewhere response.account_id = 4766and response.expire_timestamp is nulland response.create_timestamp >= DATE '2014-08-01'order by create_timestamp;The response table looks like this:"account_id";"integer""file_type_id";"integer""receiver_inbound_detail_id";"integer""processing_status_id";"integer""processing";"boolean""expire_timestamp";"timestamp without time zone""last_mod_timestamp";"timestamp without time zone""create_timestamp";"timestamp without time zone""response_trace_nbr";"character varying""posted_timestamp";"timestamp without time zone""need_to_post";"boolean""response_message";"text""worked";"boolean""response_status_id";"integer""response_type_id";"integer""outbound_claim_detail_id";"bigint""id";"bigint"Here are some rowcounts:SELECT count(*) from claim_response.response_201408;  count- 4585746(1 row)Time: 7271.054 msSELECT count(*) from claim_response.response_201409;  count- 3523370(1 row)Time: 4341.116 msSELECT count(*) from claim_response.response_201410; count---   154(1 row)Time: 0.258 msThe entire table has 225,665,512 rows. I read that a partitioning rule of thumb is that benefits of partitioning occur starting around 100 million rows.SELECT count(*) from claim.response;   count--- 225665512(1 row)Time: 685064.637 msThe partitioning is on the create_timestamp field.The server is Red Hat Enterprise Linux Server release 6.2 (Santiago) on a VM machine - 8 GB RAM with 2 CPUs:Architecture:          x86_64CPU op-mode(s):        32-bit, 64-bitByte Order:            Little EndianCPU(s):                2On-line CPU(s) list:   0,1Thread(s) per core:    1Core(s) per socket:    2CPU socket(s):         1NUMA node(s):          1Vendor ID:             GenuineIntelCPU family:            6Model:                 44Stepping:              2CPU MHz:               2660.000BogoMIPS:              5320.00L1d cache:             32KL1i cache:             32KL2 cache:              256KL3 cache:              12288KNUMA node0 CPU(s):     0,12 users,  load average: 0.00, 0.12, 0.37Please see the following for the explain analysis :http://explain.depesz.com/s/I3SLI'm trying to understand why I'm getting the yellow, orange, and red on the inclusive, and the yellow on the exclusive. (referring to the explain.depesz.com/s/I3SL page.)I'm relatively new to PostgreSQL, but I've been an Oracle DBA for some time. I suspect the I/O may be dragging but I don't know how to dig that information out from here. Please point out anything else you can decipher from this. Thanks,John

Re: [PERFORM] query performance with hstore vs. non-hstore

2014-09-04 Thread Huang, Suya

Hi David,

Thanks for the reply.

>Calling 43s "close to" 70s doesn't sound right...

Oops, I'm not saying 43s close to 70s... I mean that the plan generated by 
disable timing for explain plan doesn't make obvious difference comparing to 
the earlier plan I sent out which enabled timing.

>What version of PostgreSQL are you using?
>
>Two calls to each() and cast to numeric are not free.
>
>Your sequential scan savings is nearly 9 seconds but you lose all of that, and 
>more, when PostgreSQL evaluates the result of the scan and has to process the 
>each() and >the cast before it performs the join against the expanded result.  
>There is no planner node for this activity but it does cost time - in this 
>case more time than it >would take to simply store the native data types in 
>separate rows.
>
>You really should expand the hstore after the join (i.e., in the top-most
>select-list) but in this case since the join removed hardly any rows the gain 
>from doing so would be minimal.  The idea being you should not expand the 
>hstore of any row >that fails the join condition since it will not end up in 
>the final result anyway.
>
>Also, in this specific case, the call to each(...).key is pointless - you 
>never use the data.
>
>If you did need to use both columns, and are using 9.3, you should re-write 
>this to use LATERAL.
>
>In 9.2- you, possibly using a CTE, could do something like this:
>
>SELECT (each).* FROM (
>SELECT each(hs) FROM ( VALUES('k=>1'::hstore) ) h (hs)
>) src
>
>This is a single call to each(), in a subquery, which result is then expanded 
>using (col).* notation in the parent query.  This avoids calling each twice - 
>and note that >(each(...).*) does not work to avoid the double-call - you have 
>to use a subquery / a CTE one to ensure that it is not collapsed (offset 0 
>should work too but I find the >CTE one a little cleaner personally).
>

I'm using Postgresql 9.3.4.
I changed the query as you suggested. The execution time are still similar to 
the original one.

dev=# explain analyze select (each).key as cha_type, sum((each).value::numeric) 
as visits from (select each(visits) from weekly_hstore a  join seg1 b on 
a.ref_id=b.ref_id )foo  group by cha_type  order by visits desc;
  QUERY 
PLAN
--
 Sort  (cost=9455046.69..9455047.19 rows=200 width=32) (actual 
time=70928.881..71425.833 rows=3639539 loops=1)
   Sort Key: (sum(((foo.each).value)::numeric))
   Sort Method: quicksort  Memory: 394779kB
   ->  HashAggregate  (cost=9455037.05..9455039.05 rows=200 width=32) (actual 
time=60077.937..61425.469 rows=3639539 loops=1)
 ->  Subquery Scan on foo  (cost=12029.58..5737447.05 rows=371759000 
width=32) (actual time=281.658..23912.400 rows=36962761 loops=1)
   ->  Hash Join  (cost=12029.58..2019857.05 rows=371759000 
width=186) (actual time=281.655..18759.265 rows=36962761 loops=1)
 Hash Cond: ((a.ref_id)::text = (b.ref_id)::text)
 ->  Seq Scan on weekly_hstore a  (cost=0.00..133321.14 
rows=1292314 width=232) (actual time=11.141..857.959 rows=1292314 loops=1)
 ->  Hash  (cost=7382.59..7382.59 rows=371759 width=47) 
(actual time=262.722..262.722 rows=371759 loops=1)
   Buckets: 65536  Batches: 1  Memory Usage: 28951kB
   ->  Seq Scan on seg1 b  (cost=0.00..7382.59 
rows=371759 width=47) (actual time=11.701..113.859 rows=371759 loops=1)
 Total runtime: 71626.871 ms
(12 rows)


-Original Message-
From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of David G Johnston
Sent: Tuesday, September 02, 2014 1:38 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] query performance with hstore vs. non-hstore

Huang, Suya wrote
> See output of explain (analyze,timing off), the total runtime is close 
> to the one enable timing.

Calling 43s "close to" 70s doesn't sound right...


> dev=# explain (analyze, timing off) select cha_type, sum(visits) from 
> (select (each(visits)).key as cha_type,(each(visits)).value::numeric 
> as visits from weekly_hstore a  join seg1 b on a.ref_id=b.ref_id )foo  
> group by cha_type  order by sum(visits) desc;

What version of PostgreSQL are you using?

Two calls to each() and cast to numeric are not free.

Your sequential scan savings is nearly 9 seconds but you lose all of that, and 
more, when PostgreSQL evaluates the result of the scan and has to process the 
each() and the cast before it performs the join against the expanded re

Re: [PERFORM] query performance with hstore vs. non-hstore

2014-09-01 Thread David G Johnston

Huang, Suya wrote
> See output of explain (analyze,timing off), the total runtime is close to
> the one enable timing.

Calling 43s "close to" 70s doesn't sound right...


> dev=# explain (analyze, timing off) select cha_type, sum(visits) from
> (select (each(visits)).key as cha_type,(each(visits)).value::numeric as
> visits from weekly_hstore a  join seg1 b on a.ref_id=b.ref_id )foo  group
> by cha_type  order by sum(visits) desc;

What version of PostgreSQL are you using?

Two calls to each() and cast to numeric are not free.

Your sequential scan savings is nearly 9 seconds but you lose all of that,
and more, when PostgreSQL evaluates the result of the scan and has to
process the each() and the cast before it performs the join against the
expanded result.  There is no planner node for this activity but it does
cost time - in this case more time than it would take to simply store the
native data types in separate rows.

You really should expand the hstore after the join (i.e., in the top-most
select-list) but in this case since the join removed hardly any rows the
gain from doing so would be minimal.  The idea being you should not expand
the hstore of any row that fails the join condition since it will not end up
in the final result anyway.

Also, in this specific case, the call to each(...).key is pointless - you
never use the data.

If you did need to use both columns, and are using 9.3, you should re-write
this to use LATERAL.

In 9.2- you, possibly using a CTE, could do something like this:

SELECT (each).* FROM (
SELECT each(hs) FROM ( VALUES('k=>1'::hstore) ) h (hs)
) src

This is a single call to each(), in a subquery, which result is then
expanded using (col).* notation in the parent query.  This avoids calling
each twice - and note that (each(...).*) does not work to avoid the
double-call - you have to use a subquery / a CTE one to ensure that it is
not collapsed (offset 0 should work too but I find the CTE one a little
cleaner personally).

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/query-performance-with-hstore-vs-non-hstore-tp5817109p5817281.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] query performance with hstore vs. non-hstore

2014-09-01 Thread Huang, Suya

Hi Pavel,

See output of explain (analyze,timing off), the total runtime is close to the 
one enable timing.

dev=# EXPLAIN (ANALYZE, TIMING OFF)  select cha_type,sum(visits) from 
weekly_non_hstore a  join seg1 b on a.ref_id=b.ref_id group by cha_type order 
by sum(visits) desc;
QUERY PLAN
---
Sort  (cost=3674118.09..3674476.91 rows=143528 width=27) (actual rows=3639539 
loops=1)
   Sort Key: (sum(a.visits))
   Sort Method: quicksort  Memory: 391723kB
   ->  HashAggregate  (cost=3660388.94..3661824.22 rows=143528 width=27) 
(actual rows=3639539 loops=1)
 ->  Hash Join  (cost=12029.58..3301288.46 rows=71820096 width=27) 
(actual rows=36962761 loops=1)
   Hash Cond: ((a.ref_id)::text = (b.ref_id)::text)
   ->  Seq Scan on weekly_non_hstore a  (cost=0.00..1852856.96 
rows=71820096 width=75) (actual rows=71818882 loops=1)
   ->  Hash  (cost=7382.59..7382.59 rows=371759 width=47) (actual 
rows=371759 loops=1)
 Buckets: 65536  Batches: 1  Memory Usage: 28951kB
 ->  Seq Scan on seg1 b  (cost=0.00..7382.59 rows=371759 
width=47) (actual rows=371759 loops=1)
Total runtime: 42914.194 ms
(11 rows)


dev=# explain (analyze, timing off) select cha_type, sum(visits) from (select 
(each(visits)).key as cha_type,(each(visits)).value::numeric as visits from 
weekly_hstore a  join seg1 b on a.ref_id=b.ref_id )foo  group by cha_type  
order by sum(visits) desc;
 QUERY PLAN
-
Sort  (cost=7599039.89..7599040.39 rows=200 width=64) (actual rows=3639539 
loops=1)
   Sort Key: (sumeach(a.visits)).value)::numeric)))
   Sort Method: quicksort  Memory: 394779kB
   ->  HashAggregate  (cost=7599030.24..7599032.24 rows=200 width=64) (actual 
rows=3639539 loops=1)
 ->  Hash Join  (cost=12029.58..2022645.24 rows=371759000 width=186) 
(actual rows=36962761 loops=1)
   Hash Cond: ((a.ref_id)::text = (b.ref_id)::text)
   ->  Seq Scan on weekly_hstore a  (cost=0.00..133321.14 
rows=1292314 width=232) (actual rows=1292314 loops=1)
   ->  Hash  (cost=7382.59..7382.59 rows=371759 width=47) (actual 
rows=371759 loops=1)
 Buckets: 65536  Batches: 1  Memory Usage: 28951kB
 ->  Seq Scan on seg1 b  (cost=0.00..7382.59 rows=371759 
width=47) (actual rows=371759 loops=1)
Total runtime: 69521.570 ms
(11 rows)

Thanks,
Suya

From: Pavel Stehule [mailto:pavel.steh...@gmail.com]
Sent: Monday, September 01, 2014 5:07 PM
To: Huang, Suya
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] query performance with hstore vs. non-hstore



2014-09-01 8:54 GMT+02:00 Huang, Suya 
mailto:suya.hu...@au.experian.com>>:
Thank you Pavel.

The cost of unpacking hstore comparing to non-hstore could be calculated by:
Seq scan on hstore table + hash join with seg1 table:
Hstore: 416.741+ 34619.879 =~34 seconds
Non-hstore: 8858.594 +26477.652 =~ 34 seconds

The subsequent hash-aggregate and sort operation should be working on the 
unpacked hstore rows which has same row counts as non-hstore table. however, 
timing on those operations actually makes the big difference.

I don’t quite get why…

These values can be messy -- timing in EXPLAIN ANALYZE has relative big impact 
but different for some methods
try to watch complete time for EXPLAIN (ANALYZE, TIMING OFF)


Thanks,
Suya

From: Pavel Stehule 
[mailto:pavel.steh...@gmail.com<mailto:pavel.steh...@gmail.com>]
Sent: Monday, September 01, 2014 4:22 PM
To: Huang, Suya
Cc: pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>
Subject: Re: [PERFORM] query performance with hstore vs. non-hstore

Hi
In this use case hstore should not help .. there is relative high overhead 
related with unpacking hstore -- so classic schema is better.
Hstore should not to replace well normalized schema - it should be a replace 
for some semi normalized structures as EAV.
Hstore can have some profit from TOAST .. comprimation, less system data 
overhead, but this advantage started from some length of data. You should to 
see this benefit on table size. When table with HStore is less than without, 
then there is benefit of Hstore. Last benefit of Hstore are indexes over tuple 
(key, value) .. but you don't use it.
Regards

Pavel

2014-09-01 8:10 GMT+02:00 Huang, Suya 
mailto:suya.hu...@au.experian.com>>:
Hi ,

I’m tweaking table layout to get better performance of query. One table doesn’t 
use hstore but expand all metrics of cha_type to different rows. The other 
table has hstore for metrics column as cha_t

Re: [PERFORM] query performance with hstore vs. non-hstore

2014-09-01 Thread Pavel Stehule

2014-09-01 8:54 GMT+02:00 Huang, Suya :

>  Thank you Pavel.
>
>
>
> The cost of unpacking hstore comparing to non-hstore could be calculated
> by:
>
> Seq scan on hstore table + hash join with seg1 table:
>
> Hstore: 416.741+ 34619.879 =~34 seconds
>
> Non-hstore: 8858.594 +26477.652 =~ 34 seconds
>
>
>
> The subsequent hash-aggregate and sort operation should be working on the
> unpacked hstore rows which has same row counts as non-hstore table.
> however, timing on those operations actually makes the big difference.
>

>
> I don’t quite get why…
>

These values can be messy -- timing in EXPLAIN ANALYZE has relative big
impact but different for some methods

try to watch complete time for EXPLAIN (ANALYZE, TIMING OFF)


>
>
> Thanks,
>
> Suya
>
>
>
> *From:* Pavel Stehule [mailto:pavel.steh...@gmail.com]
> *Sent:* Monday, September 01, 2014 4:22 PM
> *To:* Huang, Suya
> *Cc:* pgsql-performance@postgresql.org
> *Subject:* Re: [PERFORM] query performance with hstore vs. non-hstore
>
>
>
> Hi
>
> In this use case hstore should not help .. there is relative high overhead
> related with unpacking hstore -- so classic schema is better.
>
> Hstore should not to replace well normalized schema - it should be a
> replace for some semi normalized structures as EAV.
>
> Hstore can have some profit from TOAST .. comprimation, less system data
> overhead, but this advantage started from some length of data. You should
> to see this benefit on table size. When table with HStore is less than
> without, then there is benefit of Hstore. Last benefit of Hstore are
> indexes over tuple (key, value) .. but you don't use it.
>
> Regards
>
> Pavel
>
>
>
> 2014-09-01 8:10 GMT+02:00 Huang, Suya :
>
> Hi ,
>
>
>
> I’m tweaking table layout to get better performance of query. One table
> doesn’t use hstore but expand all metrics of cha_type to different rows.
> The other table has hstore for metrics column as cha_type->metrics so it
> has less records than the first one.
>
>
>
> I would be expecting the query on seconds table has better performance
> than the first one. However, it’s not the case at all. I’m wondering if
> there’s something wrong with my execution plan? With the hstore table, the
> optimizer has totally wrong estimation on row counts at hash aggregate
> stage and it takes 34 seconds on hash-join,25 seconds on hash-aggregate, 10
> seconds on sort. However, with non-hstore table, it takes 17 seconds on
> hash join, 18 seconds on hashaggregate and 2 seconds on sort.
>
>
>
> Can someone help me to explain why this is happening? And is there a way
> to fine-tune the query?
>
>
>
> Table structure
>
>
>
> dev=# \d+ weekly_non_hstore
>
>  Table "test.weekly_non_hstore"
>
>   Column  |  Type  | Modifiers | Storage  | Stats target |
> Description
>
>
> --++---+--+--+-
>
> date | date   |   | plain|  |
>
> ref_id| character varying(256) |   | extended |  |
>
> cha_typel  | text   |   | extended |  |
>
> visits   | double precision   |   | plain|  |
>
> pages| double precision   |   | plain|  |
>
> duration | double precision   |   | plain|  |
>
> Has OIDs: no
>
> Tablespace: "tbs_data"
>
>
>
> dev=# \d+ weekly_hstore
>
>Table "test.weekly_hstore"
>
>   Column  |  Type  | Modifiers | Storage  | Stats target |
> Description
>
>
> --++---+--+--+-
>
> date | date   |   | plain|  |
>
> ref_id| character varying(256) |   | extended |  |
>
> visits   | hstore |   | extended |  |
>
> pages| hstore |   | extended |  |
>
> duration | hstore |   | extended |  |
>
> Has OIDs: no
>
> Tablespace: "tbs_data"
>
>
>
> dev=# select count(*) from weekly_non_hstore;
>
>   count
>
> --
>
> 71818882
>
> (1 row)
>
>
>
>
>
> dev=# select count(*) from weekly_hstore;
>
>   count
>
> -
>
> 1292314
>
> (1 row)
>
>
>
>
>
> Query
>
> dev=# explain analyze select cha_type,sum(visits) from we

Re: [PERFORM] query performance with hstore vs. non-hstore

2014-08-31 Thread Huang, Suya

Thank you Pavel.

The cost of unpacking hstore comparing to non-hstore could be calculated by:
Seq scan on hstore table + hash join with seg1 table:
Hstore: 416.741+ 34619.879 =~34 seconds
Non-hstore: 8858.594 +26477.652 =~ 34 seconds

The subsequent hash-aggregate and sort operation should be working on the 
unpacked hstore rows which has same row counts as non-hstore table. however, 
timing on those operations actually makes the big difference.

I don’t quite get why…

Thanks,
Suya

From: Pavel Stehule [mailto:pavel.steh...@gmail.com]
Sent: Monday, September 01, 2014 4:22 PM
To: Huang, Suya
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] query performance with hstore vs. non-hstore

Hi
In this use case hstore should not help .. there is relative high overhead 
related with unpacking hstore -- so classic schema is better.
Hstore should not to replace well normalized schema - it should be a replace 
for some semi normalized structures as EAV.
Hstore can have some profit from TOAST .. comprimation, less system data 
overhead, but this advantage started from some length of data. You should to 
see this benefit on table size. When table with HStore is less than without, 
then there is benefit of Hstore. Last benefit of Hstore are indexes over tuple 
(key, value) .. but you don't use it.
Regards

Pavel

2014-09-01 8:10 GMT+02:00 Huang, Suya 
mailto:suya.hu...@au.experian.com>>:
Hi ,

I’m tweaking table layout to get better performance of query. One table doesn’t 
use hstore but expand all metrics of cha_type to different rows. The other 
table has hstore for metrics column as cha_type->metrics so it has less records 
than the first one.

I would be expecting the query on seconds table has better performance than the 
first one. However, it’s not the case at all. I’m wondering if there’s 
something wrong with my execution plan? With the hstore table, the optimizer 
has totally wrong estimation on row counts at hash aggregate stage and it takes 
34 seconds on hash-join,25 seconds on hash-aggregate, 10 seconds on sort. 
However, with non-hstore table, it takes 17 seconds on hash join, 18 seconds on 
hashaggregate and 2 seconds on sort.

Can someone help me to explain why this is happening? And is there a way to 
fine-tune the query?

Table structure

dev=# \d+ weekly_non_hstore
 Table "test.weekly_non_hstore"
  Column  |  Type  | Modifiers | Storage  | Stats target | 
Description
--++---+--+--+-
date | date   |   | plain|  |
ref_id| character varying(256) |   | extended |  |
cha_typel  | text   |   | extended |  |
visits   | double precision   |   | plain|  |
pages| double precision   |   | plain|  |
duration | double precision   |   | plain|  |
Has OIDs: no
Tablespace: "tbs_data"

dev=# \d+ weekly_hstore
   Table "test.weekly_hstore"
  Column  |  Type  | Modifiers | Storage  | Stats target | 
Description
--++---+--+--+-
date | date   |   | plain|  |
ref_id| character varying(256) |   | extended |  |
visits   | hstore |   | extended |  |
pages| hstore |   | extended |  |
duration | hstore |   | extended |  |
Has OIDs: no
Tablespace: "tbs_data"

dev=# select count(*) from weekly_non_hstore;
  count
--
71818882
(1 row)


dev=# select count(*) from weekly_hstore;
  count
-
1292314
(1 row)


Query
dev=# explain analyze select cha_type,sum(visits) from weekly_non_hstore a  
join seg1 b on a.ref_id=b.ref_id group by cha_type order by sum(visits) desc;
   
QUERY PLAN

Sort  (cost=3674073.37..3674431.16 rows=143115 width=27) (actual 
time=47520.637..47969.658 rows=3639539 loops=1)
   Sort Key: (sum(a.visits))
   Sort Method: quicksort  Memory: 391723kB
   ->  HashAggregate  (cost=3660386.70..3661817.85 rows=143115 width=27) 
(actual time=43655.637..44989.202 rows=3639539 loops=1)
 ->  Hash Join  (cost=12029.58..3301286.54 rows=71820032 width=27) 
(actual time=209.789..26477.652 rows=36962761 loops=1)
   Hash Cond: ((a.ref_id)::text = (b.ref_id)::text)
   ->  Seq Scan on weekly_non_hstore a  (cost=0.00..1852856.32 
rows=71820032 width=75) (actual time=0.053..8858.594 rows=71818882 loops=1)

Re: [PERFORM] query performance with hstore vs. non-hstore

2014-08-31 Thread Pavel Stehule

Hi

In this use case hstore should not help .. there is relative high overhead
related with unpacking hstore -- so classic schema is better.

Hstore should not to replace well normalized schema - it should be a
replace for some semi normalized structures as EAV.

Hstore can have some profit from TOAST .. comprimation, less system data
overhead, but this advantage started from some length of data. You should
to see this benefit on table size. When table with HStore is less than
without, then there is benefit of Hstore. Last benefit of Hstore are
indexes over tuple (key, value) .. but you don't use it.

Regards

Pavel


2014-09-01 8:10 GMT+02:00 Huang, Suya :

>  Hi ,
>
>
>
> I’m tweaking table layout to get better performance of query. One table
> doesn’t use hstore but expand all metrics of cha_type to different rows.
> The other table has hstore for metrics column as cha_type->metrics so it
> has less records than the first one.
>
>
>
> I would be expecting the query on seconds table has better performance
> than the first one. However, it’s not the case at all. I’m wondering if
> there’s something wrong with my execution plan? With the hstore table, the
> optimizer has totally wrong estimation on row counts at hash aggregate
> stage and it takes 34 seconds on hash-join,25 seconds on hash-aggregate, 10
> seconds on sort. However, with non-hstore table, it takes 17 seconds on
> hash join, 18 seconds on hashaggregate and 2 seconds on sort.
>
>
>
> Can someone help me to explain why this is happening? And is there a way
> to fine-tune the query?
>
>
>
> Table structure
>
>
>
> dev=# \d+ weekly_non_hstore
>
>  Table "test.weekly_non_hstore"
>
>   Column  |  Type  | Modifiers | Storage  | Stats target |
> Description
>
>
> --++---+--+--+-
>
> date | date   |   | plain|  |
>
> ref_id| character varying(256) |   | extended |  |
>
> cha_typel  | text   |   | extended |  |
>
> visits   | double precision   |   | plain|  |
>
> pages| double precision   |   | plain|  |
>
> duration | double precision   |   | plain|  |
>
> Has OIDs: no
>
> Tablespace: "tbs_data"
>
>
>
> dev=# \d+ weekly_hstore
>
>Table "test.weekly_hstore"
>
>   Column  |  Type  | Modifiers | Storage  | Stats target |
> Description
>
>
> --++---+--+--+-
>
> date | date   |   | plain|  |
>
> ref_id| character varying(256) |   | extended |  |
>
> visits   | hstore |   | extended |  |
>
> pages| hstore |   | extended |  |
>
> duration | hstore |   | extended |  |
>
> Has OIDs: no
>
> Tablespace: "tbs_data"
>
>
>
> dev=# select count(*) from weekly_non_hstore;
>
>   count
>
> --
>
> 71818882
>
> (1 row)
>
>
>
>
>
> dev=# select count(*) from weekly_hstore;
>
>   count
>
> -
>
> 1292314
>
> (1 row)
>
>
>
>
>
> Query
>
> dev=# explain analyze select cha_type,sum(visits) from weekly_non_hstore
> a  join seg1 b on a.ref_id=b.ref_id group by cha_type order by sum(visits)
> desc;
>
>
>  QUERY PLAN
>
>
> 
>
> Sort  (cost=3674073.37..3674431.16 rows=143115 width=27) (actual
> time=47520.637..47969.658 rows=3639539 loops=1)
>
>Sort Key: (sum(a.visits))
>
>Sort Method: quicksort  Memory: 391723kB
>
>->  HashAggregate  (cost=3660386.70..3661817.85 rows=143115 width=27)
> (actual time=43655.637..44989.202 rows=3639539 loops=1)
>
>  ->  Hash Join  (cost=12029.58..3301286.54 rows=71820032 width=27)
> (actual time=209.789..26477.652 rows=36962761 loops=1)
>
>Hash Cond: ((a.ref_id)::text = (b.ref_id)::text)
>
>->  Seq Scan on weekly_non_hstore a  (cost=0.00..1852856.32
> rows=71820032 width=75) (actual time=0.053..8858.594 rows=71818882 loops=1)
>
>->  Hash  (cost=7382.59..7382.59 rows=371759 width=47)
> (actual time=209.189..209.189 rows=371759 loops=1)
>
>  Buckets: 65536  Batches: 1  Memory Usage: 28951kB
>
>  ->  Seq Scan on seg1 b  (cost=0.00..7382.59
> rows=371759 width=47) (actual time=0.014..64.695 rows=371759 loops=1)
>
> Total runtime: 48172.405 ms
>
> (11 rows)
>
>
>
> Time: 48173.569 ms
>
>
>
> dev=# explain analyze select cha_type, sum(visits) from (select
> (each(visits)).key as cha_type,(each(visits)).value::numeric as visits from
> weekly_hstore a  join seg1 b on a.ref_id=b.ref_id )f

[PERFORM] query performance with hstore vs. non-hstore

2014-08-31 Thread Huang, Suya

Hi ,

I'm tweaking table layout to get better performance of query. One table doesn't 
use hstore but expand all metrics of cha_type to different rows. The other 
table has hstore for metrics column as cha_type->metrics so it has less records 
than the first one.

I would be expecting the query on seconds table has better performance than the 
first one. However, it's not the case at all. I'm wondering if there's 
something wrong with my execution plan? With the hstore table, the optimizer 
has totally wrong estimation on row counts at hash aggregate stage and it takes 
34 seconds on hash-join,25 seconds on hash-aggregate, 10 seconds on sort. 
However, with non-hstore table, it takes 17 seconds on hash join, 18 seconds on 
hashaggregate and 2 seconds on sort.

Can someone help me to explain why this is happening? And is there a way to 
fine-tune the query?

Table structure

dev=# \d+ weekly_non_hstore
 Table "test.weekly_non_hstore"
  Column  |  Type  | Modifiers | Storage  | Stats target | 
Description
--++---+--+--+-
date | date   |   | plain|  |
ref_id| character varying(256) |   | extended |  |
cha_typel  | text   |   | extended |  |
visits   | double precision   |   | plain|  |
pages| double precision   |   | plain|  |
duration | double precision   |   | plain|  |
Has OIDs: no
Tablespace: "tbs_data"

dev=# \d+ weekly_hstore
   Table "test.weekly_hstore"
  Column  |  Type  | Modifiers | Storage  | Stats target | 
Description
--++---+--+--+-
date | date   |   | plain|  |
ref_id| character varying(256) |   | extended |  |
visits   | hstore |   | extended |  |
pages| hstore |   | extended |  |
duration | hstore |   | extended |  |
Has OIDs: no
Tablespace: "tbs_data"

dev=# select count(*) from weekly_non_hstore;
  count
--
71818882
(1 row)


dev=# select count(*) from weekly_hstore;
  count
-
1292314
(1 row)


Query
dev=# explain analyze select cha_type,sum(visits) from weekly_non_hstore a  
join seg1 b on a.ref_id=b.ref_id group by cha_type order by sum(visits) desc;
   
QUERY PLAN

Sort  (cost=3674073.37..3674431.16 rows=143115 width=27) (actual 
time=47520.637..47969.658 rows=3639539 loops=1)
   Sort Key: (sum(a.visits))
   Sort Method: quicksort  Memory: 391723kB
   ->  HashAggregate  (cost=3660386.70..3661817.85 rows=143115 width=27) 
(actual time=43655.637..44989.202 rows=3639539 loops=1)
 ->  Hash Join  (cost=12029.58..3301286.54 rows=71820032 width=27) 
(actual time=209.789..26477.652 rows=36962761 loops=1)
   Hash Cond: ((a.ref_id)::text = (b.ref_id)::text)
   ->  Seq Scan on weekly_non_hstore a  (cost=0.00..1852856.32 
rows=71820032 width=75) (actual time=0.053..8858.594 rows=71818882 loops=1)
   ->  Hash  (cost=7382.59..7382.59 rows=371759 width=47) (actual 
time=209.189..209.189 rows=371759 loops=1)
 Buckets: 65536  Batches: 1  Memory Usage: 28951kB
 ->  Seq Scan on seg1 b  (cost=0.00..7382.59 rows=371759 
width=47) (actual time=0.014..64.695 rows=371759 loops=1)
Total runtime: 48172.405 ms
(11 rows)

Time: 48173.569 ms

dev=# explain analyze select cha_type, sum(visits) from (select 
(each(visits)).key as cha_type,(each(visits)).value::numeric as visits from 
weekly_hstore a  join seg1 b on a.ref_id=b.ref_id )foo  group by cha_type  
order by sum(visits) desc;
   QUERY 
PLAN
-
Sort  (cost=7599039.89..7599040.39 rows=200 width=64) (actual 
time=70424.561..70986.202 rows=3639539 loops=1)
   Sort Key: (sumeach(a.visits)).value)::numeric)))
   Sort Method: quicksort  Memory: 394779kB
   ->  HashAggregate  (cost=7599030.24..7599032.24 rows=200 width=64) (actual 
time=59267.120..60502.647 rows=3639539 loops=1)
 ->  Hash Join  (cost=12029.58..2022645.24 rows=371759000 width=184) 
(actual time=186.140..34619.879 rows=36962761 loops=1)
   Hash Cond: ((a.ref_id)::text = (b.ref_id)::text)
   ->  Seq Scan on weekly_hstore a  (cost=0.00..133321.14 
row

Re: [PERFORM] Query Performance question

2014-07-14 Thread Magers, James

Thank you Tomas.  I did execute the queries against a dataset that was 
representative of what we expect the production dataset to have.  By higher 
volume I meant more transactions happening against the data,  We would expect 
the data size to increase over time and when we executed against a dataset that 
was about 4x larger the index scan was selected to perform the lookup versus 
the bitmap heap scan.  The scan of the both the smaller and larger datasets 
were returning in similar times between the two groups of tests.  This is part 
of the reason that I was thinking that the bitmap heap scan may not be as 
efficient since 4 times the data returned in just a little more time using the 
index scan.

Thank you,
James

-Original Message-
From: Tomas Vondra [mailto:t...@fuzzy.cz] 
Sent: Monday, July 14, 2014 12:29 PM
To: Magers, James
Cc: Tomas Vondra; Thomas Kellerer; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] Query Performance question

On 14 Červenec 2014, 18:02, Magers, James wrote:
> Tomas,
>
> Thank you for the recommendation.  In this case, The bitmap scan runs
> quite quickly, however in production were data may or may not be cached
> and at higher volumes I am trying to ensure the process will continue to
> execute efficiently and reduce the impact of the process on other
> processes running against the database.

That's why it's important to do the testing with representative amount of
data. Testing the queries on significantly reduced dataset is pointless,
because the optimizer will do different decisions.

> My assessment is based on my experiences with the scans.  Does your
> experience provide you with a different assessment of the scan types and
> how efficient they may be?

No. Because I don't have your data. And it seems that your assessment is
based on experience with dataset that's very different from your expected
production dataset, which means the experience is not directly applicable.
The optimizer considers the size of the dataset when choosing the plan.

regards
Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance question

2014-07-14 Thread Tomas Vondra

On 14 Červenec 2014, 18:02, Magers, James wrote:
> Tomas,
>
> Thank you for the recommendation.  In this case, The bitmap scan runs
> quite quickly, however in production were data may or may not be cached
> and at higher volumes I am trying to ensure the process will continue to
> execute efficiently and reduce the impact of the process on other
> processes running against the database.

That's why it's important to do the testing with representative amount of
data. Testing the queries on significantly reduced dataset is pointless,
because the optimizer will do different decisions.

> My assessment is based on my experiences with the scans.  Does your
> experience provide you with a different assessment of the scan types and
> how efficient they may be?

No. Because I don't have your data. And it seems that your assessment is
based on experience with dataset that's very different from your expected
production dataset, which means the experience is not directly applicable.
The optimizer considers the size of the dataset when choosing the plan.

regards
Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance question

2014-07-14 Thread Magers, James

Tomas,

Thank you for the recommendation.  In this case, The bitmap scan runs quite 
quickly, however in production were data may or may not be cached and at higher 
volumes I am trying to ensure the process will continue to execute efficiently 
and reduce the impact of the process on other processes running against the 
database.  

My assessment is based on my experiences with the scans.  Does your experience 
provide you with a different assessment of the scan types and how efficient 
they may be?

Thank you,
James




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance question

2014-07-14 Thread Tomas Vondra

On 14 Červenec 2014, 16:00, Magers, James wrote:
> Thomas,
>
> I would have to agree that the current results do indicate that.  However,
> I have run this explain analyze multiple times and the timing varies from
> about 4ms to 35ms using the Bitmap Heap Scan.  Here is an explain plan
> from Thursday of last week that shows about 21ms.  Part of the issue in
> trying to isolate if the query can be faster is that once the data is
> cached any way that the query is executed appears to be quicker.
>
> http://explain.depesz.com/s/SIX1

I think that judging the performance based on this limited number of
samples is futile, especially when the plans are this fast. The
measurements are easy to influence by other tasks running on the system,
OS process scheduling etc. Or it might be because of memory pressure on
the system, causing the important data from page cache (and thus I/O for
queries accessing them).

This might be the reason why you saw higher timings, and it's impossible
to say based solely on explain plan from a single execution. To get
meaningful numbers it's necessary to execute the query repeatedly, to
eliminate caching effects. But the question is whether these caching
effects will happen on production or not. (Because what if you tweak the
configuration to get the best plan based on assumption that everything is
cached, when it won't be in practice?)

That being said, the only plan that's actually faster than the bitmap
index scan (which you believe is inefficient) is this one

   http://explain.depesz.com/s/3wna

The reason why it's not selected by the optimizer is that the cost is
estimated to be 20.60, while the bitmap index scan cost is estimated as
20.38. So the optimizer decides that 20.38 is lower than 20.60, and thus
chooses the bitmap index scan.

What you may do is tweak cost constants, described here

www.postgresql.org/docs/9.4/static/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS

You need to increase the bitmap idex scan cost estimate, so that it's more
expensive than the index scan. I'd guess that increasing the
cpu_tuple_cost and/or cpu_index_tuple_cost a bit should do the trick.

regards
Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance question

2014-07-14 Thread Magers, James

Thomas,

I would have to agree that the current results do indicate that.  However, I 
have run this explain analyze multiple times and the timing varies from about 
4ms to 35ms using the Bitmap Heap Scan.  Here is an explain plan from Thursday 
of last week that shows about 21ms.  Part of the issue in trying to isolate if 
the query can be faster is that once the data is cached any way that the query 
is executed appears to be quicker.

http://explain.depesz.com/s/SIX1

Thank you,
James

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance question

2014-07-14 Thread Thomas Kellerer

Magers, James, 14.07.2014 15:18:
> Thank you.  I executed the query this morning after disabling the scan types. 
>  
> I am including links to explain.depesz output for each of the three 
> variations that I executed.  
> 
> indexscan and bitmapscan off: http://explain.depesz.com/s/sIx
> seqscan and bitmapscan off: http://explain.depesz.com/s/GfM
> bitmapscan off: http://explain.depesz.com/s/3wna
> 

So the original query (using an "Index Scan" + "Bitmap Index Scan") is indeed 
the most efficient one: 4ms vs. 39ms vs. 64ms 








-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance question

2014-07-14 Thread Magers, James

Thomas,

Thank you.  I executed the query this morning after disabling the scan types.  
I am including links to explain.depesz output for each of the three variations 
that I executed.  

indexscan and bitmapscan off: http://explain.depesz.com/s/sIx
seqscan and bitmapscan off: http://explain.depesz.com/s/GfM
bitmapscan off: http://explain.depesz.com/s/3wna


Thank you,
James


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance question

2014-07-14 Thread Thomas Kellerer

Magers, James, 14.07.2014 04:20:

> Thank you for your feedback.  I am attaching the requested information.  
> While I do not think the query is necessarily inefficient, I believe  a 
> sequence scan would be more efficient.  

You can try

set enable_indexscan = off;
set enable_bitmapscan = off;

and then run your query. 

But I would be very surprised if a seq scan (which reads through the whole 
table) was faster than those 4ms you have now




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance question

2014-07-13 Thread Magers, James

Tomas,

Thank you for your feedback.  I am attaching the requested information.  While 
I do not think the query is necessarily inefficient, I believe  a sequence scan 
would be more efficient.  

\d member_subscription_d

Table "public.member_subscription_d"
Column |   Type   |
Modifiers
---+--+-
 subscription_key  | bigint   | not null default 
nextval('subscription_id_seq'::regclass)
 version   | integer  | not null
 date_from | timestamp with time zone |
 date_to   | timestamp with time zone |
 newsletter_nme| character varying(50)|
 subscription_platform | character varying(50)|
 subscription_value| character varying(255)   |
 subscription_status   | character varying(100)   |
 list_status   | character varying(25)|
 current_status| boolean  |
 unsubscribetoken  | character varying(200)   |
 transaction_date  | timestamp with time zone |
 newsletter_sts| integer  |
Indexes:
"member_subscription_key" PRIMARY KEY, btree (subscription_key)
"idx_member_subscription_d_list_status" btree (list_status)
"idx_member_subscription_d_newsletter_nme" btree (newsletter_nme)
"idx_member_subscription_d_subscription_status" btree (subscription_status)
"idx_member_subscription_d_subscription_value" btree (subscription_value)
"idx_member_subscription_d_tk" btree (subscription_key)
Referenced by:
TABLE "member_recipient_f" CONSTRAINT 
"member_subscription_d_recipient_f_fk" FOREIGN KEY (subscription_key) 
REFERENCES member_subscription_d(subscription_key)



pgahq_datamart-# FROM pg_class WHERE relname = 'member_subscription_d';
relname| relpages | reltuples
---+--+---
 member_subscription_d | 1383 | 63012
(1 row)


Explain output:
http://explain.depesz.com/s/OVK

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance question

2014-07-13 Thread Tomas Vondra

On 14.7.2014 00:55, Magers, James wrote:
> I am using a Pentaho process to access the database and select the
> appropriate information to update the DB tables and records.  I am
> trying to select the previous subscription key in order to update the
> factable for any records that have the previous key to have the current
> subscription key. This query is intended to use the current subscription
> key and subscription info to select the previous subscription key to
> allow for the information to be updated.  I would like to optimize the
> query to execute more efficiently.
> 
> The database table has about 60K records in it and when I run an explain
> anaylyze it indicates that the query optimizer chooses to execute a
> bitmap heap scan, this seems like an inefficient method for this query.

Why do you think it's inefficient? The planner thinks it's efficient,
for some reason. And it's impossible to say if that's a good decision,
because we don't know (a) the explain plan, and (b) structure of the
table involved (indexes, ...).

Please post the explain analyze output to explain.depesz.com and post
the link here (it's more readable than posting it here directly).

Also, please do this:

  SELECT relname, relpages, reltuples
FROM pg_class WHERE relname = 'member_subscription_d'

and this

  \d member_subscription_d

and post the results here.

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query Performance question

2014-07-13 Thread Magers, James

I am using a Pentaho process to access the database and select the appropriate 
information to update the DB tables and records.  I am trying to select the 
previous subscription key in order to update the factable for any records that 
have the previous key to have the current subscription key. This query is 
intended to use the current subscription key and subscription info to select 
the previous subscription key to allow for the information to be updated.  I 
would like to optimize the query to execute more efficiently.

The database table has about 60K records in it and when I run an explain 
anaylyze it indicates that the query optimizer chooses to execute a bitmap heap 
scan, this seems like an inefficient method for this query.

Query:
Select subscription_key as prev_sub_key
from member_subscription_d
where subscription_value 
='non...@mailinator.com'
and newsletter_nme = 'newsletter_member'
and subscription_platform = 'email'
and version = (select version -1 as mtch_vers
   from member_subscription_d
   where subscription_key = 4037516)

Current Data in Database for this address:
 subscription_key | version |   date_from|  date_to 
  |  newsletter_nme   | subscription_platform | subscription_value | 
subscription_status | list_status | current_status | unsubscribetoken |
transaction_date| newsletter_sts
--+-+++---+---++-+-++--++
  4001422 |   1 | 2000-02-09 00:00:00-05 | 2014-04-19 09:57:24-04   
  | newsletter_member | email | 
non...@mailinator.com   | VALID   
| pending | f  |  | 2000-02-09 00:00:00-05 |
  2
  4019339 |   2 | 2014-04-19 09:57:24-04 | 2014-06-04 12:27:34-04   
  | newsletter_member | email | 
non...@mailinator.com   | VALID   
| subscribe   | f  |  | 2014-04-19 09:57:24-04 |
  1
  4037516 |   3 | 2014-06-04 12:27:34-04 | 2199-12-31 
23:59:59.999-05 | newsletter_member | email | 
non...@mailinator.com   | VALID   
| subscribe   | t  |  | 2014-06-04 12:27:34-04 |
  1
(3 rows)

System information:
Postgres Version: 9.2
OS : Linux cmprodpgsql1 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03 UTC 
2013 x86_64 x86_64 x86_64 GNU/Linux
Pentaho: 5.0.1-stable

postgresql.conf
checkpoint_segments = '8'
data_directory = '/var/lib/postgresql/9.2/main'
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
effective_cache_size = '2GB'
external_pid_file = '/var/run/postgresql/9.2-main.pid'
hba_file = '/etc/postgresql/9.2/main/pg_hba.conf'
ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'
listen_addresses = '*'
log_line_prefix = '%t '
max_connections = '200'
max_wal_senders = '3'
port = 5432
shared_buffers = '1024MB'
ssl = off
ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'
ssl_key_file = '/etc/ssl/certs/ssl-cert-snakeoil.key'
timezone = 'localtime'
unix_socket_directory = '/var/run/postgresql'
wal_keep_segments = '8'
wal_level = 'hot_standby'
work_mem = '100MB'

Re: [PERFORM] Query performance

2013-06-13 Thread Sergey Konoplev

On Thu, Jun 13, 2013 at 12:49 AM, K P Manoj  wrote:
> One of my query treating performance issue on my production server.
> Once i run  query on my parent table with specific condition(hard coded
> value) its uses  only proper child table and its index on explain plan  ,
> but once i am using table conditions (instead of hard coded value), query
> planner is going all the child tables, Can i know where i am worng

>From the docs:

"Constraint exclusion only works when the query's WHERE clause
contains constants (or externally supplied parameters). For example, a
comparison against a non-immutable function such as CURRENT_TIMESTAMP
cannot be optimized, since the planner cannot know which partition the
function value might fall into at run time."

http://www.postgresql.org/docs/9.2/static/ddl-partitioning.html#DDL-PARTITIONING-CAVEATS

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray...@gmail.com


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query performance

2013-06-13 Thread K P Manoj

Hi All

One of my query treating performance issue on my production server.
Once i run  query on my parent table with specific condition(hard coded
value) its uses  only proper child table and its index on explain plan  ,
but once i am using table conditions (instead of hard coded value), query
planner is going all the child tables, Can i know where i am worng

Postgresql version 9.2.2

Please find details below
==

XXX_db=> select id from xxx where d_id = '5';
 id
---
 5
45
(2 rows)


XXX_db=> explain  analyze  SELECT *  FROM xxx_parent_table WHERE id in
(5,45) and ( sts = 1 or status is null ) order by creation_time limit 40 ;
  QUERY PLAN


--
-

 Limit  (cost=12.21..12.21 rows=3 width=251) (actual time=6.585..6.585
rows=0 loops=1)
   ->  Sort  (cost=12.21..12.21 rows=3 width=251) (actual time=6.582..6.582
rows=0 loops=1)
 Sort Key: public.xxx_parent_tables.creation_time
 Sort Method: quicksort  Memory: 25kB
 ->  Result  (cost=0.00..12.18 rows=3 width=251) (actual
time=6.571..6.571 rows=0 loops=1)
   ->  Append  (cost=0.00..12.18 rows=3 width=251) (actual
time=6.569..6.569 rows=0 loops=1)
 ->  Seq Scan on xxx_parent_tables  (cost=0.00..0.00
rows=1 width=324) (actual time=0.003..0.003 rows=0 loops=1)
   Filter: ((id = ANY ('{5,45}'::bigint[])) AND
((status = 1) OR (status IS NULL)))
 ->  Bitmap Heap Scan on
xxx_parent_tables_table_details_ xxx_parent_tables  (cost=4.52..6.53 rows=1
width=105) (actual ti
me=0.063..0.063 rows=0 loops=1)
   Recheck Cond: ((status = 1) OR (status IS NULL))
   Filter: (id = ANY ('{5,45}'::bigint[]))
   ->  BitmapOr  (cost=4.52..4.52 rows=1 width=0)
(actual time=0.059..0.059 rows=0 loops=1)
 ->  Bitmap Index Scan on
xxx_parent_tables_table_details__status_idx  (cost=0.00..2.26 rows=1
width=0)
 (actual time=0.038..0.038 rows=0 loops=1)
   Index Cond: (status = 1)
 ->  Bitmap Index Scan on
xxx_parent_tables_table_details__status_idx  (cost=0.00..2.26 rows=1
width=0)
 (actual time=0.019..0.019 rows=0 loops=1)
   Index Cond: (status IS NULL)
 ->  Bitmap Heap Scan on
xxx_parent_tables_table_details_det xxx_parent_tables  (cost=2.52..5.65
rows=1 width=324) (actual ti
me=6.502..6.502 rows=0 loops=1)
   Recheck Cond: (id = ANY ('{5,45}'::bigint[]))
   Filter: ((status = 1) OR (status IS NULL))
   ->  Bitmap Index Scan on
xxx_parent_tables_table_details_id_idx  (cost=0.00..2.52 rows=2 width=0)
(actua
l time=6.499..6.499 rows=0 loops=1)
 Index Cond: (id = ANY ('{5,45}'::bigint[]))
 Total runtime: 6.823 ms
(22 rows)


XXX_db => explain analyzeSELECT *  FROM xxx_parent_tables WHERE cp_id
in (select id from xxx where d_id = '5') and ( status = 1 or status is null
) order by creation_time limit 40 ;

  QUERY PLAN


--

 Limit  (cost=3.66..6067.89 rows=40 width=105) (actual
time=70479.596..70479.596 rows=0 loops=1)
   ->  Nested Loop Semi Join  (cost=3.66..4587291.92 rows=30258 width=105)
(actual time=70479.593..70479.593 rows=0 loops=1)
 Join Filter: (public.xxx_parent_tables.cp_id = cp_info.cp_id)
 Rows Removed by Join Filter: 1416520
 ->  Merge Append  (cost=3.66..4565956.68 rows=711059 width=105)
(actual time=67225.964..69635.016 rows=708260 loops=1)
   Sort Key: public.xxx_parent_tables.creation_time
   ->  Sort  (cost=0.01..0.02 rows=1 width=324) (actual
time=0.018..0.018 rows=0 loops=1)
 Sort Key: public.xxx_parent_tables.creation_time
 Sort Method: quicksort  Memory: 25kB
 ->  Seq Scan on xxx_parent_tables  (cost=0.00..0.00
rows=1 width=324) (actual time=0.011..0.011 rows=0 loops=1)
   Filter: ((status = 1) OR (status IS NULL))
   ->  Index Scan using
xxx_parent_tables_automobiles_carwale_creation_time_idx on
xxx_parent_tables_automobiles_carwale xxx_parent_tables  (co
st=0.00..649960.44 rows=17 width=105) (actual time=10219.559..10219.559
rows=0 loops=1)
 Filter: ((status = 1) OR (status IS NULL))
 Rows Removed by Filter: 3102241
   ->  Index Scan using
xxx_parent_tables_automobiles_s

Re: [PERFORM] query performance, where goes time?

2012-09-05 Thread Craig Ringer


On 09/06/2012 07:48 AM, Anibal David Acosta wrote:

Using explain analyze I saw that many of my queries run really fast,
less than 1 milliseconds, for example the analyze output of a simple
query over a table with 5millions of records return  "Total runtime:
0.078 ms"

But the real time is a lot  more, about 15 ms, in fact the pgadmin show
this value.

So, where goes the others 14.2 ms?


Client-side latency, time spent transmitting query results, and network 
latency.


You'll see much less difference in queries that take more meaningful 
amounts of time. This query is so fast that timing accuracy will be an 
issue on some systems, and so will scheduler jitter etc.


--
Craig Ringer



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] query performance, where goes time?

2012-09-05 Thread Anibal David Acosta

Using explain analyze I saw that many of my queries run really fast, less
than 1 milliseconds, for example the analyze output of a simple query over a
table with 5millions of records return  "Total runtime: 0.078 ms"

 

But the real time is a lot  more, about 15 ms, in fact the pgadmin show this
value.

 

So, where goes the others 14.2 ms?

 

Network transfer (TCP)?

 

Or analyze Total runtime don't represent the query runtime?

 

Thanks!

Re: [PERFORM] Query performance - normal on 9.0.4, slow from 9.0.5 onwards

2012-01-02 Thread Miguel Silva


On 30-12-2011 22:29, Tom Lane wrote:

I poked at this a little bit.  AFAICS the only potentially relevant
planner change between 9.0.4 and 9.0.5 was the removal of eqjoinsel's
ndistinct-clamping heuristic,
http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=3505862a8d3e3b389ab926346061b7135fa44f79

Now that's something we took out because it seemed to be making more
cases worse than better, but there were cases where it helped (for the
wrong reasons, but nonetheless it sometimes adjusted the estimates to be
closer to reality), and apparently you've got one such case.  However,
removing that logic just brought the behavior back to what it was
pre-8.4, so I'm a bit dubious of the claim that this query has worked
well for "over 7 years".  Perhaps you had lots fewer tables and/or FKs
back in pre-8.4 days?

Well, thanks, that clarifies the reason why this happens!
Perhaps you are right. I mean, that's what I've been told, and I believe 
it really worked well for all that time. But since this is an 
auto-generated query, maybe it hasn't always been exactly like this. Or 
maybe there really were fewer tables/FKs, back then.


I experimented with a toy database having 1000 tables of 30 columns
each, with one foreign key per table, all in the "public" schema, and
indeed this query is pretty slow on current releases.  A big part of the
problem is that the planner is unaware that the one row you're selecting
from pg_namespace will join to almost all the rows in pg_class; so it
underestimates the sizes of those join results, and that leads to
picking a nestloop plan style where it's not appropriate.

I tried removing these WHERE conditions:


  AND pkn.nspname = 'public'
  AND fkn.nspname = 'public'

and got a decently fast plan.  If those are, as I suspect, also no-ops
in your real database, perhaps that will do as a workaround.

regards, tom lane

I tried running the query with that change, but it still takes around 25 
secs. What I did as a workaround, was use this query instead of an 
auto-generated one:


SELECT
tc.constraint_name AS FK_NAME,
tc.table_name AS PKTABLE_NAME,
kcu.column_name AS PKCOLUMN_NAME,
ccu.table_name AS FKTABLE_NAME,
ccu.column_name AS FKCOLUMN_NAME,
CASE con.confupdtype  WHEN 'c' THEN 0 WHEN 'n' THEN 2 WHEN 'd' THEN 
4 WHEN 'r' THEN 1 WHEN 'a' THEN 3 ELSE NULL END AS UPDATE_RULE,
CASE con.confdeltype  WHEN 'c' THEN 0 WHEN 'n' THEN 2 WHEN 'd' THEN 
4 WHEN 'r' THEN 1 WHEN 'a' THEN 3 ELSE NULL END AS DELETE_RULE


FROM information_schema.table_constraints AS tc
JOIN information_schema.key_column_usage AS kcu ON 
tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage AS ccu ON 
ccu.constraint_name = tc.constraint_name
JOIN pg_catalog.pg_constraint AS con ON con.conname = 
tc.constraint_name


WHERE constraint_type = 'FOREIGN KEY';

Thanks for looking into this!

Best regards,

Miguel Silva

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance - normal on 9.0.4, slow from 9.0.5 onwards

2012-01-02 Thread Miguel Silva


On 30-12-2011 19:35, Merlin Moncure wrote:

try this (curious):
create table pos as select n from generate_series(1,32) n;

and swap that for the in-query generate series call.  your statistics
in the query are completely off (not 100% sure why), so I'm thinking
to replace that since it lies to the planner about the # rows
returned.  also the join on the array element probably isn't helping.

merlin

Tried it. The query still takes around the same amount of time but, out 
of curiosity, here's the explain analyze of it:

http://explain.depesz.com/s/MvE .

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance - normal on 9.0.4, slow from 9.0.5 onwards

2011-12-30 Thread Tom Lane

Miguel Silva  writes:
> I work for a software company that has it's main program installed on 
> over 200 clients. This program uses a small local database in 
> postgresql. Always installed with the one-click installer and 
> postgresql.conf left on default settings. This structure allows us to 
> always install the latest version of postgresql both in new clients and 
> older clients (when they are updated). And all was well for over 7 years.
> But with postgresql version 9.0.5 (in version 9.0.4 all was fine), we 
> noticed the program was taking longer to start.

I poked at this a little bit.  AFAICS the only potentially relevant
planner change between 9.0.4 and 9.0.5 was the removal of eqjoinsel's
ndistinct-clamping heuristic,
http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=3505862a8d3e3b389ab926346061b7135fa44f79

Now that's something we took out because it seemed to be making more
cases worse than better, but there were cases where it helped (for the
wrong reasons, but nonetheless it sometimes adjusted the estimates to be
closer to reality), and apparently you've got one such case.  However,
removing that logic just brought the behavior back to what it was
pre-8.4, so I'm a bit dubious of the claim that this query has worked
well for "over 7 years".  Perhaps you had lots fewer tables and/or FKs
back in pre-8.4 days?

I experimented with a toy database having 1000 tables of 30 columns
each, with one foreign key per table, all in the "public" schema, and
indeed this query is pretty slow on current releases.  A big part of the
problem is that the planner is unaware that the one row you're selecting
from pg_namespace will join to almost all the rows in pg_class; so it
underestimates the sizes of those join results, and that leads to
picking a nestloop plan style where it's not appropriate.

I tried removing these WHERE conditions:

>  AND pkn.nspname = 'public'
>  AND fkn.nspname = 'public'

and got a decently fast plan.  If those are, as I suspect, also no-ops
in your real database, perhaps that will do as a workaround.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance - normal on 9.0.4, slow from 9.0.5 onwards

2011-12-30 Thread Merlin Moncure

On Fri, Dec 30, 2011 at 10:39 AM, Miguel Silva  wrote:
> Hi all!
>
> I've ran into a performance problem a few time ago and I've been trying to
> figure out a solution until now. But since I've failed to come up with
> anything conclusive, it's time to ask some help from people with more
> understanding of how postgresql works.
>
> Here's the big picture.
> I work for a software company that has it's main program installed on over
> 200 clients. This program uses a small local database in postgresql. Always
> installed with the one-click installer and postgresql.conf left on default
> settings. This structure allows us to always install the latest version of
> postgresql both in new clients and older clients (when they are updated).
> And all was well for over 7 years.
> But with postgresql version 9.0.5 (in version 9.0.4 all was fine), we
> noticed the program was taking longer to start. In fact, in some clients
> that had older hardware, it could take around 20 minutes when it usually
> takes only a few seconds. To make a long story short, the problem was traced
> and narrowed down to a single auto generated query. Here it is:
>
> "SELECT
>    NULL::text AS PKTABLE_CAT,
>    pkn.nspname AS PKTABLE_SCHEM,
>    pkc.relname AS PKTABLE_NAME,
>    pka.attname AS PKCOLUMN_NAME,
>    NULL::text AS FKTABLE_CAT,
>    fkn.nspname AS FKTABLE_SCHEM,
>    fkc.relname AS FKTABLE_NAME,
>    fka.attname AS FKCOLUMN_NAME,
>    pos.n AS KEY_SEQ,
>    CASE con.confupdtype  WHEN 'c' THEN 0 WHEN 'n' THEN 2 WHEN 'd' THEN 4
> WHEN 'r' THEN 1 WHEN 'a' THEN 3 ELSE NULL END AS UPDATE_RULE,
>    CASE con.confdeltype  WHEN 'c' THEN 0 WHEN 'n' THEN 2 WHEN 'd' THEN 4
> WHEN 'r' THEN 1 WHEN 'a' THEN 3 ELSE NULL END AS DELETE_RULE,
>    con.conname AS FK_NAME,
>    pkic.relname AS PK_NAME,
>    CASE  WHEN con.condeferrable AND con.condeferred THEN 5 WHEN
> con.condeferrable THEN 6 ELSE 7 END AS DEFERRABILITY
> FROM
>    pg_catalog.pg_namespace pkn,
>    pg_catalog.pg_class pkc,
>    pg_catalog.pg_attribute pka,
>    pg_catalog.pg_namespace fkn,
>    pg_catalog.pg_class fkc,
>    pg_catalog.pg_attribute fka,
>    pg_catalog.pg_constraint con,
>    pg_catalog.generate_series(1, 32) pos(n),
>    pg_catalog.pg_depend dep,
>    pg_catalog.pg_class pkic
> WHERE pkn.oid = pkc.relnamespace
>    AND pkc.oid = pka.attrelid
>    AND pka.attnum = con.confkey[pos.n]
>    AND con.confrelid = pkc.oid
>    AND fkn.oid = fkc.relnamespace
>    AND fkc.oid = fka.attrelid
>    AND fka.attnum = con.conkey[pos.n]
>    AND con.conrelid = fkc.oid
>    AND con.contype = 'f'
>    AND con.oid = dep.objid
>    AND pkic.oid = dep.refobjid
>    AND pkic.relkind = 'i'
>    AND dep.classid = 'pg_constraint'::regclass::oid
>    AND dep.refclassid = 'pg_class'::regclass::oid
>    AND pkn.nspname = 'public'
>    AND fkn.nspname = 'public'
> ORDER BY
>    pkn.nspname,
>    pkc.relname,
>    pos.n;"
>
>
> From this point on, in all the tests I did, I directly typed this query on
> psql command line. I tried everything. Vaccuming and analyzing (although
> this is already automatic on postgresql 9.0), updating postgresql to version
> 9.1, tuning the database as explained on postgresql.org documentation (with
> various values to every parameter, different possible combinations), nothing
> worked, EXCEPT switching the "enable_material" parameter to OFF. That
> reduces the query time from around 25 seconds on my system (Intel Core2 Duo
> 2.93GHz 32bit running Windows 7 Enterprise Service Pack 1) to around 5
> seconds. Here are the explain analyzes.
>
> enable_material ON: http://explain.depesz.com/s/wen
> enable_material OFF: http://explain.depesz.com/s/Zaa
>
> Then, to narrow it down a bit further, I tried running the query on another
> database. It ran much faster.
> So I made a script that creates tables and foreign keys on a database, to
> find out at which number of tables/foreign keys the query started to slow
> down. I managed to get identically slow performance when I had 1000 tables
> and 5000 foreign keys. Which didn't help at all, since the database in which
> the problem occurs has only 292 tables and 521 foreign keys.
>
> Of course, it is possible to change the code and use a (different) manual
> query that does the same and runs perfectly fine, I've already done that.
> But why does this happen, from 9.0.5 on? Is there any idea? Is this
> situation already known?
> I hope someone can enlighten me on this subject..

try this (curious):
create table pos as select n from generate_series(1,32) n;

and swap that for the in-query generate series call.  your statistics
in the query are completely off (not 100% sure why), so I'm thinking
to replace that since it lies to the planner about the # rows
returned.  also the join on the array element probably isn't helping.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance - normal on 9.0.4, slow from 9.0.5 onwards

2011-12-30 Thread Miguel Silva


On 30-12-2011 17:40, Tom Lane wrote:

Miguel Silva  writes:

But with postgresql version 9.0.5 (in version 9.0.4 all was fine), we
noticed the program was taking longer to start. In fact, in some clients
that had older hardware, it could take around 20 minutes when it usually
takes only a few seconds. To make a long story short, the problem was
traced and narrowed down to a single auto generated query. Here it is:
"SELECT [ snip ]"
... Here are the explain analyzes.
enable_material ON: http://explain.depesz.com/s/wen
enable_material OFF: http://explain.depesz.com/s/Zaa

It doesn't really accomplish anything to post anonymized explain output
when you've already shown us the actual query, does it?  Especially when
said query involves only the system catalogs and by no stretch of the
imagination could be thought to contain anything proprietary?

regards, tom lane

Indeed you are right. Those are explains I created some time ago, when I 
didn't really know what that webpage did. I just kept them, and used 
them now, didn't even think about that. But the explains are still 
there, still useful. Anyway, if it is really necessary, I can post new ones.


Best regards,

Miguel Silva

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance - normal on 9.0.4, slow from 9.0.5 onwards

2011-12-30 Thread Tom Lane

Miguel Silva  writes:
> But with postgresql version 9.0.5 (in version 9.0.4 all was fine), we 
> noticed the program was taking longer to start. In fact, in some clients 
> that had older hardware, it could take around 20 minutes when it usually 
> takes only a few seconds. To make a long story short, the problem was 
> traced and narrowed down to a single auto generated query. Here it is:

> "SELECT [ snip ]"

> ... Here are the explain analyzes.

> enable_material ON: http://explain.depesz.com/s/wen
> enable_material OFF: http://explain.depesz.com/s/Zaa

It doesn't really accomplish anything to post anonymized explain output
when you've already shown us the actual query, does it?  Especially when
said query involves only the system catalogs and by no stretch of the
imagination could be thought to contain anything proprietary?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query performance - normal on 9.0.4, slow from 9.0.5 onwards

2011-12-30 Thread Miguel Silva


Hi all!

I've ran into a performance problem a few time ago and I've been trying 
to figure out a solution until now. But since I've failed to come up 
with anything conclusive, it's time to ask some help from people with 
more understanding of how postgresql works.


Here's the big picture.
I work for a software company that has it's main program installed on 
over 200 clients. This program uses a small local database in 
postgresql. Always installed with the one-click installer and 
postgresql.conf left on default settings. This structure allows us to 
always install the latest version of postgresql both in new clients and 
older clients (when they are updated). And all was well for over 7 years.
But with postgresql version 9.0.5 (in version 9.0.4 all was fine), we 
noticed the program was taking longer to start. In fact, in some clients 
that had older hardware, it could take around 20 minutes when it usually 
takes only a few seconds. To make a long story short, the problem was 
traced and narrowed down to a single auto generated query. Here it is:


"SELECT
NULL::text AS PKTABLE_CAT,
pkn.nspname AS PKTABLE_SCHEM,
pkc.relname AS PKTABLE_NAME,
pka.attname AS PKCOLUMN_NAME,
NULL::text AS FKTABLE_CAT,
fkn.nspname AS FKTABLE_SCHEM,
fkc.relname AS FKTABLE_NAME,
fka.attname AS FKCOLUMN_NAME,
pos.n AS KEY_SEQ,
CASE con.confupdtype  WHEN 'c' THEN 0 WHEN 'n' THEN 2 WHEN 'd' THEN 
4 WHEN 'r' THEN 1 WHEN 'a' THEN 3 ELSE NULL END AS UPDATE_RULE,
CASE con.confdeltype  WHEN 'c' THEN 0 WHEN 'n' THEN 2 WHEN 'd' THEN 
4 WHEN 'r' THEN 1 WHEN 'a' THEN 3 ELSE NULL END AS DELETE_RULE,

con.conname AS FK_NAME,
pkic.relname AS PK_NAME,
CASE  WHEN con.condeferrable AND con.condeferred THEN 5 WHEN 
con.condeferrable THEN 6 ELSE 7 END AS DEFERRABILITY

FROM
pg_catalog.pg_namespace pkn,
pg_catalog.pg_class pkc,
pg_catalog.pg_attribute pka,
pg_catalog.pg_namespace fkn,
pg_catalog.pg_class fkc,
pg_catalog.pg_attribute fka,
pg_catalog.pg_constraint con,
pg_catalog.generate_series(1, 32) pos(n),
pg_catalog.pg_depend dep,
pg_catalog.pg_class pkic
WHERE pkn.oid = pkc.relnamespace
AND pkc.oid = pka.attrelid
AND pka.attnum = con.confkey[pos.n]
AND con.confrelid = pkc.oid
AND fkn.oid = fkc.relnamespace
AND fkc.oid = fka.attrelid
AND fka.attnum = con.conkey[pos.n]
AND con.conrelid = fkc.oid
AND con.contype = 'f'
AND con.oid = dep.objid
AND pkic.oid = dep.refobjid
AND pkic.relkind = 'i'
AND dep.classid = 'pg_constraint'::regclass::oid
AND dep.refclassid = 'pg_class'::regclass::oid
AND pkn.nspname = 'public'
AND fkn.nspname = 'public'
ORDER BY
pkn.nspname,
pkc.relname,
pos.n;"


From this point on, in all the tests I did, I directly typed this query 
on psql command line. I tried everything. Vaccuming and analyzing 
(although this is already automatic on postgresql 9.0), updating 
postgresql to version 9.1, tuning the database as explained on 
postgresql.org documentation (with various values to every parameter, 
different possible combinations), nothing worked, EXCEPT switching the 
"enable_material" parameter to OFF. That reduces the query time from 
around 25 seconds on my system (Intel Core2 Duo 2.93GHz 32bit running 
Windows 7 Enterprise Service Pack 1) to around 5 seconds. Here are the 
explain analyzes.


enable_material ON: http://explain.depesz.com/s/wen
enable_material OFF: http://explain.depesz.com/s/Zaa

Then, to narrow it down a bit further, I tried running the query on 
another database. It ran much faster.
So I made a script that creates tables and foreign keys on a database, 
to find out at which number of tables/foreign keys the query started to 
slow down. I managed to get identically slow performance when I had 1000 
tables and 5000 foreign keys. Which didn't help at all, since the 
database in which the problem occurs has only 292 tables and 521 foreign 
keys.


Of course, it is possible to change the code and use a (different) 
manual query that does the same and runs perfectly fine, I've already 
done that. But why does this happen, from 9.0.5 on? Is there any idea? 
Is this situation already known?

I hope someone can enlighten me on this subject..

Thanks in advance! Best regards,

Miguel Silva

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance issue

2011-09-05 Thread Jayadevan

Based on my initial hunch that something resulting from all the ALTERS was
making PostgreSQL planner end up with bad plans, I tried a pg_dump and
pg_restore. Now the 'bad' query comes back in 70 seconds (compared to 20
minutes earlier) and the rewritten query still comes back in 2 seconds. So
we will stick with the re-written query.

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Query-performance-issue-tp4753453p4773061.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance issue

2011-09-04 Thread Jayadevan M

Hello,
> 
> If that does not help, you'll have to change the query probably. The
> problem is the explain analyze you've provided
> (http://explain.depesz.com/s/MY1) does not match the query from your
> yesterday's post so we can't really help with it.
Thanks for the pointers. I think I posted the same plan, may be the 
variable values changed. Anyway, I changed the query and now it comes back 
in 2 seconds. Here is the plan
http://explain.depesz.com/s/n9S
Interesting observation - PostgreSQL takes from 2 seconds to 20 minutes 
fetch the same data set of 2212 records, with slightly modified queries. 
Oracle is consistent (taking under 1 minute in both cases), though not 
consistently faster. The modified query is 
SELECT PFLMST.MEMSHPNUM,
   PFLMST.MEMSHPTYP,
   ACCMST.PRGCOD,
   CNTINF.EMLADR,
   CNTINF.CELISDCOD,
   CNTINF.CELARACOD,
   CNTINF.CELNUM,
   CNTINF.ADRLINONE ,
   CNTINF.ZIPCOD,
   CNTINF.ADRTYP,
   (select ONE.FLDDES from COMONETIM ONE
 WHERE ONE.CMPCOD =ACCMST.CMPCOD
 AND ONE.FLDCOD='program.member.accountStatus'
 AND ONE.FLDVAL=ACCMST.ACCSTA)ACCSTA,
   (SELECT ONE1.FLDDES FROM COMONETIM ONE1
 WHERE ONE1.CMPCOD =ACCMST.CMPCOD
 AND ONE1.FLDCOD='common.member.membershipStatus'
 AND ONE1.FLDVAL=PFLMST.MEMSHPSTA )MEMSHPSTA,
   INDINF.CMPNAM EMPNAM,
   INDINF.PRFADR,
   INDINF.GVNNAM GVNNAM,
   INDINF.FAMNAM FAMNAM,
   INDINF.MEMDOB MEMDOB
 FROM PRGMEMACCMST ACCMST
 JOIN EAIMEMPFLMST PFLMST
 ON ACCMST.CMPCOD = PFLMST.CMPCOD
 AND ACCMST.MEMSHPNUM = PFLMST.MEMSHPNUM
 JOIN EAICUSPFLCNTINF CNTINF
 ON CNTINF.CMPCOD  = PFLMST.CMPCOD
 AND CNTINF.CUSNUM = PFLMST.CUSNUM
 LEFT JOIN EAICUSPFLINDINF INDINF
 ON INDINF.CMPCOD   = PFLMST.CMPCOD
 AND INDINF.CUSNUM  = PFLMST.CUSNUM
 WHERE ACCMST.CMPCOD= 'SA'
 AND UPPER(INDINF.FAMNAM) LIKE 'PRICE'
   || '%'
 ORDER BY UPPER(INDINF.GVNNAM),
   UPPER(INDINF.FAMNAM),
 UPPER(INDINF.CMPNAM) 

Regards,
Jayadevan





DISCLAIMER: 

"The information in this e-mail and any attachment is intended only for 
the person to whom it is addressed and may contain confidential and/or 
privileged material. If you have received this e-mail in error, kindly 
contact the sender and destroy all copies of the original communication. 
IBS makes no warranty, express or implied, nor guarantees the accuracy, 
adequacy or completeness of the information contained in this email or any 
attachment and is not liable for any errors, defects, omissions, viruses 
or for resultant loss or damage, if any, direct or indirect."

Re: [PERFORM] Query performance issue

2011-09-04 Thread Tomas Vondra

On 4 Září 2011, 20:06, Jayadevan wrote:
> I don't think I understood all that. Anyway, is there a way to fix this -
> either by rewriting the query or by creating an index? The output does
> match
> what I am expecting.  It does take more than 10 times the time taken by
> Oracle for the same result, with PostgreSQL taking more than 20 minutes. I
> am sort of stuck on this since this query does get executed often. By the
> way, changing the filter from FAMNAM to GIVENNAME fetches results in 90
> seconds. Probably there is a difference in the cardinality of values in
> these 2 columns.

Tom Lane explained why sort produces more rows (2673340321) than it gets
on the input (1121), or why it seems like that - it's a bit complicated
because of the merge join.

I'd try to increase statistics target - it's probably 100, change it to
1000, run ANALYZE and try the query (it may improve the plan without the
need to mess with the query).

If that does not help, you'll have to change the query probably. The
problem is the explain analyze you've provided
(http://explain.depesz.com/s/MY1) does not match the query from your
yesterday's post so we can't really help with it. I do have some ideas of
how to change the query, but it's really wild guessing without the query
plan.

Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance issue

2011-09-04 Thread Jayadevan

I don't think I understood all that. Anyway, is there a way to fix this -
either by rewriting the query or by creating an index? The output does match
what I am expecting.  It does take more than 10 times the time taken by
Oracle for the same result, with PostgreSQL taking more than 20 minutes. I
am sort of stuck on this since this query does get executed often. By the
way, changing the filter from FAMNAM to GIVENNAME fetches results in 90
seconds. Probably there is a difference in the cardinality of values in
these 2 columns.

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Query-performance-issue-tp4753453p4768047.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance issue

2011-09-04 Thread Tom Lane

"Kevin Grittner"  writes:
> Thanks for posting the query and related schema.  I tried working
> through it, but I keep coming back to this sort, and wondering how a
> sort can have 1121 rows as input and 2673340321 rows as output.  Does
> anyone have any ideas on what could cause that?

Mergejoin rescan.  There really are only 1121 rows in the data, but
the parent merge join is pulling them over and over again --- evidently
there are a lot of equal keys in the data.  The EXPLAIN ANALYZE
machinery counts each fetch as a new row, even after a mark/restore.

The planner does know about that effect and will penalize merge joins
when it realizes there are a lot of duplicate keys in the input.  In
this case I'm thinking that the drastic underestimate of the size of the
other side of the join results in not penalizing the merge enough.

(On the other hand, hash joins don't like equal keys that much either...)

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance issue

2011-09-04 Thread Kevin Grittner

Jayadevan M  wrote:

> Here is the explain analyze
> http://explain.depesz.com/s/MY1

> PostgreSQL 9.0.4 on x86_64-pc-solaris2.10

> work_mem = 96MB

Thanks for posting the query and related schema.  I tried working
through it, but I keep coming back to this sort, and wondering how a
sort can have 1121 rows as input and 2673340321 rows as output.  Does
anyone have any ideas on what could cause that?

  ->  Sort  (cost=1895.95..1896.49 rows=215 width=61)
(actual time=25.926..711784.723
 rows=2673340321 loops=1)
Sort Key: memmst.memshpsta
Sort Method:  quicksort  Memory: 206kB
->  Nested Loop  (cost=0.01..1887.62 rows=215 width=61)
 (actual time=0.088..23.445
  rows=1121 loops=1)

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance issue

2011-09-04 Thread Grzegorz Jaśkiewicz

Order by ...upper(xyz), do you have functional index on these ?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance issue

2011-09-02 Thread Jayadevan

Here goesI think it might be difficult to go through all these
definitions.. 
PRGMEMACCMST 

  Table "public.prgmemaccmst" 
Column|Type | Modifiers 
--+-+--- 
 cmpcod   | character varying(5)| not null 
 prgcod   | character varying(5)| not null 
 memshpnum| character varying(30)   | not null 
 accsta   | character varying(1)| not null 
 accstachgdat | timestamp without time zone | not null 
 expdat   | timestamp without time zone | 
 tircod   | character varying(5)| 
 tirexpdat| timestamp without time zone | 
 crdexpdat| timestamp without time zone | 
 tiraltdat| timestamp without time zone | 
 crdlmtalwflg | boolean | 
 lstactdat| timestamp without time zone | 
 enrsrc   | character varying(1)| not null 
 enrsrccod| character varying(15)   | 
 enrdat   | timestamp without time zone | not null 
 acrpntflg| boolean | 
 usrcod   | character varying(25)   | 
 upddat   | timestamp without time zone | 
 erlrgn   | character varying(20)   | 
 susflg   | character varying(1)| 
 fstactdat| timestamp without time zone | 
 fstacractnum | character varying(12)   | 
 acccrtdat| timestamp without time zone | not null 
 lsttirprcdat | timestamp without time zone | 
 enrtircod| character varying(5)| 
Indexes: 
"prgmemaccmst_pkey" PRIMARY KEY, btree (cmpcod, prgcod, memshpnum) 
"prgmemaccmst_accsta_idx" btree (accsta) 
"prgmemaccmst_enrdat_idx" btree (enrdat) 
"prgmemaccmst_tircod_idx" btree (tircod) 
"prgmemaccmst_tirexpdat_ind" btree (tirexpdat) 



EAIMEMPFLMST 
  View "public.eaimempflmst" 
  Column   |Type | Modifiers | Storage  |
Description 
---+-+---+--+- 
 cmpcod| character varying(5)|   | extended | 
 memshpnum | character varying(30)   |   | extended | 
 memshptyp | character varying(1)|   | extended | 
 memshpsta | character varying(1)|   | extended | 
 pin   | character varying(50)   |   | extended | 
 sctqst| character varying(200)  |   | extended | 
 sctans| character varying(200)  |   | extended | 
 rtoclmcnt | smallint|   | plain| 
 usrcod| character varying(25)   |   | extended | 
 upddat| timestamp without time zone |   | plain| 
 cusnum| character varying(11)   |   | extended | 
View definition: 
 SELECT memmst.cmpcod, memmst.memshpnum, memmst.memshptyp, memmst.memshpsta,
memmst.pin, memmst.sctqst, memmst.sctans, memmst.rtoclmcnt, memmst.usrcod,
memmst.upddat, memmst.cusnum 
   FROM memmst; 

memmst 
Table "public.memmst" 
  Column   |Type | Modifiers 
---+-+--- 
 cmpcod| character varying(5)| not null 
 memshpnum | character varying(30)   | not null 
 memshptyp | character varying(1)| not null 
 memshpsta | character varying(1)| not null 
 pin   | character varying(50)   | not null 
 sctqst| character varying(200)  | 
 sctans| character varying(200)  | 
 rtoclmcnt | smallint| 
 usrcod| character varying(25)   | 
 upddat| timestamp without time zone | 
 cusnum| character varying(11)   | 
 weblgn| boolean | 
 rsncod| character varying(1)| 
 lgntrycnt | smallint| 
 lgntrytim | timestamp without time zone | 
 rempinchg | boolean | 
Indexes: 
"memmst_pkey" PRIMARY KEY, btree (cmpcod, memshpnum) 
"memmst_idx" UNIQUE, btree (cusnum, memshpnum, cmpcod) 
"memmst_upddat_idx" btree (upddat) 


   View "public.eaicuspflcntinf" 
  Column   |Type | Modifiers | Storage  |
Description 
---+-+---+--+- 
 cmpcod| character varying(5)|   | extended | 
 cusnum| character varying(11)   |   | extended | 
 adrtyp| character varying(1)|   | extended | 
 adrlinone | character varying(150)  |   | extended | 
 adrlintwo | character varying(150)  |   | extended | 
 cty   | character varying(100)  |   | extended | 
 stt   | character varying(100)  |   | extended | 
 ctr   | character varying(5)|   | extended | 
 zipcod| character varying(30)   |   | extended | 
 emladr| character varying(100)  |   | extended | 
 phnnum| character varying(50)   |   | extended |

Re: [PERFORM] Query performance issue

2011-08-31 Thread Kevin Grittner

Jayadevan M  wrote:

>> And the schema of the tables involved, and any indexes on them.

> The details of the tables and indexes may take a bit of effort to
> explain.  Will do that.

In psql you can do \d to get a decent summary.

Without seeing the query and the table definitions, it's hard to give
advice; especially when a sort step increases the number of rows.
I'm guessing there is incorrect usage of some set-returning function.

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance issue

2011-08-31 Thread Jayadevan M

> 
> A really interesting part is the sort near the bottom -
> 
> ->  Sort  (cost=1895.95..1896.49 rows=215 width=61) (actual
> time=25.926..711784.723 rows=2673340321 loops=1)
> Sort Key: memmst.memshpsta
> Sort Method:  quicksort  Memory: 206kB
> ->  Nested Loop  (cost=0.01..1887.62 rows=215 width=61) (actual
> time=0.088..23.445 rows=1121 loops=1)
> 
> How can a sort ge 1121 rows at the input and return 2673340321 rows at 
the
> output? Not sure where this comes from.
> 
> BTW what PostgreSQL version is this?
PostgreSQL 9.0.4 on x86_64-pc-solaris2.10

Regards,
Jayadevan





DISCLAIMER: 

"The information in this e-mail and any attachment is intended only for 
the person to whom it is addressed and may contain confidential and/or 
privileged material. If you have received this e-mail in error, kindly 
contact the sender and destroy all copies of the original communication. 
IBS makes no warranty, express or implied, nor guarantees the accuracy, 
adequacy or completeness of the information contained in this email or any 
attachment and is not liable for any errors, defects, omissions, viruses 
or for resultant loss or damage, if any, direct or indirect."

Re: [PERFORM] Query performance issue

2011-08-31 Thread Tomas Vondra

On 31 Srpen 2011, 13:19, Jayadevan M wrote:
> Hello,
>
>> >
>> > Please run EXPLAIN ANALYZE on the query and post that, it's hard to
> say
>> > what's wrong from just the query plan, without knowing where the time
> is
>> > actually spent.
>> Here is the explain analyze
>> http://explain.depesz.com/s/MY1
> Going through the url tells me that statistics may be off. I will try
> analyzing the tables. That should help?
> Regards,
> Jayadevan

That could help, but not necessarily.

A really interesting part is the sort near the bottom -

->  Sort  (cost=1895.95..1896.49 rows=215 width=61) (actual
time=25.926..711784.723 rows=2673340321 loops=1)
Sort Key: memmst.memshpsta
Sort Method:  quicksort  Memory: 206kB
->  Nested Loop  (cost=0.01..1887.62 rows=215 width=61) (actual
time=0.088..23.445 rows=1121 loops=1)

How can a sort ge 1121 rows at the input and return 2673340321 rows at the
output? Not sure where this comes from.

BTW what PostgreSQL version is this?

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance issue

2011-08-31 Thread Venkat Balaji

Missed out looping in community...

On Wed, Aug 31, 2011 at 5:01 PM, Venkat Balaji wrote:

> Could you help us know the tables and columns on which Indexes are built ?
>
> Query is performing sorting based on key upper(column) and that is where i
> believe the cost is high.
>
> The 'upper'  function is used up in the where clause?
>
> Thanks
> Venkat
>
>
> On Wed, Aug 31, 2011 at 4:49 PM, Jayadevan M  > wrote:
>
>> Hello,
>>
>> > >
>> > > Please run EXPLAIN ANALYZE on the query and post that, it's hard to
>> say
>> > > what's wrong from just the query plan, without knowing where the time
>> is
>> > > actually spent.
>> > Here is the explain analyze
>> > http://explain.depesz.com/s/MY1
>>
>> Going through the url tells me that statistics may be off. I will try
>> analyzing the tables. That should help?
>> Regards,
>> Jayadevan
>>
>>
>>
>>
>>
>> DISCLAIMER:
>>
>> "The information in this e-mail and any attachment is intended only for
>> the person to whom it is addressed and may contain confidential and/or
>> privileged material. If you have received this e-mail in error, kindly
>> contact the sender and destroy all copies of the original communication. IBS
>> makes no warranty, express or implied, nor guarantees the accuracy, adequacy
>> or completeness of the information contained in this email or any attachment
>> and is not liable for any errors, defects, omissions, viruses or for
>> resultant loss or damage, if any, direct or indirect."
>>
>>
>>
>>
>>
>

Re: [PERFORM] Query performance issue

2011-08-31 Thread Jayadevan M

Hello,

> > 
> > Please run EXPLAIN ANALYZE on the query and post that, it's hard to 
say 
> > what's wrong from just the query plan, without knowing where the time 
is 
> > actually spent. 
> Here is the explain analyze 
> http://explain.depesz.com/s/MY1 
Going through the url tells me that statistics may be off. I will try 
analyzing the tables. That should help?
Regards,
Jayadevan





DISCLAIMER: 

"The information in this e-mail and any attachment is intended only for 
the person to whom it is addressed and may contain confidential and/or 
privileged material. If you have received this e-mail in error, kindly 
contact the sender and destroy all copies of the original communication. 
IBS makes no warranty, express or implied, nor guarantees the accuracy, 
adequacy or completeness of the information contained in this email or any 
attachment and is not liable for any errors, defects, omissions, viruses 
or for resultant loss or damage, if any, direct or indirect."

Re: [PERFORM] Query performance issue

2011-08-31 Thread Jayadevan M

Hello,
> 
> Please run EXPLAIN ANALYZE on the query and post that, it's hard to say 
> what's wrong from just the query plan, without knowing where the time is 

> actually spent. 
Here is the explain analyze
http://explain.depesz.com/s/MY1
Regards,
Jayadevan





DISCLAIMER: 

"The information in this e-mail and any attachment is intended only for 
the person to whom it is addressed and may contain confidential and/or 
privileged material. If you have received this e-mail in error, kindly 
contact the sender and destroy all copies of the original communication. 
IBS makes no warranty, express or implied, nor guarantees the accuracy, 
adequacy or completeness of the information contained in this email or any 
attachment and is not liable for any errors, defects, omissions, viruses 
or for resultant loss or damage, if any, direct or indirect."

Re: [PERFORM] Query performance issue

2011-08-31 Thread Jayadevan M

Hello,

> Please run EXPLAIN ANALYZE on the query and post that, it's hard to say 
> what's wrong from just the query plan, without knowing where the time is 

> actually spent. And the schema of the tables involved, and any indexes 
> on them. (see also http://wiki.postgresql.org/wiki/SlowQueryQuestions)
The details of the tables and indexes may take a bit of effort to explain. 
Will do that.
I remembered that a similar query took about 90 seconds to run a few days 
ago. Now that is also taking a few minutes to run. In between, we made 
some changes to a few tables (the tables are about 9-10 GB each). This was 
to fix some issue in conversion from CHARACTER VARYING to BOOLEAN on 
PostgreSQL (some columns in Oracle were of type VARCHAR, to store BOOLEAN 
values. We changed that to BOOLEAN in PostgreSQL to resolve some issues at 
the jdbc level). The alters were of similar type - 

ALTER TABLE cusdynatr ALTER tstflg TYPE boolean USING CASE WHEN tstflg = 
'1' THEN true WHEN tstflg = '0' then FALSE END;

Do such alters result in fragmentation at storage level?

Regards,
Jayadevan





DISCLAIMER: 

"The information in this e-mail and any attachment is intended only for 
the person to whom it is addressed and may contain confidential and/or 
privileged material. If you have received this e-mail in error, kindly 
contact the sender and destroy all copies of the original communication. 
IBS makes no warranty, express or implied, nor guarantees the accuracy, 
adequacy or completeness of the information contained in this email or any 
attachment and is not liable for any errors, defects, omissions, viruses 
or for resultant loss or damage, if any, direct or indirect."

Re: [PERFORM] Query performance issue

2011-08-31 Thread Sushant Sinha

Where is the query? And also paste the \d to show the tables and
indexes.

-Sushant.

On Wed, 2011-08-31 at 14:30 +0530, Jayadevan M wrote:
> Hello all, 
> I have a query which takes about 20 minutes to execute and retrieves
> 2000-odd records. The explain for the query is pasted here 
> http://explain.depesz.com/s/52f 
> The same query, with similar data structures/indexes and data comes
> back in 50 seconds in Oracle. We just ported the product to PostgreSQL
> and are testing it. Any input on what to look for? 
> 
> Possible relevant parameters are 
> shared_buffers = 4GB 
> temp_buffers = 8MB 
> work_mem = 96MB   
> maintenance_work_mem = 1GB 
> effective_cache_size = 8GB 
> default_statistics_target = 50 
> 
> It is a machine with 16 GB RAM. 
> Regards, 
> Jayadevan
> 
> 
> 
> 
> 
> DISCLAIMER: 
> 
> "The information in this e-mail and any attachment is intended only
> for the person to whom it is addressed and may contain confidential
> and/or privileged material. If you have received this e-mail in error,
> kindly contact the sender and destroy all copies of the original
> communication. IBS makes no warranty, express or implied, nor
> guarantees the accuracy, adequacy or completeness of the information
> contained in this email or any attachment and is not liable for any
> errors, defects, omissions, viruses or for resultant loss or damage,
> if any, direct or indirect."
> 
> 
> 
> 



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance issue

2011-08-31 Thread Heikki Linnakangas


On 31.08.2011 12:00, Jayadevan M wrote:

Hello all,
I have a query which takes about 20 minutes to execute and retrieves
2000-odd records. The explain for the query is pasted here
http://explain.depesz.com/s/52f
The same query, with similar data structures/indexes and data comes back
in 50 seconds in Oracle. We just ported the product to PostgreSQL and are
testing it. Any input on what to look for?

Possible relevant parameters are
shared_buffers = 4GB
temp_buffers = 8MB
work_mem = 96MB
maintenance_work_mem = 1GB
effective_cache_size = 8GB
default_statistics_target = 50

It is a machine with 16 GB RAM.


Please run EXPLAIN ANALYZE on the query and post that, it's hard to say 
what's wrong from just the query plan, without knowing where the time is 
actually spent. And the schema of the tables involved, and any indexes 
on them. (see also http://wiki.postgresql.org/wiki/SlowQueryQuestions)


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query performance issue

2011-08-31 Thread Jayadevan M

Hello all,
I have a query which takes about 20 minutes to execute and retrieves 
2000-odd records. The explain for the query is pasted here
http://explain.depesz.com/s/52f
The same query, with similar data structures/indexes and data comes back 
in 50 seconds in Oracle. We just ported the product to PostgreSQL and are 
testing it. Any input on what to look for?

Possible relevant parameters are 
shared_buffers = 4GB 
temp_buffers = 8MB 
work_mem = 96MB 
maintenance_work_mem = 1GB 
effective_cache_size = 8GB 
default_statistics_target = 50 

It is a machine with 16 GB RAM.
Regards,
Jayadevan





DISCLAIMER: 

"The information in this e-mail and any attachment is intended only for 
the person to whom it is addressed and may contain confidential and/or 
privileged material. If you have received this e-mail in error, kindly 
contact the sender and destroy all copies of the original communication. 
IBS makes no warranty, express or implied, nor guarantees the accuracy, 
adequacy or completeness of the information contained in this email or any 
attachment and is not liable for any errors, defects, omissions, viruses 
or for resultant loss or damage, if any, direct or indirect."

Re: [PERFORM] Query Performance with Indexes on Integer type vs. Date type.

2011-04-27 Thread Dhimant Patel

Thanks for all valuable insights. I decided to drop the idea of adding
additional column and
will just rely on Date column for all ordering.

Tom - thanks for clear answer on the issue I was concerned about.
Maciek,Kevin -
thanks for ideas, hint on generate_series() - I will have to go through cpl
of times of postgres documentation before I will have better grasp of all
available tools but this forum is very valuable.


-DP.


On Wed, Apr 27, 2011 at 12:46 PM, Phoenix Kiula wrote:

> On Thu, Apr 28, 2011 at 12:17 AM, Kevin Grittner
>  wrote:
> >
> > Dhimant Patel  wrote:
> >
> > > I am a new comer on postgres world and now using it for some
> > > serious (at least for me)  projects. I have a need where I am
> > > running some analytical + aggregate functions on data where
> > > ordering is done on Date type column.
> > >
> > > From my initial read on documentation I believe internally a date
> > > type is represented by integer type of data. This makes me wonder
> > > would it make any good to create additional column of Integer type
> > > and update it as data gets added and use this integer column for
> > > all ordering purposes for my sqls - or should I not hasitate using
> > > Date type straight into my sql for ordering?
> >
> > I doubt that this will improve performance, particularly if you ever
> > want to see your dates formatted as dates.
> >
> > > Better yet, is there anyway I can verify impact of ordering on
> > > Date type vs. Integer type, apart from using \timing and explain
> > > plan?
> >
> > You might be better off just writing the code in the most natural
> > way, using the date type for dates, and then asking about any
> > queries which aren't performing as you hope they would.  Premature
> > optimization is often counter-productive.  If you really want to do
> > some benchmarking of relative comparison speeds, though, see the
> > generate_series function -- it can be good at generating test tables
> > for such things.
>
>
>
>
> There is a lot of really good advice here already. I'll just add one
> thought.
>
> If the dates in your tables are static based only on creation (as in
> only a CREATE_DATE, which will never be modified per row like a
> MODIFY_DATE for each record), then your thought might have made sense.
> But in that case you can already use the ID field if you have one?
>
> In most real world cases however the DATE field will likely be storing
> an update time as well. Which would make your thought about numbering
> with integers pointless.
>

Re: [PERFORM] Query Performance with Indexes on Integer type vs. Date type.

2011-04-27 Thread Phoenix Kiula

On Thu, Apr 28, 2011 at 12:17 AM, Kevin Grittner
 wrote:
>
> Dhimant Patel  wrote:
>
> > I am a new comer on postgres world and now using it for some
> > serious (at least for me)  projects. I have a need where I am
> > running some analytical + aggregate functions on data where
> > ordering is done on Date type column.
> >
> > From my initial read on documentation I believe internally a date
> > type is represented by integer type of data. This makes me wonder
> > would it make any good to create additional column of Integer type
> > and update it as data gets added and use this integer column for
> > all ordering purposes for my sqls - or should I not hasitate using
> > Date type straight into my sql for ordering?
>
> I doubt that this will improve performance, particularly if you ever
> want to see your dates formatted as dates.
>
> > Better yet, is there anyway I can verify impact of ordering on
> > Date type vs. Integer type, apart from using \timing and explain
> > plan?
>
> You might be better off just writing the code in the most natural
> way, using the date type for dates, and then asking about any
> queries which aren't performing as you hope they would.  Premature
> optimization is often counter-productive.  If you really want to do
> some benchmarking of relative comparison speeds, though, see the
> generate_series function -- it can be good at generating test tables
> for such things.




There is a lot of really good advice here already. I'll just add one thought.

If the dates in your tables are static based only on creation (as in
only a CREATE_DATE, which will never be modified per row like a
MODIFY_DATE for each record), then your thought might have made sense.
But in that case you can already use the ID field if you have one?

In most real world cases however the DATE field will likely be storing
an update time as well. Which would make your thought about numbering
with integers pointless.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance with Indexes on Integer type vs. Date type.

2011-04-27 Thread Kevin Grittner

Dhimant Patel  wrote:

> I am a new comer on postgres world and now using it for some
> serious (at least for me)  projects. I have a need where I am
> running some analytical + aggregate functions on data where
> ordering is done on Date type column.
> 
> From my initial read on documentation I believe internally a date
> type is represented by integer type of data. This makes me wonder
> would it make any good to create additional column of Integer type
> and update it as data gets added and use this integer column for
> all ordering purposes for my sqls - or should I not hasitate using
> Date type straight into my sql for ordering?

I doubt that this will improve performance, particularly if you ever
want to see your dates formatted as dates.

> Better yet, is there anyway I can verify impact of ordering on
> Date type vs. Integer type, apart from using \timing and explain
> plan?

You might be better off just writing the code in the most natural
way, using the date type for dates, and then asking about any
queries which aren't performing as you hope they would.  Premature
optimization is often counter-productive.  If you really want to do
some benchmarking of relative comparison speeds, though, see the
generate_series function -- it can be good at generating test tables
for such things.

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance with Indexes on Integer type vs. Date type.

2011-04-27 Thread Maciek Sakrejda

> This makes me wonder would it make any good to create additional  column of 
> Integer type and update it as data gets added and use this integer column for 
> all ordering purposes for my sqls - or should I not hasitate using Date type 
> straight into my sql for ordering?

Keep in mind what Michael A. Jackson (among others) had to say on
this: "The First Rule of Program Optimization: Don't do it. The Second
Rule of Program Optimization (for experts only!): Don't do it yet."
For one thing, adding an extra column to your data would mean more
data you need to cram in the cache as you query, so even if the *raw*
integer versus date ordering is faster, the "optimization" could still
be a net loss due to the fatter tuples. If you're willing to live with
*only* integer-based dates, that could help, but that seems
exceptionally painful and not worth considering unless you run into
trouble.

> Better yet, is there anyway I can verify impact of ordering on Date type vs. 
> Integer type, apart from using \timing and explain plan?

Remember to use explain analyze (and not just explain) when validating
these sorts of things. Explain is really just a guess. Also remember
to ensure that stats are up to date before you test this.

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance with Indexes on Integer type vs. Date type.

2011-04-27 Thread Tom Lane

Dhimant Patel  writes:
> From my initial read on documentation I believe internally a date type is
> represented by integer type of data. This makes me wonder would it make any
> good to create additional  column of Integer type and update it as data gets
> added and use this integer column for all ordering purposes for my sqls - or
> should I not hasitate using Date type straight into my sql for ordering?

Don't overcomplicate things.  Comparison of dates is just about as fast as
comparison of integers, anyway.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query Performance with Indexes on Integer type vs. Date type.

2011-04-27 Thread Dhimant Patel

Hi All,

I am a new comer on postgres world and now using it for some serious (at
least for me)  projects. I have a need where I am running some analytical +
aggregate functions on data where ordering is done on Date type column.

>From my initial read on documentation I believe internally a date type is
represented by integer type of data. This makes me wonder would it make any
good to create additional  column of Integer type and update it as data gets
added and use this integer column for all ordering purposes for my sqls - or
should I not hasitate using Date type straight into my sql for ordering?

Better yet, is there anyway I can verify impact of ordering on Date type vs.
Integer type, apart from using \timing and explain plan?


Thanks for sharing your insights.
-DP.

Re: [PERFORM] Query performance with disabled hashjoin and mergejoin

2011-03-09 Thread Robert Haas

On Tue, Mar 8, 2011 at 4:24 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> The reason I thought cross-column correlations might be relevant is
>> that the bitmap index scan on news_visible_from is quite accurate
>> (19976 estimated vs. 19932 actual) and the bitmap index scan on
>> news_visible_to is tolerably accurate (151 estimated vs. 41 actual)
>> but the estimate on the BitmapOr is somehow totally wrong (20127
>> estimated vs. 0 actual).  But on further reflection that doesn't make
>> much sense.  How can the BitmapOr produce fewer rows than the sum of
>> its constituent inputs?
>
> That's not an estimation bug, that's a measurement bug.  We don't try to
> count the actual number of rows present in the result of a BitmapOr or
> BitmapAnd node.  (It would be impractical in lossy cases anyway, not to
> mention expensive.)

Mmm, OK.  But I still think there's a problem with the selectivity
estimate in there somewhere, because

  ->  Bitmap Heap Scan on news
(cost=1282.94..5494.05 rows=1422 width=634) (actual time=5.532..5.560
rows=7 loops=1)

...which may be why the planner is going wrong for the OP.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance with disabled hashjoin and mergejoin

2011-03-08 Thread Merlin Moncure

On Tue, Mar 8, 2011 at 2:57 PM, Robert Haas  wrote:
> On Mon, Mar 7, 2011 at 3:40 PM, Merlin Moncure  wrote:
>> On Tue, Feb 22, 2011 at 9:07 PM, Robert Haas  wrote:
>>> On Fri, Feb 4, 2011 at 7:08 AM, Ivan Voras  wrote:
                                 ->  BitmapAnd  (cost=1282.94..1282.94
 rows=1430 width=0) (actual time=5.508..5.508 rows=0 loops=1)
                                       ->  Bitmap Index Scan on
 news_index_layout_id_state  (cost=0.00..150.14 rows=2587 width=0) (actual
 time=0.909..0.909 rows=3464 loops=1)
                                             Index Cond: ((layout_id = 8980)
 AND (state = 2))
                                       ->  BitmapOr (cost=1132.20..1132.20
 rows=20127 width=0) (actual time=4.136..4.136 rows=0 loops=1)
                                             ->  Bitmap Index Scan on
 news_visible_from  (cost=0.00..1122.09 rows=19976 width=0) (actual
 time=3.367..3.367 rows=19932 loops=1)
                                                   Index Cond: (visible_from
 IS NULL)
                                             ->  Bitmap Index Scan on
 news_visible_to  (cost=0.00..9.40 rows=151 width=0) (actual
 time=0.766..0.766 rows=43 loops=1)
                                                   Index Cond: (1296806570 
 <=
 visible_to)
>>>
>>> I think this part of the query is the problem.  Since the planner
>>> doesn't support cross-column statistics, it can't spot the correlation
>>> between these different search conditions, resulting in a badly broken
>>> selectivity estimate.
>>>
>>> Sometimes you can work around this by adding a single column, computed
>>> with a trigger, that contains enough information to test the whole
>>> WHERE-clause condition using a single indexable test against the
>>> column value.  Or sometimes you can get around it by partitioning the
>>> data into multiple tables, say with the visible_from IS NULL rows in a
>>> different table from the rest.
>>
>> Why should you need cross column statistics for this case?  You should
>> be able to multiple selectivity from left to right as long as you are
>> doing equality comparisons, yes?
>>
>> Right now the planner is treating
>> select * from foo where (a,b,c) between (1,1,1) and (9,9,9) the same
>> (using selectivity on a) as
>> select * from foo where (a,b,c) between (1,1,5) and (1,1,7)
>>
>> but they are not the same. since in the second query terms a,b are
>> equal, shouldn't you able to multiply the selectivity through?
>
> I'm not quite following that...
>
> The reason I thought cross-column correlations might be relevant is
> that the bitmap index scan on news_visible_from is quite accurate
> (19976 estimated vs. 19932 actual) and the bitmap index scan on
> news_visible_to is tolerably accurate (151 estimated vs. 41 actual)
> but the estimate on the BitmapOr is somehow totally wrong (20127
> estimated vs. 0 actual).  But on further reflection that doesn't make
> much sense.  How can the BitmapOr produce fewer rows than the sum of
> its constituent inputs?
>
> /me scratches head.

my fault -- the point i was making I think was valid but didn't apply
to the op's question: I mistakenly where expression could be converted
to row wise comparison type operation but that wasn't the case...

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance with disabled hashjoin and mergejoin

2011-03-08 Thread Tom Lane

Robert Haas  writes:
> The reason I thought cross-column correlations might be relevant is
> that the bitmap index scan on news_visible_from is quite accurate
> (19976 estimated vs. 19932 actual) and the bitmap index scan on
> news_visible_to is tolerably accurate (151 estimated vs. 41 actual)
> but the estimate on the BitmapOr is somehow totally wrong (20127
> estimated vs. 0 actual).  But on further reflection that doesn't make
> much sense.  How can the BitmapOr produce fewer rows than the sum of
> its constituent inputs?

That's not an estimation bug, that's a measurement bug.  We don't try to
count the actual number of rows present in the result of a BitmapOr or
BitmapAnd node.  (It would be impractical in lossy cases anyway, not to
mention expensive.)

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance with disabled hashjoin and mergejoin

2011-03-08 Thread Robert Haas

On Mon, Mar 7, 2011 at 3:40 PM, Merlin Moncure  wrote:
> On Tue, Feb 22, 2011 at 9:07 PM, Robert Haas  wrote:
>> On Fri, Feb 4, 2011 at 7:08 AM, Ivan Voras  wrote:
>>>                                 ->  BitmapAnd  (cost=1282.94..1282.94
>>> rows=1430 width=0) (actual time=5.508..5.508 rows=0 loops=1)
>>>                                       ->  Bitmap Index Scan on
>>> news_index_layout_id_state  (cost=0.00..150.14 rows=2587 width=0) (actual
>>> time=0.909..0.909 rows=3464 loops=1)
>>>                                             Index Cond: ((layout_id = 8980)
>>> AND (state = 2))
>>>                                       ->  BitmapOr (cost=1132.20..1132.20
>>> rows=20127 width=0) (actual time=4.136..4.136 rows=0 loops=1)
>>>                                             ->  Bitmap Index Scan on
>>> news_visible_from  (cost=0.00..1122.09 rows=19976 width=0) (actual
>>> time=3.367..3.367 rows=19932 loops=1)
>>>                                                   Index Cond: (visible_from
>>> IS NULL)
>>>                                             ->  Bitmap Index Scan on
>>> news_visible_to  (cost=0.00..9.40 rows=151 width=0) (actual
>>> time=0.766..0.766 rows=43 loops=1)
>>>                                                   Index Cond: (1296806570 <=
>>> visible_to)
>>
>> I think this part of the query is the problem.  Since the planner
>> doesn't support cross-column statistics, it can't spot the correlation
>> between these different search conditions, resulting in a badly broken
>> selectivity estimate.
>>
>> Sometimes you can work around this by adding a single column, computed
>> with a trigger, that contains enough information to test the whole
>> WHERE-clause condition using a single indexable test against the
>> column value.  Or sometimes you can get around it by partitioning the
>> data into multiple tables, say with the visible_from IS NULL rows in a
>> different table from the rest.
>
> Why should you need cross column statistics for this case?  You should
> be able to multiple selectivity from left to right as long as you are
> doing equality comparisons, yes?
>
> Right now the planner is treating
> select * from foo where (a,b,c) between (1,1,1) and (9,9,9) the same
> (using selectivity on a) as
> select * from foo where (a,b,c) between (1,1,5) and (1,1,7)
>
> but they are not the same. since in the second query terms a,b are
> equal, shouldn't you able to multiply the selectivity through?

I'm not quite following that...

The reason I thought cross-column correlations might be relevant is
that the bitmap index scan on news_visible_from is quite accurate
(19976 estimated vs. 19932 actual) and the bitmap index scan on
news_visible_to is tolerably accurate (151 estimated vs. 41 actual)
but the estimate on the BitmapOr is somehow totally wrong (20127
estimated vs. 0 actual).  But on further reflection that doesn't make
much sense.  How can the BitmapOr produce fewer rows than the sum of
its constituent inputs?

/me scratches head.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance with disabled hashjoin and mergejoin

2011-03-07 Thread Merlin Moncure

On Tue, Feb 22, 2011 at 9:07 PM, Robert Haas  wrote:
> On Fri, Feb 4, 2011 at 7:08 AM, Ivan Voras  wrote:
>>                                 ->  BitmapAnd  (cost=1282.94..1282.94
>> rows=1430 width=0) (actual time=5.508..5.508 rows=0 loops=1)
>>                                       ->  Bitmap Index Scan on
>> news_index_layout_id_state  (cost=0.00..150.14 rows=2587 width=0) (actual
>> time=0.909..0.909 rows=3464 loops=1)
>>                                             Index Cond: ((layout_id = 8980)
>> AND (state = 2))
>>                                       ->  BitmapOr (cost=1132.20..1132.20
>> rows=20127 width=0) (actual time=4.136..4.136 rows=0 loops=1)
>>                                             ->  Bitmap Index Scan on
>> news_visible_from  (cost=0.00..1122.09 rows=19976 width=0) (actual
>> time=3.367..3.367 rows=19932 loops=1)
>>                                                   Index Cond: (visible_from
>> IS NULL)
>>                                             ->  Bitmap Index Scan on
>> news_visible_to  (cost=0.00..9.40 rows=151 width=0) (actual
>> time=0.766..0.766 rows=43 loops=1)
>>                                                   Index Cond: (1296806570 <=
>> visible_to)
>
> I think this part of the query is the problem.  Since the planner
> doesn't support cross-column statistics, it can't spot the correlation
> between these different search conditions, resulting in a badly broken
> selectivity estimate.
>
> Sometimes you can work around this by adding a single column, computed
> with a trigger, that contains enough information to test the whole
> WHERE-clause condition using a single indexable test against the
> column value.  Or sometimes you can get around it by partitioning the
> data into multiple tables, say with the visible_from IS NULL rows in a
> different table from the rest.

Why should you need cross column statistics for this case?  You should
be able to multiple selectivity from left to right as long as you are
doing equality comparisons, yes?

Right now the planner is treating
select * from foo where (a,b,c) between (1,1,1) and (9,9,9) the same
(using selectivity on a) as
select * from foo where (a,b,c) between (1,1,5) and (1,1,7)

but they are not the same. since in the second query terms a,b are
equal, shouldn't you able to multiply the selectivity through?

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance with disabled hashjoin and mergejoin

2011-02-22 Thread Robert Haas

On Fri, Feb 4, 2011 at 7:08 AM, Ivan Voras  wrote:
>                                 ->  BitmapAnd  (cost=1282.94..1282.94
> rows=1430 width=0) (actual time=5.508..5.508 rows=0 loops=1)
>                                       ->  Bitmap Index Scan on
> news_index_layout_id_state  (cost=0.00..150.14 rows=2587 width=0) (actual
> time=0.909..0.909 rows=3464 loops=1)
>                                             Index Cond: ((layout_id = 8980)
> AND (state = 2))
>                                       ->  BitmapOr (cost=1132.20..1132.20
> rows=20127 width=0) (actual time=4.136..4.136 rows=0 loops=1)
>                                             ->  Bitmap Index Scan on
> news_visible_from  (cost=0.00..1122.09 rows=19976 width=0) (actual
> time=3.367..3.367 rows=19932 loops=1)
>                                                   Index Cond: (visible_from
> IS NULL)
>                                             ->  Bitmap Index Scan on
> news_visible_to  (cost=0.00..9.40 rows=151 width=0) (actual
> time=0.766..0.766 rows=43 loops=1)
>                                                   Index Cond: (1296806570 <=
> visible_to)

I think this part of the query is the problem.  Since the planner
doesn't support cross-column statistics, it can't spot the correlation
between these different search conditions, resulting in a badly broken
selectivity estimate.

Sometimes you can work around this by adding a single column, computed
with a trigger, that contains enough information to test the whole
WHERE-clause condition using a single indexable test against the
column value.  Or sometimes you can get around it by partitioning the
data into multiple tables, say with the visible_from IS NULL rows in a
different table from the rest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance with disabled hashjoin and mergejoin

2011-02-06 Thread Ivan Voras

Sorry for the misunderstaning: of course not default "normal" settings; shared 
buffers, work mem, wal segments and others have been tuned according to 
available hardware (e.g. 4 GB, 32 MB, 10 for these settings, respectively). I 
meant "planner default settings" in the post.
-- 
Sent from my Android phone, please excuse my brevity.

Greg Smith  wrote:

Ivan Voras wrote: > The "vanilla" plan, with default settings is: Pause here 
for a second: why default settings? A default PostgreSQL configuration is 
suitable for systems with about 128MB of RAM. Since you say you have "good 
enough hardware", I'm assuming you have a bit more than that. The first things 
to try here are the list at 
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server ; your bad query 
here looks like it might benefit from a large increase to effective_cache_size, 
and possibly an increase to work_mem as well. Your "bad" plan here is doing a 
lot of sequential scans instead of indexed lookups, which makes me wonder if 
the change in join types you're forcing isn't fixing that part as a 
coincidence. Note that the estimated number of rows coming out of each form of 
plan is off by a factor of about 200X, so it's not that the other plan type is 
better estimating anything. -- Greg Smith 2ndQuadrant US g...@2ndquadrant.com 
Baltimore, MD PostgreSQL Training, Serv
 ices,
and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": 
http://www.2ndQuadrant.com/books

Re: [PERFORM] Query performance with disabled hashjoin and mergejoin

2011-02-04 Thread Ivan Voras


On 04/02/2011 15:44, Greg Smith wrote:

Ivan Voras wrote:

The "vanilla" plan, with default settings is:


Pause here for a second: why default settings? A default PostgreSQL
configuration is suitable for systems with about 128MB of RAM. Since you
say you have "good enough hardware", I'm assuming you have a bit more
than that. The first things to try here are the list at
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server ; your bad
query here looks like it might benefit from a large increase to
effective_cache_size, and possibly an increase to work_mem as well. Your
"bad" plan here is doing a lot of sequential scans instead of indexed
lookups, which makes me wonder if the change in join types you're
forcing isn't fixing that part as a coincidence.


My earlier message didn't get through so here's a repeat:

Sorry for the confusion, by "default settings" I meant "planner default 
settings" not generic shared buffers, wal logs, work memory etc. - which 
are adequately tuned.



Note that the estimated number of rows coming out of each form of plan
is off by a factor of about 200X, so it's not that the other plan type
is better estimating anything.


Any ideas how to fix the estimates? Or will I have to simulate hints by 
issuing "set enable_hashjoin=f; set enable_mergejoin=f;" for this query? :)




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query performance with disabled hashjoin and mergejoin

2011-02-04 Thread Greg Smith


Ivan Voras wrote:

The "vanilla" plan, with default settings is:


Pause here for a second:  why default settings?  A default PostgreSQL 
configuration is suitable for systems with about 128MB of RAM.  Since 
you say you have "good enough hardware", I'm assuming you have a bit 
more than that.  The first things to try here are the list at 
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server ; your bad 
query here looks like it might benefit from a large increase to 
effective_cache_size, and possibly an increase to work_mem as well.  
Your "bad" plan here is doing a lot of sequential scans instead of 
indexed lookups, which makes me wonder if the change in join types 
you're forcing isn't fixing that part as a coincidence.


Note that the estimated number of rows coming out of each form of plan 
is off by a factor of about 200X, so it's not that the other plan type 
is better estimating anything.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Query performance with disabled hashjoin and mergejoin

2011-02-04 Thread Ivan Voras

I'm running all this on a 9.0 server with good enough hardware. The 
query is:


SELECT  news.id AS news_id
,   news.layout_id
,   news.news_relation_id
,   news.author_id
,   news.date_created
,   news.date_published
,   news.lastedit
,   news.lastedit_user_id
,   news.lastedit_date
,   news.approved_by
,   news.state
,   news.visible_from
,   news.visible_to
,   news.archived_by
,   news.archived_date
,   news.priority
,   news.collection_id
,   news.comment
,   news.keywords
,   news.icon
,   news.icon_url
,   news.icon_width
,   news.icon_height
,   news.icon_position
,   news.icon_onclick
,   news.icon_newwindow
,   news.no_lead
,   news.content_exists
, news.title, news.lead, news.content


,   author.public_name AS 
author_public_name
,   lastedit_user.public_name AS 
lastedit_user_public_name
,   approved_by_user.public_name AS 
approved_by_public_name
,   archived_by_user.public_name AS 
archived_by_public_name

FROM news
JOIN users AS author ON news.author_id 
= author.id
LEFT JOIN users AS lastedit_user ON 
news.lastedit_user_id = lastedit_user.id
LEFT JOIN users AS approved_by_user ON 
news.approved_by = approved_by_user.id
LEFT JOIN users AS archived_by_user ON 
news.archived_by = archived_by_user.id


WHERE (news.layout_id = 8980) AND (state = 
2) AND (date_published <= 1296806570 AND (visible_from IS NULL OR 
1296806570 BETWEEN visible_f

rom AND visible_to))
ORDER BY priority DESC, date_published DESC
;

The "vanilla" plan, with default settings is:

 Sort  (cost=7325.84..7329.39 rows=1422 width=678) (actual 
time=100.846..100.852 rows=7 loops=1)

   Sort Key: news.priority, news.date_published
   Sort Method:  quicksort  Memory: 38kB
   ->  Hash Left Join  (cost=2908.02..7251.37 rows=1422 width=678) 
(actual time=100.695..100.799 rows=7 loops=1)

 Hash Cond: (news.archived_by = archived_by_user.id)
 ->  Hash Left Join  (cost=2501.75..6819.47 rows=1422 
width=667) (actual time=76.742..76.830 rows=7 loops=1)

   Hash Cond: (news.approved_by = approved_by_user.id)
   ->  Hash Left Join  (cost=2095.48..6377.69 rows=1422 
width=656) (actual time=53.248..53.318 rows=7 loops=1)

 Hash Cond: (news.lastedit_user_id = lastedit_user.id)
 ->  Hash Join  (cost=1689.21..5935.87 rows=1422 
width=645) (actual time=29.793..29.846 rows=7 loops=1)

   Hash Cond: (news.author_id = author.id)
   ->  Bitmap Heap Scan on news 
(cost=1282.94..5494.05 rows=1422 width=634) (actual time=5.532..5.560 
rows=7 loops=1)
 Recheck Cond: ((layout_id = 8980) AND 
(state = 2) AND ((visible_from IS NULL) OR (1296806570 <= visible_to)))
 Filter: ((date_published <= 
1296806570) AND ((visible_from IS NULL) OR ((1296806570 >= visible_from) 
AND (1296806570 <= visible_to
 ->  BitmapAnd  (cost=1282.94..1282.94 
rows=1430 width=0) (actual time=5.508..5.508 rows=0 loops=1)
   ->  Bitmap Index Scan on 
news_index_layout_id_state  (cost=0.00..150.14 rows=2587 width=0) 
(actual time=0.909..0.909 rows=3464 loops=1)
 Index Cond: ((layout_id = 
8980) AND (state = 2))
   ->  BitmapOr 
(cost=1132.20..1132.20 rows=20127 width=0) (actual time=4.136..4.136 
rows=0 loops=1)
 ->  Bitmap Index Scan on 
news_visible_from  (cost=0.00..1122.09 rows=19976 width=0) (actual 
time=3.367..3.367 rows=19932 loops=1)

Re: [PERFORM] Query Performance SQL Server vs. Postgresql

2010-11-23 Thread Merlin Moncure

On Mon, Nov 22, 2010 at 7:12 PM, Humair Mohammed  wrote:
> I did some further analysis and here are the results:
> work_mem;response_time
> 1MB;62 seconds
> 2MB;2 seconds
> 4MB;700 milliseconds
> 8MB;550 milliseconds
> In all cases shared_buffers were set to the default value of 32MB. As you
> can see the 1 to 2 MB jump on the work_mem does wonders. I probably don't
> need this to be any higher than 8 or 16 MB. Thanks to all for help!
> Humair

work_mem directly affects how the query is planned, because certain
types of plans (hash joins ans large sorts) require memory. raising or
lowering shared_buffers OTOH is very subtle and is not something you
tune to improve the execution of a single query...

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance SQL Server vs. Postgresql

2010-11-23 Thread Humair Mohammed


I did some further analysis and here are the results:
work_mem;response_time1MB;62 seconds2MB;2 seconds4MB;700 milliseconds8MB;550 
milliseconds
In all cases shared_buffers were set to the default value of 32MB. As you can 
see the 1 to 2 MB jump on the work_mem does wonders. I probably don't need this 
to be any higher than 8 or 16 MB. Thanks to all for help!
Humair
> Date: Mon, 22 Nov 2010 12:00:15 +0100
> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
> From: t...@fuzzy.cz
> To: huma...@hotmail.com
> CC: pgsql-performance@postgresql.org
> 
> >
> >
> > Correct, the optimizer did not take the settings with the pg_ctl reload
> > command. I did a pg_ctl restart and work_mem now displays the updated
> > value. I had to bump up all the way to 2047 MB to get the response below
> > (with work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB
> > (which is the max value that can be set for work_mem - anything more than
> > that results in a FATAL error because of the limit) the results are below.
> 
> Hm, can you post explain plan for the case work_mem=1024MB. I guess the
> difference is due to caching. According to the explain analyze, there are
> just cache hits, no reads.
> 
> Anyway the hash join uses only about 40MB of memory, so 1024MB should be
> perfectly fine and the explain plan should be exactly the same as with
> work_mem=2047MB. And the row estimates seem quite precise, so I don't
> think there's some severe overestimation.
> 
> Tomas
>

Re: [PERFORM] Query Performance SQL Server vs. Postgresql

2010-11-22 Thread tv

> I believe you can set work_mem to a different value just for the duration
> of
> a single query, so you needn't have work_mem set so high if for every
> query
> on the system.  A single query may well use a multiple of work_mem, so you
> really probably don't want it that high all the time unless all of your
> queries are structured similarly.  Just set work_mem='2047MB'; query;
> reset
> all;

Yes, executing "set work_mem='64MB'" right before the query should be just
fine. Setting work_mem to 2GB is an overkill most of the time (99.9%).

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance SQL Server vs. Postgresql

2010-11-22 Thread Samuel Gendler

On Sun, Nov 21, 2010 at 10:21 PM, Humair Mohammed wrote:

>
> Correct, the optimizer did not take the settings with the pg_ctl reload
> command. I did a pg_ctl restart and work_mem now displays the updated value.
> I had to bump up all the way to 2047 MB to get the response below (with
> work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB (which
> is the max value that can be set for work_mem - anything more than that
> results in a FATAL error because of the limit) the results are below. The
> batches and memory usage are reflecting the right behavior with these
> settings. Thanks for everyones input, the result is now matching what SQL
> Server was producing.
>
>
I believe you can set work_mem to a different value just for the duration of
a single query, so you needn't have work_mem set so high if for every query
on the system.  A single query may well use a multiple of work_mem, so you
really probably don't want it that high all the time unless all of your
queries are structured similarly.  Just set work_mem='2047MB'; query; reset
all;

But you should wait until someone more knowledgable than I confirm what I
just wrote.

Re: [PERFORM] Query Performance SQL Server vs. Postgresql

2010-11-22 Thread tv

>
>
> Correct, the optimizer did not take the settings with the pg_ctl reload
> command. I did a pg_ctl restart and work_mem now displays the updated
> value. I had to bump up all the way to 2047 MB to get the response below
> (with work_mem at 1024 MB I see 7 seconds response time) and with 2047 MB
> (which is the max value that can be set for work_mem - anything more than
> that results in a FATAL error because of the limit) the results are below.

Hm, can you post explain plan for the case work_mem=1024MB. I guess the
difference is due to caching. According to the explain analyze, there are
just cache hits, no reads.

Anyway the hash join uses only about 40MB of memory, so 1024MB should be
perfectly fine and the explain plan should be exactly the same as with
work_mem=2047MB. And the row estimates seem quite precise, so I don't
think there's some severe overestimation.

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance SQL Server vs. Postgresql

2010-11-22 Thread Humair Mohammed

Correct, the optimizer did not take the settings with the pg_ctl reload 
command. I did a pg_ctl restart and work_mem now displays the updated value. I 
had to bump up all the way to 2047 MB to get the response below (with work_mem 
at 1024 MB I see 7 seconds response time) and with 2047 MB (which is the max 
value that can be set for work_mem - anything more than that results in a FATAL 
error because of the limit) the results are below. The batches and memory usage 
are reflecting the right behavior with these settings. Thanks for everyones 
input, the result is now matching what SQL Server was producing.
"Hash Join  (cost=11305.30..39118.43 rows=92869 width=17) (actual 
time=145.888..326.216 rows=3163 loops=1)""  Hash Cond: (((pb.id)::text = 
(pg.id)::text) AND ((pb.question)::text = (pg.question)::text))""  Join Filter: 
((COALESCE(pb.response, 'MISSING'::character varying))::text <> 
(COALESCE(pg.response, 'MISSING'::character varying))::text)""  Buffers: shared 
hit=6895""  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 
width=134) (actual time=0.011..11.903 rows=93496 loops=1)""Buffers: 
shared hit=1870""  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) 
(actual time=145.673..145.673 rows=251212 loops=1)""Buckets: 32768  
Batches: 1  Memory Usage: 39939kB""Buffers: shared hit=5025""-> 
 Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212 width=134) (actual 
time=0.004..26.242 rows=251212 loops=1)""  Buffers: shared 
hit=5025""Total runtime: 331.168 ms"
Humair

> CC: t...@fuzzy.cz; huma...@hotmail.com; pavel.steh...@gmail.com; 
> pgsql-performance@postgresql.org
> From: robertmh...@gmail.com
> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
> Date: Sun, 21 Nov 2010 13:55:54 -0500
> To: t...@sss.pgh.pa.us
> 
> On Nov 21, 2010, at 12:16 PM, Tom Lane  wrote:
> > t...@fuzzy.cz writes:
> >>> Second, I modified the work_mem setting to 2GB (reloaded config) and I see
> >>> a response time of 38 seconds. Results below from EXPLAIN ANALYZE:
> > 
> >> How did you reload the config? Using 'kill -HUP pid'? That should work
> >> fine. Have you cheched 'work_mem' after the reload?
> > 
> >> Because the explain plans are exactly the same (structure, estimated
> >> costs). The really interesting bit is this and it did not change at all
> > 
> >>   Buckets: 1024 Batches: 64  Memory Usage: 650kB
> > 
> > If that didn't change, I'm prepared to bet that the OP didn't actually
> > manage to change the active value of work_mem.
> 
> Yep.  All this speculation about slow disks and/or COALESCE strikes me as 
> likely totally off-base. I think the original poster needs to run "show 
> work_mem" right before the EXPLAIN ANALYZE to make sure the new value they 
> set actually stuck. There's no reason for the planner to have used only 650kB 
> if work_mem is set to anything >=2MB.
> 
> ...Robert

Re: [PERFORM] Query Performance SQL Server vs. Postgresql

2010-11-22 Thread Humair Mohammed


That was a typo:
work_mem = 2GBshared_buffers = 2GB
> From: pavel.steh...@gmail.com
> Date: Sun, 21 Nov 2010 12:38:43 +0100
> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
> To: huma...@hotmail.com
> CC: pgsql-performance@postgresql.org
> 
> 2010/11/21 Humair Mohammed :
> >
> > 1) OS/Configuration
> > 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel CPU
> > postgresql-x64-9.0 (PostgreSQL 9.0.1, compiled by Visual C++ build 1500,
> > 64-bit)
> > work_mem  2GB
> > shared_buffers = 2
> 
> shared_buffers = 2 ???
> 
> Regards
> 
> Pavel Stehule
> 
> 
> > 2) Dataset
> > name,pages,tuples,pg_size_pretty
> > "pivotbad";1870;93496;"15 MB"
> > "pivotgood";5025;251212;"39 MB"
> > 3) EXPLAIN (ANALYZE ON, BUFFERS ON)
> > "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
> > time=25814.222..32296.765 rows=3163 loops=1)"
> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> > (pg.question)::text))"
> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> > "  Buffers: shared hit=384 read=6511, temp read=6444 written=6318"
> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)
> > (actual time=0.069..37.143 rows=93496 loops=1)"
> > "Buffers: shared hit=192 read=1678"
> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
> > time=24621.752..24621.752 rows=251212 loops=1)"
> > "Buckets: 1024  Batches: 64  Memory Usage: 650kB"
> > "Buffers: shared hit=192 read=4833, temp written=4524"
> > "->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> > width=134) (actual time=0.038..117.780 rows=251212 loops=1)"
> > "  Buffers: shared hit=192 read=4833"
> > "Total runtime: 32297.305 ms"
> > 4) INDEXES
> > I can certainly add an index but given the table sizes I am not sure if that
> > is a factor. This by no means is a large dataset less than 350,000 rows in
> > total and 3 columns. Also this was just a quick dump of data for comparison
> > purpose. When I saw the poor performance on the COALESCE, I pointed the data
> > load to SQL Server and ran the same query except with the TSQL specific
> > ISNULL function.
> >

Re: [PERFORM] Query Performance SQL Server vs. Postgresql

2010-11-21 Thread Pavel Stehule

Hello

2010/11/21 Humair Mohammed :
> That was a typo:
> work_mem = 2GB
> shared_buffers = 2GB

ok, then try to decrease a shared_buffers. Maybe a win7 has a some
problem - large a shared buffers are well just for UNIX like systems.
I am thinking so 500 MB is enough

Regards

Pavel Stehule

>> From: pavel.steh...@gmail.com
>> Date: Sun, 21 Nov 2010 12:38:43 +0100
>> Subject: Re: [PERFORM] Query Performance SQL Server vs. Postgresql
>> To: huma...@hotmail.com
>> CC: pgsql-performance@postgresql.org
>>
>> 2010/11/21 Humair Mohammed :
>> >
>> > 1) OS/Configuration
>> > 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel
>> > CPU
>> > postgresql-x64-9.0 (PostgreSQL 9.0.1, compiled by Visual C++ build 1500,
>> > 64-bit)
>> > work_mem  2GB
>> > shared_buffers = 2
>>
>> shared_buffers = 2 ???
>>
>> Regards
>>
>> Pavel Stehule
>>
>>
>> > 2) Dataset
>> > name,pages,tuples,pg_size_pretty
>> > "pivotbad";1870;93496;"15 MB"
>> > "pivotgood";5025;251212;"39 MB"
>> > 3) EXPLAIN (ANALYZE ON, BUFFERS ON)
>> > "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
>> > time=25814.222..32296.765 rows=3163 loops=1)"
>> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text
>> > =
>> > (pg.question)::text))"
>> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character
>> > varying))::text
>> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
>> > "  Buffers: shared hit=384 read=6511, temp read=6444 written=6318"
>> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496
>> > width=134)
>> > (actual time=0.069..37.143 rows=93496 loops=1)"
>> > "        Buffers: shared hit=192 read=1678"
>> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
>> > time=24621.752..24621.752 rows=251212 loops=1)"
>> > "        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
>> > "        Buffers: shared hit=192 read=4833, temp written=4524"
>> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
>> > width=134) (actual time=0.038..117.780 rows=251212 loops=1)"
>> > "              Buffers: shared hit=192 read=4833"
>> > "Total runtime: 32297.305 ms"
>> > 4) INDEXES
>> > I can certainly add an index but given the table sizes I am not sure if
>> > that
>> > is a factor. This by no means is a large dataset less than 350,000 rows
>> > in
>> > total and 3 columns. Also this was just a quick dump of data for
>> > comparison
>> > purpose. When I saw the poor performance on the COALESCE, I pointed the
>> > data
>> > load to SQL Server and ran the same query except with the TSQL specific
>> > ISNULL function.
>> >
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query Performance SQL Server vs. Postgresql

2010-11-21 Thread Robert Haas

On Nov 21, 2010, at 12:16 PM, Tom Lane  wrote:
> t...@fuzzy.cz writes:
>>> Second, I modified the work_mem setting to 2GB (reloaded config) and I see
>>> a response time of 38 seconds. Results below from EXPLAIN ANALYZE:
> 
>> How did you reload the config? Using 'kill -HUP pid'? That should work
>> fine. Have you cheched 'work_mem' after the reload?
> 
>> Because the explain plans are exactly the same (structure, estimated
>> costs). The really interesting bit is this and it did not change at all
> 
>>   Buckets: 1024 Batches: 64  Memory Usage: 650kB
> 
> If that didn't change, I'm prepared to bet that the OP didn't actually
> manage to change the active value of work_mem.

Yep.  All this speculation about slow disks and/or COALESCE strikes me as 
likely totally off-base. I think the original poster needs to run "show 
work_mem" right before the EXPLAIN ANALYZE to make sure the new value they set 
actually stuck. There's no reason for the planner to have used only 650kB if 
work_mem is set to anything >=2MB.

...Robert
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

1 2 3 >

1 - 100 of 227 matches

Mail list logo