Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse
Hello, From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Mike Broers Sent: Dienstag, 7. Juli 2015 18:28 To: Tom Lane Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse After bumping up work_mem from 12MB to 25MB that last materialize is indeed hashing and this cut the query time by about 60%. Thanks, this was very helpful and gives me something else to look for when troubleshooting explains. On Tue, Jul 7, 2015 at 11:10 AM, Mike Broers mbro...@gmail.com wrote: Thanks, very informative! I'll experiment with work_mem settings and report back. On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane t...@sss.pgh.pa.us wrote: Mike Broers mbro...@gmail.com writes: I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it. I suggested that the engineers include a new boolean column for successful status. They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour). Can someone help me understand why that is the case and how to resolve it? It's hashing the subplan output in the first case and not the second: Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3 vs Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3 ^ Presumably, the new more-accurate rows count causes the planner to realize that the hash table will exceed work_mem so it doesn't choose to hash ... but for your situation, you'd rather it did, because what you're getting instead is a Materialize node that spills to disk (again, because the data involved exceeds work_mem) and that's a killer for this query. You should be able to get back the old behavior if you raise work_mem enough. Another idea you might think about is changing the OR'd IN conditions to a single IN over a UNION ALL of the subselects. I'm not really sure if that would produce a better plan, but it's worth trying if it wouldn't require too much app-side contortion. Hello, you might try to use a CTE to first collect the IDs to exclude, and join them to your main table. This should result in an anti join pattern. Something like: WITH IDS as ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE U1.event_type ='type_1' UNION ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE U1.event_type = 'type_2' INTERSECT SELECT U1.lead_id AS lead_id FROM event U1 WHERE successful ) ) SELECT * FROM lead LEFT OUTER JOIN IDS ON (lead.id=IDS.lead_id) WHERE IDS.lead_id IS NULL; regards, Marc Mamin regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse
Mike Broers mbro...@gmail.com writes: I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it. I suggested that the engineers include a new boolean column for successful status. They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour). Can someone help me understand why that is the case and how to resolve it? It's hashing the subplan output in the first case and not the second: Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3 vs Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3 ^ Presumably, the new more-accurate rows count causes the planner to realize that the hash table will exceed work_mem so it doesn't choose to hash ... but for your situation, you'd rather it did, because what you're getting instead is a Materialize node that spills to disk (again, because the data involved exceeds work_mem) and that's a killer for this query. You should be able to get back the old behavior if you raise work_mem enough. Another idea you might think about is changing the OR'd IN conditions to a single IN over a UNION ALL of the subselects. I'm not really sure if that would produce a better plan, but it's worth trying if it wouldn't require too much app-side contortion. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] wildcard text filter switched to boolean column, performance is way worse
I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it. I suggested that the engineers include a new boolean column for successful status. They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour). Can someone help me understand why that is the case and how to resolve it? First query: SELECT * FROM lead WHERE ( NOT ( ( lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE U1.event_type = 'type_1' ) OR ( lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE U1.event_type = 'type_2' ) AND lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE UPPER(U1.response::text) LIKE UPPER('%success%') ) ) ) ) AND NOT (lead.ReferenceNumber = '') ) ; explain/analyze result: Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3 Rows Removed by Filter: 375369 SubPlan 1 - Seq Scan on event u1 (cost=0.00..42408.62 rows=7748 width=4) (actual time=0.005..171.350 rows=7414 loops=1) Filter: ((event_type)::text = 'type_1'::text) Rows Removed by Filter: 1099436 SubPlan 2 - Seq Scan on event u1_1 (cost=0.00..42408.62 rows=375665 width=4) (actual time=0.006..219.092 rows=373298 loops=1) Filter: ((event_type)::text = 'type_2'::text) Rows Removed by Filter: 733552 SubPlan 3 - Seq Scan on event u1_2 (cost=0.00..45175.75 rows=111 width=4) (actual time=0.040..3389.550 rows=712952 loops=1) Filter: (upper(response) ~~ '%SUCCESS%'::text) Rows Removed by Filter: 393898 The main thing that sticks out to me for this plan is the low estimate for the rows it will return on the %SUCCESS% filter. Here is the second query with explain: SELECT * FROM lead WHERE ( NOT ( (lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE U1.event_type ='type_1' ) OR (lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE U1.event_type = 'type_2 ) AND lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE successful ) ) ) ) AND NOT (lead.ReferenceNumber = '') ) ; explain result: Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3 SubPlan 1 - Seq Scan on event u1 (cost=0.00..42408.62 rows=7748 width=4) Filter: ((event_type)::text = 'type_1'::text) SubPlan 2 - Seq Scan on event u1_1 (cost=0.00..42408.62 rows=375665 width=4) Filter: ((event_type)::text = 'type_2'::text) SubPlan 3 - Materialize (cost=0.00..46154.43 rows=731185 width=4) - Seq Scan on event u1_2 (cost=0.00..39641.50 rows=731185 width=4) Filter: successful Here is does a materialize and estimates rows properly, but as stated this query just hangs and pegs load. There are no locks and its in an active state the whole time. I am running these queries in a test environment on a recently exported full schema from production, with a reindex and a vacuum/analyze. This is postgres 9.3.6 on rhel6. When I run just the different subquery element: SELECT U1.lead_id AS lead_id FROM event U1 WHERE successful; it returns in about 250ms, with the text field %SUCCESS% it runs in about 4 seconds. This seemed like a low hanging fruit query improvement so I'm surprised its not working, it seems like we are just lucky that the planner is estimating that filter incorrectly in the original form. I'm sure the query just needs to be completely overhauled and am starting to pull it apart and work with the engineers to get something more efficient set up overall, but I am not sure how to answer the question as to why this original attempt at improving the query is not successful. Any guidance is greatly appreciated, thanks!
Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse
After bumping up work_mem from 12MB to 25MB that last materialize is indeed hashing and this cut the query time by about 60%. Thanks, this was very helpful and gives me something else to look for when troubleshooting explains. On Tue, Jul 7, 2015 at 11:10 AM, Mike Broers mbro...@gmail.com wrote: Thanks, very informative! I'll experiment with work_mem settings and report back. On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane t...@sss.pgh.pa.us wrote: Mike Broers mbro...@gmail.com writes: I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it. I suggested that the engineers include a new boolean column for successful status. They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour). Can someone help me understand why that is the case and how to resolve it? It's hashing the subplan output in the first case and not the second: Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3 vs Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3 ^ Presumably, the new more-accurate rows count causes the planner to realize that the hash table will exceed work_mem so it doesn't choose to hash ... but for your situation, you'd rather it did, because what you're getting instead is a Materialize node that spills to disk (again, because the data involved exceeds work_mem) and that's a killer for this query. You should be able to get back the old behavior if you raise work_mem enough. Another idea you might think about is changing the OR'd IN conditions to a single IN over a UNION ALL of the subselects. I'm not really sure if that would produce a better plan, but it's worth trying if it wouldn't require too much app-side contortion. regards, tom lane
Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse
Thanks, very informative! I'll experiment with work_mem settings and report back. On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane t...@sss.pgh.pa.us wrote: Mike Broers mbro...@gmail.com writes: I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it. I suggested that the engineers include a new boolean column for successful status. They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour). Can someone help me understand why that is the case and how to resolve it? It's hashing the subplan output in the first case and not the second: Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3 vs Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3 ^ Presumably, the new more-accurate rows count causes the planner to realize that the hash table will exceed work_mem so it doesn't choose to hash ... but for your situation, you'd rather it did, because what you're getting instead is a Materialize node that spills to disk (again, because the data involved exceeds work_mem) and that's a killer for this query. You should be able to get back the old behavior if you raise work_mem enough. Another idea you might think about is changing the OR'd IN conditions to a single IN over a UNION ALL of the subselects. I'm not really sure if that would produce a better plan, but it's worth trying if it wouldn't require too much app-side contortion. regards, tom lane