Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse

2015-07-08 Thread Marc Mamin
Hello,
 
 
 
 
 
 From: pgsql-performance-ow...@postgresql.org 
 [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Mike Broers
 Sent: Dienstag, 7. Juli 2015 18:28
 To: Tom Lane
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] wildcard text filter switched to boolean column, 
 performance is way worse
 
 After bumping up work_mem from 12MB to 25MB that last materialize is indeed 
 hashing and this cut the query time by about 60%.  Thanks, this was very 
 helpful and gives me something else to look for when troubleshooting 
 explains.  
 
 
 
 On Tue, Jul 7, 2015 at 11:10 AM, Mike Broers mbro...@gmail.com wrote:
 Thanks, very informative! I'll experiment with work_mem settings and report 
 back.
 
 On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Mike Broers mbro...@gmail.com writes:
  I had a query that was filtering with a wildcard search of a text field for
  %SUCCESS%. The query took about 5 seconds and was running often so I wanted
  to improve it.  I suggested that the engineers include a new boolean column
  for successful status.  They implemented the requested field, but the query
  that filters on that new column runs very long (i kill it after letting it
  run for about an hour).  Can someone help me understand why that is the
  case and how to resolve it?
 
 It's hashing the subplan output in the first case and not the second:
 
  Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369) (actual
  time=4699.619..4699.869 rows=1 loops=1)
  Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
  ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3

 vs
 
Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)
  Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
  ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3
^
 
 Presumably, the new more-accurate rows count causes the planner to realize
 that the hash table will exceed work_mem so it doesn't choose to hash ...
 but for your situation, you'd rather it did, because what you're getting
 instead is a Materialize node that spills to disk (again, because the data
 involved exceeds work_mem) and that's a killer for this query.  You should
 be able to get back the old behavior if you raise work_mem enough.
 
 Another idea you might think about is changing the OR'd IN conditions
 to a single IN over a UNION ALL of the subselects.  I'm not really sure if
 that would produce a better plan, but it's worth trying if it wouldn't
 require too much app-side contortion.


Hello,
you might try to use a CTE to first collect the IDs to exclude, and join them 
to your main table.
This should result in an anti join pattern.

Something like:

WITH IDS as (
SELECT U1.lead_id AS lead_id 
FROM event U1 
WHERE U1.event_type ='type_1'
UNION (
   SELECT U1.lead_id AS lead_id 
   FROM event U1 
   WHERE U1.event_type = 'type_2'
   INTERSECT
   SELECT U1.lead_id AS lead_id 
   FROM event U1 
   WHERE successful
)
)
SELECT * FROM lead LEFT OUTER JOIN IDS ON (lead.id=IDS.lead_id)
WHERE IDS.lead_id IS NULL;

regards,

Marc Mamin






 regards, tom lane
 
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse

2015-07-07 Thread Tom Lane
Mike Broers mbro...@gmail.com writes:
 I had a query that was filtering with a wildcard search of a text field for
 %SUCCESS%. The query took about 5 seconds and was running often so I wanted
 to improve it.  I suggested that the engineers include a new boolean column
 for successful status.  They implemented the requested field, but the query
 that filters on that new column runs very long (i kill it after letting it
 run for about an hour).  Can someone help me understand why that is the
 case and how to resolve it?

It's hashing the subplan output in the first case and not the second:

 Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369) (actual
 time=4699.619..4699.869 rows=1 loops=1)
 Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
 ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3
   
vs

   Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)
 Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
 ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3
   ^

Presumably, the new more-accurate rows count causes the planner to realize
that the hash table will exceed work_mem so it doesn't choose to hash ...
but for your situation, you'd rather it did, because what you're getting
instead is a Materialize node that spills to disk (again, because the data
involved exceeds work_mem) and that's a killer for this query.  You should
be able to get back the old behavior if you raise work_mem enough.

Another idea you might think about is changing the OR'd IN conditions
to a single IN over a UNION ALL of the subselects.  I'm not really sure if
that would produce a better plan, but it's worth trying if it wouldn't
require too much app-side contortion.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] wildcard text filter switched to boolean column, performance is way worse

2015-07-07 Thread Mike Broers
I had a query that was filtering with a wildcard search of a text field for
%SUCCESS%. The query took about 5 seconds and was running often so I wanted
to improve it.  I suggested that the engineers include a new boolean column
for successful status.  They implemented the requested field, but the query
that filters on that new column runs very long (i kill it after letting it
run for about an hour).  Can someone help me understand why that is the
case and how to resolve it?

First query:
SELECT *
FROM lead
WHERE ( NOT ( ( lead.id IN
 ( SELECT U1.lead_id AS lead_id
  FROM event U1
  WHERE U1.event_type = 'type_1' )
   OR ( lead.id IN
 ( SELECT U1.lead_id AS lead_id
  FROM event U1
  WHERE U1.event_type = 'type_2' )
   AND lead.id IN
 ( SELECT U1.lead_id AS lead_id
  FROM event U1
  WHERE UPPER(U1.response::text) LIKE
UPPER('%success%') ) ) ) )
   AND NOT (lead.ReferenceNumber = '') ) ;

explain/analyze result:

Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369) (actual
time=4699.619..4699.869 rows=1 loops=1)
Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3
Rows Removed by Filter: 375369

SubPlan 1

  -  Seq Scan on event u1  (cost=0.00..42408.62 rows=7748 width=4)
(actual time=0.005..171.350 rows=7414 loops=1)
Filter: ((event_type)::text = 'type_1'::text)

Rows Removed by Filter: 1099436

SubPlan 2

  -  Seq Scan on event u1_1  (cost=0.00..42408.62 rows=375665 width=4)
(actual time=0.006..219.092 rows=373298 loops=1)
Filter: ((event_type)::text = 'type_2'::text)

Rows Removed by Filter: 733552

SubPlan 3

  -  Seq Scan on event u1_2  (cost=0.00..45175.75 rows=111 width=4)
(actual time=0.040..3389.550 rows=712952 loops=1)
Filter: (upper(response) ~~ '%SUCCESS%'::text)

Rows Removed by Filter: 393898

The main thing that sticks out to me for this plan is the low estimate for
the rows it will return on the %SUCCESS% filter.

Here is the second query with explain:

SELECT *
FROM lead
WHERE
(
 NOT
 (
  (lead.id IN
   (
SELECT U1.lead_id AS lead_id
FROM event U1
WHERE U1.event_type ='type_1'
   )
   OR
   (lead.id IN
(
 SELECT U1.lead_id AS lead_id
 FROM event U1
 WHERE U1.event_type = 'type_2
 )
   AND lead.id IN
(
 SELECT U1.lead_id AS lead_id
 FROM event U1
 WHERE successful
)
   )
  )
 )
 AND NOT  (lead.ReferenceNumber = '')
) ;

explain result:
  Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)

Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3
SubPlan 1

  -  Seq Scan on event u1  (cost=0.00..42408.62 rows=7748 width=4)

Filter: ((event_type)::text = 'type_1'::text)

SubPlan 2

  -  Seq Scan on event u1_1  (cost=0.00..42408.62 rows=375665 width=4)

Filter: ((event_type)::text = 'type_2'::text)

SubPlan 3

  -  Materialize  (cost=0.00..46154.43 rows=731185 width=4)

-  Seq Scan on event u1_2  (cost=0.00..39641.50 rows=731185
width=4)
  Filter: successful

Here is does a materialize and estimates rows properly, but as stated this
query just hangs and pegs load.  There are no locks and its in an active
state the whole time.  I am running these queries in a test environment on
a recently exported full schema from production, with a reindex and a
vacuum/analyze.  This is postgres 9.3.6 on rhel6.

When I run just the different subquery element:
SELECT U1.lead_id AS lead_id
   FROM event U1

  WHERE successful;

it returns in about 250ms, with the text field %SUCCESS% it runs in about 4
seconds.  This seemed like a low hanging fruit query improvement so I'm
surprised its not working, it seems like we are just lucky that the planner
is estimating that filter incorrectly in the original form.

I'm sure the query just needs to be completely overhauled and am starting
to pull it apart and work with the engineers to get something more
efficient set up overall, but I am not sure how to answer the question as
to why this original attempt at improving the query is not successful.

Any guidance is greatly appreciated, thanks!


Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse

2015-07-07 Thread Mike Broers
After bumping up work_mem from 12MB to 25MB that last materialize is indeed
hashing and this cut the query time by about 60%.  Thanks, this was very
helpful and gives me something else to look for when troubleshooting
explains.



On Tue, Jul 7, 2015 at 11:10 AM, Mike Broers mbro...@gmail.com wrote:

 Thanks, very informative! I'll experiment with work_mem settings and
 report back.

 On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Mike Broers mbro...@gmail.com writes:
  I had a query that was filtering with a wildcard search of a text field
 for
  %SUCCESS%. The query took about 5 seconds and was running often so I
 wanted
  to improve it.  I suggested that the engineers include a new boolean
 column
  for successful status.  They implemented the requested field, but the
 query
  that filters on that new column runs very long (i kill it after letting
 it
  run for about an hour).  Can someone help me understand why that is the
  case and how to resolve it?

 It's hashing the subplan output in the first case and not the second:

  Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369)
 (actual
  time=4699.619..4699.869 rows=1 loops=1)
  Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
  ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3

 vs

Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957
 width=369)
  Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
  ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3
^

 Presumably, the new more-accurate rows count causes the planner to realize
 that the hash table will exceed work_mem so it doesn't choose to hash ...
 but for your situation, you'd rather it did, because what you're getting
 instead is a Materialize node that spills to disk (again, because the data
 involved exceeds work_mem) and that's a killer for this query.  You should
 be able to get back the old behavior if you raise work_mem enough.

 Another idea you might think about is changing the OR'd IN conditions
 to a single IN over a UNION ALL of the subselects.  I'm not really sure if
 that would produce a better plan, but it's worth trying if it wouldn't
 require too much app-side contortion.

 regards, tom lane





Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse

2015-07-07 Thread Mike Broers
Thanks, very informative! I'll experiment with work_mem settings and report
back.

On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Mike Broers mbro...@gmail.com writes:
  I had a query that was filtering with a wildcard search of a text field
 for
  %SUCCESS%. The query took about 5 seconds and was running often so I
 wanted
  to improve it.  I suggested that the engineers include a new boolean
 column
  for successful status.  They implemented the requested field, but the
 query
  that filters on that new column runs very long (i kill it after letting
 it
  run for about an hour).  Can someone help me understand why that is the
  case and how to resolve it?

 It's hashing the subplan output in the first case and not the second:

  Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369)
 (actual
  time=4699.619..4699.869 rows=1 loops=1)
  Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
  ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3

 vs

Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)
  Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text 
  ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3
^

 Presumably, the new more-accurate rows count causes the planner to realize
 that the hash table will exceed work_mem so it doesn't choose to hash ...
 but for your situation, you'd rather it did, because what you're getting
 instead is a Materialize node that spills to disk (again, because the data
 involved exceeds work_mem) and that's a killer for this query.  You should
 be able to get back the old behavior if you raise work_mem enough.

 Another idea you might think about is changing the OR'd IN conditions
 to a single IN over a UNION ALL of the subselects.  I'm not really sure if
 that would produce a better plan, but it's worth trying if it wouldn't
 require too much app-side contortion.

 regards, tom lane