Re: [PERFORM] similarity and operator '%'

2016-05-30 Thread Jeff Janes
On Mon, May 30, 2016 at 10:53 AM, Volker Boehm  wrote:

> The reason for using the similarity function in place of the '%'-operator is
> that I want to use different similarity values in one query:
>
> select name, street, zip, city
> from addresses
> where name % $1
> and street % $2
> and (zip % $3 or city % $4)
> or similarity(name, $1) > 0.8

I think the best you can do through query writing is to use the
most-lenient setting in all places, and then refilter to get the less
lenient cutoff:

 select name, street, zip, city
 from addresses
 where name % $1
 and street % $2
 and (zip % $3 or city % $4)
 or (name % $1 and similarity(name, $1) > 0.8)

If it were really important to me to get maximum performance, what I
would do is alter/fork the pg_trgm extension so that it had another
operator, say %%%, with a hard-coded cutoff which paid no attention to
the set_limit().  I'm not really sure how the planner would deal with
that, though.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Re: Planner chooses slow index heap scan despite accurate row estimates

2016-05-30 Thread Jeff Janes
On Sat, May 28, 2016 at 5:38 PM, Jake Magner  wrote:
> Tom Lane-2 wrote
>> Jake Magner 
>
>> jakemagner90@
>
>>  writes:
>>> I tried without doing an INSERT at all, just running the SELECT queries
>>> and
>>> the result is the same. Nested loop is chosen but is much slower.
>>
>> FWIW, I just noticed that the comparisons you're using are plain equality
>> of the arrays.  While a GIN array index supports that, it's not exactly
>> its strong suit: the sort of questions that index type supports well are
>> more like "which arrays contain value X?".  I wonder if it'd be worth
>> creating btree indexes on the array column.
>
> I added btree indexes and now the nested loop uses those and is a bit faster
> than the hash join. So the planner just misestimates the cost of doing the
> equality comparisons?

I wonder how it would do in 9.4?  Either in them actually being
faster, or the planner doing
a better job of realizing they won't be fast.

> I'd prefer not to add more indexes, the hash join
> performance is fast enough if it would just choose that but I'm reluctant to
> turn off nested loops in case the table gets a lot bigger.

A large hash join just needs to divide it up into batches.  It should
still be faster than the nested loop (as currently implemented) ,
until you run out of temp space.

But, you already have a solution in hand.  I agree you shouldn't add
more indexes without reason, but you do have a reason.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] similarity and operator '%'

2016-05-30 Thread David G. Johnston
On Mon, May 30, 2016 at 1:53 PM, Volker Boehm  wrote:

>
> The reason for using the similarity function in place of the '%'-operator
> is that I want to use different similarity values in one query:
>
> select name, street, zip, city
> from addresses
> where name % $1
> and street % $2
> and (zip % $3 or city % $4)
> or similarity(name, $1) > 0.8
>
> which means: take all addresses where name, street, zip and city have
> little similarity _plus_ all addresses where the name matches very good.
>
>
> The only way I found, was to create a temporary table from the first
> query, change the similarity value with set_limit() and then select the
> second query UNION the temporary table.
>
> Is there a more elegant and straight forward way to achieve this result?
>

​Not that I can envision.

You are forced into using an operator due to our index implementation.

You are thus forced into using a GUC to control the parameter that the
index scanning function uses to compute true/false.

A GUC can only take on a single value within a given query - well, not
quite true[1] but the exception doesn't seem like it will help here.

Th
us you are consigned to​

​using two queries.

*​A functional index​ doesn't work since the second argument is query
specific

[1]​ When defining a function you can attach a "SET" clause to it; commonly
used for search_path but should work with any GUC.  If you could wrap the
operator comparison into a custom function you could use this capability.
It also would require a function that would take the threshold as a value -
the extension only provides variations that use the GUC.

I don't think this will use the index even if it compiles (not tested):

CREATE FUNCTION similarity_80(col, val)
RETURNS boolean
SET similarity_threshold = 0.80
LANGUAGE sql
AS $$
​SELECT ​col % val;
$$;

​David J.​


[PERFORM] similarity and operator '%'

2016-05-30 Thread Volker Boehm

Hello,
I'm trying to find persons in an address database where I have built 
trgm-indexes on name, street, zip and city.


When I search for all four parts of the address (name, street, zip and city)

select name, street, zip, city
from addresses
where name % $1
  and street % $2
  and (zip % $3 or city % $4)

everything works fine: It takes less than a second to get some (5 - 500) 
proposed addresses out of 500,000 addresses and the query plan shows


Bitmap Heap Scan on addresses  (cost=168.31..1993.38 rows=524 ...
  Recheck Cond: ...
  ->  Bitmap Index Scan on ...
Index Cond: ...

The same happens when I search only by name with

select name, street, zip, city
from addresses
where name % $1

But when I rewrite this query to

select name, street, zip, city
from addresses
where similarity(name, $1) > 0.3

which means exactly then same as the second example, the query plan 
changes to


Seq Scan on addresses  (cost=0.00..149714.42 rows=174675 width=60)
  Filter: ...

and the query lasts about a minute.

The reason for using the similarity function in place of the 
'%'-operator is that I want to use different similarity values in one query:


select name, street, zip, city
from addresses
where name % $1
and street % $2
and (zip % $3 or city % $4)
or similarity(name, $1) > 0.8

which means: take all addresses where name, street, zip and city have 
little similarity _plus_ all addresses where the name matches very good.



The only way I found, was to create a temporary table from the first 
query, change the similarity value with set_limit() and then select the 
second query UNION the temporary table.


Is there a more elegant and straight forward way to achieve this result?

regards  Volker

--
Volker BöhmTel.: +49 4141 981155
Voßkuhl 5  mailto:vol...@vboehm.de
21682 Stadehttp://www.vboehm.de


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance problems with 9.2.15

2016-05-30 Thread Johan Fredriksson
> > The rowcount estimates from 9.2 seem greatly different from the 8.4 plan.
> > Did you remember to ANALYZE all the tables after migrating?  Maybe there
> > were some table-specific statistics targets that you forgot to transfer
> > over?  In any case, the 9.2 plan looks like garbage-in-garbage-out to
> > me :-( ... without estimates at least a little closer to reality, the
> > planner is unlikely to do anything very sane.
> > 
> > (BTW, I wonder why you are moving only to 9.2 and not something more
> > recent.)
> 
> You put me on the right track with your conclusion that the estimates
> were off the chart. The quick-and-dirty fix "DELETE FROM pg_statistic;"
> solved this problem. This database now have to build up sane estimates
> from scratch.

Actually it took a VACUUM FULL; and DELETE FROM pg_statistic; followed
by ANALYZE on all tables to get it right.

Can someone please explain to me the difference between these two query
plans:

The bad one:
 Unique  (cost=6037.10..6037.18 rows=1 width=434) (actual
time=255608.588..255646.828 rows=572 loops=1)
   ->  Sort  (cost=6037.10..6037.11 rows=1 width=434) (actual
time=255608.583..255611.632 rows=33209 loops=1)
 Sort Method: quicksort  Memory: 13143kB
 ->  Nested Loop  (cost=42.51..6037.09 rows=1 width=434) (actual
time=152.818..254886.674 rows=33209 loops=1)
   Join Filter: (cachedgroupmembers_4.groupid =
acl_3.principalid)
   Rows Removed by Join Filter: 495425041
   ->  Bitmap Heap Scan on public.acl acl_3
(cost=30.07..144.35 rows=497 width=4) (actual time=0.284..8.184 rows=525
loops=1)
 Recheck Cond: acl_3.rightname)::text =
'OwnTicket'::text) AND ((acl_3.principaltype)::text = 'Group'::text)) OR
(((acl_3.rightname)::text = 'SuperUser'::text) AND
((acl_3.principaltype):
:text = 'Group'::text)))
 Filter: (((acl_3.objecttype)::text =
'RT::Queue'::text) OR (((acl_3.objecttype)::text = 'RT::System'::text)
AND (acl_3.objectid = 1)))
 ->  BitmapOr  (cost=30.07..30.07 rows=531 width=0)
(actual time=0.249..0.249 rows=0 loops=1)
   ->  Bitmap Index Scan on acl1
(cost=0.00..25.46 rows=521 width=0) (actual time=0.233..0.233 rows=521
loops=1)
 Index Cond: (((acl_3.rightname)::text =
'OwnTicket'::text) AND ((acl_3.principaltype)::text = 'Group'::text))
   ->  Bitmap Index Scan on acl1
(cost=0.00..4.36 rows=11 width=0) (actual time=0.016..0.016 rows=4
loops=1)
 Index Cond: (((acl_3.rightname)::text =
'SuperUser'::text) AND ((acl_3.principaltype)::text = 'Group'::text))
   ->  Materialize  (cost=12.44..5870.39 rows=3 width=438)
(actual time=0.004..176.296 rows=943730 loops=525)
 ->  Nested Loop  (cost=12.44..5870.37 rows=3
width=438) (actual time=0.351..1028.683 rows=943730 loops=1)
   ->  Nested Loop  (cost=12.44..5601.49 rows=2
width=442) (actual time=0.326..15.591 rows=675 loops=1)
 ->  Nested Loop  (cost=12.44..5502.26
rows=27 width=8) (actual time=0.303..9.744 rows=675 loops=1)
   Output: principals_1.id,
cachedgroupmembers_2.memberid
   ->  Bitmap Heap Scan on
public.cachedgroupmembers cachedgroupmembers_2  (cost=12.44..1659.12
rows=446 width=4) (actual time=0.267..1.266 rows=676 loops=1)

Recheck Cond: (cachedgroupmembers_2.groupid = 4)
 Filter:
(cachedgroupmembers_2.disabled = 0)
 ->  Bitmap Index Scan on
cachedgroupmembers2  (cost=0.00..12.33 rows=446 width=0) (actual
time=0.171..0.171 rows=676 loops=1)
   Index Cond:
(cachedgroupmembers_2.groupid = 4)
   ->  Index Scan using
principals_pkey on public.principals principals_1  (cost=0.00..8.61
rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=676)
 Output: principals_1.id
 Index Cond:
(principals_1.id = cachedgroupmembers_2.memberid)
 Filter: ((principals_1.id
<> 1) AND (principals_1.disabled = 0) AND
((principals_1.principaltype)::text = 'User'::text))
 Rows Removed by Filter: 0
 ->  Index Scan using users_pkey on
public.users main  (cost=0.00..3.67 rows=1 width=434) (actual
time=0.007..0.008 rows=1
loops=675)   
   Index Cond: (main.id =
principals_1.id)
   ->  Index Scan using cachedgroupmembers1 on
public.cachedgroupmembers cachedgroupmembers_4  (cost=0.00..133.79
rows=65 width=8) (actual time=0.012..1.199 rows=1398 loops=675)

 

Re: [PERFORM] Performance problems with 9.2.15

2016-05-30 Thread Johan Fredriksson
> > I am just about to upgrade from PostgreSQL 8.4.20 to 9.2.15, but I'v run
> > into some huge performance issues.
> 
> The rowcount estimates from 9.2 seem greatly different from the 8.4 plan.
> Did you remember to ANALYZE all the tables after migrating?  Maybe there
> were some table-specific statistics targets that you forgot to transfer
> over?  In any case, the 9.2 plan looks like garbage-in-garbage-out to
> me :-( ... without estimates at least a little closer to reality, the
> planner is unlikely to do anything very sane.
> 
> (BTW, I wonder why you are moving only to 9.2 and not something more
> recent.)

You put me on the right track with your conclusion that the estimates
were off the chart. The quick-and-dirty fix "DELETE FROM pg_statistic;"
solved this problem. This database now have to build up sane estimates
from scratch.


/ Eskil




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance