Re: [PERFORM] similarity and operator '%'
On Mon, May 30, 2016 at 10:53 AM, Volker Boehmwrote: > The reason for using the similarity function in place of the '%'-operator is > that I want to use different similarity values in one query: > > select name, street, zip, city > from addresses > where name % $1 > and street % $2 > and (zip % $3 or city % $4) > or similarity(name, $1) > 0.8 I think the best you can do through query writing is to use the most-lenient setting in all places, and then refilter to get the less lenient cutoff: select name, street, zip, city from addresses where name % $1 and street % $2 and (zip % $3 or city % $4) or (name % $1 and similarity(name, $1) > 0.8) If it were really important to me to get maximum performance, what I would do is alter/fork the pg_trgm extension so that it had another operator, say %%%, with a hard-coded cutoff which paid no attention to the set_limit(). I'm not really sure how the planner would deal with that, though. Cheers, Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Re: Planner chooses slow index heap scan despite accurate row estimates
On Sat, May 28, 2016 at 5:38 PM, Jake Magnerwrote: > Tom Lane-2 wrote >> Jake Magner > >> jakemagner90@ > >> writes: >>> I tried without doing an INSERT at all, just running the SELECT queries >>> and >>> the result is the same. Nested loop is chosen but is much slower. >> >> FWIW, I just noticed that the comparisons you're using are plain equality >> of the arrays. While a GIN array index supports that, it's not exactly >> its strong suit: the sort of questions that index type supports well are >> more like "which arrays contain value X?". I wonder if it'd be worth >> creating btree indexes on the array column. > > I added btree indexes and now the nested loop uses those and is a bit faster > than the hash join. So the planner just misestimates the cost of doing the > equality comparisons? I wonder how it would do in 9.4? Either in them actually being faster, or the planner doing a better job of realizing they won't be fast. > I'd prefer not to add more indexes, the hash join > performance is fast enough if it would just choose that but I'm reluctant to > turn off nested loops in case the table gets a lot bigger. A large hash join just needs to divide it up into batches. It should still be faster than the nested loop (as currently implemented) , until you run out of temp space. But, you already have a solution in hand. I agree you shouldn't add more indexes without reason, but you do have a reason. Cheers, Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] similarity and operator '%'
On Mon, May 30, 2016 at 1:53 PM, Volker Boehmwrote: > > The reason for using the similarity function in place of the '%'-operator > is that I want to use different similarity values in one query: > > select name, street, zip, city > from addresses > where name % $1 > and street % $2 > and (zip % $3 or city % $4) > or similarity(name, $1) > 0.8 > > which means: take all addresses where name, street, zip and city have > little similarity _plus_ all addresses where the name matches very good. > > > The only way I found, was to create a temporary table from the first > query, change the similarity value with set_limit() and then select the > second query UNION the temporary table. > > Is there a more elegant and straight forward way to achieve this result? > Not that I can envision. You are forced into using an operator due to our index implementation. You are thus forced into using a GUC to control the parameter that the index scanning function uses to compute true/false. A GUC can only take on a single value within a given query - well, not quite true[1] but the exception doesn't seem like it will help here. Th us you are consigned to using two queries. *A functional index doesn't work since the second argument is query specific [1] When defining a function you can attach a "SET" clause to it; commonly used for search_path but should work with any GUC. If you could wrap the operator comparison into a custom function you could use this capability. It also would require a function that would take the threshold as a value - the extension only provides variations that use the GUC. I don't think this will use the index even if it compiles (not tested): CREATE FUNCTION similarity_80(col, val) RETURNS boolean SET similarity_threshold = 0.80 LANGUAGE sql AS $$ SELECT col % val; $$; David J.
[PERFORM] similarity and operator '%'
Hello, I'm trying to find persons in an address database where I have built trgm-indexes on name, street, zip and city. When I search for all four parts of the address (name, street, zip and city) select name, street, zip, city from addresses where name % $1 and street % $2 and (zip % $3 or city % $4) everything works fine: It takes less than a second to get some (5 - 500) proposed addresses out of 500,000 addresses and the query plan shows Bitmap Heap Scan on addresses (cost=168.31..1993.38 rows=524 ... Recheck Cond: ... -> Bitmap Index Scan on ... Index Cond: ... The same happens when I search only by name with select name, street, zip, city from addresses where name % $1 But when I rewrite this query to select name, street, zip, city from addresses where similarity(name, $1) > 0.3 which means exactly then same as the second example, the query plan changes to Seq Scan on addresses (cost=0.00..149714.42 rows=174675 width=60) Filter: ... and the query lasts about a minute. The reason for using the similarity function in place of the '%'-operator is that I want to use different similarity values in one query: select name, street, zip, city from addresses where name % $1 and street % $2 and (zip % $3 or city % $4) or similarity(name, $1) > 0.8 which means: take all addresses where name, street, zip and city have little similarity _plus_ all addresses where the name matches very good. The only way I found, was to create a temporary table from the first query, change the similarity value with set_limit() and then select the second query UNION the temporary table. Is there a more elegant and straight forward way to achieve this result? regards Volker -- Volker BöhmTel.: +49 4141 981155 Voßkuhl 5 mailto:vol...@vboehm.de 21682 Stadehttp://www.vboehm.de -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance problems with 9.2.15
> > The rowcount estimates from 9.2 seem greatly different from the 8.4 plan. > > Did you remember to ANALYZE all the tables after migrating? Maybe there > > were some table-specific statistics targets that you forgot to transfer > > over? In any case, the 9.2 plan looks like garbage-in-garbage-out to > > me :-( ... without estimates at least a little closer to reality, the > > planner is unlikely to do anything very sane. > > > > (BTW, I wonder why you are moving only to 9.2 and not something more > > recent.) > > You put me on the right track with your conclusion that the estimates > were off the chart. The quick-and-dirty fix "DELETE FROM pg_statistic;" > solved this problem. This database now have to build up sane estimates > from scratch. Actually it took a VACUUM FULL; and DELETE FROM pg_statistic; followed by ANALYZE on all tables to get it right. Can someone please explain to me the difference between these two query plans: The bad one: Unique (cost=6037.10..6037.18 rows=1 width=434) (actual time=255608.588..255646.828 rows=572 loops=1) -> Sort (cost=6037.10..6037.11 rows=1 width=434) (actual time=255608.583..255611.632 rows=33209 loops=1) Sort Method: quicksort Memory: 13143kB -> Nested Loop (cost=42.51..6037.09 rows=1 width=434) (actual time=152.818..254886.674 rows=33209 loops=1) Join Filter: (cachedgroupmembers_4.groupid = acl_3.principalid) Rows Removed by Join Filter: 495425041 -> Bitmap Heap Scan on public.acl acl_3 (cost=30.07..144.35 rows=497 width=4) (actual time=0.284..8.184 rows=525 loops=1) Recheck Cond: acl_3.rightname)::text = 'OwnTicket'::text) AND ((acl_3.principaltype)::text = 'Group'::text)) OR (((acl_3.rightname)::text = 'SuperUser'::text) AND ((acl_3.principaltype): :text = 'Group'::text))) Filter: (((acl_3.objecttype)::text = 'RT::Queue'::text) OR (((acl_3.objecttype)::text = 'RT::System'::text) AND (acl_3.objectid = 1))) -> BitmapOr (cost=30.07..30.07 rows=531 width=0) (actual time=0.249..0.249 rows=0 loops=1) -> Bitmap Index Scan on acl1 (cost=0.00..25.46 rows=521 width=0) (actual time=0.233..0.233 rows=521 loops=1) Index Cond: (((acl_3.rightname)::text = 'OwnTicket'::text) AND ((acl_3.principaltype)::text = 'Group'::text)) -> Bitmap Index Scan on acl1 (cost=0.00..4.36 rows=11 width=0) (actual time=0.016..0.016 rows=4 loops=1) Index Cond: (((acl_3.rightname)::text = 'SuperUser'::text) AND ((acl_3.principaltype)::text = 'Group'::text)) -> Materialize (cost=12.44..5870.39 rows=3 width=438) (actual time=0.004..176.296 rows=943730 loops=525) -> Nested Loop (cost=12.44..5870.37 rows=3 width=438) (actual time=0.351..1028.683 rows=943730 loops=1) -> Nested Loop (cost=12.44..5601.49 rows=2 width=442) (actual time=0.326..15.591 rows=675 loops=1) -> Nested Loop (cost=12.44..5502.26 rows=27 width=8) (actual time=0.303..9.744 rows=675 loops=1) Output: principals_1.id, cachedgroupmembers_2.memberid -> Bitmap Heap Scan on public.cachedgroupmembers cachedgroupmembers_2 (cost=12.44..1659.12 rows=446 width=4) (actual time=0.267..1.266 rows=676 loops=1) Recheck Cond: (cachedgroupmembers_2.groupid = 4) Filter: (cachedgroupmembers_2.disabled = 0) -> Bitmap Index Scan on cachedgroupmembers2 (cost=0.00..12.33 rows=446 width=0) (actual time=0.171..0.171 rows=676 loops=1) Index Cond: (cachedgroupmembers_2.groupid = 4) -> Index Scan using principals_pkey on public.principals principals_1 (cost=0.00..8.61 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=676) Output: principals_1.id Index Cond: (principals_1.id = cachedgroupmembers_2.memberid) Filter: ((principals_1.id <> 1) AND (principals_1.disabled = 0) AND ((principals_1.principaltype)::text = 'User'::text)) Rows Removed by Filter: 0 -> Index Scan using users_pkey on public.users main (cost=0.00..3.67 rows=1 width=434) (actual time=0.007..0.008 rows=1 loops=675) Index Cond: (main.id = principals_1.id) -> Index Scan using cachedgroupmembers1 on public.cachedgroupmembers cachedgroupmembers_4 (cost=0.00..133.79 rows=65 width=8) (actual time=0.012..1.199 rows=1398 loops=675)
Re: [PERFORM] Performance problems with 9.2.15
> > I am just about to upgrade from PostgreSQL 8.4.20 to 9.2.15, but I'v run > > into some huge performance issues. > > The rowcount estimates from 9.2 seem greatly different from the 8.4 plan. > Did you remember to ANALYZE all the tables after migrating? Maybe there > were some table-specific statistics targets that you forgot to transfer > over? In any case, the 9.2 plan looks like garbage-in-garbage-out to > me :-( ... without estimates at least a little closer to reality, the > planner is unlikely to do anything very sane. > > (BTW, I wonder why you are moving only to 9.2 and not something more > recent.) You put me on the right track with your conclusion that the estimates were off the chart. The quick-and-dirty fix "DELETE FROM pg_statistic;" solved this problem. This database now have to build up sane estimates from scratch. / Eskil -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance