Re: Using regexp from table has unpredictable poor performance

2021-08-25 Thread Vitalii Tymchyshyn
Btw: if you still run out of cache later with more regexes may be it makes
sense to do prefiltering first my making a single gigantic regexp as
string_agg(‘(‘||name_matches||’)’,’|’) and then only filter ones that match
later. If postgresql provides capturing groups you may even be able to
explode the result without postfilter.

ср, 25 серп. 2021 о 14:22 Jack Christensen  пише:

> The optimizer was a bit too clever. It used the same plan for the LEFT
> JOIN. But that put me on the right track. I tried a LATERAL join. But the
> optimizer saw through that too and used the same plan. So I tried a
> materialized CTE and that finally forced it to use a different plan. That
> made it  run in ~70ms -- about 18x faster. Thanks!
>
> explain analyze
> with r as materialized (
>   select * from matching_rules
>   where id >= 0 and id < 60
> )
> select r.id, i.id
> from r
>   join items i on i.name ~ r.name_matches
> ;
>
>  QUERY PLAN
>
> ─
>  Nested Loop  (cost=2.78..714.20 rows=230 width=8) (actual
> time=0.071..69.545 rows=702 loops=1)
>Join Filter: (i.name ~ r.name_matches)
>Rows Removed by Join Filter: 45298
>CTE r
>  ->  Seq Scan on matching_rules  (cost=0.00..2.78 rows=46 width=26)
> (actual time=0.007..0.047 rows=46 loops=1)
>Filter: ((id >= 0) AND (id < 60))
>Rows Removed by Filter: 6
>->  CTE Scan on r  (cost=0.00..0.92 rows=46 width=36) (actual
> time=0.008..0.090 rows=46 loops=1)
>->  Materialize  (cost=0.00..23.00 rows=1000 width=27) (actual
> time=0.000..0.081 rows=1000 loops=46)
>  ->  Seq Scan on items i  (cost=0.00..18.00 rows=1000 width=27)
> (actual time=0.003..0.092 rows=1000 loops=1)
>  Planning Time: 0.206 ms
>  Execution Time: 69.633 ms
>
>
> On Wed, Aug 25, 2021 at 4:05 PM Justin Pryzby 
> wrote:
>
>> On Wed, Aug 25, 2021 at 11:47:43AM -0500, Jack Christensen wrote:
>> > I have items that need to be categorized by user defined matching rules.
>> > Trusted users can create rules that include regular expressions. I've
>> > reduced the problem to this example.
>>
>> > I use the following query to find matches:
>> >
>> > select r.id, i.id
>> > from items i
>> >   join matching_rules r on i.name ~ r.name_matches;
>> >
>> > When there are few rules the query runs quickly. But as the number of
>> rules
>> > increases the runtime often increases at a greater than linear rate.
>>
>> Maybe it's because the REs are cached by RE_compile_and_cache(), but if
>> you
>> loop over the REs in the inner loop, then the caching is ineffecive.
>>
>> Maybe you can force it to join with REs on the outer loop by writing it
>> as:
>> | rules LEFT JOIN items WHERE rules.id IS NOT NULL,
>> ..to improve performance, or at least test that theory.
>>
>> --
>> Justin
>>
>


Re: Using regexp from table has unpredictable poor performance

2021-08-25 Thread Jack Christensen
The optimizer was a bit too clever. It used the same plan for the LEFT
JOIN. But that put me on the right track. I tried a LATERAL join. But the
optimizer saw through that too and used the same plan. So I tried a
materialized CTE and that finally forced it to use a different plan. That
made it  run in ~70ms -- about 18x faster. Thanks!

explain analyze
with r as materialized (
  select * from matching_rules
  where id >= 0 and id < 60
)
select r.id, i.id
from r
  join items i on i.name ~ r.name_matches
;

 QUERY PLAN
─
 Nested Loop  (cost=2.78..714.20 rows=230 width=8) (actual
time=0.071..69.545 rows=702 loops=1)
   Join Filter: (i.name ~ r.name_matches)
   Rows Removed by Join Filter: 45298
   CTE r
 ->  Seq Scan on matching_rules  (cost=0.00..2.78 rows=46 width=26)
(actual time=0.007..0.047 rows=46 loops=1)
   Filter: ((id >= 0) AND (id < 60))
   Rows Removed by Filter: 6
   ->  CTE Scan on r  (cost=0.00..0.92 rows=46 width=36) (actual
time=0.008..0.090 rows=46 loops=1)
   ->  Materialize  (cost=0.00..23.00 rows=1000 width=27) (actual
time=0.000..0.081 rows=1000 loops=46)
 ->  Seq Scan on items i  (cost=0.00..18.00 rows=1000 width=27)
(actual time=0.003..0.092 rows=1000 loops=1)
 Planning Time: 0.206 ms
 Execution Time: 69.633 ms


On Wed, Aug 25, 2021 at 4:05 PM Justin Pryzby  wrote:

> On Wed, Aug 25, 2021 at 11:47:43AM -0500, Jack Christensen wrote:
> > I have items that need to be categorized by user defined matching rules.
> > Trusted users can create rules that include regular expressions. I've
> > reduced the problem to this example.
>
> > I use the following query to find matches:
> >
> > select r.id, i.id
> > from items i
> >   join matching_rules r on i.name ~ r.name_matches;
> >
> > When there are few rules the query runs quickly. But as the number of
> rules
> > increases the runtime often increases at a greater than linear rate.
>
> Maybe it's because the REs are cached by RE_compile_and_cache(), but if you
> loop over the REs in the inner loop, then the caching is ineffecive.
>
> Maybe you can force it to join with REs on the outer loop by writing it as:
> | rules LEFT JOIN items WHERE rules.id IS NOT NULL,
> ..to improve performance, or at least test that theory.
>
> --
> Justin
>


Re: Using regexp from table has unpredictable poor performance

2021-08-25 Thread Justin Pryzby
On Wed, Aug 25, 2021 at 11:47:43AM -0500, Jack Christensen wrote:
> I have items that need to be categorized by user defined matching rules.
> Trusted users can create rules that include regular expressions. I've
> reduced the problem to this example.

> I use the following query to find matches:
> 
> select r.id, i.id
> from items i
>   join matching_rules r on i.name ~ r.name_matches;
> 
> When there are few rules the query runs quickly. But as the number of rules
> increases the runtime often increases at a greater than linear rate.

Maybe it's because the REs are cached by RE_compile_and_cache(), but if you
loop over the REs in the inner loop, then the caching is ineffecive.

Maybe you can force it to join with REs on the outer loop by writing it as:
| rules LEFT JOIN items WHERE rules.id IS NOT NULL,
..to improve performance, or at least test that theory.

-- 
Justin




Using regexp from table has unpredictable poor performance

2021-08-25 Thread Jack Christensen
I have items that need to be categorized by user defined matching rules.
Trusted users can create rules that include regular expressions. I've
reduced the problem to this example.

   Table "public.items"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 id │ integer │   │ not null │
 name   │ text│   │ not null │
Indexes:
"items_pkey" PRIMARY KEY, btree (id)

  Table "public.matching_rules"
Column│  Type   │ Collation │ Nullable │ Default
──┼─┼───┼──┼─
 id   │ integer │   │ not null │
 name_matches │ text│   │ not null │
Indexes:
"matching_rules_pkey" PRIMARY KEY, btree (id)

I use the following query to find matches:

select r.id, i.id
from items i
  join matching_rules r on i.name ~ r.name_matches;

When there are few rules the query runs quickly. But as the number of rules
increases the runtime often increases at a greater than linear rate.

For example if I run two queries, one the tests rule IDs 0 - 30 and another
that tests 30 - 60 the total runtime is less than 100ms. But if I instead
test rule IDs 0 - 60 in a single query the runtime balloons to over 1300ms.

explain analyze
select r.id, i.id
from items i
  join matching_rules r on i.name ~ r.name_matches
where r.id >= 0 and r.id < 30
;
─
 Nested Loop  (cost=0.00..260.82 rows=80 width=8) (actual
time=0.820..28.334 rows=172 loops=1)
   Join Filter: (i.name ~ r.name_matches)
   Rows Removed by Join Filter: 16828
   ->  Seq Scan on items i  (cost=0.00..18.00 rows=1000 width=27) (actual
time=0.006..0.176 rows=1000 loops=1)
   ->  Materialize  (cost=0.00..2.86 rows=16 width=26) (actual
time=0.000..0.001 rows=17 loops=1000)
 ->  Seq Scan on matching_rules r  (cost=0.00..2.78 rows=16
width=26) (actual time=0.004..0.012 rows=17 loops=1)
   Filter: ((id >= 0) AND (id < 30))
   Rows Removed by Filter: 35
 Planning Time: 0.086 ms
 Execution Time: 28.364 ms


explain analyze
select r.id, i.id
from items i
  join matching_rules r on i.name ~ r.name_matches
where r.id >= 30 and r.id < 60
;
   QUERY PLAN
─
 Nested Loop  (cost=0.00..470.86 rows=150 width=8) (actual
time=1.418..65.508 rows=530 loops=1)
   Join Filter: (i.name ~ r.name_matches)
   Rows Removed by Join Filter: 28470
   ->  Seq Scan on items i  (cost=0.00..18.00 rows=1000 width=27) (actual
time=0.007..0.193 rows=1000 loops=1)
   ->  Materialize  (cost=0.00..2.93 rows=30 width=26) (actual
time=0.000..0.002 rows=29 loops=1000)
 ->  Seq Scan on matching_rules r  (cost=0.00..2.78 rows=30
width=26) (actual time=0.005..0.020 rows=29 loops=1)
   Filter: ((id >= 30) AND (id < 60))
   Rows Removed by Filter: 23
 Planning Time: 0.076 ms
 Execution Time: 65.573 ms


explain analyze
select r.id, i.id
from items i
  join matching_rules r on i.name ~ r.name_matches
where r.id >= 0 and r.id < 60
;
   QUERY PLAN
─
 Nested Loop  (cost=0.00..710.89 rows=230 width=8) (actual
time=3.731..1344.834 rows=702 loops=1)
   Join Filter: (i.name ~ r.name_matches)
   Rows Removed by Join Filter: 45298
   ->  Seq Scan on items i  (cost=0.00..18.00 rows=1000 width=27) (actual
time=0.006..0.442 rows=1000 loops=1)
   ->  Materialize  (cost=0.00..3.01 rows=46 width=26) (actual
time=0.000..0.004 rows=46 loops=1000)
 ->  Seq Scan on matching_rules r  (cost=0.00..2.78 rows=46
width=26) (actual time=0.004..0.019 rows=46 loops=1)
   Filter: ((id >= 0) AND (id < 60))
   Rows Removed by Filter: 6
 Planning Time: 0.084 ms
 Execution Time: 1344.967 ms

It's also not predictable when additional regexp rows will trigger the poor
performance. There's not a specific number of rows or kind of regexp that I
can discern that triggers the issue. The regexps themselves are pretty
trivial too. Only normal text, start and end of string anchors, and
alternation.

I've vacuumed, analyzed, and I am on PostgreSQL 13.4 on
x86_64-apple-darwin20.4.0, compiled by Apple clang version 12.0.5
(clang-1205.0.22.9), 64-bit.

Any ideas what's causing this?

Thanks.

Jack