Re: [PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-11 Thread l...@laurent-hasson.com
Thanks for the pointer on the "select exists" syntax Tom. Much appreciated. I 
couldn't figure it out! And as for normalizing, yes, thought about it, but the 
one-to-many relationship would make other scenarios we have more complex and 
slower. So I am juggling with trade-offs.

So, here are my findings. I did 10 runs for each of the 4 options I have 
arrived at. The runs were pretty consistent, within a few 10th's of a second 
off each other, so little variability. Not 100% scientific, but good enough for 
my test. I picked here the last run I had with the plans for illustration.

Take-aways:
---
   - The "select exists" (#3) approach is roughly 40% faster than "select 
count(*) > 0" (#1).
   - The SQL Function version (#3) Vs the plpgSQL function version (#2) of the 
same query performs better (~30%)
   - The inlined version (#4) is twice as fast (roughly) as the SQL version 
(#3).

I wish there were a way to force inlining, or some other mechanism as the 
performance difference is large here. I'll be using the inlining approach when 
possible, but the SQL Function approach is simpler and will likely be more 
suitable for some developers.

Details:
-
1- select count(*) > 0 as SQL
===
CREATE OR REPLACE FUNCTION MyLike2(text[], text) RETURNS boolean
  AS  'select count(*) > 0 from unnest($1) a where a like $2'
LANGUAGE SQL STRICT IMMUTABLE

EXPLAIN ANALYZE
select count(*) 
from cms.claims
where MyLike2("code", '427%')
--"Aggregate  (cost=609418.77..609418.78 rows=1 width=0) (actual 
time=8464.372..8464.372 rows=1 loops=1)"
--"  ->  Seq Scan on claims  (cost=0.00..607761.44 rows=662931 width=0) (actual 
time=0.077..8457.963 rows=85632 loops=1)"
--"Filter: MyLike2("code", '427%'::text)"
--"Rows Removed by Filter: 1851321"
--"Planning time: 0.131 ms"
--"Execution time: 8464.407 ms"

2- select exists as plpgSQL
===
CREATE OR REPLACE FUNCTION MyLike3(text[], text) RETURNS boolean
  AS  'begin return exists (select * from unnest($1) a where a like $2); end'
LANGUAGE plpgSQL STRICT IMMUTABLE

EXPLAIN ANALYZE
select count(*) 
from cms.claims
where MyLike3("code", '427%')
--"Aggregate  (cost=609418.77..609418.78 rows=1 width=0) (actual 
time=7708.945..7708.945 rows=1 loops=1)"
--"  ->  Seq Scan on claims  (cost=0.00..607761.44 rows=662931 width=0) (actual 
time=0.040..7700.528 rows=85632 loops=1)"
--"Filter: MyLike3("code", '427%'::text)"
--"Rows Removed by Filter: 1851321"
--"Planning time: 0.076 ms"
--"Execution time: 7708.975 ms"

3- select exists as SQL
===
CREATE OR REPLACE FUNCTION MyLike(text[], text) RETURNS boolean
  AS  'select exists (select * from unnest($1) a where a like $2)'
LANGUAGE SQL STRICT IMMUTABLE

EXPLAIN ANALYZE
select count(*) 
from cms.claims
where MyLike("code", '427%')
--"Aggregate  (cost=609418.77..609418.78 rows=1 width=0) (actual 
time=5524.690..5524.690 rows=1 loops=1)"
--"  ->  Seq Scan on claims  (cost=0.00..607761.44 rows=662931 width=0) (actual 
time=0.064..5515.886 rows=85632 loops=1)"
--"Filter: tilda."like"("code", '427%'::text)"
--"Rows Removed by Filter: 1851321"
--"Planning time: 0.097 ms"
--"Execution time: 5524.718 ms"

4- select exists inlined
===
EXPLAIN ANALYZE
select count(*) 
from cms.claims
where exists (select * from unnest("SECONDARY_ICD9_DGNS_CD") a where a like 
'427%')
--"Aggregate  (cost=2604013.42..2604013.43 rows=1 width=0) (actual 
time=2842.259..2842.259 rows=1 loops=1)"
--"  ->  Seq Scan on claims  (cost=0.00..2601527.42 rows=994397 width=0) 
(actual time=0.017..2837.122 rows=85632 loops=1)"
--"Filter: (SubPlan 1)"
--"Rows Removed by Filter: 1851321"
--"SubPlan 1"
--"  ->  Function Scan on unnest a  (cost=0.00..1.25 rows=1 width=0) 
(actual time=0.001..0.001 rows=0 loops=1936953)"
--"Filter: (a ~~ '427%'::text)"
--"Rows Removed by Filter: 2"
--"Planning time: 0.155 ms"
--"Execution time: 2842.311 ms"


Thank you,
Laurent Hasson

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Friday, November 11, 2016 11:46
To: l...@laurent-hasson.com
Cc: Marc Mamin ; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Inlining of functions (doing LIKE on an array)

"l...@laurent-hasson.com"  writes:
> I tried "exists", but won't work in the Function, i.e., CREATE OR 
> REPLACE FUNCTION ArrayLike(text[], text) RETURNS bigint
>   AS  'exists (select * from unnest($1) a where a like $2)'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF

Syntax and semantics problems.  This would work:

regression=# CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bool 
regression-# as 'select exists (select * from unnest($1) a where a like $2)'
regression-# LANGUAGE SQL STRICT IMMUTABLE; CREATE FUNCTION regression=# create 
table tt (f1 

Re: [PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-11 Thread Jeff Janes
On Thu, Nov 10, 2016 at 10:54 PM, l...@laurent-hasson.com <
l...@laurent-hasson.com> wrote:

> Hello,
>
>
>
> I am trying to implement an efficient “like” over a text[]. I see a lot of
> people have tried before me and I learnt a lot through the forums.
>

Have you looked at parray_gin?

https://github.com/theirix/parray_gin

(Also on PGXN, but I don't know how up-to-date it is there)

Or you could create an regular pg_trgm index on the expression:

array_to_string("ICD9_DGNS_CD",'')

If you can find a safe delimiter to use (one that can't be part of the
text[]).

The performance of these options will depend on both the nature of your
data and the nature of your queries.

Cheers,

Jeff


Re: [PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-11 Thread Tom Lane
"l...@laurent-hasson.com"  writes:
> I tried "exists", but won't work in the Function, i.e.,
> CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bigint
>   AS  'exists (select * from unnest($1) a where a like $2)'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF

Syntax and semantics problems.  This would work:

regression=# CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bool
regression-# as 'select exists (select * from unnest($1) a where a like $2)'
regression-# LANGUAGE SQL STRICT IMMUTABLE;
CREATE FUNCTION
regression=# create table tt (f1 text[]);
CREATE TABLE
regression=# explain select * from tt where ArrayLike(f1, 'foo');
  QUERY PLAN   
---
 Seq Scan on tt  (cost=0.00..363.60 rows=453 width=32)
   Filter: arraylike(f1, 'foo'::text)
(2 rows)

But we don't inline SQL functions containing sub-selects, so you're still
stuck with the rather high overhead of a SQL function.  A plpgsql function
might be a bit faster:

CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bool
as 'begin return exists (select * from unnest($1) a where a like $2); end'
LANGUAGE plpgSQL STRICT IMMUTABLE;

BTW, I'd be pretty suspicious of marking this function leakproof,
because the underlying LIKE operator isn't leakproof according to
pg_proc.


> It's as expected though. As for the GIN indices, I tried and it didn't make a 
> difference, which I guess is expected as well because of the Like operator. I 
> don't expect regular indices to work on regular columns for Like operations, 
> especially '%xxx' ones, so I didn't expect GIN indices to work either for 
> Array columns with Like. Am I wrong?

Plain GIN index, probably not.  A pg_trgm index could help with LIKE
searches, but I don't think we have a variant of that for array columns.

Have you considered renormalizing the data so that you don't have
arrays?

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Any advice tuning this query ?

2016-11-11 Thread Andreas Karlsson
I have a couple of suggestions which should lead to some minor 
improvements, but in general I am surprised by the huge size of the 
result set. Is your goal really to get a 43 million row result? When a 
query returns that many rows usually all possible query plans are more 
or less bad.


1) You can remove "3" from the group by clause to avoid having to sort 
that data when we already sort by d.date.


2) If (books, date) is the primary key of dates_per_books we can also 
safely remove "4" from the group by clause further reducing the length 
of the keys that we need to sort.


3) For a minor speed up change "coalesce(sum(case when i.invno is not 
null then 1 else 0 end),0)" to "count(i.invno)".


Andreas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-11 Thread l...@laurent-hasson.com
I tried "exists", but won't work in the Function, i.e.,

CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bigint
  AS  'exists (select * from unnest($1) a where a like $2)'
LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF

It's as expected though. As for the GIN indices, I tried and it didn't make a 
difference, which I guess is expected as well because of the Like operator. I 
don't expect regular indices to work on regular columns for Like operations, 
especially '%xxx' ones, so I didn't expect GIN indices to work either for Array 
columns with Like. Am I wrong?

Finally, I think the issue is actually not what I originally thought (i.e., 
index usage, as per above). But the inlining still is the culprit. Here is the 
plan for 

select count(*) from claims
where (select count(*)  from unnest("SECONDARY_ICD9_DGNS_CD") x_  where x_ like 
'427%' ) > 0

"Aggregate  (cost=2633016.66..2633016.67 rows=1 width=0) (actual 
time=3761.888..3761.889 rows=1 loops=1)"
"  ->  Seq Scan on claims  (cost=0.00..2631359.33 rows=662931 width=0) (actual 
time=0.097..3757.314 rows=85632 loops=1)"
"Filter: ((SubPlan 1) > 0)"
"Rows Removed by Filter: 1851321"
"SubPlan 1"
"  ->  Aggregate  (cost=1.25..1.26 rows=1 width=0) (actual 
time=0.001..0.001 rows=1 loops=1936953)"
"->  Function Scan on unnest a  (cost=0.00..1.25 rows=1 
width=0) (actual time=0.001..0.001 rows=0 loops=1936953)"
"  Filter: (a ~~ '427%'::text)"
"  Rows Removed by Filter: 2"
"Planning time: 0.461 ms"
"Execution time: 3762.272 ms"

And when using the function:

"Aggregate  (cost=614390.75..614390.76 rows=1 width=0) (actual 
time=8169.416..8169.417 rows=1 loops=1)"
"  ->  Seq Scan on claims  (cost=0.00..612733.43 rows=662931 width=0) (actual 
time=0.163..8162.679 rows=85632 loops=1)"
"Filter: (tilda."like"("SECONDARY_ICD9_DGNS_CD", '427%'::text) > 0)"
"Rows Removed by Filter: 1851321"
"Planning time: 0.166 ms"
"Execution time: 8169.676 ms"

There is something fundamental here it seems, but I am not so good at reading 
plans to understand the differences here.




Thank you,
Laurent Hasson

-Original Message-
From: Marc Mamin [mailto:m.ma...@intershop.de] 
Sent: Friday, November 11, 2016 07:44
To: l...@laurent-hasson.com; pgsql-performance@postgresql.org
Subject: RE: Inlining of functions (doing LIKE on an array)




> From: pgsql-performance-ow...@postgresql.org 
> [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of 
> l...@laurent-hasson.com
> Sent: Freitag, 11. November 2016 07:54
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Inlining of functions (doing LIKE on an array)
> 
> Hello,
> 
> I am trying to implement an efficient "like" over a text[]. I see a lot of 
> people have tried before me and I learnt a lot through the forums. The 
> results of my search is that a query like the following is optimal:
> 
> select count(*) 
>   from claims
> where (select count(*) 
>   from unnest("ICD9_DGNS_CD") x_ 
>  where x_ like '427%'
>) > 0
> 

Hi,
are you using GIN indexes?

http://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns 

moreover your query can still be optimized:
=>
select count(*)
  from claims
where exists (select *
  from unnest("ICD9_DGNS_CD") x_ 
 where x_ like '427%'
   ) 

regards,

Marc Mamin

> So I figured I'd create a Function to encapsulate the concept:
> 
> CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bigint AS 
> 'select count(*) from unnest($1) a where a like $2'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
> 
> This works functionally, but performs like crap: full table scan, and cannot 
> make use of any index it seems. Basically, it feels like PG can't inline that 
> function.
> 
> I have been trying all evening to find a way to rewrite it to trick the 
> compiler/planner into inlining. I tried the operator approach for example, 
> but performance is again not good.
> 
> create function rlike(text,text)
> returns bool as 'select $2 like $1' language sql strict immutable; 
> create operator  ``` (procedure = rlike, leftarg = text,
>   rightarg = text, commutator = ```); CREATE OR 
> REPLACE FUNCTION MyLike(text[], text) RETURNS boolean AS 'select $2 
> ``` ANY($1)'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
> 
> And by not good, I mean that on my table of 2M+ rows, the "native" query 
> takes 3s, while the function version takes 9s and the operator version takes 
> (via the function, or through the operator directly), takes 15s.
> 
> Any ideas or pointers?
> 
> 
> Thank you,
> Laurent Hasson


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Any advice tuning this query ?

2016-11-11 Thread Devrim Gündüz

Hi,

On Fri, 2016-11-11 at 16:19 +0100, Henrik Ekenberg wrote:
>  Sort Method: external merge  Disk: 16782928kB

This query is generating 16GB temp file on disk. Is this the amount of data you
want to sort?

Regards,
-- 
Devrim GÜNDÜZ
EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR


signature.asc
Description: This is a digitally signed message part


[PERFORM] Any advice tuning this query ?

2016-11-11 Thread Henrik Ekenberg

Hi,

I have a select moving around a lot of data and takes times
Any advice tuning this query ?

EXPLAIN (ANALYZE ON, BUFFERS ON)
    select
    d.books,
    d.date publish_date,
    extract(dow from d.date) publish_dow,
    week_num_fixed,
    coalesce(sum(case when i.invno is not null then 1 else 0 end),0) as
daily_cnt,
    coalesce(sum(i.activation_amount_sek),0) as daily_amt_sek
    from dates_per_books d
    left join publishing_data i on (d.books=i.books and
d.date=i.publish_date)
    group by 1,2,3,4;

( explain : https://explain.depesz.com/s/aDOi )
    

   
QUERY
PLAN
  
 
--
 GroupAggregate  (cost=44606264.52..48172260.66 rows=4318263 width=68)
(actual time=839980.887..1029679.771 rows=43182733 loops=1)
   Group Key: d.books, d.date, (date_part('dow'::text,
(d.date)::timestamp without time zone)), d.week_num_fixed
   Buffers: shared hit=3, local hit=10153260 read=165591641, temp
read=2097960 written=2097960
   I/O Timings: read=399828.103
   ->  Sort  (cost=44606264.52..45104896.89 rows=199452945 width=48)
(actual time=839980.840..933883.311 rows=283894005 loops=1)
 Sort Key: d.books, d.date, (date_part('dow'::text,
(d.date)::timestamp without time zone)), d.week_num_fixed
 Sort Method: external merge  Disk: 16782928kB
 Buffers: shared hit=3, local hit=10153260 read=165591641,
temp read=2097960 written=2097960
 I/O Timings: read=399828.103
 ->  Merge Left Join  (cost=191.15..13428896.40
rows=199452945 width=48) (actual time=0.031..734937.112 rows=283894005
loops=1)
   Merge Cond: ((d.books = i.books) AND (d.date =
i.publish_date))
   Buffers: local hit=10153260 read=165591641
   I/O Timings: read=399828.103
   ->  Index Scan using books_date on
dates_per_books d  (cost=0.56..1177329.91 rows=43182628 width=20) (actual
time=0.005..33789.216 rows=43182733 loops=1)
 Buffers: local hit=10 read=475818
 I/O Timings: read=27761.376
   ->  Index Scan using activations_books_date
on publishing_data i  (cost=0.57..7797117.25 rows=249348384 width=32)
(actual time=0.004..579806.706 rows=249348443 loops=1)
 Buffers: local hit=10153250
read=165115823
 I/O Timings: read=372066.727
 Planning time: 2.864 ms
 Execution time: 1034284.193 ms
(21 rows)

(END)


Re: [PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-11 Thread Marc Mamin



> From: pgsql-performance-ow...@postgresql.org 
> [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of 
> l...@laurent-hasson.com
> Sent: Freitag, 11. November 2016 07:54
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Inlining of functions (doing LIKE on an array)
> 
> Hello,
> 
> I am trying to implement an efficient "like" over a text[]. I see a lot of 
> people have tried before me and I learnt a lot through the forums. The 
> results of my search is that a query like the following is optimal:
> 
> select count(*) 
>   from claims
> where (select count(*) 
>   from unnest("ICD9_DGNS_CD") x_ 
>  where x_ like '427%'
>) > 0
> 

Hi,
are you using GIN indexes?

http://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns 

moreover your query can still be optimized:
=>
select count(*) 
  from claims
where exists (select *
  from unnest("ICD9_DGNS_CD") x_ 
 where x_ like '427%'
   ) 

regards,

Marc Mamin

> So I figured I'd create a Function to encapsulate the concept:
> 
> CREATE OR REPLACE FUNCTION ArrayLike(text[], text)
> RETURNS bigint
> AS 'select count(*) from unnest($1) a where a like $2'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
> 
> This works functionally, but performs like crap: full table scan, and cannot 
> make use of any index it seems. Basically, it feels like PG can't inline that 
> function.
> 
> I have been trying all evening to find a way to rewrite it to trick the 
> compiler/planner into inlining. I tried the operator approach for example, 
> but performance is again not good.
> 
> create function rlike(text,text) 
> returns bool as 'select $2 like $1' language sql strict immutable;
> create operator  ``` (procedure = rlike, leftarg = text, 
>   rightarg = text, commutator = ```);
> CREATE OR REPLACE FUNCTION MyLike(text[], text)
> RETURNS boolean
> AS 'select $2 ``` ANY($1)'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
> 
> And by not good, I mean that on my table of 2M+ rows, the "native" query 
> takes 3s, while the function version takes 9s and the operator version takes 
> (via the function, or through the operator directly), takes 15s.
> 
> Any ideas or pointers?
> 
> 
> Thank you,
> Laurent Hasson


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance