Re: [PERFORM] OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices

2017-11-01 Thread l...@laurent-hasson.com

> -Original Message-
> From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-
> ow...@postgresql.org] On Behalf Of Gunther
> Sent: Wednesday, November 01, 2017 20:29
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] OLAP/reporting queries fall into nested loops over seq
> scans or other horrible planner choices
> 
> Hi, this is Gunther, have been with PgSQL for decades, on an off this list.
> Haven't been on for a long time making my way just fine. But there is one 
> thing
> that keeps bothering me both with Oracle and PgSQL. And that is the
> preference for Nested Loops.
> 
> Over the years the archives have questions about Nested Loops being chosen
> over Hash Joins. But the responses seem too specific to the people's queries,
> ask many questions, make them post the query plans, and often end up
> frustrating with suggestions to change the data model or to add an index and
> stuff like that.
> 
> One should not have to go into that personal detail.
> 
> There are some clear boundaries that a smart database should just never cross.
> 
> Especially with OLAP queries. Think a database that is fine for OLTP, has
> indexes and the index based accesses for a few records joined with a dozen
> other tables all with indexes is no problem. If you fall into a Seq Scan 
> scenario
> or unwanted Hash Join, you usually forgot to add an index or forgot to put 
> index
> columns into your join or other constraints. Such are novice questions and we
> should be beyond that.
> 
> But the issue is bulk searches, reports, and any analytic queries scenarios. 
> In
> those queries Nested Loops are almost always a bad choice, even if there is an
> index. In over 20 years of working with RDBMs this has been my unfailing
> heuristics. A report runs slow? Look at plan, is there a Nested Loop? Yes?
> Squash it! And the report runs 10x faster instantaneously.
> 
> So, all the more troublesome is if any database system (here PgSQL) would
> ever fall into a Nested Loop trap with CPU spinning at 100% for several
> minutes, with a Nested Loop body of anything from a Seq Scan or worse with a
> cardinality of anything over 10 or 100. Nested Loops of Nested Loops or Nested
> Loops of other complex query plan fragments should be a no-no and chosen
> only as an absolute last resort when the system cannot find enough memory,
> even then disk based merge sort should be better, i.e., Nested Loops should
> never be chosen. Period.
> 
> If you can set enable_nestloop off and the Hash Join is chosen and the
> performance goes from 1 hour of 100% CPU to 10 seconds completion time,
> then something is deadly wrong. And it doesn't matter to me if I should have
> re-written my query in some funny ways or tweaked my data model, these are
> all unacceptable options when you have a complex system with hybrid
> OLTP/OLAP uses. Don't tell me to de-normalize. I know I can materialize joins
> in tables which I can then use again in joins to save time. But that is not 
> the
> point here.
> 
> And I don't think tweaking optimizer statistics is the solution either.
> Because optimizer statistics quickly become worthless when your criteria get
> more complex.
> 
> The point is that Nested Loops should never be chosen except in index lookup
> situations or may be memory constraints.
> 
> How can I prevent it on a query by query scope? I cannot set enable_nestloop =
> off because one query will be for a full report, wile another one might have
> indexed constraints running in the same session, and I don't want to manage
> side effects and remember to set enable_nestloop parameter on and off.
> 
> There must be a way to tell the optimizer to penalize nested loops to make
> them the last resort. In Oracle there are those infamous hints, but they don't
> always work either (or it is easy to make mistakes that you get no feedback
> about).
> 
> Is there any chance PgSQL can get something like a hint feature? Or is there a
> way to use postgresql.conf to penalize nested loops so that they would only 
> ever
> be chosen in the most straight-forward situations as with query parameters
> that are indexed? I know I need to have sufficient work_mem, but if you can 
> set
> enable_nestloop = off and you get the desired Hash Join, there is obviously
> sufficient work_mem, so that isn't the answer either.
> 
> Thanks for listening to my rant.
> 
> regards,
> -Gunther
> 
> 
> 
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

 [Laurent Hasson] 
Hello Gunther,

Just adding to your voice. I recently experienced the same issue with a complex 
multi-table view, including pivots, and was surprised to see all the nested 
loops everywhere in spite of indices being available. I spent a lot of time 
optimizing the query and went from about 1h to about 3mn, but penalizing nested 
loops in favor of other "joining" 

Re: [PERFORM] Unlogged tables

2017-08-09 Thread l...@laurent-hasson.com


Sent from my BlackBerry - the most secure mobile device
From: gneun...@comcast.net
Sent: August 9, 2017 14:52
To: l...@laurent-hasson.com
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Unlogged tables


Please don't top post.

On 8/9/2017 2:30 PM, l...@laurent-hasson.com<mailto:l...@laurent-hasson.com> 
wrote:
> On 8/9/2017 2:17 PM, gneun...@comcast.net<mailto:gneun...@comcast.net> wrote:

>> On Wed, 9 Aug 2017 09:14:48 -0700, Jeff Janes 
>> <jeff.ja...@gmail.com><mailto:jeff.ja...@gmail.com> wrote:

>> Why doesn't the Windows scheduled shutdown signal postgres to shutdown
>> cleanly and wait for it to do so?  That is what is supposed to happen.

> Windows *does* signal shutdown (and sleep and hibernate and wakeup).
> pg_ctl can catch these signals only when running as a service ... it
> will not catch any system signals when run as an application.

Ok, I am not sure. I run Postgres as a service, and when my Windows rebooted 
after a patch, UNLOGGED tables were cleaned... maybe the patch process in 
Windows messed something up, I don't know.

Hmm.  Do you have checkpoint intervals set very long?  Or do you have the 
Windows shutdown delay(s) set short?

Data in unlogged tables persists only AFTER a checkpoint ... if the tables had 
been written to and were "dirty", and the system went down before the shutdown 
checkpoint (or before the shutdown checkpoint completed), then the tables would 
be truncated at the next startup.


Service control in Windows is very different from Unix/Linux, and Windows is 
not completely POSIX compatible.  I develop software for Windows and Linux, but 
I only use Postgresql.  Postgresql was written originally for Unix and it is 
possible that the Windows version is not doing something quite right.

I took a quick glance at the source for pg_ctl:  SERVICE_CONTROL_SHUTDOWN and 
SERVICE_CONTROL_STOP both just set an shared event to notify the writer 
processes to terminate.  Offhand I don't see where pg_ctl - running as a 
service - is waiting for the writer processes to actually terminate ( it does 
wait if run from the command line ).   It's possible that your system shut down 
too quickly and the WAL writer was killed instead of terminating cleanly.


Just FYI, re: Postgresql as a user application.

Windows doesn't send *signals* (ala Unix) at all ... it is message based.  The 
control messages are different for applications and services - e.g., 
WM_SHUTDOWN is sent to applications, SERVICE_CONTROL_SHUTDOWN is sent to 
services.  In order for an application to catch a message, it must create a 
window.

pg_ctl is a command line program which does not create any windows (in any 
mode).  It was designed to enable it to run as a service, but when run as a 
user application it will can't receive any system messages.  The user *must* 
manually stop a running database cluster before shutting down or sleeping.

George


Hello George... I know about not doing top posting but was emailing from my 
phone, and just recently moved to Android. I think I am still not configured 
right.

Somewhat orthogonal, but any particular reason why top posts == bad, or just 
convention?

I will try a few scenarios and report back. I do not believe I have long cp 
intervals and I do not believe the windows machine shuts down faster than 
'normal'

Finally, my true question was whether Postgres would support something like 
worm with the performance benefits of UNLOGGED, but not the inconveniences of 
auto truncates.

Thanks.


Re: [PERFORM] Unlogged tables

2017-08-09 Thread l...@laurent-hasson.com
Ok, I am not sure. I run Postgres as a service, and when my Windows rebooted 
after a patch, UNLOGGED tables were cleaned... maybe the patch process in 
Windows messed something up, I don't know.

From: gneun...@comcast.net
Sent: August 9, 2017 13:17
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Unlogged tables


On Wed, 9 Aug 2017 09:14:48 -0700, Jeff Janes  wrote:

 >Why doesn't the Windows scheduled shutdown signal postgres to shutdown
 >cleanly and wait for it to do so?  That is what is supposed to happen.

Windows *does* signal shutdown (and sleep and hibernate and wakeup).
pg_ctl can catch these signals only when running as a service ... it
will not catch any system signals when run as an application.

George


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Unlogged tables

2017-08-08 Thread l...@laurent-hasson.com
Hello,


We have a fairly large static dataset that we load into Postgres. We made the 
tables UNLOGGED and saw a pretty significant performance improvement for the 
loading. This was all fantastic until the server crashed and we were surprised 
to see during a follow up demo that the data had disappeared... Of course, it's 
all our fault for not understanding the implications of UNLOGGED proprely.


However, our scenario is truly a set of tables with 100's of millions of rows 
that are effectively WORMs: we write them once only, and then only read from 
them afterwards. As such, they could not be possibly corrupted post-load (i 
think) during a server crash (short of physical disk defects...).


I'd like to have the performance improvement during a initial batch insert, and 
then make sure the table remains after "unclean" shutdowns, which, as you might 
have it, includes a regular Windows server shut down during patching for 
example. So unlogged tables in practice are pretty flimsy. I tried to ALTER ... 
SET LOGGED, but that takes a VERY long time and pretty much negates the initial 
performance boost of loading into an unlogged table.


Is there a way to get my cake and eat it too?


Thank you,

Laurent Hasson





Re: [PERFORM] Dataset is fetched from cache but still takes same time to fetch records as first run

2017-06-23 Thread l...@laurent-hasson.com
ditto here... much slower, and crashes too often. We run an evergreen shop 
where I work, but everyone has moved back to III.

Sent from my BlackBerry KEYone - the most secure mobile device
From: adambrusselb...@gmail.com
Sent: June 23, 2017 8:11 AM
To: t...@sss.pgh.pa.us
Cc: sumeet.k.shu...@gmail.com; dstibr...@gmail.com; 
pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Dataset is fetched from cache but still takes same time 
to fetch records as first run


On Fri, Jun 23, 2017 at 12:50 AM, Tom Lane 
> wrote:
It's possible that pgAdmin4 has improved matters in this area.

Sadly, not in my experience.  It's actually considerably worse than pgAdminIII 
in my experience when selecting a lot of rows, especially when very wide (20+ 
columns).


[PERFORM] Sudden drastic change in performance

2017-06-15 Thread l...@laurent-hasson.com
Hello all,

I have a query with many joins, something like:

Select c1, c2, c3, sum(c5)
  From V1
   Join V2 on ...
   Left join V3 on ...
   Left join T4 on ...
   Join T5 on ...
   Join T6 on ...
   Left join T7 on ...
   Join T8 on ...
   Left join T9 on ...
Where ...
Group by c1, c2, c3

The join clauses are fairly innocuous and work directly on foreign key 
relationships, so there is no voodoo there. Same for the where clause. The 
views are similar and also join 3-4 tables each. All in all, there are 3 of all 
the tables involved that have millions of rows and all the other tables have 
thousands of rows. In particular, T9 is totally empty.

If I remove T9 from the query, it takes 9s to run. If I keep T9, the query 
takes over 30mn to run! If I switch the order of T8/T9, then the same happens 
with T8. So I don't think this has to do with the tables themselves. I have 
updated all the statistics and reindexed all involved tables.

Any idea as to what could be causing this issue? Am I having one too many joins 
and tripping the query execution? The query plans are very large in both cases, 
so I figured I'd abstract the cases a bit for this question, but could provide 
those plans if someone thinks it'd be useful.

Thank you,
Laurent.



Re: [PERFORM] More cores or higer frequency ?

2017-05-26 Thread l...@laurent-hasson.com
Are you already on SSDs? That will be the dominant factor I think. Then 
memory After that, more cores are good for parallelism (especially with 
9.6, although that requires solid memory support). Faster cores will be better 
if you expect complex calculations in memory, i.e., some analytics perhaps, but 
for your fairly straightforward write-throughput scenario, I think SSDs and 
memory will be king.

LDH

-Original Message-
From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Jarek
Sent: Tuesday, May 23, 2017 14:30
To: pgsql-performance@postgresql.org
Subject: [PERFORM] More cores or higer frequency ?

Hello!

I've heavy loaded PostgreSQL server, which I want to upgrade, so it will handle 
more traffic. Can I estimate what is better: more cores or higher frequency ? I 
expect that pg_stat should give some tips, but don't know where to start...

best regards
Jarek



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Understanding PostgreSQL query execution time

2017-04-07 Thread l...@laurent-hasson.com
The first behavior is very likely just caching. The plan and results from the 
query are cached, so the second time, it's reused directly.

If you ran a bunch of other queries in the middle and effectively exhausted the 
cache, then back to your query, likely tou'd see the 'slow' behavior again.

As for AWS, not sure, but likely about memory and config more than latency.


Sent from my BlackBerry 10 smartphone.
From: Haider Ali
Sent: Friday, April 7, 2017 09:58
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Understanding PostgreSQL query execution time


Hello

I want to understand execution time of a query in PostgreSQL then I want to 
relate it to the problem i am getting. According to my observation ( I can't 
explain why this happen ) whenever we query a table first time its execution 
will be high (sometimes very high) as compare to queries made on same table in 
a short period of time followed by first query on that table. For example query 
given below

SELECT  "global_configs".* FROM "global_configs" ORDER BY "global_configs"."id" 
ASC LIMIT $1

executed multiple times instantaneous one after another have following 
execution time

1st time => 147.5ms
2nd time => 3.0ms
3rd time => 3.0ms
4th time => 3.0ms
5th time => 0.8ms

I want to understand why there is a huge time difference between 1st and rest 
of the executions.

Relation to other problem

Having experience above behaviour of PostgreSQL now I am using PostgreSQL 
managed by Amazon RDS. Observation is no matter how many times I execute same 
query its execution times remain same ( although execution time of a query on 
RDS is comparatively high as compare to query running on local instance of 
PostgreSQL that I can understand is because of Network latency)

Questions


  1.  Why first query on a table takes more time then queries followed by it ?
  2.  Why above behaviour doesn't reflect on Amazon RDS ?

Thank you for reading my post.

--
Haider Ali


Re: [PERFORM] How Can I check PostgreSQL backup is successfully or not ?

2017-03-02 Thread l...@laurent-hasson.com
It'd be so nice to have some checks to guarantee the backup is trustworthy. 
Restoring the db is imho not a very good option in general:
 - large databases are a problem. My db is about 3TB. Time plus disk space is a 
big blocker.
 - also, what if the backup is incomplete? Just restoring the db successfully 
is not enough right? You'd have to compare with the prod to make sure nothing 
was missed... in a fast moving outfit where the db today will have tons of 
new/changed deleted stuff from yesterday.. how to even do that?

I am in a warehouse environment, so I have given ‎up on guaranteeing backups 
and in a case of trouble, i'll spend 20h rebuilding my db. So I have a way out 
but i'd much prefer working with trustworthy backups.


Sent from my BlackBerry 10 smartphone.
From: Rick Otten
Sent: Thursday, March 2, 2017 08:19
To: Dinesh Chandra 12108
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] How Can I check PostgreSQL backup is successfully or not 
?


This reminds me - I have had a case where the exit code for pg_dump was 
successful, but the backup was still corrupted on disk.  By all means check the 
exit code, but I strong encourage a second validation, such as the index 
listing, to increase your confidence that the backup was successful.

The best way to ensure good backups is to establish a regular practice of 
restoring a backup to another database.  The easiest such practice to justify 
and implement is to maintain a developer/development database, and to use your 
production database backups to rebuild it on a regular basis.  Other approaches 
could include regularly scheduled Disaster Recovery exercises, or simply 
spinning up throw away cloud instances for the purpose.

pg_dump uses the ordinary postgresql COPY command to extract data from the 
tables.  Beyond that, I'm not sure how it works.  Sorry I can't help you there.


On Thu, Mar 2, 2017 at 7:05 AM, Dinesh Chandra 12108 
> wrote:
Hi,

When I issue the bleow command
  > ./bin >pg_dump -U dummy_user  dummy_database; echo $?

I checked with Linux TOP command on the same server, it was showing COPY 
database.
What exactly it doing ??

Regards,
Dinesh Chandra

-Original Message-
From: vinny [mailto:vi...@xs4all.nl]
Sent: 27 February, 2017 7:31 PM
To: John Gorman >
Cc: Rick Otten >; 
Dinesh Chandra 12108 
>; 
pgsql-performance@postgresql.org; 
pgsql-performance-ow...@postgresql.org
Subject: Re: [PERFORM] How Can I check PostgreSQL backup is successfully or not 
?

On 2017-02-27 14:29, John Gorman wrote:
> Even though it's not listed in any of the documentation or “pg_dump
> --help” you can check the return code of the process. A return code
> greater than 0 (zero) usually indicates a failure
>
> ./bin >pg_dump -U dummy_user  dummy_database; echo $?
>
> 1
>
> FROM: 
> pgsql-performance-ow...@postgresql.org
> [mailto:pgsql-performance-ow...@postgresql.org]
>  ON BEHALF OF Rick
> Otten
> SENT: Monday, February 27, 2017 3:36 AM
> TO: Dinesh Chandra 12108
> CC: pgsql-performance@postgresql.org
> SUBJECT: Re: [PERFORM] How Can I check PostgreSQL backup is
> successfully or not ?
>
> Although it doesn't really tell if the pg_dump was successful (you'll
> need to do a full restore to be sure), I generate an archive list.  If
> that fails, the backup clearly wasn't successful, and if it succeeds,
> odds are pretty good that it worked:
>
> On Mon, Feb 27, 2017 at 4:35 AM, Dinesh Chandra 12108
> > wrote:
>
> Hi,
>
> We are taking daily full backup of PostgreSQL database using PG_DUMP
> which is automatic scheduled through Cronjobs.
>
> How can I check my yesterday backup is successfully or not?
>
> Is there any query or view by which I can check it?
>
> REGARDS,
>
> DINESH CHANDRA
>
> |DATABASE ADMINISTRATOR (ORACLE/POSTGRESQL)| CYIENT LTD. NOIDA.


It's important to note the distinction between

"the backup process did not fail"

and

"we now have a trustworthy backup"

And you can go full-paranoia and say that you can successfully create a 
perfectly working backup of the wrong database.

So what is it that you want to make sure of:
1. Did the process give an error?
2. Did the process create a usable backup?

What are the chances of #1 reporting success but still producing a bad backup?
And can #2 fail on a good database, and if so, can you detect that?





DISCLAIMER:

This email message is for the sole use of the intended recipient(s) and may 
contain confidential and 

Re: [PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-12 Thread l...@laurent-hasson.com
Yep, agreed. A simple lexical macro-like approach to test "if it works" could 
be a simple approach to see if inlining a piece of sql would not break the main 
query?

Laurent Hasson
Sent from my BlackBerry Passport

  Original Message
From: Tom Lane
Sent: Saturday, November 12, 2016 14:59
To: l...@laurent-hasson.com
Cc: Marc Mamin; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Inlining of functions (doing LIKE on an array)


"l...@laurent-hasson.com" <l...@laurent-hasson.com> writes:
> I wish there were a way to force inlining, or some other mechanism as the 
> performance difference is large here. I'll be using the inlining approach 
> when possible, but the SQL Function approach is simpler and will likely be 
> more suitable for some developers.

I'm not sure that there's any fundamental reason why we don't inline SQL
functions containing sub-selects.  It may just be not having wanted to put
any effort into the case way-back-when.  Inlining happens too late to
allow a resulting WHERE EXISTS to get mutated into a semijoin, but in this
example that couldn't happen anyway, so it's not much of an objection.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-11 Thread l...@laurent-hasson.com
Thanks for the pointer on the "select exists" syntax Tom. Much appreciated. I 
couldn't figure it out! And as for normalizing, yes, thought about it, but the 
one-to-many relationship would make other scenarios we have more complex and 
slower. So I am juggling with trade-offs.

So, here are my findings. I did 10 runs for each of the 4 options I have 
arrived at. The runs were pretty consistent, within a few 10th's of a second 
off each other, so little variability. Not 100% scientific, but good enough for 
my test. I picked here the last run I had with the plans for illustration.

Take-aways:
---
   - The "select exists" (#3) approach is roughly 40% faster than "select 
count(*) > 0" (#1).
   - The SQL Function version (#3) Vs the plpgSQL function version (#2) of the 
same query performs better (~30%)
   - The inlined version (#4) is twice as fast (roughly) as the SQL version 
(#3).

I wish there were a way to force inlining, or some other mechanism as the 
performance difference is large here. I'll be using the inlining approach when 
possible, but the SQL Function approach is simpler and will likely be more 
suitable for some developers.

Details:
-
1- select count(*) > 0 as SQL
===
CREATE OR REPLACE FUNCTION MyLike2(text[], text) RETURNS boolean
  AS  'select count(*) > 0 from unnest($1) a where a like $2'
LANGUAGE SQL STRICT IMMUTABLE

EXPLAIN ANALYZE
select count(*) 
from cms.claims
where MyLike2("code", '427%')
--"Aggregate  (cost=609418.77..609418.78 rows=1 width=0) (actual 
time=8464.372..8464.372 rows=1 loops=1)"
--"  ->  Seq Scan on claims  (cost=0.00..607761.44 rows=662931 width=0) (actual 
time=0.077..8457.963 rows=85632 loops=1)"
--"Filter: MyLike2("code", '427%'::text)"
--"Rows Removed by Filter: 1851321"
--"Planning time: 0.131 ms"
--"Execution time: 8464.407 ms"

2- select exists as plpgSQL
===
CREATE OR REPLACE FUNCTION MyLike3(text[], text) RETURNS boolean
  AS  'begin return exists (select * from unnest($1) a where a like $2); end'
LANGUAGE plpgSQL STRICT IMMUTABLE

EXPLAIN ANALYZE
select count(*) 
from cms.claims
where MyLike3("code", '427%')
--"Aggregate  (cost=609418.77..609418.78 rows=1 width=0) (actual 
time=7708.945..7708.945 rows=1 loops=1)"
--"  ->  Seq Scan on claims  (cost=0.00..607761.44 rows=662931 width=0) (actual 
time=0.040..7700.528 rows=85632 loops=1)"
--"Filter: MyLike3("code", '427%'::text)"
--"Rows Removed by Filter: 1851321"
--"Planning time: 0.076 ms"
--"Execution time: 7708.975 ms"

3- select exists as SQL
===
CREATE OR REPLACE FUNCTION MyLike(text[], text) RETURNS boolean
  AS  'select exists (select * from unnest($1) a where a like $2)'
LANGUAGE SQL STRICT IMMUTABLE

EXPLAIN ANALYZE
select count(*) 
from cms.claims
where MyLike("code", '427%')
--"Aggregate  (cost=609418.77..609418.78 rows=1 width=0) (actual 
time=5524.690..5524.690 rows=1 loops=1)"
--"  ->  Seq Scan on claims  (cost=0.00..607761.44 rows=662931 width=0) (actual 
time=0.064..5515.886 rows=85632 loops=1)"
--"Filter: tilda."like"("code", '427%'::text)"
--"Rows Removed by Filter: 1851321"
--"Planning time: 0.097 ms"
--"Execution time: 5524.718 ms"

4- select exists inlined
===
EXPLAIN ANALYZE
select count(*) 
from cms.claims
where exists (select * from unnest("SECONDARY_ICD9_DGNS_CD") a where a like 
'427%')
--"Aggregate  (cost=2604013.42..2604013.43 rows=1 width=0) (actual 
time=2842.259..2842.259 rows=1 loops=1)"
--"  ->  Seq Scan on claims  (cost=0.00..2601527.42 rows=994397 width=0) 
(actual time=0.017..2837.122 rows=85632 loops=1)"
--"Filter: (SubPlan 1)"
--"Rows Removed by Filter: 1851321"
--"SubPlan 1"
--"  ->  Function Scan on unnest a  (cost=0.00..1.25 rows=1 width=0) 
(actual time=0.001..0.001 rows=0 loops=1936953)"
--"Filter: (a ~~ '427%'::text)"
--"Rows Removed by Filter: 2"
--"Planning time: 0.155 ms"
--"Execution time: 2842.311 ms"


Thank you,
Laurent Hasson

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Friday, November 11, 2016 11:46
To: l...@laurent-hasson.com
Cc: Marc Mamin <m.ma...@intershop.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Inlining of functions (doing LIKE on an array)

"l...@laurent-hasson.com" <l...@laurent-hasson.com> writes:
> I tried "exists", but won't work in the Function, i.e., CREATE OR 
> REPLACE

Re: [PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-11 Thread l...@laurent-hasson.com
I tried "exists", but won't work in the Function, i.e.,

CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bigint
  AS  'exists (select * from unnest($1) a where a like $2)'
LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF

It's as expected though. As for the GIN indices, I tried and it didn't make a 
difference, which I guess is expected as well because of the Like operator. I 
don't expect regular indices to work on regular columns for Like operations, 
especially '%xxx' ones, so I didn't expect GIN indices to work either for Array 
columns with Like. Am I wrong?

Finally, I think the issue is actually not what I originally thought (i.e., 
index usage, as per above). But the inlining still is the culprit. Here is the 
plan for 

select count(*) from claims
where (select count(*)  from unnest("SECONDARY_ICD9_DGNS_CD") x_  where x_ like 
'427%' ) > 0

"Aggregate  (cost=2633016.66..2633016.67 rows=1 width=0) (actual 
time=3761.888..3761.889 rows=1 loops=1)"
"  ->  Seq Scan on claims  (cost=0.00..2631359.33 rows=662931 width=0) (actual 
time=0.097..3757.314 rows=85632 loops=1)"
"Filter: ((SubPlan 1) > 0)"
"Rows Removed by Filter: 1851321"
"SubPlan 1"
"  ->  Aggregate  (cost=1.25..1.26 rows=1 width=0) (actual 
time=0.001..0.001 rows=1 loops=1936953)"
"->  Function Scan on unnest a  (cost=0.00..1.25 rows=1 
width=0) (actual time=0.001..0.001 rows=0 loops=1936953)"
"  Filter: (a ~~ '427%'::text)"
"  Rows Removed by Filter: 2"
"Planning time: 0.461 ms"
"Execution time: 3762.272 ms"

And when using the function:

"Aggregate  (cost=614390.75..614390.76 rows=1 width=0) (actual 
time=8169.416..8169.417 rows=1 loops=1)"
"  ->  Seq Scan on claims  (cost=0.00..612733.43 rows=662931 width=0) (actual 
time=0.163..8162.679 rows=85632 loops=1)"
"Filter: (tilda."like"("SECONDARY_ICD9_DGNS_CD", '427%'::text) > 0)"
"Rows Removed by Filter: 1851321"
"Planning time: 0.166 ms"
"Execution time: 8169.676 ms"

There is something fundamental here it seems, but I am not so good at reading 
plans to understand the differences here.




Thank you,
Laurent Hasson

-Original Message-
From: Marc Mamin [mailto:m.ma...@intershop.de] 
Sent: Friday, November 11, 2016 07:44
To: l...@laurent-hasson.com; pgsql-performance@postgresql.org
Subject: RE: Inlining of functions (doing LIKE on an array)




> From: pgsql-performance-ow...@postgresql.org 
> [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of 
> l...@laurent-hasson.com
> Sent: Freitag, 11. November 2016 07:54
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Inlining of functions (doing LIKE on an array)
> 
> Hello,
> 
> I am trying to implement an efficient "like" over a text[]. I see a lot of 
> people have tried before me and I learnt a lot through the forums. The 
> results of my search is that a query like the following is optimal:
> 
> select count(*) 
>   from claims
> where (select count(*) 
>   from unnest("ICD9_DGNS_CD") x_ 
>  where x_ like '427%'
>) > 0
> 

Hi,
are you using GIN indexes?

http://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns 

moreover your query can still be optimized:
=>
select count(*)
  from claims
where exists (select *
  from unnest("ICD9_DGNS_CD") x_ 
 where x_ like '427%'
   ) 

regards,

Marc Mamin

> So I figured I'd create a Function to encapsulate the concept:
> 
> CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bigint AS 
> 'select count(*) from unnest($1) a where a like $2'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
> 
> This works functionally, but performs like crap: full table scan, and cannot 
> make use of any index it seems. Basically, it feels like PG can't inline that 
> function.
> 
> I have been trying all evening to find a way to rewrite it to trick the 
> compiler/planner into inlining. I tried the operator approach for example, 
> but performance is again not good.
> 
> create function rlike(text,text)
> returns bool as 'select $2 like $1' language sql strict immutable; 
> create operator  ``` (procedure = rlike, leftarg = text,
>   rightarg = text, commutator = ```); CREATE OR 
> REPLACE FUNCTION MyLike(text[], text) RETURNS boolean AS 'select $2 
> ``` ANY($1)'
> LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
> 
> And by not good, I mean that on my table of 2M+ rows, the "native" query 
> takes 3s, while the function version takes 9s and the operator version takes 
> (via the function, or through the operator directly), takes 15s.
> 
> Any ideas or pointers?
> 
> 
> Thank you,
> Laurent Hasson


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Inlining of functions (doing LIKE on an array)

2016-11-10 Thread l...@laurent-hasson.com
Hello,

I am trying to implement an efficient "like" over a text[]. I see a lot of 
people have tried before me and I learnt a lot through the forums. The results 
of my search is that a query like the following is optimal:

select count(*)
  from claims
where (select count(*)
  from unnest("ICD9_DGNS_CD") x_
 where x_ like '427%'
   ) > 0

So I figured I'd create a Function to encapsulate the concept:

CREATE OR REPLACE FUNCTION ArrayLike(text[], text)
RETURNS bigint
AS 'select count(*) from unnest($1) a where a like $2'
LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF

This works functionally, but performs like crap: full table scan, and cannot 
make use of any index it seems. Basically, it feels like PG can't inline that 
function.

I have been trying all evening to find a way to rewrite it to trick the 
compiler/planner into inlining. I tried the operator approach for example, but 
performance is again not good.

create function rlike(text,text)
returns bool as 'select $2 like $1' language sql strict immutable;
create operator  ``` (procedure = rlike, leftarg = text,
  rightarg = text, commutator = ```);
CREATE OR REPLACE FUNCTION MyLike(text[], text)
RETURNS boolean
AS 'select $2 ``` ANY($1)'
LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF

And by not good, I mean that on my table of 2M+ rows, the "native" query takes 
3s, while the function version takes 9s and the operator version takes (via the 
function, or through the operator directly), takes 15s.

Any ideas or pointers?


Thank you,
Laurent Hasson