Re: Removing unneeded self joins

2021-07-06 Thread Hywel Carver
On Mon, Jul 5, 2021 at 2:20 PM Andrey Lepikhov 
wrote:

> On 2/7/21 01:56, Hywel Carver wrote:
> > On Wed, Jun 30, 2021 at 12:21 PM Andrey Lepikhov
> > mailto:a.lepik...@postgrespro.ru>> wrote:
> > I think, here we could ask more general question: do we want to
> > remove a
> > 'IS NOT NULL' clause from the clause list if the rest of the list
> > implicitly implies it?
> >
> >
> > My suggestion was not to remove it, but to avoid adding it in the first
> > place. When your optimisation has found a join on a group of columns
> > under a uniqueness constraint, you would do something like this (forgive
> > the pseudo-code)
> >
> > foreach(column, join_clause) {
> >if(column.nullable) { // This condition is what I'm suggesting is
> added
> >  add_null_test(column, IS_NOT_NULL);
> >}
> > }
> >
> > But it may be that that's not possible or practical at this point in the
> > code.
> I think, such option will require to implement a new machinery to prove
> that arbitrary column couldn't produce NULL value.
>

Got it, and it makes sense to me that this would be out of scope for this
change.

I remember in the previous conversation about this, Tomas acknowledged that
while there are some silly queries that would benefit from this change,
there are also some well-written ones (e.g. properly denormalised table
structures, with decomposed views that need joining together in some
queries). So the optimization needs to be essentially free to run to
minimise impact on other queries.

Looking through the email chain, a previous version of this patch added
~0.6% to planning time in the worst case tested - does that meet the
"essentially free" requirement?


Re: Removing unneeded self joins

2021-07-01 Thread Hywel Carver
On Wed, Jun 30, 2021 at 12:21 PM Andrey Lepikhov 
wrote:

> On 12/3/21 12:05, Hywel Carver wrote:
> > I've built and tested this, and it seems to function correctly to me.
> One question I have is whether the added "IS NOT NULL" filters can be
> omitted when they're unnecessary. Some of the resulting plans included an
> "IS NOT NULL" filter on a non-nullable column. To be clear, this is still
> an improvement (to me) without that.

I think, here we could ask more general question: do we want to remove a
> 'IS NOT NULL' clause from the clause list if the rest of the list
> implicitly implies it?


My suggestion was not to remove it, but to avoid adding it in the first
place. When your optimisation has found a join on a group of columns under
a uniqueness constraint, you would do something like this (forgive the
pseudo-code)

foreach(column, join_clause) {
  if(column.nullable) { // This condition is what I'm suggesting is added
add_null_test(column, IS_NOT_NULL);
  }
}

But it may be that that's not possible or practical at this point in the
code.


Re: Self-join optimisation

2021-03-12 Thread Hywel Carver
Hi, thanks for your replies. I've tested the patch and it works for me in
the cases where I'd use it.

> And explain why your application is doing queries like this, and why it
can't be changed to changed to not generate such queries.

Reading the thread, it looks like some of the requests for this feature are
coming from people using ORMs that generate bad queries. That's not been my
experience - I've always been able to find a way to construct the right
query through the ORM or just write correct SQL. When I've wanted this
feature has always been in relation to combining views.

For example, I was recently helping out a company that runs a booking
system for leisure activities, and their database has a view for how many
staff are available on a given day to supervise a given activity (e.g.
surfing), and a view for how much equipment is available on a given day
(e.g. how many surfboards). They also have a view for the equipment
requirements for a given activity (e.g. some boat trips require a minimum
of 2 boats and 4 oars). When they want to make bookings, they have to
combine data from these views, and the tables that create them.

It would definitely be possible to write one view that had all of this data
in (and maintain the other views too, which are needed elsewhere in the
site). And it could be made wide to have all of the rows from the source
tables. But it would, to me, feel like much better code to keep the
separate decomposed views and join them together for the booking query.
Right now, that query's performance suffers in a way that this feature
would fix. So the current choices are: accept worse performance with
decomposed views, write one very large and performant view but duplicate
some of the logic, or use their ORM to generate the SQL that they'd
normally put in a view.

On Thu, Mar 11, 2021 at 10:50 PM Tomas Vondra 
wrote:

> On 3/11/21 3:39 PM, Hywel Carver wrote:
> > Great! It looks like it's been in commitfest status for a few years. Is
> > there anything someone like me (outside the pgsql-hackers community) can
> > do to help it get reviewed this time around?
> >
>
> Well, you could do a review, or at least test it with the queries your
> application is actually running. And explain why your application is
> doing queries like this, and why it can't be changed to changed to not
> generate such queries.
>
> The first couple of messages from the patch thread [1] (particularly the
> messages from May 2018) are a good explanation why patches like this are
> tricky to get through.
>
> The basic assumption is that such queries are a bit silly, and it'd be
> probably easier to modify the application not to generate them instead
> of undoing the harm in the database planner. The problem is this makes
> the planner more expensive for everyone, including people who carefully
> write "good" queries.
>
>
> And we don't want to do that, so we need to find a way to make this
> optimization very cheap (essentially "free" if not applicable), but
> that's difficult because there may be cases where the self-joins are
> intentional, and undoing them would make the query slower. And doing
> good decision requires enough information, but this decision needs to
> happen quite early in the planning, when we have very little info.
>
> So while it seems like a simple optimization, it's actually quite tricky
> to get right.
>
> (Of course, there are cases where you may get such queries even if you
> try writing good SQL, say when joining views etc.)
>
> regards
>
> [1]
>
> https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-080115490...@postgrespro.ru
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: Removing unneeded self joins

2021-03-12 Thread Hywel Carver
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

I've built and tested this, and it seems to function correctly to me. One 
question I have is whether the added "IS NOT NULL" filters can be omitted when 
they're unnecessary. Some of the resulting plans included an "IS NOT NULL" 
filter on a non-nullable column. To be clear, this is still an improvement (to 
me) without that.

Here's the simple test script I ran, on master ("before") and with the latest 
patch applied ("after").

CREATE TABLE users (id BIGINT PRIMARY KEY, nullable_int BIGINT UNIQUE, 
some_non_nullable_int BIGINT NOT NULL);
CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);

EXPLAIN SELECT *
FROM users u1
INNER JOIN users u2 
  ON u1.id = u2.id;
-- before does HASH JOIN
-- after does seq scan with "id IS NOT NULL" condition

EXPLAIN SELECT *
FROM only_some_users
INNER JOIN some_other_users
  ON only_some_users.id = some_other_users.id;
-- before does HASH JOIN
-- after does no JOIN, instead does scan, with an extra "id IS NOT NULL 
condition" (in addition to id < 10, id > 3)

EXPLAIN SELECT *
FROM users u1
INNER JOIN users u2 
  ON u1.nullable_int = u2.nullable_int;
-- before does HASH JOIN
-- after does scan with (nullable_int IS NOT NULL) filter

EXPLAIN SELECT *
FROM users u1
INNER JOIN users u2 
  ON u1.id = u2.nullable_int;
-- before does HASH JOIN
-- after correctly unchanged

EXPLAIN SELECT *
FROM users u1
INNER JOIN users u2 
  ON u1.id = u2.some_non_nullable_int
INNER JOIN users u3 
  ON u2.some_non_nullable_int = u3.id;
-- before does 2x HASH JOIN
-- now does 1x HASH JOIN, with a sequential scan over users filtered by id IS 
NOT NULL

Re: Self-join optimisation

2021-03-11 Thread Hywel Carver
Great! It looks like it's been in commitfest status for a few years. Is
there anything someone like me (outside the pgsql-hackers community) can do
to help it get reviewed this time around?

On Thu, Mar 11, 2021 at 2:32 PM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> On Thu, 11 Mar 2021 at 15:15, Hywel Carver  wrote:
> >
> > Hi,
> >
> > I asked this question in the Postgres Slack, and was recommended to ask
> here instead.
> >
> > A few times, I've been in a situation where I want to join a table to
> itself on its primary key. That typically happens because I have some kind
> of summary view, which I then want to join to the original table (using its
> primary key) to flesh out the summary data with other columns. That's
> executed as a join, which surprised me. But in this case, I could extend
> the view to have all of the columns of the original table to avoid the join.
> >
> > But there's another case that's harder to solve this way: combining
> views together. Here's a trivial example:
> >
> > CREATE TABLE users (id BIGINT PRIMARY KEY, varchar name);
> > CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
> > CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);
> >
> > EXPLAIN SELECT * FROM only_some_users
> > INNER JOIN some_other_users ON only_some_users.id = some_other_users.id;
> >
> > Hash Join  (cost=29.23..43.32 rows=90 width=144)
> >   Hash Cond: (users.id = users_1.id)
> >   ->  Bitmap Heap Scan on users  (cost=6.24..19.62 rows=270 width=72)
> > Recheck Cond: (id < 10)
> > ->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18 rows=270
> width=0)
> >   Index Cond: (id < 10)
> >   ->  Hash  (cost=19.62..19.62 rows=270 width=72)
> > ->  Bitmap Heap Scan on users users_1  (cost=6.24..19.62
> rows=270 width=72)
> >   Recheck Cond: (id > 3)
> >   ->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18
> rows=270 width=0)
> > Index Cond: (id > 3)
> >
> > Is there a reason why Postgres doesn't have an optimisation built in to
> optimise this JOIN? What I'm imagining is that a join between two aliases
> for the same table on its primary key could be optimised by treating them
> as the same table. I think the same would be true for self-joins on any
> non-null columns covered by a uniqueness constraint.
> >
> > If this is considered a desirable change, I'd be keen to work on it
> (with some guidance).
>
> There's currently a patch registered in the commitfest that could fix
> this for you, called "Remove self join on a unique column" [0].
>
>
> With regards,
>
> Matthias van de Meent
>
> [0] https://commitfest.postgresql.org/31/1712/, thread at
>
> https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-080115490...@postgrespro.ru
>


Self-join optimisation

2021-03-11 Thread Hywel Carver
Hi,

I asked this question in the Postgres Slack, and was recommended to ask
here instead.

A few times, I've been in a situation where I want to join a table to
itself on its primary key. That typically happens because I have some kind
of summary view, which I then want to join to the original table (using its
primary key) to flesh out the summary data with other columns. That's
executed as a join, which surprised me. But in this case, I could extend
the view to have all of the columns of the original table to avoid the join.

But there's another case that's harder to solve this way: combining views
together. Here's a trivial example:

CREATE TABLE users (id BIGINT PRIMARY KEY, varchar name);
CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);

EXPLAIN SELECT * FROM only_some_users
INNER JOIN some_other_users ON only_some_users.id = some_other_users.id;

Hash Join  (cost=29.23..43.32 rows=90 width=144)

  Hash Cond: (users.id = users_1.id)

  ->  Bitmap Heap Scan on users  (cost=6.24..19.62 rows=270 width=72)

Recheck Cond: (id < 10)

->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18 rows=270
width=0)
  Index Cond: (id < 10)

  ->  Hash  (cost=19.62..19.62 rows=270 width=72)

->  Bitmap Heap Scan on users users_1  (cost=6.24..19.62 rows=270
width=72)
  Recheck Cond: (id > 3)

  ->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18
rows=270 width=0)
Index Cond: (id > 3)

Is there a reason why Postgres doesn't have an optimisation built in to
optimise this JOIN? What I'm imagining is that a join between two aliases
for the same table on its primary key could be optimised by treating them
as the same table. I think the same would be true for self-joins on any
non-null columns covered by a uniqueness constraint.

If this is considered a desirable change, I'd be keen to work on it (with
some guidance).

Thanks,

Hywel