Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-19 Thread David Rowley
On Mon, 19 Oct 2020 at 13:06, Tom Lane  wrote:
>
> David Rowley  writes:
> > I guess we could resolve that concern by just changing the failing
> > assert to become: Assert(outer_skip_rows <= outer_rows ||
> > isinf(outer_rows));
>
> I can't really object to just weakening the Assert a tad.
> My thoughts would have run towards checking for the NaN though.

I ended up back-patching a change that does that.

Thanks for your input on this and for the report, Onder.

David




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-18 Thread David Rowley
On Mon, 19 Oct 2020 at 13:06, Tom Lane  wrote:
> (BTW, am I wrong to suppose that the same case fails the same
> way in our older branches?  Certainly that Assert has been there
> a long time.)

I only tested as back as far as 9.5, but it does fail there.

David




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-18 Thread Tom Lane
David Rowley  writes:
> It would be good to hear Onder's case to see if he has a good argument
> for having a vested interest in pg13 not failing this way with assets
> enabled.

Yeah, some context for this report would be a good thing.
(BTW, am I wrong to suppose that the same case fails the same
way in our older branches?  Certainly that Assert has been there
a long time.)

> I guess we could resolve that concern by just changing the failing
> assert to become: Assert(outer_skip_rows <= outer_rows ||
> isinf(outer_rows));

I can't really object to just weakening the Assert a tad.
My thoughts would have run towards checking for the NaN though.

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-18 Thread David Rowley
On Mon, 19 Oct 2020 at 12:25, Tom Lane  wrote:
>
> David Rowley  writes:
> > On Mon, 19 Oct 2020 at 12:10, Tom Lane  wrote:
> >> TBH, I see no need to do anything in the back branches.  This is not
> >> an issue for production usage.
>
> > I understand the Assert failure is pretty harmless, so non-assert
> > builds shouldn't suffer too greatly.  I just assumed that any large
> > stakeholders invested in upgrading to a newer version of PostgreSQL
> > may like to run various tests with their application against an assert
> > enabled version of PostgreSQL perhaps to gain some confidence in the
> > upgrade. A failing assert is unlikely to inspire additional
> > confidence.
>
> If any existing outside regression tests hit such corner cases, then
> (a) we'd have heard about it, and (b) likely they'd fail in the older
> branch as well.  So I don't buy the argument that this will dissuade
> somebody from upgrading.

hmm, well it was reported to us. Perhaps swapping the word "upgrading"
for "migrating".

It would be good to hear Onder's case to see if he has a good argument
for having a vested interest in pg13 not failing this way with assets
enabled.

> I do, on the other hand, buy the idea that if anyone is indeed working
> in this realm, they might be annoyed by a behavior change in a stable
> branch.  So it cuts both ways.  On balance I don't think we should
> touch this in the back branches.

I guess we could resolve that concern by just changing the failing
assert to become: Assert(outer_skip_rows <= outer_rows ||
isinf(outer_rows));

It's pretty grotty but should address that concern.

David




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-18 Thread Tom Lane
David Rowley  writes:
> On Mon, 19 Oct 2020 at 12:10, Tom Lane  wrote:
>> TBH, I see no need to do anything in the back branches.  This is not
>> an issue for production usage.

> I understand the Assert failure is pretty harmless, so non-assert
> builds shouldn't suffer too greatly.  I just assumed that any large
> stakeholders invested in upgrading to a newer version of PostgreSQL
> may like to run various tests with their application against an assert
> enabled version of PostgreSQL perhaps to gain some confidence in the
> upgrade. A failing assert is unlikely to inspire additional
> confidence.

If any existing outside regression tests hit such corner cases, then
(a) we'd have heard about it, and (b) likely they'd fail in the older
branch as well.  So I don't buy the argument that this will dissuade
somebody from upgrading.

I do, on the other hand, buy the idea that if anyone is indeed working
in this realm, they might be annoyed by a behavior change in a stable
branch.  So it cuts both ways.  On balance I don't think we should
touch this in the back branches.

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-18 Thread David Rowley
On Mon, 19 Oct 2020 at 12:10, Tom Lane  wrote:
>
> David Rowley  writes:
> > For the backbranches, I think I go with something more minimal in the
> > form of adding:
>
> TBH, I see no need to do anything in the back branches.  This is not
> an issue for production usage.

I understand the Assert failure is pretty harmless, so non-assert
builds shouldn't suffer too greatly.  I just assumed that any large
stakeholders invested in upgrading to a newer version of PostgreSQL
may like to run various tests with their application against an assert
enabled version of PostgreSQL perhaps to gain some confidence in the
upgrade. A failing assert is unlikely to inspire additional
confidence.

I'm not set on backpatching, but that's just my thoughts.

FWIW, the patch I'd thought of is attached.

David
From a309f1f0dc0329e16e328d6c3766c4014a64319b Mon Sep 17 00:00:00 2001
From: "dgrow...@gmail.com" 
Date: Mon, 19 Oct 2020 11:49:51 +1300
Subject: [PATCH] Fix Assert failure in join costing code

In the planner, it was possible, given an extreme enough case containing a
large number of joins for the number of estimated rows to become infinite.
This was a problem in initial_cost_mergejoin() where the row estimate
could be multiplied by 0 which resulted in NaN.  This could cause the
Asserts which verified the skip rows was <= rows to fail due to the fact
that NaN <= Inf is false.

To fix this, modify the join costing functions to look for isinf() row
estimates and explicitly set those to DBL_MAX.  It seems the various
places which perform calculations on these values only multiply by a
selectivity estimate, which should be between 0.0 and 1.0, so we should
never end up with infinity again after the multiplication takes place.

In the reported case, only initial_cost_mergejoin() suffered from the
failing Assert().  However, both final_cost_nestloop() and
final_cost_mergejoin() seem to make similar assumptions about the row
estimates not being infinite, so change those too.

This particular problem was fixed in master in a90c950fc, however, that
fix seemed a little too generic to go backpatching, so this is the minimal
fix for the same problem.

Reported-by: Onder Kalaci
Discussion: 
https://postgr.es/m/dm6pr21mb1211ff360183bca901b27f04d8...@dm6pr21mb1211.namprd21.prod.outlook.com
---
 src/backend/optimizer/path/costsize.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index f39e6a9f80..f1eb9194ba 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -71,6 +71,7 @@
 
 #include "postgres.h"
 
+#include 
 #include 
 
 #include "access/amapi.h"
@@ -2737,11 +2738,18 @@ final_cost_nestloop(PlannerInfo *root, NestPath *path,
QualCostrestrict_qual_cost;
double  ntuples;
 
-   /* Protect some assumptions below that rowcounts aren't zero or NaN */
+   /*
+* Protect some assumptions below that rowcounts aren't zero, NaN or
+* infinite.
+*/
if (outer_path_rows <= 0 || isnan(outer_path_rows))
outer_path_rows = 1;
+   else if (isinf(outer_path_rows))
+   outer_path_rows = DBL_MAX;
if (inner_path_rows <= 0 || isnan(inner_path_rows))
inner_path_rows = 1;
+   else if (isinf(inner_path_rows))
+   inner_path_rows = DBL_MAX;
 
/* Mark the path with the correct row estimate */
if (path->path.param_info)
@@ -2952,11 +2960,18 @@ initial_cost_mergejoin(PlannerInfo *root, 
JoinCostWorkspace *workspace,
innerendsel;
Pathsort_path;  /* dummy for result of 
cost_sort */
 
-   /* Protect some assumptions below that rowcounts aren't zero or NaN */
+   /*
+* Protect some assumptions below that rowcounts aren't zero, NaN or
+* infinite.
+*/
if (outer_path_rows <= 0 || isnan(outer_path_rows))
outer_path_rows = 1;
+   else if (isinf(outer_path_rows))
+   outer_path_rows = DBL_MAX;
if (inner_path_rows <= 0 || isnan(inner_path_rows))
inner_path_rows = 1;
+   else if (isinf(inner_path_rows))
+   inner_path_rows = DBL_MAX;
 
/*
 * A merge join will stop as soon as it exhausts either input stream
@@ -3185,9 +3200,14 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
rescannedtuples;
double  rescanratio;
 
-   /* Protect some assumptions below that rowcounts aren't zero or NaN */
+   /*
+* Protect some assumptions below that rowcounts aren't zero, NaN or
+* infinite.
+*/
if (inner_path_rows <= 0 || isnan(inner_path_rows))
inner_path_rows = 1;
+   else if (isinf(inner_path_rows))
+   inner_path_rows = DBL_MAX;
 
/* Mar

Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-18 Thread Tom Lane
David Rowley  writes:
> On Sat, 17 Oct 2020 at 06:00, Tom Lane  wrote:
>> I'm good with the v2 patch.

> Thanks a lot for having a look. I'll proceed in getting the v2 which I
> sent earlier into master.

> For the backbranches, I think I go with something more minimal in the
> form of adding:

TBH, I see no need to do anything in the back branches.  This is not
an issue for production usage.

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-18 Thread David Rowley
On Sat, 17 Oct 2020 at 06:00, Tom Lane  wrote:
> I'm confused now, because the v2 patch does remove those isnan calls?

I think that was a case of a last-minute change of mind and forgetting
to attach the updated patch.

> I rechecked the archives, and I agree that there's no data about
> exactly how we could have gotten a NaN here.  My guess though is
> infinity-times-zero in some earlier relation size estimate.  So
> hopefully the clamp to 1e100 will make that impossible, or if it
> doesn't then clamp_row_est() should still prevent a NaN from
> propagating to the next level up.
>
> I'm good with the v2 patch.

Thanks a lot for having a look. I'll proceed in getting the v2 which I
sent earlier into master.

For the backbranches, I think I go with something more minimal in the
form of adding:

if (outer_path_rows <= 0 || isnan(outer_path_rows))
outer_path_rows = 1;
+else if (isinf(outer_path_rows))
+outer_path_rows = DBL_MAX;

and the same for the inner_path_rows to each area in costsize.c which
has that code.

Wondering your thoughts on that.

David




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-16 Thread Tom Lane
I wrote:
> David Rowley  writes:
>> I've ended up leaving the NaN checks in the join costing functions.
>> There was no case mentioned in [1] that showed how we hit that
>> reported test case, so I'm not really confident enough to know I'm not
>> just reintroducing the same problem again by removing that.  The path
>> row estimate that had the NaN might not have been through
>> clamp_row_est(). Many don't.

> Hmm, I will try to find some time tomorrow to reconstruct that.

I'm confused now, because the v2 patch does remove those isnan calls?

I rechecked the archives, and I agree that there's no data about
exactly how we could have gotten a NaN here.  My guess though is
infinity-times-zero in some earlier relation size estimate.  So
hopefully the clamp to 1e100 will make that impossible, or if it
doesn't then clamp_row_est() should still prevent a NaN from
propagating to the next level up.

I'm good with the v2 patch.

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-13 Thread Tom Lane
David Rowley  writes:
> On Wed, 14 Oct 2020 at 04:16, Tom Lane  wrote:
>> So now I'm imagining something like
>> #define MAXIMUM_ROWCOUNT 1e100

> That seems more reasonable. We likely could push it a bit higher, but
> I'm not all that motivated to since if that was true, then you could
> expect the heat death of the universe to arrive before your query
> results. In which case the user would likely struggle to find
> electrons to power their computer.

Right.  But I'm thinking about joins in which both inputs are clamped to
that maximum estimate.  If we allowed it to be as high as 1e200, then
multiplying the two input rowcounts together would itself overflow.
At 1e100, we can do that and also multiply in a ridiculous per-row cost,
and we're still well below the overflow threshold.  So this should go
pretty far towards preventing internal overflows in any one plan step's
cost & rows calculations.

(For comparison's sake, I believe the number of atoms in the observable
universe is thought to be somewhere on the order of 1e80.  So we are
pretty safe in thinking that no practically-useful rowcount estimate
will exceed 1e100; there is no need to make it higher.)

> I've ended up leaving the NaN checks in the join costing functions.
> There was no case mentioned in [1] that showed how we hit that
> reported test case, so I'm not really confident enough to know I'm not
> just reintroducing the same problem again by removing that.  The path
> row estimate that had the NaN might not have been through
> clamp_row_est(). Many don't.

Hmm, I will try to find some time tomorrow to reconstruct that.

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-13 Thread David Rowley
Thanks for having a look at this.

On Wed, 14 Oct 2020 at 04:16, Tom Lane  wrote:
> I'm on board with trying to get rid of NaN rowcount estimates more
> centrally.  I do not think it is a good idea to try to wire in a
> prohibition against zero rowcounts.  That is actually the correct
> thing in assorted scenarios --- one example recently under discussion
> was ModifyTable without RETURNING, and another is where we can prove
> that a restriction clause is constant-false.  At some point I think
> we are going to want to deal honestly with those cases instead of
> sweeping them under the rug.  So I'm disinclined to remove zero
> defenses that we'll just have to put back someday.

OK, that certainly limits the scope here.  It just means we can't get
rid of the <= 0 checks in join costing functions.  The problem case
that this was added for was a dummy Append. We still have valid cases
that won't convert the join rel to a dummy rel with a dummy Append on
one side.

> I think converting Inf to DBL_MAX, in hopes of avoiding creation of
> NaNs later, is fine.  (Note that applying rint() to that is quite
> useless --- in every floating-point system, values bigger than
> 2^number-of-mantissa-bits are certainly integral.)

Good point.

> I'm not sure why you propose to map NaN to one.  Wouldn't mapping it
> to Inf (and thence to DBL_MAX) make at least as much sense?  Probably
> more in fact.  We know that unwarranted one-row estimates are absolute
> death to our chances of picking a well-chosen plan.

That came around due to what the join costing functions were doing. i.e:

/* Protect some assumptions below that rowcounts aren't zero or NaN */
if (inner_path_rows <= 0 || isnan(inner_path_rows))
   inner_path_rows = 1;

[1] didn't have an example case of how the NaNs were introduced, so I
was mostly just copying the logic that was added to fix that back in
72826fb3.

> > I toyed around with the attached patch, but I'm still not that excited
> > about the clamping of infinite values to DBL_MAX.  The test case I
> > showed above with generate_Series(1,379) still ends up with NaN cost
> > estimates due to costing a sort with DBL_MAX rows.  When I was writing
> > the patch, I had it in my head that the costs per row will always be
> > lower than 1.
>
> Yeah, that is a good point.  Maybe instead of clamping to DBL_MAX,
> we should clamp rowcounts to something that provides some headroom
> for multiplication by per-row costs.  A max rowcount of say 1e100
> should serve fine, while still being comfortably more than any
> non-insane estimate.
>
> So now I'm imagining something like
>
> #define MAXIMUM_ROWCOUNT 1e100

That seems more reasonable. We likely could push it a bit higher, but
I'm not all that motivated to since if that was true, then you could
expect the heat death of the universe to arrive before your query
results. In which case the user would likely struggle to find
electrons to power their computer.

> clamp_row_est(double nrows)
> {
> /* Get rid of NaN, Inf, and impossibly large row counts */
> if (isnan(nrows) || nrows >= MAXIMUM_ROWCOUNT)
> nrows = MAXIMUM_ROWCOUNT;
> else
> ... existing logic ...

I've got something along those lines in the attached.

> Perhaps we should also have some sort of clamp for path cost
> estimates, at least to prevent them from being NaNs which
> is going to confuse add_path terribly.

hmm. I'm not quite sure where to start with that one.  Many of the
path estimates will already go through clamp_row_est(). There are
various special requirements, e.g Appends with no subpaths. So when to
apply it would depend on what path type it is. I'd say it would need
lots of careful analysis and a scattering of new calls in pathnode.c

I've ended up leaving the NaN checks in the join costing functions.
There was no case mentioned in [1] that showed how we hit that
reported test case, so I'm not really confident enough to know I'm not
just reintroducing the same problem again by removing that.  The path
row estimate that had the NaN might not have been through
clamp_row_est(). Many don't.

David

[1] https://www.postgresql.org/message-id/7270.1302902842%40sss.pgh.pa.us
diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index cd3716d494..733f7ea543 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -107,6 +107,13 @@
  */
 #define APPEND_CPU_COST_MULTIPLIER 0.5
 
+/*
+ * Maximum value for row estimates.  We cap row estimates to this to help
+ * ensure that costs based on these estimates remain within the range of what
+ * double can represent.  add_path() wouldn't act sanely given infinite or NaN
+ * cost values.
+ */
+#define MAXIMUM_ROWCOUNT 1e100
 
 double seq_page_cost = DEFAULT_SEQ_PAGE_COST;
 double random_page_cost = DEFAULT_RANDOM_PAGE_COST;
@@ -189,11 +196,14 @@ double
 clamp_row_est(double nrows)
 {
/*
-* Force estimate to be

Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-13 Thread Tom Lane
David Rowley  writes:
> Because there's been quite a few of these, and this report is yet
> another one, I wonder if it's time to try and stamp these out at the
> source rather than where the row counts are being used.

I'm on board with trying to get rid of NaN rowcount estimates more
centrally.  I do not think it is a good idea to try to wire in a
prohibition against zero rowcounts.  That is actually the correct
thing in assorted scenarios --- one example recently under discussion
was ModifyTable without RETURNING, and another is where we can prove
that a restriction clause is constant-false.  At some point I think
we are going to want to deal honestly with those cases instead of
sweeping them under the rug.  So I'm disinclined to remove zero
defenses that we'll just have to put back someday.

I think converting Inf to DBL_MAX, in hopes of avoiding creation of
NaNs later, is fine.  (Note that applying rint() to that is quite
useless --- in every floating-point system, values bigger than
2^number-of-mantissa-bits are certainly integral.)

I'm not sure why you propose to map NaN to one.  Wouldn't mapping it
to Inf (and thence to DBL_MAX) make at least as much sense?  Probably
more in fact.  We know that unwarranted one-row estimates are absolute
death to our chances of picking a well-chosen plan.

> I toyed around with the attached patch, but I'm still not that excited
> about the clamping of infinite values to DBL_MAX.  The test case I
> showed above with generate_Series(1,379) still ends up with NaN cost
> estimates due to costing a sort with DBL_MAX rows.  When I was writing
> the patch, I had it in my head that the costs per row will always be
> lower than 1.

Yeah, that is a good point.  Maybe instead of clamping to DBL_MAX,
we should clamp rowcounts to something that provides some headroom
for multiplication by per-row costs.  A max rowcount of say 1e100
should serve fine, while still being comfortably more than any
non-insane estimate.

So now I'm imagining something like

#define MAXIMUM_ROWCOUNT 1e100

clamp_row_est(double nrows)
{
/* Get rid of NaN, Inf, and impossibly large row counts */
if (isnan(nrows) || nrows >= MAXIMUM_ROWCOUNT)
nrows = MAXIMUM_ROWCOUNT;
else
... existing logic ...


Perhaps we should also have some sort of clamp for path cost
estimates, at least to prevent them from being NaNs which
is going to confuse add_path terribly.

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-13 Thread David Rowley
On Sat, 10 Oct 2020 at 02:19, Tom Lane  wrote:
> I'm fairly certain that every one of the existing NaN checks was put
> there on the basis of hard experience.  Possibly digging in the git
> history would offer more info about exactly where the NaNs came from.


I had a look at this and found there's been quite a number of fixes
which added either that isnan checks or the <= 0 checks.

Namely:

---
commit 72826fb362c4aada6d2431df0b706df448806c02
Author: Tom Lane 
Date:   Fri Apr 15 17:45:41 2011 -0400

Guard against incoming rowcount estimate of NaN in cost_mergejoin().

Although rowcount estimates really ought not be NaN, a bug elsewhere
could perhaps result in that, and that would cause Assert failure in
cost_mergejoin, which I believe to be the explanation for bug #5977 from
Anton Kuznetsov.  Seems like a good idea to expend a couple more cycles
to prevent that, even though the real bug is elsewhere.  Not back-patching,
though, because we don't encourage running production systems with
Asserts on.


The discussion for that is in
https://www.postgresql.org/message-id/flat/4602.1302705756%40sss.pgh.pa.us#69dd8c334aa714cfac4e0d9b04c5201c

commit 76281aa9647e6a5dfc646514554d0f519e3b8a58
Author: Tom Lane 
Date:   Sat Mar 26 12:03:12 2016 -0400

Avoid a couple of zero-divide scenarios in the planner.



commit fd791e7b5a1bf53131ad15e68e4d4f8ca795fcb4
Author: Tom Lane 
Date:   Mon Mar 24 21:53:04 2008 +

When a relation has been proven empty by constraint exclusion,
propagate that
knowledge up through any joins it participates in.  We were doing
that already
in some special cases but not in the general case.  Also, defend
against zero
row estimates for the input relations in cost_mergejoin --- this
fix may have
eliminated the only scenario in which that can happen, but be safe.  Per
report from Alex Solovey.


That was reported in
https://www.postgresql.org/message-id/flat/BLU136-DAV79FF310AC13FFC96FA2FDAEFD0%40phx.gbl#4cde17b2369fc7e0da83cc7d4aeeaa48

The problem was that an Append with no subpaths could have a 0 row estimate.
---

Because there's been quite a few of these, and this report is yet
another one, I wonder if it's time to try and stamp these out at the
source rather than where the row counts are being used.

I toyed around with the attached patch, but I'm still not that excited
about the clamping of infinite values to DBL_MAX.  The test case I
showed above with generate_Series(1,379) still ends up with NaN cost
estimates due to costing a sort with DBL_MAX rows.  When I was writing
the patch, I had it in my head that the costs per row will always be
lower than 1. I thought because of that that even if the row count is
dangerously close to DBL_MAX, the costs will never be higher than the
row count... Turns out, I was wrong about that as clearly sorting a
number of rows even close to DBL_MAX would beyond astronomically
expensive and cause the costs would go infinite.

The fd791e7b5 fix was for a subpath-less Append node having a 0-row
estimate and causing problems in the costing of merge join. In the
patch, I thought it would be better just to fix this by insisting that
Append always will have at least 1 row.  That means even a dummy path
would have 1 row, which will become a const-false Result in the plan.
I've had to add a special case to set the plan_rows back to 0 so that
EXPLAIN shows 0 rows as it did before.  That's not exactly pretty, but
I still feel there is merit in insisting we never have 0-row paths to
get away from these types of bugs once at for all.

The patch does fix the failing Assert. However, something along these
lines seems more suitable for master only. The back branches maybe
should just get a more localised isinf() check and clamp to DBL_MAX
that I mentioned earlier in this thread.

I've searched through the code to see if there are other possible
cases where paths may be generated with a 0-row count.  I imagine
anything that has a qual and performs a selectivity estimate will
already have a clamp_row_est() since we'd see fractional row counts if
it didn't.  That leaves me with Append / Merge Append and each join
type + aggregates.  Currently, it seems we never will generate a Merge
Append without any sub-paths. I wondered if I should just Assert
that's the case in create_merge_append_path(). I ended up just adding
a clamp_row_est call instead.  calc_joinrel_size_estimate() seems to
handle all join path row estimates. That uses clamp_row_est.
Aggregate paths can reduce the number of rows, but I think all the row
estimates from those will go through estimate_num_groups(), which
appears to never be able to return 0.

David
diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index cd3716d494..edcae1f36a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -45,10 +45,9 @@
  * (total_cost - startup_c

Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-09 Thread Tom Lane
David Rowley  writes:
> On Fri, 9 Oct 2020 at 15:06, Tom Lane  wrote:
>> I notice there are some other ad-hoc isnan() checks scattered
>> about costsize.c, too.  Maybe we should indeed consider fixing
>> clamp_row_estimate to get rid of inf (and nan too, I suppose)
>> so that we'd not need those.  I don't recall the exact cases
>> that made us introduce those checks, but they were for cases
>> a lot more easily reachable than this one, I believe.

> Is there actually a case where nrows could be NaN?  If not, then it
> seems like a wasted check.  Wouldn't it take one of the input
> relations or the input rels to have an Inf row estimate (which won't
> happen after changing clamp_row_estimate()), or the selectivity
> estimate being NaN.

I'm fairly certain that every one of the existing NaN checks was put
there on the basis of hard experience.  Possibly digging in the git
history would offer more info about exactly where the NaNs came from.

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread David Rowley
On Fri, 9 Oct 2020 at 15:06, Tom Lane  wrote:
> I notice there are some other ad-hoc isnan() checks scattered
> about costsize.c, too.  Maybe we should indeed consider fixing
> clamp_row_estimate to get rid of inf (and nan too, I suppose)
> so that we'd not need those.  I don't recall the exact cases
> that made us introduce those checks, but they were for cases
> a lot more easily reachable than this one, I believe.

Is there actually a case where nrows could be NaN?  If not, then it
seems like a wasted check.  Wouldn't it take one of the input
relations or the input rels to have an Inf row estimate (which won't
happen after changing clamp_row_estimate()), or the selectivity
estimate being NaN.

David




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread Tom Lane
David Rowley  writes:
> Are you worried about the costs above the join that triggers that
> coming out as NaN with that fix? It appears that's the case.

[ pokes at that... ]  Yeah, it looks like nestloop cost estimation
also has some issues with inf-times-zero producing NaN; it's just
not asserting about it.

I notice there are some other ad-hoc isnan() checks scattered
about costsize.c, too.  Maybe we should indeed consider fixing
clamp_row_estimate to get rid of inf (and nan too, I suppose)
so that we'd not need those.  I don't recall the exact cases
that made us introduce those checks, but they were for cases
a lot more easily reachable than this one, I believe.

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread David Rowley
On Fri, 9 Oct 2020 at 12:59, Tom Lane  wrote:
> If we did want to do something here, I'd consider something like
>
> if (isnan(outer_skip_rows))
> outer_skip_rows = 0;
> if (isnan(inner_skip_rows))
> inner_skip_rows = 0;

Are you worried about the costs above the join that triggers that
coming out as NaN with that fix? It appears that's the case. Cost
comparisons of paths with that are not going to do anything along the
lines of sane.

I guess whether or not that matters depends on if we expect any real
queries to hit this, or if we just want to stop the Assert failure.

... 500 joins. I'm willing to listen to the explanation use case, but
in absence of that explanation, I'd be leaning towards "you're doing
it wrong".  If that turns out to be true, then perhaps your proposed
fix is okay.

David




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread Tom Lane
David Rowley  writes:
> I admit it's annoying to add cycles to clamp_row_est() for such insane cases.

I poked at this a bit more closely, and noted that the actual problem is
that when we do this:

outer_skip_rows = rint(outer_path_rows * outerstartsel);

we have outer_path_rows = inf, outerstartsel = 0, and of course inf times
zero is NaN.  So we end up asserting "NaN <= Inf", not "Inf <= Inf"
(which wouldn't have caused a problem).

If we did want to do something here, I'd consider something like

if (isnan(outer_skip_rows))
outer_skip_rows = 0;
if (isnan(inner_skip_rows))
inner_skip_rows = 0;

(We shouldn't need that for outer_rows/inner_rows, since the endsel
values can't be 0.)  Messing with clamp_row_est would be a much more
indirect way of fixing it, as well as having more widespread effects.

In the end though, I'm still not terribly excited about this.

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread David Rowley
On Fri, 9 Oct 2020 at 12:16, Tom Lane  wrote:
>
> > Perhaps the right fix is to modify clamp_row_est() with:
>
> I thought of that too, but as you say, if the rowcount has overflowed a
> double then we've got way worse problems.  It'd make more sense to try
> to keep the count to a saner value in the first place.

I wonder if there was something more logical we could do to maintain
sane estimates too, but someone could surely still cause it to blow up
by writing a long series of clause-less joins. We can't really get
away from the fact that we must estimate those as inner_rows *
outer_rows

I admit it's annoying to add cycles to clamp_row_est() for such insane cases.

David




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread Tom Lane
David Rowley  writes:
> The reason it fails is that outer_path_rows has become infinity due to
> calc_joinrel_size_estimate continually multiplying in the join
> selectivity of 0.05 (due to our 200 default num distinct from lack of
> any stats) which after a number of iterations causes the number to
> become very large.

0.005, but yeah.  We're estimating that each additional join inflates
the output size by about 6x (1270 * 0.005), and after a few hundred
of those, it'll overflow.

> Perhaps the right fix is to modify clamp_row_est() with:

I thought of that too, but as you say, if the rowcount has overflowed a
double then we've got way worse problems.  It'd make more sense to try
to keep the count to a saner value in the first place.  

In the end, (a) this is an Assert, so not a problem for production
systems, and (b) it's going to take you longer than you want to
wait to join 500+ tables, anyhow, unless maybe they're empty.
I'm kind of disinclined to do anything in the way of a band-aid fix.

If somebody has an idea for a different way of estimating the join
size with no stats, we could talk about that.  I notice though that
the only way a plan of this sort isn't going to blow up at execution
is if the join multiplication factor is at most 1, ie the join
key is unique.  But guess what, we already know what to do in that
case.  Adding a unique or pkey constraint to users_table.user_id
causes the plan to collapse entirely (if they're left joins) or
at least still produce a small rowcount estimate (if plain joins).

regards, tom lane




Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread David Rowley
On Fri, 9 Oct 2020 at 08:16, Onder Kalaci  wrote:
> I hit an assertion failure. When asserts disabled, it works fine even with 
> more tables  (>5000).
>
> Steps to reproduce:
> CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 
> int, value_3 float, value_4 bigint);
> 250 relations work fine, see the query (too long to copy & paste here): 
> https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L41

I had a quick look at this and I can recreate it using the following
(using psql)

select 'explain select count(*) from users_table ' || string_Agg('LEFT
JOIN users_table u'|| x::text || ' USING (user_id)',' ') from
generate_Series(1,379)x;
\gexec

That triggers the assert due to the Assert(outer_skip_rows <=
outer_rows); failing in initial_cost_mergejoin().

The reason it fails is that outer_path_rows has become infinity due to
calc_joinrel_size_estimate continually multiplying in the join
selectivity of 0.05 (due to our 200 default num distinct from lack of
any stats) which after a number of iterations causes the number to
become very large.

Instead of running 379 joins from above, try with 378 and you get:

 Aggregate  (cost=NaN..NaN rows=1 width=8)
   ->  Nested Loop Left Join  (cost=33329.16..NaN rows=Infinity width=0)
 Join Filter: (users_table.user_id = u378.user_id)
 ->  Merge Left Join  (cost=33329.16.. width=4)
   Merge Cond: (users_table.user_id = u377.user_id)
   ->  Merge Left Join  (cost=33240.99.. width=4)

Changing the code in initial_cost_mergejoin() to add:

if (outer_path_rows <= 0 || isnan(outer_path_rows))
outer_path_rows = 1;
+else if (isinf(outer_path_rows))
+outer_path_rows = DBL_MAX;

does seem to fix the problem, but that's certainly not the right fix.

Perhaps the right fix is to modify clamp_row_est() with:

@@ -193,7 +194,9 @@ clamp_row_est(double nrows)
 * better and to avoid possible divide-by-zero when interpolating costs.
 * Make it an integer, too.
 */
-   if (nrows <= 1.0)
+   if (isinf(nrows))
+   nrows = rint(DBL_MAX);
+   else if (nrows <= 1.0)
nrows = 1.0;
else
nrows = rint(nrows);

but the row estimates are getting pretty insane well before then.
DBL_MAX is 226 orders of magnitude more than the estimated number of
atoms in the observable universe, so it seems pretty unreasonable that
someone might figure out a way to store that many tuples on a disk any
time soon.

Perhaps DBL_MAX is way to big a number to clamp at. I'm just not sure
what we should reduce it to so that it is reasonable.

David




Assertion failure with LEFT JOINs among >500 relations

2020-10-08 Thread Onder Kalaci
Hi,

I hit an assertion failure. When asserts disabled, it works fine even with more 
tables  (>5000).

Steps to reproduce:

CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 
int, value_3 float, value_4 bigint);

250 relations work fine, see the query (too long to copy & paste here): 
https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L41

--  when # relations >500, we hit the assertion (too long to copy & paste here):
See the query: 
https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L45


And, the backtrace:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x7fff639fa2c2 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff63ab5bf1 libsystem_pthread.dylib`pthread_kill + 284
frame #2: 0x7fff639646a6 libsystem_c.dylib`abort + 127
frame #3: 0x000102180a02 
postgres`ExceptionalCondition(conditionName=, 
errorType=, fileName=, lineNumber=) at 
assert.c:67:2
frame #4: 0x000101ece9b2 
postgres`initial_cost_mergejoin(root=0x7ff0, 
workspace=0x7ffeedf5b528, jointype=JOIN_INNER, mergeclauses=, 
outer_path=0x00012ebf12d0, inner_path=0x4093d800, 
outersortkeys=0x, innersortkeys=0x00012ebf68e8, 
extra=0x7ffeedf5b6f8) at costsize.c:3043:2
frame #5: 0x000101eda01b 
postgres`try_mergejoin_path(root=0x000104a12618, 
joinrel=0x00012ebeede0, outer_path=0x00012ebf12d0, 
inner_path=0x0001283d00e8, pathkeys=0x00012ebf67e0, 
mergeclauses=0x00012ebf6890, outersortkeys=0x, 
innersortkeys=0x00012ebf68e8, jointype=JOIN_LEFT, extra=0x7ffeedf5b6f8, 
is_partial=) at joinpath.c:615:2
frame #6: 0x000101ed9426 
postgres`sort_inner_and_outer(root=0x000104a12618, 
joinrel=0x00012ebeede0, outerrel=, innerrel=, 
jointype=JOIN_LEFT, extra=0x7ffeedf5b6f8) at joinpath.c:1038:3
frame #7: 0x000101ed8f7a 
postgres`add_paths_to_joinrel(root=0x000104a12618, 
joinrel=0x00012ebeede0, outerrel=0x00012ebe7b48, 
innerrel=0x000127f146e0, jointype=, sjinfo=, 
restrictlist=0x00012ebf42b0) at joinpath.c:269:3
frame #8: 0x000101edbdc6 
postgres`populate_joinrel_with_paths(root=0x000104a12618, 
rel1=0x00012ebe7b48, rel2=0x000127f146e0, joinrel=0x00012ebeede0, 
sjinfo=0x00012809edc8, restrictlist=0x00012ebf42b0) at joinrels.c:824:4
frame #9: 0x000101edb57a 
postgres`make_join_rel(root=0x000104a12618, rel1=0x00012ebe7b48, 
rel2=0x000127f146e0) at joinrels.c:760:2
frame #10: 0x000101edb1ec 
postgres`make_rels_by_clause_joins(root=0x000104a12618, 
old_rel=0x00012ebe7b48, other_rels_list=, 
other_rels=) at joinrels.c:312:11
frame #11: 0x000101edada3 
postgres`join_search_one_level(root=0x000104a12618, level=2) at 
joinrels.c:123:4
frame #12: 0x000101ec7feb 
postgres`standard_join_search(root=0x000104a12618, levels_needed=8, 
initial_rels=0x00012ebf4078) at allpaths.c:3097:3
frame #13: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280a5618) at allpaths.c:2993:14
frame #14: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280ab320) at allpaths.c:2993:14
frame #15: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280b1028) at allpaths.c:2993:14
frame #16: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280b6d30) at allpaths.c:2993:14
frame #17: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280bca38) at allpaths.c:2993:14
frame #18: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280c2740) at allpaths.c:2993:14
frame #19: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280c8448) at allpaths.c:2993:14
frame #20: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280ce150) at allpaths.c:2993:14
frame #21: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280d3e58) at allpaths.c:2993:14
frame #22: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280d9b60) at allpaths.c:2993:14
frame #23: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280df868) at allpaths.c:2993:14
frame #24: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280e5570) at allpaths.c:2993:14
frame #25: 0x000101ec6b38 
postgres`make_rel_from_joinlist(root=0x000104a12618, 
joinlist=0x0001280eb278) at allpaths.c:2993:14