subject:"Re\: Bug in row_number\(\) optimization"

Re: Bug in row_number() optimization

2022-12-04 Thread David Rowley

On Thu, 1 Dec 2022 at 21:18, Richard Guo  wrote:
>> +   if (!func_strict(opexpr->opfuncid))
>> +   return false;
>>
>> Should return true instead?
>
>
> Yeah, you're right.  This should be a thinko.

Yeah, oops. That's wrong.

I've adjusted that in the attached.

I'm keen to move along and push the fix for this bug.  If there are no
objections to the method in the attached and also adding the
restriction to limit the optimization to only working with strict
OpExprs, then I'm going to push this, likely about 24 hours from now.

David
diff --git a/src/backend/executor/nodeWindowAgg.c 
b/src/backend/executor/nodeWindowAgg.c
index 81ba024bba..110c594edd 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2300,7 +2300,27 @@ ExecWindowAgg(PlanState *pstate)
continue;
}
else
+   {
winstate->status = 
WINDOWAGG_PASSTHROUGH;
+
+   /*
+* Otherwise, if we're not the 
top-window, we'd better
+* NULLify the aggregate 
results, else, by-ref result
+* types will point to freed 
memory.  We can get away
+* without storing the final 
result of the WindowFunc
+* because in the planner we 
insisted that the
+* runcondition is strict.  
Setting these to NULL will
+* ensure the top-level 
WindowAgg filter will always
+* filter out the rows produced 
in this WindowAgg when
+* not in WINDOWAGG_RUN state.
+*/
+   numfuncs = winstate->numfuncs;
+   for (i = 0; i < numfuncs; i++)
+   {
+   
econtext->ecxt_aggvalues[i] = (Datum) 0;
+   
econtext->ecxt_aggnulls[i] = true;
+   }
+   }
}
else
{
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index 03ee6dc832..595870ea30 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2399,6 +2399,24 @@ check_and_push_window_quals(Query *subquery, 
RangeTblEntry *rte, Index rti,
if (list_length(opexpr->args) != 2)
return true;
 
+   /*
+* Currently, we restrict this optimization to strict OpExprs.  The 
reason
+* for this is that during execution, once the runcondition becomes 
false,
+* we stop evaluating WindowFuncs in WindowAgg nodes and since we're no
+* longer updating them, we NULLify the aggregate and window aggregate
+* results to prevent by-ref result typed window aggregates from 
pointing
+* to free'd memory for byref WindowFuncs and storing stale values for
+* byval WindowFuncs (remember that window functions such as 
row_number()
+* return a byref type on non-FLOAT8PASSBYVAL builds).  Upper-level
+* WindowAgg nodes may make reference to these results in their filter
+* clause and we can ensure the filter remain false by making sure the
+* operator is strict and by performing the NULLification in the 
executor
+* when the runcondition first becomes false.
+*/
+   set_opfuncid(opexpr);
+   if (!func_strict(opexpr->opfuncid))
+   return true;
+
/*
 * Check for plain Vars that reference window functions in the subquery.
 * If we find any, we'll ask find_window_run_conditions() if 'opexpr' 
can

Re: Bug in row_number() optimization

2022-12-04 Thread David Rowley

On Mon, 28 Nov 2022 at 22:59, Sergey Shinderuk
 wrote:
>
> On 28.11.2022 03:23, David Rowley wrote:
> > On Sat, 26 Nov 2022 at 05:19, Tom Lane  wrote:
> >> It's pretty unlikely that this would work during an actual index scan.
> >> I'm fairly sure that btree (and other index AMs) have hard-wired
> >> assumptions that index operators are strict.
> >
> > If we're worried about that then we could just restrict this
> > optimization to only work with strict quals.
>
> Not sure this is necessary if btree operators must be strict anyway.

I'd rather see the func_strict() test in there. You've already
demonstrated you can get wrong results with a non-strict operator. I'm
not disputing that it sounds like a broken operator class or not. I
just want to ensure we don't leave any holes open for this
optimisation to return incorrect results.

David

Re: Bug in row_number() optimization

2022-12-04 Thread David Rowley

On Fri, 2 Dec 2022 at 00:21, Sergey Shinderuk
 wrote:
> Maybe I'm missing something, but the previous call to spool_tuples()
> might have read extra tuples (if the tuplestore spilled to disk), and
> after switching to WINDOWAGG_PASSTHROUGH_STRICT mode we nevertheless
> would loop through these extra tuples and call ExecProject if only to
> increment winstate->currentpos.

The tuples which are spooled in the WindowAgg node are the ones from
the WindowAgg's subnode.  Since these don't contain the results of the
WindowFunc, then I don't think there's any issue with what's stored in
any of the spooled tuples.

What matters is what we pass along to the node that's reading from the
WindowAgg. If we NULL out the memory where we store the WindowFunc
(and maybe an Aggref) results then the ExecProject in ExecWindowAgg()
will no longer fill the WindowAgg's output slot with the address of
free'd memory (or some stale byval value which has lingered for byval
return type WindowFuncs).

Since the patch I sent sets the context's ecxt_aggnulls to true, it
means that when we do the ExecProject(), the EEOP_WINDOW_FUNC in
ExecInterpExpr (or the JIT equivalent) will put an SQL NULL in the
*output* slot for the WindowAgg node. The same is true for
EEOP_AGGREFs as the WindowAgg node that we are running in
WINDOWAGG_PASSTHROUGH mode could also contain normal aggregate
functions, not just WindowFuncs.

David

Re: Bug in row_number() optimization

2022-12-01 Thread Sergey Shinderuk


On 01.12.2022 11:18, Richard Guo wrote:


On Mon, Nov 28, 2022 at 5:59 PM Sergey Shinderuk 
mailto:s.shinde...@postgrespro.ru>> wrote:


Not quite sure that we don't need to do anything for the
WINDOWAGG_PASSTHROUGH_STRICT case. Although, we won't return any more
tuples for the current partition, we still call ExecProject with
dangling pointers. Is it okay?

AFAIU once we go into WINDOWAGG_PASSTHROUGH_STRICT we will spool all the
remaining tuples in the current partition without storing them and then
move to the next partition if available and become WINDOWAGG_RUN again
or become WINDOWAGG_DONE if there are no further partitions.  It seems
we would not have chance to see the dangling pointers.


Maybe I'm missing something, but the previous call to spool_tuples() 
might have read extra tuples (if the tuplestore spilled to disk), and 
after switching to WINDOWAGG_PASSTHROUGH_STRICT mode we nevertheless 
would loop through these extra tuples and call ExecProject if only to 
increment winstate->currentpos.


--
Sergey Shinderukhttps://postgrespro.com/

Re: Bug in row_number() optimization

2022-12-01 Thread Richard Guo

On Mon, Nov 28, 2022 at 5:59 PM Sergey Shinderuk 
wrote:

> Not quite sure that we don't need to do anything for the
> WINDOWAGG_PASSTHROUGH_STRICT case. Although, we won't return any more
> tuples for the current partition, we still call ExecProject with
> dangling pointers. Is it okay?

AFAIU once we go into WINDOWAGG_PASSTHROUGH_STRICT we will spool all the
remaining tuples in the current partition without storing them and then
move to the next partition if available and become WINDOWAGG_RUN again
or become WINDOWAGG_DONE if there are no further partitions.  It seems
we would not have chance to see the dangling pointers.

> +   if (!func_strict(opexpr->opfuncid))
> +   return false;
>
> Should return true instead?

Yeah, you're right.  This should be a thinko.

Thanks
Richard

Re: Bug in row_number() optimization

2022-11-28 Thread Sergey Shinderuk


On 28.11.2022 03:23, David Rowley wrote:

On Sat, 26 Nov 2022 at 05:19, Tom Lane  wrote:


Sergey Shinderuk  writes:

What about user-defined operators? I created my own <= operator for int8
which returns true on null input, and put it in a btree operator class.
Admittedly, it's weird that (null <= 1) evaluates to true. But does it
violate  the contract of the btree operator class or something? Didn't
find a clear answer in the docs.


It's pretty unlikely that this would work during an actual index scan.
I'm fairly sure that btree (and other index AMs) have hard-wired
assumptions that index operators are strict.


If we're worried about that then we could just restrict this
optimization to only work with strict quals.


Not sure this is necessary if btree operators must be strict anyway.



The proposal to copy the datums into the query context does not seem
to me to be a good idea. If there are a large number of partitions
then it sounds like we'll leak lots of memory.  We could invent some
partition context that we reset after each partition, but that's
probably more complexity than it would be worth.


Ah, good point.



I've attached a draft patch to move the code to nullify the aggregate
results so that's only done once per partition and adjusted the
planner to limit this to strict quals.


Not quite sure that we don't need to do anything for the 
WINDOWAGG_PASSTHROUGH_STRICT case. Although, we won't return any more 
tuples for the current partition, we still call ExecProject with 
dangling pointers. Is it okay?



+   if (!func_strict(opexpr->opfuncid))
+   return false;

Should return true instead?

--
Sergey Shinderukhttps://postgrespro.com/

Re: Bug in row_number() optimization

2022-11-27 Thread David Rowley

On Sat, 26 Nov 2022 at 05:19, Tom Lane  wrote:
>
> Sergey Shinderuk  writes:
> > What about user-defined operators? I created my own <= operator for int8
> > which returns true on null input, and put it in a btree operator class.
> > Admittedly, it's weird that (null <= 1) evaluates to true. But does it
> > violate  the contract of the btree operator class or something? Didn't
> > find a clear answer in the docs.
>
> It's pretty unlikely that this would work during an actual index scan.
> I'm fairly sure that btree (and other index AMs) have hard-wired
> assumptions that index operators are strict.

If we're worried about that then we could just restrict this
optimization to only work with strict quals.

The proposal to copy the datums into the query context does not seem
to me to be a good idea. If there are a large number of partitions
then it sounds like we'll leak lots of memory.  We could invent some
partition context that we reset after each partition, but that's
probably more complexity than it would be worth.

I've attached a draft patch to move the code to nullify the aggregate
results so that's only done once per partition and adjusted the
planner to limit this to strict quals.

David
diff --git a/src/backend/executor/nodeWindowAgg.c 
b/src/backend/executor/nodeWindowAgg.c
index 81ba024bba..110c594edd 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2300,7 +2300,27 @@ ExecWindowAgg(PlanState *pstate)
continue;
}
else
+   {
winstate->status = 
WINDOWAGG_PASSTHROUGH;
+
+   /*
+* Otherwise, if we're not the 
top-window, we'd better
+* NULLify the aggregate 
results, else, by-ref result
+* types will point to freed 
memory.  We can get away
+* without storing the final 
result of the WindowFunc
+* because in the planner we 
insisted that the
+* runcondition is strict.  
Setting these to NULL will
+* ensure the top-level 
WindowAgg filter will always
+* filter out the rows produced 
in this WindowAgg when
+* not in WINDOWAGG_RUN state.
+*/
+   numfuncs = winstate->numfuncs;
+   for (i = 0; i < numfuncs; i++)
+   {
+   
econtext->ecxt_aggvalues[i] = (Datum) 0;
+   
econtext->ecxt_aggnulls[i] = true;
+   }
+   }
}
else
{
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index 4ddaed31a4..a25956200e 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2399,6 +2399,22 @@ check_and_push_window_quals(Query *subquery, 
RangeTblEntry *rte, Index rti,
if (list_length(opexpr->args) != 2)
return true;
 
+   /*
+* Currently, we restrict this optimization to strict OpExprs.  The 
reason
+* for this is that during execution, once we stop evaluating 
WindowFuncs
+* on lower-level WindowAgg nodes, since we're no longer updating them,
+* we NULLify the aggregate and window aggregate results to prevent 
by-ref
+* result typed window aggregates from pointing to free'd memory (this 
can
+* happen on non-FLOAT8PASSBYVAL builds).  Upper-level WindowAgg nodes 
may
+* make reference to these results in their filter clause and we can
+* ensure the filter remain false by making sure the operator is strict
+* and by performing the NULLification when the run-condition is no 
longer
+* true.
+*/
+   set_opfuncid(opexpr);
+   if (!func_strict(opexpr->opfuncid))
+   return false;
+
/*
 * Check for plain Vars that reference window functions in the subquery.
 * If we find any, we'll ask find_window_run_conditions() if 'opexpr' 
can

Re: Bug in row_number() optimization

2022-11-25 Thread Tom Lane

Sergey Shinderuk  writes:
> What about user-defined operators? I created my own <= operator for int8 
> which returns true on null input, and put it in a btree operator class. 
> Admittedly, it's weird that (null <= 1) evaluates to true. But does it 
> violate  the contract of the btree operator class or something? Didn't 
> find a clear answer in the docs.

It's pretty unlikely that this would work during an actual index scan.
I'm fairly sure that btree (and other index AMs) have hard-wired
assumptions that index operators are strict.

regards, tom lane

Re: Bug in row_number() optimization

2022-11-25 Thread Sergey Shinderuk


On 25.11.2022 15:46, Richard Guo wrote:

Considering the 'Filter' is a strict function, marking it as
NULL would do.  I think this is why this patch works.


What about user-defined operators? I created my own <= operator for int8 
which returns true on null input, and put it in a btree operator class. 
With my operator I get:


  depname  | empno | salary | enroll_date | c1 | rn | c2 | c3
---+---++-++++
 personnel | 5 |   3500 | 2007-12-10  |  2 |  1 |  2 |  2
 sales | 3 |   4800 | 2007-08-01  |  3 |  1 |  3 |  3
 sales | 4 |   4800 | 2007-08-08  |  3 |||  3
(3 rows)

Admittedly, it's weird that (null <= 1) evaluates to true. But does it 
violate  the contract of the btree operator class or something? Didn't 
find a clear answer in the docs.


--
Sergey Shinderukhttps://postgrespro.com/

Re: Bug in row_number() optimization

2022-11-25 Thread Richard Guo

On Fri, Nov 25, 2022 at 11:26 AM David Rowley  wrote:

> There are two different pass-through modes that the WindowAgg can move
> into when it detects that the run condition is no longer true:
>
> 1) WINDOWAGG_PASSTHROUGH and
> 2) WINDOWAGG_PASSTHROUGH_STRICT
>
> #2 is used when the WindowAgg is the top-level one in this query
> level. Remember we'll need multiple WindowAgg nodes when there are
> multiple different windows to evaluate.  The reason that we need #1 is
> that if there are multiple WindowAggs, then the top-level one (or just
> any WindowAgg above it) might need all the rows from a lower-level
> WindowAgg.

Thanks for the explanation!  I think now I understand pass-through modes
much better.

> What I mean by "WindowAggs above us cannot reference the result of
> another WindowAgg" is that the evaluation of sum(qty) over (order by
> date) can't see the "rn" column. SQL does not allow it.

I think I get your point.  Yeah, the 'rn' column is not needed for the
evaluation of WindowAggs above.  But ISTM it might be needed to evaluate
the quals of WindowAggs above.  Such as in the plan below

explain (costs off) SELECT * FROM
   (SELECT
   count(salary) OVER (PARTITION BY depname || '') c1, -- w1
   row_number() OVER (PARTITION BY depname) rn -- w2
FROM empsalary
) e WHERE rn <= 1;
QUERY PLAN
---
 Subquery Scan on e
   ->  WindowAgg
 Filter: ((row_number() OVER (?)) <= 1)
 ->  Sort
   Sort Key: (((empsalary.depname)::text || ''::text))
   ->  WindowAgg
 Run Condition: (row_number() OVER (?) <= 1)
 ->  Sort
   Sort Key: empsalary.depname
   ->  Seq Scan on empsalary
(10 rows)

The 'rn' column is calculated in the lower-level WindowAgg, and it is
needed to evaluate the 'Filter' of the upper-level WindowAgg.  In
pass-through mode, the lower-level WindowAgg would not be evaluated any
more, so we need to mark the 'rn' column to something that can false the
'Filter'.  Considering the 'Filter' is a strict function, marking it as
NULL would do.  I think this is why this patch works.

> Just thinking of the patch a bit more, what I wrote ends up
> continually zeroing the values and marking the columns as NULL. Likely
> we can just do this once when we do: winstate->status =
> WINDOWAGG_PASSTHROUGH;

Yes, I also think we can do this only once when we go into pass-through
mode.

Thanks
Richard

Re: Bug in row_number() optimization

2022-11-24 Thread David Rowley

On Fri, 25 Nov 2022 at 16:00, Richard Guo  wrote:
> Verified the problem is fixed with this patch.  I'm not familiar with
> the WindowAgg execution codes.  As far as I understand, this patch works
> because we set ecxt_aggnulls to true, making it a NULL value.  And the
> top-level WindowAgg node's "Filter" is strict so that it can filter out
> all the tuples that don't match the intermediate WindowAgg node's run
> condition.  So I find the comments about "WindowAggs above us cannot
> reference the result of another WindowAgg" confusing.  But maybe I'm
> missing something.

There are two different pass-through modes that the WindowAgg can move
into when it detects that the run condition is no longer true:

1) WINDOWAGG_PASSTHROUGH and
2) WINDOWAGG_PASSTHROUGH_STRICT

#2 is used when the WindowAgg is the top-level one in this query
level. Remember we'll need multiple WindowAgg nodes when there are
multiple different windows to evaluate.  The reason that we need #1 is
that if there are multiple WindowAggs, then the top-level one (or just
any WindowAgg above it) might need all the rows from a lower-level
WindowAgg.  For example:

SELECT * FROM (SELECT row_number() over(order by id) rn, sum(qty) over
(order by date) qty from t) t where rn <= 10;

if the "order by id" window is evaluated first, we can't stop
outputting rows when rn <= 10 is no longer true as the "order by date"
window might need those.  In this case, once rn <= 10 is no longer
true, the WindowAgg for that window would go into
WINDOWAGG_PASSTHROUGH. This means we can stop window func evaluation
on any additional rows.  The final query will never see rn==11, so we
don't need to generate that.

The problem is that once the "order by id" window stops evaluating the
window funcs, if the window result is byref, then we leave junk in the
aggregate output columns.  Since we continue to output rows from that
WindowAgg for the top-level "order by date" window, we don't want to
form tuples with free'd memory.

Since nothing in the query will ever seen rn==11 and beyond, there's
no need to put anything in that part of the output tuple. We can just
make it an SQL NULL.

What I mean by "WindowAggs above us cannot reference the result of
another WindowAgg" is that the evaluation of sum(qty) over (order by
date) can't see the "rn" column. SQL does not allow it. If it did,
that would have to look something like:

SELECT * FROM (SELECT SUM(row_number() over (order by id)) over (order
by date) qty from t); -- not valid SQL

WINDOWAGG_PASSTHROUGH_STRICT not only does not evaluate window funcs,
it also does not even bother to store tuples in the tuple store. In
this case there's no higher-level WindowAgg that will need these
tuples, so we can just read our sub-node until we find the next
partition, or stop when there's no PARTITION BY clause.

Just thinking of the patch a bit more, what I wrote ends up
continually zeroing the values and marking the columns as NULL. Likely
we can just do this once when we do: winstate->status =
WINDOWAGG_PASSTHROUGH;  I'll test that out and make sure it works.

David

Re: Bug in row_number() optimization

2022-11-24 Thread Richard Guo

On Fri, Nov 25, 2022 at 7:34 AM David Rowley  wrote:

> Since upper-level WindowAggs cannot reference values calculated in
> some lower-level WindowAgg, why can't we just NULLify the pointers
> instead? See attached.

Verified the problem is fixed with this patch.  I'm not familiar with
the WindowAgg execution codes.  As far as I understand, this patch works
because we set ecxt_aggnulls to true, making it a NULL value.  And the
top-level WindowAgg node's "Filter" is strict so that it can filter out
all the tuples that don't match the intermediate WindowAgg node's run
condition.  So I find the comments about "WindowAggs above us cannot
reference the result of another WindowAgg" confusing.  But maybe I'm
missing something.

Thanks
Richard

Re: Bug in row_number() optimization

2022-11-24 Thread David Rowley

On Fri, 25 Nov 2022 at 00:52, Sergey Shinderuk
 wrote:
> Shouldn't we handle any pass-by-reference type the same? I suppose, a
> user-defined window function can return some other type, not int8.

Thanks for reporting this and to you and Richard for working on a fix.

I've just looked at it and it seems that valgrind is complaining
because a tuple formed by an upper-level WindowAgg contains a pointer
to free'd memory due to the byref type and eval_windowaggregates() not
having been executed to fill in ecxt_aggvalues and ecxt_aggnulls on
the lower-level WindowAgg.

Since upper-level WindowAggs cannot reference values calculated in
some lower-level WindowAgg, why can't we just NULLify the pointers
instead? See attached.

It is possible to have a monotonic window function that does not
return int8.  Technically something like MAX(text_col) OVER (PARTITION
BY somecol ORDER BY text_col) is monotonically increasing, it's just
that I didn't add a support function to tell the planner about that.
Someone could come along in the future and suggest we do that and show
us some convincing use case.  So whatever the fix, it cannot assume
the window function's return type is int8.

David
diff --git a/src/backend/executor/nodeWindowAgg.c 
b/src/backend/executor/nodeWindowAgg.c
index 81ba024bba..083a3b2386 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2234,6 +2234,21 @@ ExecWindowAgg(PlanState *pstate)
if (winstate->numaggs > 0)
eval_windowaggregates(winstate);
}
+   else if (!winstate->top_window)
+   {
+   /*
+* Otherwise, if we're not the top-window, we'd better 
fill the
+* aggregate values in with something, else they might 
be pointing
+* to freed memory.  These don't need to be valid since 
WindowAggs
+* above us cannot reference the result of another 
WindowAgg.
+*/
+   numfuncs = winstate->numfuncs;
+   for (i = 0; i < numfuncs; i++)
+   {
+   econtext->ecxt_aggvalues[i] = (Datum) 0;
+   econtext->ecxt_aggnulls[i] = true;
+   }
+   }

/*
 * If we have created auxiliary read pointers for the frame or 
group

Re: Bug in row_number() optimization

2022-11-24 Thread Sergey Shinderuk


On 24.11.2022 06:16, Richard Guo wrote:

Regarding how to fix this problem, firstly I believe we need to evaluate
window functions in the per-tuple memory context, as the HEAD does.
When we decide we need to go into pass-through mode, I'm thinking that
we can just copy out the results of the last evaluation to the per-query
memory context, while still storing their pointers in ecxt_aggvalues.

Does this idea work?

Although I'm not familiar with the code, this makes sense to me.

You proposed:

+#ifdef USE_FLOAT8_BYVAL
+   evalWfuncContext = 
winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory;

+#else
+   evalWfuncContext = 
winstate->ss.ps.ps_ExprContext->ecxt_per_query_memory;

+#endif

Shouldn't we handle any pass-by-reference type the same? I suppose, a 
user-defined window function can return some other type, not int8.


Best regards,

--
Sergey Shinderukhttps://postgrespro.com/

Re: Bug in row_number() optimization

2022-11-23 Thread Richard Guo

On Tue, Nov 22, 2022 at 3:44 PM Richard Guo  wrote:

> On Wed, Nov 16, 2022 at 7:38 AM Sergey Shinderuk <
> s.shinde...@postgrespro.ru> wrote:
>
>> The failing query is:
>> SELECT * FROM
>>(SELECT *,
>>count(salary) OVER (PARTITION BY depname || '') c1, -- w1
>>row_number() OVER (PARTITION BY depname) rn, -- w2
>>count(*) OVER (PARTITION BY depname) c2, -- w2
>>count(*) OVER (PARTITION BY '' || depname) c3 -- w3
>> FROM empsalary
>> ) e WHERE rn <= 1 AND c1 <= 3;
>> As far as I understand, ExecWindowAgg for the intermediate WindowAgg
>> node switches into pass-through mode, stops evaluating row_number(), and
>> returns the previous value instead. But if int8 is passed by reference,
>> the previous value stored in econtext->ecxt_aggvalues becomes a dangling
>> pointer when the per-output-tuple memory context is reset.
>
>
> Yeah, you're right.  In this example the window function row_number()
> goes into pass-through mode after the second evaluation because its
> run condition does not hold true any more.  The remaining run would just
> return the result from the second evaluation, which is stored in
> econtext->ecxt_aggvalues[wfuncno].
>
> If int8 is configured as pass-by-ref, the precomputed value from the
> second evaluation is actually located in a memory area from context
> ecxt_per_tuple_memory, with its pointer stored in ecxt_aggvalues.  As
> this memory context is reset once per tuple, we would be prone to wrong
> results.
>

Regarding how to fix this problem, firstly I believe we need to evaluate
window functions in the per-tuple memory context, as the HEAD does.
When we decide we need to go into pass-through mode, I'm thinking that
we can just copy out the results of the last evaluation to the per-query
memory context, while still storing their pointers in ecxt_aggvalues.

Does this idea work?

Thanks
Richard

Re: Bug in row_number() optimization

2022-11-21 Thread Richard Guo

On Wed, Nov 16, 2022 at 7:38 AM Sergey Shinderuk 
wrote:

> The failing query is:
> SELECT * FROM
>(SELECT *,
>count(salary) OVER (PARTITION BY depname || '') c1, -- w1
>row_number() OVER (PARTITION BY depname) rn, -- w2
>count(*) OVER (PARTITION BY depname) c2, -- w2
>count(*) OVER (PARTITION BY '' || depname) c3 -- w3
> FROM empsalary
> ) e WHERE rn <= 1 AND c1 <= 3;
> As far as I understand, ExecWindowAgg for the intermediate WindowAgg
> node switches into pass-through mode, stops evaluating row_number(), and
> returns the previous value instead. But if int8 is passed by reference,
> the previous value stored in econtext->ecxt_aggvalues becomes a dangling
> pointer when the per-output-tuple memory context is reset.

Yeah, you're right.  In this example the window function row_number()
goes into pass-through mode after the second evaluation because its
run condition does not hold true any more.  The remaining run would just
return the result from the second evaluation, which is stored in
econtext->ecxt_aggvalues[wfuncno].

If int8 is configured as pass-by-ref, the precomputed value from the
second evaluation is actually located in a memory area from context
ecxt_per_tuple_memory, with its pointer stored in ecxt_aggvalues.  As
this memory context is reset once per tuple, we would be prone to wrong
results.

I tried with memory context ecxt_per_query_memory when evaluating
window function in the case where int8 is configured as pass-by-ref and
I can see the problem vanishes.  I'm using the changes as below

--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1027,8 +1027,14 @@ eval_windowfunction(WindowAggState *winstate,
WindowStatePerFunc perfuncstate,
 {
LOCAL_FCINFO(fcinfo, FUNC_MAX_ARGS);
MemoryContext oldContext;
-
-   oldContext =
MemoryContextSwitchTo(winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory);
+   MemoryContext evalWfuncContext;
+
+#ifdef USE_FLOAT8_BYVAL
+   evalWfuncContext =
winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory;
+#else
+   evalWfuncContext =
winstate->ss.ps.ps_ExprContext->ecxt_per_query_memory;
+#endif
+   oldContext = MemoryContextSwitchTo(evalWfuncContext);

Thanks
Richard

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

Re: Bug in row_number() optimization

16 matches

Site Navigation

Mail list logo

Footer information