remove unnecessary assignment to tmask in DecodeDateTime

2022-11-03 Thread Zhihong Yu
Hi,
I was looking at the code in DecodeDateTime() around line 1382:

   tmask = DTK_M(type);

In case type is UNKNOWN_FIELD, the above macro would shift 1 left 31 bits
which cannot be represented in type 'int'.

Looking down in the same function, we can see that tmask is assigned for
every legitimate case.

If my understanding is correct, please take a look at the proposed patch.

Thanks


tmask-decode-date-time.patch
Description: Binary data


Re: restoring user id and SecContext before logging error in ri_PlanCheck

2022-11-02 Thread Zhihong Yu
Looking down in ri_PerformCheck(), I see there may be case where error from
SPI_execute_snapshot() would skip restoring UID.

Please look at patch v2 which tried to handle such case.

Thanks


v2-restore-uid-trigger.patch
Description: Binary data


restoring user id and SecContext before logging error in ri_PlanCheck

2022-11-02 Thread Zhihong Yu
Hi,
I was looking at the code in ri_PlanCheck
of src/backend/utils/adt/ri_triggers.c starting at line 2289.

When qplan is NULL, we log an error. This would skip
calling SetUserIdAndSecContext().

I think the intention of the code should be restoring user id
and SecContext regardless of the outcome from SPI_prepare().

If my understanding is correct, please take a look at the patch.

Thanks


Re: heavily contended lwlocks with long wait queues scale badly

2022-10-31 Thread Zhihong Yu
On Mon, Oct 31, 2022 at 5:19 PM Andres Freund  wrote:

> Hi,
>
> On 2022-10-31 17:17:03 -0700, Zhihong Yu wrote:
> > On Mon, Oct 31, 2022 at 4:51 PM Andres Freund 
> wrote:
> >
> > > Hi,
> > >
> > > On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:
> > > > BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
> > > > with the same set up [1], I'm not sure if it's really because of the
> > > > patch. I'm unable to reproduce it now and unfortunately I didn't
> > > > capture further details when it occurred.
> > >
> > > That's likely because the prototype patch I submitted in this thread
> missed
> > > updating LWLockUpdateVar().
> > >
> > > Updated patch attached.
> > >
> > > Greetings,
> > >
> > > Andres Freund
> > >
> >
> > Hi,
> > Minor comment:
> >
> > +   uint8   lwWaiting;  /* see LWLockWaitState */
> >
> > Why not declare `lwWaiting` of type LWLockWaitState ?
>
> Unfortunately C99 (*) doesn't allow to specify the width of an enum
> field. With most compilers we'd end up using 4 bytes.
>
> Greetings,
>
> Andres Freund
>
> (*) C++ has allowed specifying this for quite a few years now and I think
> C23
> will support it too, but that doesn't help us at this point.
>

Hi,
Thanks for the response.

If possible, it would be better to put your explanation in the code comment
(so that other people know the reasoning).


Re: heavily contended lwlocks with long wait queues scale badly

2022-10-31 Thread Zhihong Yu
On Mon, Oct 31, 2022 at 4:51 PM Andres Freund  wrote:

> Hi,
>
> On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:
> > BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
> > with the same set up [1], I'm not sure if it's really because of the
> > patch. I'm unable to reproduce it now and unfortunately I didn't
> > capture further details when it occurred.
>
> That's likely because the prototype patch I submitted in this thread missed
> updating LWLockUpdateVar().
>
> Updated patch attached.
>
> Greetings,
>
> Andres Freund
>

Hi,
Minor comment:

+   uint8   lwWaiting;  /* see LWLockWaitState */

Why not declare `lwWaiting` of type LWLockWaitState ?

Cheers


Reusing return value from planner_rt_fetch

2022-10-31 Thread Zhihong Yu
Hi,
I was reading examine_variable in src/backend/utils/adt/selfuncs.c

It seems we already have the rte coming out of the loop which starts on
line 5181.

Here is a patch which reuses the return value from `planner_rt_fetch`.

Please take a look.

Thanks


parent-rel-rte.patch
Description: Binary data


Re: fixing typo in comment for restriction_is_or_clause

2022-10-24 Thread Zhihong Yu
On Mon, Oct 24, 2022 at 7:58 PM Japin Li  wrote:

>
> On Tue, 25 Oct 2022 at 10:48, Richard Guo  wrote:
> > On Tue, Oct 25, 2022 at 10:05 AM John Naylor <
> john.nay...@enterprisedb.com>
> > wrote:
> >
> >>
> >> On Tue, Oct 25, 2022 at 12:19 AM Zhihong Yu  wrote:
> >> >
> >> > Hi,
> >> > When I was looking at src/backend/optimizer/util/restrictinfo.c, I
> found
> >> a typo in one of the comments.
> >>
> >> Using "t" as an abbreviation for "true" was probably intentional, so
> not a
> >> typo. There is no doubt what the behavior is.
> >>
> >> > I also took the chance to simplify the code a little bit.
> >>
> >> It's perfectly clear and simple now, even if it doesn't win at "code
> golf".
> >>
> >
> > Agree with your point.  Do you think we can further make the one-line
> > function a macro or an inline function in the .h file?  I think this
> > function is called quite frequently during planning, so maybe doing that
> > would bring a little bit of efficiency.
> >
>
> +1, same goes for restriction_is_securely_promotable.
>
> Hi,
Thanks for the comments.

Please take a look at patch v2.


v2-is-or.patch
Description: Binary data


fixing typo in comment for restriction_is_or_clause

2022-10-24 Thread Zhihong Yu
Hi,
When I was looking at src/backend/optimizer/util/restrictinfo.c, I found a
typo in one of the comments.

I also took the chance to simplify the code a little bit.

Please take a look at the patch.

Thanks


is-or.patch
Description: Binary data


Re: Missing update of all_hasnulls in BRIN opclasses

2022-10-22 Thread Zhihong Yu
Hi, Tomas:
For 0002-fixup-brin-has_nulls-20221022.patch :

+   first_row = (bval->bv_hasnulls && bval->bv_allnulls);
+
+   if (bval->bv_hasnulls && bval->bv_allnulls)

It seems the if condition can be changed to `if (first_row)` which is more
readable.

Chhers


Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Zhihong Yu
On Sun, Oct 16, 2022 at 7:33 AM Tomas Vondra 
wrote:

>
>
> On 10/16/22 16:01, Zhihong Yu wrote:
> >
> >
> > On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> > mailto:tomas.von...@enterprisedb.com>>
> > wrote:
> >
> > On 10/15/22 14:33, Tomas Vondra wrote:
> > > Hi,
> > >
> > > ...
> > >
> > > There's a bunch of issues with this initial version of the patch,
> > > usually described in XXX comments in the relevant places.6)
> > >
> > > ...
> >
> > I forgot to mention one important issue in my list yesterday, and
> that's
> > memory consumption. The way the patch is coded now, the new BRIN
> support
> > function (brin_minmax_ranges) produces information about *all*
> ranges in
> > one go, which may be an issue. The worst case is 32TB table, with
> 1-page
> > BRIN ranges, which means ~4 billion ranges. The info is an array of
> ~32B
> > structs, so this would require ~128GB of RAM. With the default
> 128-page
> > ranges, it's still be ~1GB, which is quite a lot.
> >
> > We could have a discussion about what's the reasonable size of BRIN
> > ranges on such large tables (e.g. building a bitmap on 4 billion
> ranges
> > is going to be "not cheap" so this is likely pretty rare). But we
> should
> > not introduce new nodes that ignore work_mem, so we need a way to
> deal
> > with such cases somehow.
> >
> > The easiest solution likely is to check this while planning - we can
> > check the table size, calculate the number of BRIN ranges, and check
> > that the range info fits into work_mem, and just not create the path
> > when it gets too large. That's what we did for HashAgg, although that
> > decision was unreliable because estimating GROUP BY cardinality is
> hard.
> >
> > The wrinkle here is that counting just the range info (BrinRange
> struct)
> > does not include the values for by-reference types. We could use
> average
> > width - that's just an estimate, though.
> >
> > A more comprehensive solution seems to be to allow requesting chunks
> of
> > the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> > those. So for example if you have 1000 ranges, and you can only
> handle
> > 100 at a time, we'd do 10 loops, each requesting 100 ranges.
> >
> > This has another problem - we do care about "overlaps", and we can't
> > really know if the overlapping ranges will be in the same "slice"
> > easily. The chunks would be sorted (for example) by maxval. But there
> > can be a range with much higher maxval (thus in some future slice),
> but
> > very low minval (thus intersecting with ranges in the current slice).
> >
> > Imagine ranges with these minval/maxval values, sorted by maxval:
> >
> > [101,200]
> > [201,300]
> > [301,400]
> > [150,500]
> >
> > and let's say we can only process 2-range slices. So we'll get the
> first
> > two, but both of them intersect with the very last range.
> >
> > We could always include all the intersecting ranges into the slice,
> but
> > what if there are too many very "wide" ranges?
> >
> > So I think this will need to switch to an iterative communication
> with
> > the BRIN index - instead of asking "give me info about all the
> ranges",
> > we'll need a way to
> >
> >   - request the next range (sorted by maxval)
> >   - request the intersecting ranges one by one (sorted by minval)
> >
> > Of course, the BRIN side will have some of the same challenges with
> > tracking the info without breaking the work_mem limit, but I suppose
> it
> > can store the info into a tuplestore/tuplesort, and use that instead
> of
> > plain in-memory array. Alternatively, it could just return those, and
> > BrinSort would use that. OTOH it seems cleaner to have some sort of
> API,
> > especially if we want to support e.g. minmax-multi opclasses, that
> have
> > a more complicated concept of "intersection".
> >
> >
> > regards
> >
> > --
> > Tomas Vondra
> > EnterpriseDB: http://www.enterprisedb.com <
> http://www.enterprisedb.com>
> > The Enterprise PostgreSQL Company
> >
> > Hi,
> > In your example involving [150,500], ca

Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Zhihong Yu
On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra 
wrote:

> On 10/15/22 14:33, Tomas Vondra wrote:
> > Hi,
> >
> > ...
> >
> > There's a bunch of issues with this initial version of the patch,
> > usually described in XXX comments in the relevant places.6)
> >
> > ...
>
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption. The way the patch is coded now, the new BRIN support
> function (brin_minmax_ranges) produces information about *all* ranges in
> one go, which may be an issue. The worst case is 32TB table, with 1-page
> BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
> structs, so this would require ~128GB of RAM. With the default 128-page
> ranges, it's still be ~1GB, which is quite a lot.
>
> We could have a discussion about what's the reasonable size of BRIN
> ranges on such large tables (e.g. building a bitmap on 4 billion ranges
> is going to be "not cheap" so this is likely pretty rare). But we should
> not introduce new nodes that ignore work_mem, so we need a way to deal
> with such cases somehow.
>
> The easiest solution likely is to check this while planning - we can
> check the table size, calculate the number of BRIN ranges, and check
> that the range info fits into work_mem, and just not create the path
> when it gets too large. That's what we did for HashAgg, although that
> decision was unreliable because estimating GROUP BY cardinality is hard.
>
> The wrinkle here is that counting just the range info (BrinRange struct)
> does not include the values for by-reference types. We could use average
> width - that's just an estimate, though.
>
> A more comprehensive solution seems to be to allow requesting chunks of
> the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> those. So for example if you have 1000 ranges, and you can only handle
> 100 at a time, we'd do 10 loops, each requesting 100 ranges.
>
> This has another problem - we do care about "overlaps", and we can't
> really know if the overlapping ranges will be in the same "slice"
> easily. The chunks would be sorted (for example) by maxval. But there
> can be a range with much higher maxval (thus in some future slice), but
> very low minval (thus intersecting with ranges in the current slice).
>
> Imagine ranges with these minval/maxval values, sorted by maxval:
>
> [101,200]
> [201,300]
> [301,400]
> [150,500]
>
> and let's say we can only process 2-range slices. So we'll get the first
> two, but both of them intersect with the very last range.
>
> We could always include all the intersecting ranges into the slice, but
> what if there are too many very "wide" ranges?
>
> So I think this will need to switch to an iterative communication with
> the BRIN index - instead of asking "give me info about all the ranges",
> we'll need a way to
>
>   - request the next range (sorted by maxval)
>   - request the intersecting ranges one by one (sorted by minval)
>
> Of course, the BRIN side will have some of the same challenges with
> tracking the info without breaking the work_mem limit, but I suppose it
> can store the info into a tuplestore/tuplesort, and use that instead of
> plain in-memory array. Alternatively, it could just return those, and
> BrinSort would use that. OTOH it seems cleaner to have some sort of API,
> especially if we want to support e.g. minmax-multi opclasses, that have
> a more complicated concept of "intersection".
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> Hi,
In your example involving [150,500], can this range be broken down into 4
ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.

bq. can store the info into a tuplestore/tuplesort

Wouldn't this involve disk accesses which may reduce the effectiveness of
BRIN sort ?

Cheers


Re: PATCH: Using BRIN indexes for sorted output

2022-10-15 Thread Zhihong Yu
On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra 
wrote:

> On 10/15/22 15:46, Zhihong Yu wrote:
> >...
> > 8) Parallel version is not supported, but I think it shouldn't be
> > possible. Just make the leader build the range info, and then let the
> > workers to acquire/sort ranges and merge them by Gather Merge.
> > ...
> > Hi,
> > I am still going over the patch.
> >
> > Minor: for #8, I guess you meant `it should be possible` .
> >
>
> Yes, I meant to say it should be possible. Sorry for the confusion.
>
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
Hi,

For brin_minmax_ranges, looking at the assignment to gottuple and
reading gottuple, it seems variable gottuple can be omitted - we can check
tup directly.

+   /* Maybe mark the range as processed. */
+   range->processed |= mark_processed;

`Maybe` can be dropped.

For brinsort_load_tuples(), do we need to check for interrupts inside the
loop ?
Similar question for subsequent methods involving loops, such
as brinsort_load_unsummarized_ranges.

Cheers


Re: PATCH: Using BRIN indexes for sorted output

2022-10-15 Thread Zhihong Yu
On Sat, Oct 15, 2022 at 5:34 AM Tomas Vondra 
wrote:

> Hi,
>
> There have been a couple discussions about using BRIN indexes for
> sorting - in fact this was mentioned even in the "Improving Indexing
> Performance" unconference session this year (don't remember by whom).
> But I haven't seen any patches, so here's one.
>
> The idea is that we can use information about ranges to split the table
> into smaller parts that can be sorted in smaller chunks. For example if
> you have a tiny 2MB table with two ranges, with values in [0,100] and
> [101,200] intervals, then it's clear we can sort the first range, output
> tuples, and then sort/output the second range.
>
> The attached patch builds "BRIN Sort" paths/plans, closely resembling
> index scans, only for BRIN indexes. And this special type of index scan
> does what was mentioned above - incrementally sorts the data. It's a bit
> more complicated because of overlapping ranges, ASC/DESC, NULL etc.
>
> This is disabled by default, using a GUC enable_brinsort (you may need
> to tweak other GUCs to disable parallel plans etc.).
>
> A trivial example, demonstrating the benefits:
>
>   create table t (a int) with (fillfactor = 10);
>   insert into t select i from generate_series(1,1000) s(i);
>
>
> First, a simple LIMIT query:
>
> explain (analyze, costs off) select * from t order by a limit 10;
>
>   QUERY PLAN
> 
>  Limit (actual time=1879.768..1879.770 rows=10 loops=1)
>->  Sort (actual time=1879.767..1879.768 rows=10 loops=1)
>  Sort Key: a
>  Sort Method: top-N heapsort  Memory: 25kB
>  ->  Seq Scan on t
>  (actual time=0.007..1353.110 rows=1000 loops=1)
>  Planning Time: 0.083 ms
>  Execution Time: 1879.786 ms
> (7 rows)
>
>   QUERY PLAN
> 
>  Limit (actual time=1.217..1.219 rows=10 loops=1)
>->  BRIN Sort using t_a_idx on t
>(actual time=1.216..1.217 rows=10 loops=1)
>  Sort Key: a
>  Planning Time: 0.084 ms
>  Execution Time: 1.234 ms
> (5 rows)
>
> That's a pretty nice improvement - of course, this is thanks to having a
> perfectly sequential, and the difference can be almost arbitrary by
> making the table smaller/larger. Similarly, if the table gets less
> sequential (making ranges to overlap), the BRIN plan will be more
> expensive. Feel free to experiment with other data sets.
>
> However, not only the LIMIT queries can improve - consider a sort of the
> whole table:
>
> test=# explain (analyze, costs off) select * from t order by a;
>
>QUERY PLAN
> -
>  Sort (actual time=2806.468..3487.213 rows=1000 loops=1)
>Sort Key: a
>Sort Method: external merge  Disk: 117528kB
>->  Seq Scan on t (actual time=0.018..1498.754 rows=1000 loops=1)
>  Planning Time: 0.110 ms
>  Execution Time: 3766.825 ms
> (6 rows)
>
> test=# explain (analyze, costs off) select * from t order by a;
> QUERY PLAN
>
>
> --
>  BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=1000
> loops=1)
>Sort Key: a
>  Planning Time: 0.073 ms
>  Execution Time: 2939.324 ms
> (4 rows)
>
> Right - not a huge difference, but still a nice 25% speedup, mostly due
> to not having to spill data to disk and sorting smaller amounts of data.
>
> There's a bunch of issues with this initial version of the patch,
> usually described in XXX comments in the relevant places.6)
>
> 1) The paths are created in build_index_paths() because that's what
> creates index scans (which the new path resembles). But that is expected
> to produce IndexPath, not BrinSortPath, so it's not quite correct.
> Should be somewhere "higher" I guess.
>
> 2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
> FIRST/LAST does not really matter for them. The patch just generates
> paths for all 4 combinations (or tries to). Maybe there's a better way.
>
> 3) I'm not quite sure the separation of responsibilities between
> opfamily and opclass is optimal. I added a new amproc, but maybe this
> should be split differently. At the moment only minmax indexes have
> this, but adding this to minmax-multi should be trivial.
>
> 4) The state changes in nodeBrinSort is a bit confusing. Works, but may
> need cleanup and refactoring. Ideas welcome.
>
> 5) The costing is essentially just plain cost_index. I have some ideas
> about BRIN costing in general, which I'll post in a separate thread (as
> it's not specific to this patch).
>
> 6) At the moment this only picks one of the index keys, specified in the
> ORDER BY clause. I think we can generalize this to multiple keys, but
> thinking about multi-key ranges was 

Re: Allow WindowFuncs prosupport function to use more optimal WindowClause options

2022-10-13 Thread Zhihong Yu
On Wed, Oct 12, 2022 at 5:35 PM David Rowley  wrote:

> On Wed, 12 Oct 2022 at 16:33, Vik Fearing  wrote:
> > Per spec, the ROW_NUMBER() window function is not even allowed to have a
> > frame specified.
> >
> >  b) The window framing clause of WDX shall not be present.
> >
> > Also, the specification for ROW_NUMBER() is:
> >
> >  f) ROW_NUMBER() OVER WNS is equivalent to the :
> >
> >  COUNT (*) OVER (WNS1 ROWS UNBOUNDED PRECEDING)
> >
> >
> > So I don't think we need to test for anything at all and can
> > indiscriminately add or replace the frame with ROWS UNBOUNDED PRECEDING.
>
> Thanks for digging that out.
>
> Just above that I see:
>
> RANK() OVER WNS is equivalent to:
> ( COUNT (*) OVER (WNS1 RANGE UNBOUNDED PRECEDING)
> - COUNT (*) OVER (WNS1 RANGE CURRENT ROW) + 1 )
>
> and
>
> DENSE_RANK() OVER WNS is equivalent to the :
> COUNT (DISTINCT ROW ( VE1, ..., VEN ) )
> OVER (WNS1 RANGE UNBOUNDED PRECEDING)
>
> So it looks like the same can be done for rank() and dense_rank() too.
> I've added support for those in the attached.
>
> This also got me thinking that maybe we should be a bit more generic
> with the support function node tag name. After looking at the
> nodeWindowAgg.c code for a while, I wondered if we might want to add
> some optimisations in the future that makes WindowAgg not bother
> storing tuples for row_number(), rank() and dense_rank().  That might
> save a bit of overhead from the tuple store.  I imagined that we'd
> want to allow the expansion of this support request so that the
> support function could let the planner know if any tuples will be
> accessed by the window function or not.  The
> SupportRequestWFuncOptimizeFrameOpts name didn't seem very fitting for
> that so I adjusted it to become SupportRequestOptimizeWindowClause
> instead.
>
> The updated patch is attached.
>
> David
>
Hi,

+   req->frameOptions = (FRAMEOPTION_ROWS |
+FRAMEOPTION_START_UNBOUNDED_PRECEDING |
+FRAMEOPTION_END_CURRENT_ROW);

The bit combination appears multiple times in the patch.
Maybe define the combination as a constant in supportnodes.h and reference
it in the code.

Cheers


Re: remove redundant memset() call

2022-10-13 Thread Zhihong Yu
On Thu, Oct 13, 2022 at 12:10 PM Bruce Momjian  wrote:

> On Thu, Oct 13, 2022 at 10:55:08AM -0700, Zhihong Yu wrote:
> > Hi,
> > I was looking at combo_init in contrib/pgcrypto/px.c .
> >
> > There is a memset() call following palloc0() - the call is redundant.
> >
> > Please see the patch for the proposed change.
> >
> > Thanks
>
> > diff --git a/contrib/pgcrypto/px.c b/contrib/pgcrypto/px.c
> > index 3b098c6151..d35ccca777 100644
> > --- a/contrib/pgcrypto/px.c
> > +++ b/contrib/pgcrypto/px.c
> > @@ -203,7 +203,6 @@ combo_init(PX_Combo *cx, const uint8 *key, unsigned
> klen,
> >   if (klen > ks)
> >   klen = ks;
> >   keybuf = palloc0(ks);
> > - memset(keybuf, 0, ks);
> >   memcpy(keybuf, key, klen);
> >
> >   err = px_cipher_init(c, keybuf, klen, ivbuf);
>
> Uh, the memset() is ks length but the memcpy() is klen, and the above
> test allows ks to be larger than klen.
>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   Indecision is a decision.  Inaction is an action.  Mark Batterson
>
> Hi,
the memory has been zero'ed out by palloc0().

memcpy is not relevant w.r.t. resetting memory.

Cheers


remove redundant memset() call

2022-10-13 Thread Zhihong Yu
Hi,
I was looking at combo_init in contrib/pgcrypto/px.c .

There is a memset() call following palloc0() - the call is redundant.

Please see the patch for the proposed change.

Thanks


remove-redundant-memset-call.patch
Description: Binary data


Re: Bloom filter Pushdown Optimization for Merge Join

2022-10-13 Thread Zhihong Yu
On Thu, Oct 13, 2022 at 7:30 AM Zhihong Yu  wrote:

>
>
> On Wed, Oct 12, 2022 at 4:35 PM Zhihong Yu  wrote:
>
>>
>>
>> On Wed, Oct 12, 2022 at 3:36 PM Lyu Pan  wrote:
>>
>>> Hello Zhihong Yu & Tomas Vondra,
>>>
>>> Thank you so much for your review and feedback!
>>>
>>> We made some updates based on previous feedback and attached the new
>>> patch set. Due to time constraints, we didn't get to resolve all the
>>> comments, and we'll continue to improve this patch.
>>>
>>> > In this prototype, the cost model is based on an assumption that there
>>> is
>>> > a linear relationship between the performance gain from using a
>>> semijoin
>>> > filter and the estimated filtering rate:
>>> > % improvement to Merge Join cost = 0.83 * estimated filtering rate -
>>> 0.137.
>>> >
>>> > How were the coefficients (0.83 and 0.137) determined ?
>>> > I guess they were based on the results of running certain workload.
>>>
>>> Right, the coefficients (0.83 and 0.137) determined are based on some
>>> preliminary testings. The current costing model is pretty naive and
>>> we'll work on a more robust costing model in future work.
>>>
>>>
>>> > I agree, in principle, although I think the current logic / formula is
>>> a
>>> > bit too crude and fitted to the simple data used in the test. I think
>>> > this needs to be formulated as a regular costing issue, considering
>>> > stuff like cost of the hash functions, and so on.
>>> >
>>> > I think this needs to do two things:
>>> >
>>> > 1) estimate the cost of building the bloom filter - This shall depend
>>> on
>>> > the number of rows in the inner relation, number/cost of the hash
>>> > functions (which may be higher for some data types), etc.
>>> >
>>> > 2) estimate improvement for the probing branch - Essentially, we need
>>> to
>>> > estimate how much we save by filtering some of the rows, but this also
>>> > neeeds to include the cost of probing the bloom filter.
>>> >
>>> > This will probably require some improvements to the lib/bloomfilter, in
>>> > order to estimate the false positive rate - this may matter a lot for
>>> > large data sets and small work_mem values. The bloomfilter library
>>> > simply reduces the size of the bloom filter, which increases the false
>>> > positive rate. At some point it'll start reducing the benefit.
>>> >
>>>
>>> These suggestions make a lot of sense. The current costing model is
>>> definitely not good enough, and we plan to work on a more robust
>>> costing model as we continue to improve the patch.
>>>
>>>
>>> > OK. Could also build the bloom filter in shared memory?
>>> >
>>>
>>> We thought about this approach but didn't prefer this one because if
>>> all worker processes share the same bloom filter in shared memory, we
>>> need to frequently lock and unlock the bloom filter to avoid race
>>> conditions. So we decided to have each worker process create its own
>>> bloom filter.
>>>
>>>
>>> > IMHO we shouldn't make too many conclusions from these examples. Yes,
>>> it
>>> > shows merge join can be improved, but for cases where a hashjoin works
>>> > better so we wouldn't use merge join anyway.
>>> >
>>> > I think we should try constructing examples where either merge join
>>> wins
>>> > already (and gets further improved by the bloom filter), or would lose
>>> > to hash join and the bloom filter improves it enough to win.
>>> >
>>> > AFAICS that requires a join of two large tables - large enough that
>>> hash
>>> > join would need to be batched, or pre-sorted inputs (which eliminates
>>> > the explicit Sort, which is the main cost in most cases).
>>> >
>>> > The current patch only works with sequential scans, which eliminates
>>> the
>>> > second (pre-sorted) option. So let's try the first one - can we invent
>>> > an example with a join of two large tables where a merge join would
>>> win?
>>> >
>>> > Can we find such example in existing benchmarks like TPC-H/TPC-DS.
>>> >
>>>
>>> Agreed. The current examples are only intended to show us that using
>>> bloom

Re: Bloom filter Pushdown Optimization for Merge Join

2022-10-13 Thread Zhihong Yu
On Wed, Oct 12, 2022 at 4:35 PM Zhihong Yu  wrote:

>
>
> On Wed, Oct 12, 2022 at 3:36 PM Lyu Pan  wrote:
>
>> Hello Zhihong Yu & Tomas Vondra,
>>
>> Thank you so much for your review and feedback!
>>
>> We made some updates based on previous feedback and attached the new
>> patch set. Due to time constraints, we didn't get to resolve all the
>> comments, and we'll continue to improve this patch.
>>
>> > In this prototype, the cost model is based on an assumption that there
>> is
>> > a linear relationship between the performance gain from using a semijoin
>> > filter and the estimated filtering rate:
>> > % improvement to Merge Join cost = 0.83 * estimated filtering rate -
>> 0.137.
>> >
>> > How were the coefficients (0.83 and 0.137) determined ?
>> > I guess they were based on the results of running certain workload.
>>
>> Right, the coefficients (0.83 and 0.137) determined are based on some
>> preliminary testings. The current costing model is pretty naive and
>> we'll work on a more robust costing model in future work.
>>
>>
>> > I agree, in principle, although I think the current logic / formula is a
>> > bit too crude and fitted to the simple data used in the test. I think
>> > this needs to be formulated as a regular costing issue, considering
>> > stuff like cost of the hash functions, and so on.
>> >
>> > I think this needs to do two things:
>> >
>> > 1) estimate the cost of building the bloom filter - This shall depend on
>> > the number of rows in the inner relation, number/cost of the hash
>> > functions (which may be higher for some data types), etc.
>> >
>> > 2) estimate improvement for the probing branch - Essentially, we need to
>> > estimate how much we save by filtering some of the rows, but this also
>> > neeeds to include the cost of probing the bloom filter.
>> >
>> > This will probably require some improvements to the lib/bloomfilter, in
>> > order to estimate the false positive rate - this may matter a lot for
>> > large data sets and small work_mem values. The bloomfilter library
>> > simply reduces the size of the bloom filter, which increases the false
>> > positive rate. At some point it'll start reducing the benefit.
>> >
>>
>> These suggestions make a lot of sense. The current costing model is
>> definitely not good enough, and we plan to work on a more robust
>> costing model as we continue to improve the patch.
>>
>>
>> > OK. Could also build the bloom filter in shared memory?
>> >
>>
>> We thought about this approach but didn't prefer this one because if
>> all worker processes share the same bloom filter in shared memory, we
>> need to frequently lock and unlock the bloom filter to avoid race
>> conditions. So we decided to have each worker process create its own
>> bloom filter.
>>
>>
>> > IMHO we shouldn't make too many conclusions from these examples. Yes, it
>> > shows merge join can be improved, but for cases where a hashjoin works
>> > better so we wouldn't use merge join anyway.
>> >
>> > I think we should try constructing examples where either merge join wins
>> > already (and gets further improved by the bloom filter), or would lose
>> > to hash join and the bloom filter improves it enough to win.
>> >
>> > AFAICS that requires a join of two large tables - large enough that hash
>> > join would need to be batched, or pre-sorted inputs (which eliminates
>> > the explicit Sort, which is the main cost in most cases).
>> >
>> > The current patch only works with sequential scans, which eliminates the
>> > second (pre-sorted) option. So let's try the first one - can we invent
>> > an example with a join of two large tables where a merge join would win?
>> >
>> > Can we find such example in existing benchmarks like TPC-H/TPC-DS.
>> >
>>
>> Agreed. The current examples are only intended to show us that using
>> bloom filters in merge join could improve the merge join performance
>> in some cases. We are working on testing more examples that merge join
>> with bloom filter could out-perform hash join, which should be more
>> persuasive.
>>
>>
>> > The bloom filter is built by the first seqscan (on t0), and then used by
>> > the second seqscan (on t1). But this only works because we always run
>> > the t0 scan to completion (because we're feeding it into Sort) before we
>>

Re: Bloom filter Pushdown Optimization for Merge Join

2022-10-12 Thread Zhihong Yu
On Wed, Oct 12, 2022 at 3:36 PM Lyu Pan  wrote:

> Hello Zhihong Yu & Tomas Vondra,
>
> Thank you so much for your review and feedback!
>
> We made some updates based on previous feedback and attached the new
> patch set. Due to time constraints, we didn't get to resolve all the
> comments, and we'll continue to improve this patch.
>
> > In this prototype, the cost model is based on an assumption that there is
> > a linear relationship between the performance gain from using a semijoin
> > filter and the estimated filtering rate:
> > % improvement to Merge Join cost = 0.83 * estimated filtering rate -
> 0.137.
> >
> > How were the coefficients (0.83 and 0.137) determined ?
> > I guess they were based on the results of running certain workload.
>
> Right, the coefficients (0.83 and 0.137) determined are based on some
> preliminary testings. The current costing model is pretty naive and
> we'll work on a more robust costing model in future work.
>
>
> > I agree, in principle, although I think the current logic / formula is a
> > bit too crude and fitted to the simple data used in the test. I think
> > this needs to be formulated as a regular costing issue, considering
> > stuff like cost of the hash functions, and so on.
> >
> > I think this needs to do two things:
> >
> > 1) estimate the cost of building the bloom filter - This shall depend on
> > the number of rows in the inner relation, number/cost of the hash
> > functions (which may be higher for some data types), etc.
> >
> > 2) estimate improvement for the probing branch - Essentially, we need to
> > estimate how much we save by filtering some of the rows, but this also
> > neeeds to include the cost of probing the bloom filter.
> >
> > This will probably require some improvements to the lib/bloomfilter, in
> > order to estimate the false positive rate - this may matter a lot for
> > large data sets and small work_mem values. The bloomfilter library
> > simply reduces the size of the bloom filter, which increases the false
> > positive rate. At some point it'll start reducing the benefit.
> >
>
> These suggestions make a lot of sense. The current costing model is
> definitely not good enough, and we plan to work on a more robust
> costing model as we continue to improve the patch.
>
>
> > OK. Could also build the bloom filter in shared memory?
> >
>
> We thought about this approach but didn't prefer this one because if
> all worker processes share the same bloom filter in shared memory, we
> need to frequently lock and unlock the bloom filter to avoid race
> conditions. So we decided to have each worker process create its own
> bloom filter.
>
>
> > IMHO we shouldn't make too many conclusions from these examples. Yes, it
> > shows merge join can be improved, but for cases where a hashjoin works
> > better so we wouldn't use merge join anyway.
> >
> > I think we should try constructing examples where either merge join wins
> > already (and gets further improved by the bloom filter), or would lose
> > to hash join and the bloom filter improves it enough to win.
> >
> > AFAICS that requires a join of two large tables - large enough that hash
> > join would need to be batched, or pre-sorted inputs (which eliminates
> > the explicit Sort, which is the main cost in most cases).
> >
> > The current patch only works with sequential scans, which eliminates the
> > second (pre-sorted) option. So let's try the first one - can we invent
> > an example with a join of two large tables where a merge join would win?
> >
> > Can we find such example in existing benchmarks like TPC-H/TPC-DS.
> >
>
> Agreed. The current examples are only intended to show us that using
> bloom filters in merge join could improve the merge join performance
> in some cases. We are working on testing more examples that merge join
> with bloom filter could out-perform hash join, which should be more
> persuasive.
>
>
> > The bloom filter is built by the first seqscan (on t0), and then used by
> > the second seqscan (on t1). But this only works because we always run
> > the t0 scan to completion (because we're feeding it into Sort) before we
> > start scanning t1.
> >
> > But when the scan on t1 switches to an index scan, it's over - we'd be
> > building the filter without being able to probe it, and when we finish
> > building it we no longer need it. So this seems pretty futile.
> >
> > It might still improve plans like
> >
> >->  Merge Join
> >  Merge Cond: (t0.c1 = t1.c1)
> >  SemiJoi

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2022-10-11 Thread Zhihong Yu
On Tue, Oct 11, 2022 at 9:58 AM Zhihong Yu  wrote:

>
>
> On Tue, Oct 11, 2022 at 9:22 AM Dmitry Koval 
> wrote:
>
>> Hi!
>>
>> Fixed couple warnings (for cfbot).
>>
>> --
>> With best regards,
>> Dmitry Koval
>>
>> Postgres Professional: http://postgrespro.com
>
> Hi,
> For v12-0001-PGPRO-ALTER-TABLE-MERGE-PARTITIONS-command.patch:
>
> +   if (equal(name, cmd->name))
> +   /* One new partition can have the same name as merged
> partition. */
> +   isSameName = true;
>
> I think there should be a check before assigning true to isSameName - if
> isSameName is true, that means there are two partitions with this same name.
>
> Cheers
>

Pardon - I see that transformPartitionCmdForMerge() compares the partition
names.
Maybe you can add a comment in ATExecMergePartitions referring to
transformPartitionCmdForMerge() so that people can more easily understand
the logic.


Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2022-10-11 Thread Zhihong Yu
On Tue, Oct 11, 2022 at 9:22 AM Dmitry Koval  wrote:

> Hi!
>
> Fixed couple warnings (for cfbot).
>
> --
> With best regards,
> Dmitry Koval
>
> Postgres Professional: http://postgrespro.com

Hi,
For v12-0001-PGPRO-ALTER-TABLE-MERGE-PARTITIONS-command.patch:

+   if (equal(name, cmd->name))
+   /* One new partition can have the same name as merged
partition. */
+   isSameName = true;

I think there should be a check before assigning true to isSameName - if
isSameName is true, that means there are two partitions with this same name.

Cheers


subtransaction performance

2022-10-07 Thread Zhihong Yu
Hi,
I stumbled over:

https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/

I wonder if SAVEPOINT / subtransaction performance has been boosted since
the blog was written.

Cheers


Re: Add support for DEFAULT specification in COPY FROM

2022-10-07 Thread Zhihong Yu
On Fri, Oct 7, 2022 at 12:09 PM Israel Barth Rubio 
wrote:

> Hello Zhihong,
>
> > +   /* attribute is NOT to be copied from input */
> >
> > I think saying `is NOT copied from input` should suffice.
> >
> > +   /* fieldno is 0-index and attnum is 1-index */
> >
> > 0-index -> 0-indexed
>
> I have applied both suggestions, thanks! I'll submit a 4th version
> of the patch soon.
>
> > +   defaults = (bool *) palloc0(num_phys_attrs * sizeof(bool));
> > +   MemSet(defaults, false, num_phys_attrs * sizeof(bool));
> >
> > Is the MemSet() call necessary ?
>
> I would say it is, so it initializes the array with all flags set to false.
> Later, if it detects attributes that should evaluate their default
> expression,
> it would set the flag to true.
>
> Am I missing something?
>
> Regards,
> Israel.
>
Hi,
For the last question, please take a look at:

#define MemSetAligned(start, val, len) \

which is called by palloc0().


Re: [POC] Allow flattening of subquery with a link to upper query

2022-10-05 Thread Zhihong Yu
On Wed, Oct 5, 2022 at 4:38 AM Andrey Lepikhov 
wrote:

> On 5/10/2022 02:45, Zhihong Yu wrote:
> > Hi,
> > For contain_placeholders():
> >
> > +   if (IsA(node, Query))
> > +   return query_tree_walker((Query *) node, contain_placeholders,
> > context, 0);
> > +   else if (IsA(node, PlaceHolderVar))
> Fixed
> >
> > The `else` is not needed.
> >
> > For correlated_t struct, it would be better if the fields have comments.
> Ok, I've added some comments.
> >
> > +* (for grouping, as an example). So, revert its
> > status to
> > +* a full valued entry.
> >
> > full valued -> fully valued
> Fixed
>
> --
> regards,
> Andrey Lepikhov
> Postgres Professional
>
Hi,

+   List   *pulling_quals; /* List of expressions contained pulled
expressions */

contained -> containing

+   /* Does the var already exists in the target list? */

 exists -> exist

+   {"optimize_correlated_subqueries", PGC_USERSET, QUERY_TUNING_METHOD,

Is it possible that in the future there would be other optimization for
correlated subqueries ?
If so, either rename the guc or, make the guc a string which represents an
enum.

Cheers


Re: [POC] Allow flattening of subquery with a link to upper query

2022-10-04 Thread Zhihong Yu
Hi,
For contain_placeholders():

+   if (IsA(node, Query))
+   return query_tree_walker((Query *) node, contain_placeholders,
context, 0);
+   else if (IsA(node, PlaceHolderVar))

The `else` is not needed.

For correlated_t struct, it would be better if the fields have comments.

+* (for grouping, as an example). So, revert its status
to
+* a full valued entry.

full valued -> fully valued

Cheers


Re: [PATCH] Fix build with LLVM 15 or above

2022-10-03 Thread Zhihong Yu
On Mon, Oct 3, 2022 at 2:41 PM Andres Freund  wrote:

> Hi,
>
> On 2022-10-03 12:16:12 -0700, Andres Freund wrote:
> > I haven't yet run through the whole regression test with an assert
> enabled
> > llvm because an assert-enabled llvm is *SLOW*, but it got through the
> first
> > few parallel groups ok.  Using an optimized llvm 15, all tests pass with
> > PGOPTIONS=-cjit_above_cost=0.
>
> That did pass. But to be able to use clang >= 15 one more piece is
> needed. Updated patch attached.
>
> Greetings,
>
> Andres Freund
>
Hi,

+* When targetting an llvm version with opaque pointers enabled by

I think `targetting` should be spelled as targeting

Cheers


Re: Bloom filter Pushdown Optimization for Merge Join

2022-10-02 Thread Zhihong Yu
On Sun, Oct 2, 2022 at 6:40 AM Zhihong Yu  wrote:

>
>
> On Sat, Oct 1, 2022 at 12:45 AM Zhihong Yu  wrote:
>
>>
>>
>> On Fri, Sep 30, 2022 at 9:20 PM Zhihong Yu  wrote:
>>
>>>
>>>
>>> On Fri, Sep 30, 2022 at 8:40 PM Zhihong Yu  wrote:
>>>
>>>>
>>>>
>>>> On Fri, Sep 30, 2022 at 3:44 PM Zheng Li  wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> A bloom filter provides early filtering of rows that cannot be joined
>>>>> before they would reach the join operator, the optimization is also
>>>>> called a semi join filter (SJF) pushdown. Such a filter can be created
>>>>> when one child of the join operator must materialize its derived table
>>>>> before the other child is evaluated.
>>>>>
>>>>> For example, a bloom filter can be created using the the join keys for
>>>>> the build side/inner side of a hash join or the outer side of a merge
>>>>> join, the bloom filter can then be used to pre-filter rows on the
>>>>> other side of the join operator during the scan of the base relation.
>>>>> The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good
>>>>> discussion on using such optimization for hash join without going into
>>>>> the pushdown of the filter where its performance gain could be further
>>>>> increased.
>>>>>
>>>>> We worked on prototyping bloom filter pushdown for both hash join and
>>>>> merge join. Attached is a patch set for bloom filter pushdown for
>>>>> merge join. We also plan to send the patch for hash join once we have
>>>>> it rebased.
>>>>>
>>>>> Here is a summary of the patch set:
>>>>> 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early
>>>>> during the table scan instead of later on.
>>>>> -The bloom filter is pushed down along the execution tree to
>>>>> the target SeqScan nodes.
>>>>> -Experiments show that this optimization can speed up Merge
>>>>> Join by up to 36%.
>>>>>
>>>>> 2. The planner makes the decision to use the bloom filter based on the
>>>>> estimated filtering rate and the expected performance gain.
>>>>> -The planner accomplishes this by estimating four numbers per
>>>>> variable - the total number of rows of the relation, the number of
>>>>> distinct values for a given variable, and the minimum and maximum
>>>>> value of the variable (when applicable). Using these numbers, the
>>>>> planner estimates a filtering rate of a potential filter.
>>>>> -Because actually creating and implementing the filter adds
>>>>> more operations, there is a minimum threshold of filtering where the
>>>>> filter would actually be useful. Based on testing, we query to see if
>>>>> the estimated filtering rate is higher than 35%, and that informs our
>>>>> decision to use a filter or not.
>>>>>
>>>>> 3. If using a bloom filter, the planner also adjusts the expected cost
>>>>> of Merge Join based on expected performance gain.
>>>>>
>>>>> 4. Capability to build the bloom filter in parallel in case of
>>>>> parallel SeqScan. This is done efficiently by populating a local bloom
>>>>> filter for each parallel worker and then taking a bitwise OR over all
>>>>> the local bloom filters to form a shared bloom filter at the end of
>>>>> the parallel SeqScan.
>>>>>
>>>>> 5. The optimization is GUC controlled, with settings of
>>>>> enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter.
>>>>>
>>>>> We found in experiments that there is a significant improvement
>>>>> when using the bloom filter during Merge Join. One experiment involved
>>>>> joining two large tables while varying the theoretical filtering rate
>>>>> (TFR) between the two tables, the TFR is defined as the percentage
>>>>> that the two datasets are disjoint. Both tables in the merge join were
>>>>> the same size. We tested changing the TFR to see the change in
>>>>> filtering optimization.
>>>>>
>>>>> For example, let’s imagine t0 has 10 million rows, which contain the
>>>>> numbers 1 through 10 m

Re: Bloom filter Pushdown Optimization for Merge Join

2022-10-02 Thread Zhihong Yu
On Sat, Oct 1, 2022 at 12:45 AM Zhihong Yu  wrote:

>
>
> On Fri, Sep 30, 2022 at 9:20 PM Zhihong Yu  wrote:
>
>>
>>
>> On Fri, Sep 30, 2022 at 8:40 PM Zhihong Yu  wrote:
>>
>>>
>>>
>>> On Fri, Sep 30, 2022 at 3:44 PM Zheng Li  wrote:
>>>
>>>> Hello,
>>>>
>>>> A bloom filter provides early filtering of rows that cannot be joined
>>>> before they would reach the join operator, the optimization is also
>>>> called a semi join filter (SJF) pushdown. Such a filter can be created
>>>> when one child of the join operator must materialize its derived table
>>>> before the other child is evaluated.
>>>>
>>>> For example, a bloom filter can be created using the the join keys for
>>>> the build side/inner side of a hash join or the outer side of a merge
>>>> join, the bloom filter can then be used to pre-filter rows on the
>>>> other side of the join operator during the scan of the base relation.
>>>> The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good
>>>> discussion on using such optimization for hash join without going into
>>>> the pushdown of the filter where its performance gain could be further
>>>> increased.
>>>>
>>>> We worked on prototyping bloom filter pushdown for both hash join and
>>>> merge join. Attached is a patch set for bloom filter pushdown for
>>>> merge join. We also plan to send the patch for hash join once we have
>>>> it rebased.
>>>>
>>>> Here is a summary of the patch set:
>>>> 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early
>>>> during the table scan instead of later on.
>>>> -The bloom filter is pushed down along the execution tree to
>>>> the target SeqScan nodes.
>>>> -Experiments show that this optimization can speed up Merge
>>>> Join by up to 36%.
>>>>
>>>> 2. The planner makes the decision to use the bloom filter based on the
>>>> estimated filtering rate and the expected performance gain.
>>>> -The planner accomplishes this by estimating four numbers per
>>>> variable - the total number of rows of the relation, the number of
>>>> distinct values for a given variable, and the minimum and maximum
>>>> value of the variable (when applicable). Using these numbers, the
>>>> planner estimates a filtering rate of a potential filter.
>>>> -Because actually creating and implementing the filter adds
>>>> more operations, there is a minimum threshold of filtering where the
>>>> filter would actually be useful. Based on testing, we query to see if
>>>> the estimated filtering rate is higher than 35%, and that informs our
>>>> decision to use a filter or not.
>>>>
>>>> 3. If using a bloom filter, the planner also adjusts the expected cost
>>>> of Merge Join based on expected performance gain.
>>>>
>>>> 4. Capability to build the bloom filter in parallel in case of
>>>> parallel SeqScan. This is done efficiently by populating a local bloom
>>>> filter for each parallel worker and then taking a bitwise OR over all
>>>> the local bloom filters to form a shared bloom filter at the end of
>>>> the parallel SeqScan.
>>>>
>>>> 5. The optimization is GUC controlled, with settings of
>>>> enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter.
>>>>
>>>> We found in experiments that there is a significant improvement
>>>> when using the bloom filter during Merge Join. One experiment involved
>>>> joining two large tables while varying the theoretical filtering rate
>>>> (TFR) between the two tables, the TFR is defined as the percentage
>>>> that the two datasets are disjoint. Both tables in the merge join were
>>>> the same size. We tested changing the TFR to see the change in
>>>> filtering optimization.
>>>>
>>>> For example, let’s imagine t0 has 10 million rows, which contain the
>>>> numbers 1 through 10 million randomly shuffled. Also, t1 has the
>>>> numbers 4 million through 14 million randomly shuffled. Then the TFR
>>>> for a join of these two tables is 40%, since 40% of the tables are
>>>> disjoint from the other table (1 through 4 million for t0, 10 million
>>>> through 14 million for t4).
>>>>
>>&

Re: Bloom filter Pushdown Optimization for Merge Join

2022-10-01 Thread Zhihong Yu
On Fri, Sep 30, 2022 at 9:20 PM Zhihong Yu  wrote:

>
>
> On Fri, Sep 30, 2022 at 8:40 PM Zhihong Yu  wrote:
>
>>
>>
>> On Fri, Sep 30, 2022 at 3:44 PM Zheng Li  wrote:
>>
>>> Hello,
>>>
>>> A bloom filter provides early filtering of rows that cannot be joined
>>> before they would reach the join operator, the optimization is also
>>> called a semi join filter (SJF) pushdown. Such a filter can be created
>>> when one child of the join operator must materialize its derived table
>>> before the other child is evaluated.
>>>
>>> For example, a bloom filter can be created using the the join keys for
>>> the build side/inner side of a hash join or the outer side of a merge
>>> join, the bloom filter can then be used to pre-filter rows on the
>>> other side of the join operator during the scan of the base relation.
>>> The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good
>>> discussion on using such optimization for hash join without going into
>>> the pushdown of the filter where its performance gain could be further
>>> increased.
>>>
>>> We worked on prototyping bloom filter pushdown for both hash join and
>>> merge join. Attached is a patch set for bloom filter pushdown for
>>> merge join. We also plan to send the patch for hash join once we have
>>> it rebased.
>>>
>>> Here is a summary of the patch set:
>>> 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early
>>> during the table scan instead of later on.
>>> -The bloom filter is pushed down along the execution tree to
>>> the target SeqScan nodes.
>>> -Experiments show that this optimization can speed up Merge
>>> Join by up to 36%.
>>>
>>> 2. The planner makes the decision to use the bloom filter based on the
>>> estimated filtering rate and the expected performance gain.
>>> -The planner accomplishes this by estimating four numbers per
>>> variable - the total number of rows of the relation, the number of
>>> distinct values for a given variable, and the minimum and maximum
>>> value of the variable (when applicable). Using these numbers, the
>>> planner estimates a filtering rate of a potential filter.
>>> -Because actually creating and implementing the filter adds
>>> more operations, there is a minimum threshold of filtering where the
>>> filter would actually be useful. Based on testing, we query to see if
>>> the estimated filtering rate is higher than 35%, and that informs our
>>> decision to use a filter or not.
>>>
>>> 3. If using a bloom filter, the planner also adjusts the expected cost
>>> of Merge Join based on expected performance gain.
>>>
>>> 4. Capability to build the bloom filter in parallel in case of
>>> parallel SeqScan. This is done efficiently by populating a local bloom
>>> filter for each parallel worker and then taking a bitwise OR over all
>>> the local bloom filters to form a shared bloom filter at the end of
>>> the parallel SeqScan.
>>>
>>> 5. The optimization is GUC controlled, with settings of
>>> enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter.
>>>
>>> We found in experiments that there is a significant improvement
>>> when using the bloom filter during Merge Join. One experiment involved
>>> joining two large tables while varying the theoretical filtering rate
>>> (TFR) between the two tables, the TFR is defined as the percentage
>>> that the two datasets are disjoint. Both tables in the merge join were
>>> the same size. We tested changing the TFR to see the change in
>>> filtering optimization.
>>>
>>> For example, let’s imagine t0 has 10 million rows, which contain the
>>> numbers 1 through 10 million randomly shuffled. Also, t1 has the
>>> numbers 4 million through 14 million randomly shuffled. Then the TFR
>>> for a join of these two tables is 40%, since 40% of the tables are
>>> disjoint from the other table (1 through 4 million for t0, 10 million
>>> through 14 million for t4).
>>>
>>> Here is the performance test result joining two tables:
>>> TFR: theoretical filtering rate
>>> EFR: estimated filtering rate
>>> AFR: actual filtering rate
>>> HJ: hash join
>>> MJ Default: default merge join
>>> MJ Filter: merge join with bloom filter optimization enabled
>>> MJ Filter Forced: merge join with

Re: Bloom filter Pushdown Optimization for Merge Join

2022-09-30 Thread Zhihong Yu
On Fri, Sep 30, 2022 at 8:40 PM Zhihong Yu  wrote:

>
>
> On Fri, Sep 30, 2022 at 3:44 PM Zheng Li  wrote:
>
>> Hello,
>>
>> A bloom filter provides early filtering of rows that cannot be joined
>> before they would reach the join operator, the optimization is also
>> called a semi join filter (SJF) pushdown. Such a filter can be created
>> when one child of the join operator must materialize its derived table
>> before the other child is evaluated.
>>
>> For example, a bloom filter can be created using the the join keys for
>> the build side/inner side of a hash join or the outer side of a merge
>> join, the bloom filter can then be used to pre-filter rows on the
>> other side of the join operator during the scan of the base relation.
>> The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good
>> discussion on using such optimization for hash join without going into
>> the pushdown of the filter where its performance gain could be further
>> increased.
>>
>> We worked on prototyping bloom filter pushdown for both hash join and
>> merge join. Attached is a patch set for bloom filter pushdown for
>> merge join. We also plan to send the patch for hash join once we have
>> it rebased.
>>
>> Here is a summary of the patch set:
>> 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early
>> during the table scan instead of later on.
>> -The bloom filter is pushed down along the execution tree to
>> the target SeqScan nodes.
>> -Experiments show that this optimization can speed up Merge
>> Join by up to 36%.
>>
>> 2. The planner makes the decision to use the bloom filter based on the
>> estimated filtering rate and the expected performance gain.
>> -The planner accomplishes this by estimating four numbers per
>> variable - the total number of rows of the relation, the number of
>> distinct values for a given variable, and the minimum and maximum
>> value of the variable (when applicable). Using these numbers, the
>> planner estimates a filtering rate of a potential filter.
>> -Because actually creating and implementing the filter adds
>> more operations, there is a minimum threshold of filtering where the
>> filter would actually be useful. Based on testing, we query to see if
>> the estimated filtering rate is higher than 35%, and that informs our
>> decision to use a filter or not.
>>
>> 3. If using a bloom filter, the planner also adjusts the expected cost
>> of Merge Join based on expected performance gain.
>>
>> 4. Capability to build the bloom filter in parallel in case of
>> parallel SeqScan. This is done efficiently by populating a local bloom
>> filter for each parallel worker and then taking a bitwise OR over all
>> the local bloom filters to form a shared bloom filter at the end of
>> the parallel SeqScan.
>>
>> 5. The optimization is GUC controlled, with settings of
>> enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter.
>>
>> We found in experiments that there is a significant improvement
>> when using the bloom filter during Merge Join. One experiment involved
>> joining two large tables while varying the theoretical filtering rate
>> (TFR) between the two tables, the TFR is defined as the percentage
>> that the two datasets are disjoint. Both tables in the merge join were
>> the same size. We tested changing the TFR to see the change in
>> filtering optimization.
>>
>> For example, let’s imagine t0 has 10 million rows, which contain the
>> numbers 1 through 10 million randomly shuffled. Also, t1 has the
>> numbers 4 million through 14 million randomly shuffled. Then the TFR
>> for a join of these two tables is 40%, since 40% of the tables are
>> disjoint from the other table (1 through 4 million for t0, 10 million
>> through 14 million for t4).
>>
>> Here is the performance test result joining two tables:
>> TFR: theoretical filtering rate
>> EFR: estimated filtering rate
>> AFR: actual filtering rate
>> HJ: hash join
>> MJ Default: default merge join
>> MJ Filter: merge join with bloom filter optimization enabled
>> MJ Filter Forced: merge join with bloom filter optimization forced
>>
>> TFR   EFR   AFR   HJ   MJ Default   MJ Filter   MJ Filter Forced
>>
>> -
>> 10 33.46   7.416529   226382194923160
>> 20 37.27  14.85   6483   222902192821930
>> 30 41.32   22.25  6395   2

Re: Bloom filter Pushdown Optimization for Merge Join

2022-09-30 Thread Zhihong Yu
On Fri, Sep 30, 2022 at 3:44 PM Zheng Li  wrote:

> Hello,
>
> A bloom filter provides early filtering of rows that cannot be joined
> before they would reach the join operator, the optimization is also
> called a semi join filter (SJF) pushdown. Such a filter can be created
> when one child of the join operator must materialize its derived table
> before the other child is evaluated.
>
> For example, a bloom filter can be created using the the join keys for
> the build side/inner side of a hash join or the outer side of a merge
> join, the bloom filter can then be used to pre-filter rows on the
> other side of the join operator during the scan of the base relation.
> The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good
> discussion on using such optimization for hash join without going into
> the pushdown of the filter where its performance gain could be further
> increased.
>
> We worked on prototyping bloom filter pushdown for both hash join and
> merge join. Attached is a patch set for bloom filter pushdown for
> merge join. We also plan to send the patch for hash join once we have
> it rebased.
>
> Here is a summary of the patch set:
> 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early
> during the table scan instead of later on.
> -The bloom filter is pushed down along the execution tree to
> the target SeqScan nodes.
> -Experiments show that this optimization can speed up Merge
> Join by up to 36%.
>
> 2. The planner makes the decision to use the bloom filter based on the
> estimated filtering rate and the expected performance gain.
> -The planner accomplishes this by estimating four numbers per
> variable - the total number of rows of the relation, the number of
> distinct values for a given variable, and the minimum and maximum
> value of the variable (when applicable). Using these numbers, the
> planner estimates a filtering rate of a potential filter.
> -Because actually creating and implementing the filter adds
> more operations, there is a minimum threshold of filtering where the
> filter would actually be useful. Based on testing, we query to see if
> the estimated filtering rate is higher than 35%, and that informs our
> decision to use a filter or not.
>
> 3. If using a bloom filter, the planner also adjusts the expected cost
> of Merge Join based on expected performance gain.
>
> 4. Capability to build the bloom filter in parallel in case of
> parallel SeqScan. This is done efficiently by populating a local bloom
> filter for each parallel worker and then taking a bitwise OR over all
> the local bloom filters to form a shared bloom filter at the end of
> the parallel SeqScan.
>
> 5. The optimization is GUC controlled, with settings of
> enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter.
>
> We found in experiments that there is a significant improvement
> when using the bloom filter during Merge Join. One experiment involved
> joining two large tables while varying the theoretical filtering rate
> (TFR) between the two tables, the TFR is defined as the percentage
> that the two datasets are disjoint. Both tables in the merge join were
> the same size. We tested changing the TFR to see the change in
> filtering optimization.
>
> For example, let’s imagine t0 has 10 million rows, which contain the
> numbers 1 through 10 million randomly shuffled. Also, t1 has the
> numbers 4 million through 14 million randomly shuffled. Then the TFR
> for a join of these two tables is 40%, since 40% of the tables are
> disjoint from the other table (1 through 4 million for t0, 10 million
> through 14 million for t4).
>
> Here is the performance test result joining two tables:
> TFR: theoretical filtering rate
> EFR: estimated filtering rate
> AFR: actual filtering rate
> HJ: hash join
> MJ Default: default merge join
> MJ Filter: merge join with bloom filter optimization enabled
> MJ Filter Forced: merge join with bloom filter optimization forced
>
> TFR   EFR   AFR   HJ   MJ Default   MJ Filter   MJ Filter Forced
>
> -
> 10 33.46   7.416529   226382194923160
> 20 37.27  14.85   6483   222902192821930
> 30 41.32   22.25  6395   223742071820794
> 40 45.67   29.76272   219691944919410
> 50 50.41   37.16210   214121822218224
> 60 55.64   44.51  6052   211081706017018
> 70 61.59   51.98  5947   210201568215737
> 80 68.64   59.36  5761   208121441114437
> 90 77.83   66.86  5701   205851317113200
> Table. Execution Time (ms) vs Filtering Rate (%) for Joining Two
> Tables of 10M Rows.
>
> Attached you can find figures of the same performance test and a SQL script
> to reproduce the performance test.
>
> The first thing to 

Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)

2022-09-30 Thread Zhihong Yu
On Fri, Sep 30, 2022 at 3:30 PM Jehan-Guillaume de Rorthais 
wrote:

> Hi,
>
> Please, find in attachment a small serie of patch:
>
>   0001 fix the constraint parenting bug. Not much to say. It's basically
> your
>   patch we discussed with some more comments and the check on contype
> equals to
>   either primary, unique or exclusion.
>
>   0002 fix the self-FK being cloned twice on partitions
>
>   0003 add a regression test validating both fix.
>
> I should confess than even with these fix, I'm still wondering about this
> code
> sanity as we could still end up with a PK on a partition being parented
> with a
> simple unique constraint from the table, on a field not even NOT NULL:
>
>   DROP TABLE IF EXISTS parted_self_fk, part_with_pk;
>
>   CREATE TABLE parted_self_fk (
> id bigint,
> id_abc bigint,
> FOREIGN KEY (id_abc) REFERENCES parted_self_fk(id),
> UNIQUE (id)
>   )
>   PARTITION BY RANGE (id);
>
>   CREATE TABLE part_with_pk (
> id bigint PRIMARY KEY,
> id_abc bigint,
> CHECK ((id >= 0 AND id < 10))
>   );
>
>   ALTER TABLE parted_self_fk ATTACH
> PARTITION part_with_pk FOR VALUES FROM (0) TO (10);
>
>   SELECT cr.relname, co.conname, co.contype, p.conname AS conparentrelname
>   FROM pg_catalog.pg_constraint co
>   JOIN pg_catalog.pg_class cr ON cr.oid = co.conrelid
>   LEFT JOIN pg_catalog.pg_constraint p ON p.oid = co.conparentid
>   WHERE cr.relname IN ('parted_self_fk', 'part_with_pk')
> AND co.contype IN ('u', 'p');
>
>   DROP TABLE parted_self_fk;
>
>   DROP TABLE
>   CREATE TABLE
>   CREATE TABLE
>   ALTER TABLE
>   relname |conname| contype |   conparentrelname
>
>
> +---+-+---
>parted_self_fk | parted_self_fk_id_key | u   |
>part_with_pk   | part_with_pk_pkey | p   | parted_self_fk_id_key
>   (2 rows)
>
> Nothing forbid the partition to have stricter constraints than the parent
> table, but it feels weird, so it might worth noting it here.
>
> I wonder if AttachPartitionEnsureConstraints() should exists and take care
> of
> comparing/cloning constraints before calling AttachPartitionEnsureIndexes()
> which would handle missing index without paying attention to related
> constraints?
>
> Regards,
>
> On Wed, 24 Aug 2022 12:49:13 +0200
> Alvaro Herrera  wrote:
>
> > On 2022-Aug-24, Jehan-Guillaume de Rorthais wrote:
> >
> > > I was naively wondering about such a patch, but was worrying about
> potential
> > > side effects on ATExecAttachPartitionIdx(), DetachPartitionFinalize()
> and
> > > DefineIndex() where I didn't had a single glance. Did you had a look?
> >
> > No.  But AFAIR all the code there is supposed to worry about unique
> > constraints and PK only, not FKs.  So if something changes, then most
> > likely it was wrong to begin with.
> >
> > > I did a quick ATTACH + DETACH test, and it seems DETACH partly fails
> with
> > > its housecleaning:
> >
> > Ugh.  More fixes required, then.
> >
> > > Looking for few minutes in ATExecDetachPartitionFinalize(), it seems
> it only
> > > support removing the parental link on FK, not to clean the FKs added
> during
> > > the ATTACH DDL anyway. That explains the FK child1->parent left
> behind. But
> > > in fact, this let me wonder if this part of the code ever considered
> > > implication of self-FK during the ATTACH and DETACH process?
> >
> > No, or at least I don't remember thinking about self-referencing FKs.
> > If there are no tests for it, then that's likely what happened.
> >
> > > Why in the first place TWO FK are created during the ATTACH DDL?
> >
> > That's probably a bug too.
> >
>
> Hi,

+* Self-Foreign keys are ignored as the index was preliminary
created

preliminary created -> primarily created

 Cheers


Re: Add support for DEFAULT specification in COPY FROM

2022-09-26 Thread Zhihong Yu
On Mon, Sep 26, 2022 at 8:12 AM Israel Barth Rubio 
wrote:

> Hello Andrew,
>
> > . There needs to be a check that this is being used with COPY FROM, and
> > the restriction needs to be stated in the docs and tested for. c.f.
> > FORCE NULL.
> >
> > . There needs to be support for this in psql's tab_complete.c, and
> > appropriate tests added
> >
> > . There needs to be support for it in contrib/file_fdw/file_fdw.c, and a
> > test added
> >
> > . The tests should include psql's \copy as well as sql COPY
> >
> > . I'm not sure we need a separate regression test file for this.
> > Probably these tests can go at the end of src/test/regress/sql/copy2.sql.
>
> Thanks for your review! I have applied the suggested changes, and I'm
> submitting the new patch version.
>
> Kind regards,
> Israel.
>

Hi,

+   /* attribute is NOT to be copied from input */

I think saying `is NOT copied from input` should suffice.

+   defaults = (bool *) palloc0(num_phys_attrs * sizeof(bool));
+   MemSet(defaults, false, num_phys_attrs * sizeof(bool));

Is the MemSet() call necessary ?

+   /* fieldno is 0-index and attnum is 1-index */

0-index -> 0-indexed

Cheers


default sorting behavior for index

2022-09-20 Thread Zhihong Yu
Hi,
I was looking at this check in src/backend/parser/parse_utilcmd.c w.r.t.
constraint:

if (indclass->values[i] != defopclass ||
attform->attcollation != index_rel->rd_indcollation[i]
||
attoptions != (Datum) 0 ||
index_rel->rd_indoption[i] != 0)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("index \"%s\" column number %d does not
have default sorting behavior", index_name, i + 1),
 errdetail("Cannot create a primary key or
unique constraint using such an index."),

It seems this first came in via `Indexes with INCLUDE columns and their
support in B-tree`

If the index has DESC sorting order, why it cannot be used to back a
constraint ?
Some concrete sample would help me understand this.

Thanks


Re: cataloguing NOT NULL constraints

2022-09-09 Thread Zhihong Yu
Hi,
w.r.t. the while loop in findNotNullConstraintAttnum():

+   if (multiple == NULL)
+   break;

I think `pfree(arr)` should be called before breaking.

+   if (constraint->cooked_expr != NULL)
+   return
tryExtractNotNullFromNode(stringToNode(constraint->cooked_expr), rel);
+   else
+   return tryExtractNotNullFromNode(constraint->raw_expr, rel);

nit: the `else` keyword is not needed.

+   if (isnull)
+   elog(ERROR, "null conbin for constraint %u", conForm->oid);

It would be better to expand `conbin` so that the user can better
understand the error.

Cheers

>


Re: Table AM modifications to accept column projection lists

2022-09-05 Thread Zhihong Yu
On Mon, Sep 5, 2022 at 9:51 AM Nikita Malakhov  wrote:

> Hi hackers!
>
> This is the original patch rebased onto v15 master with conflicts
> resolved. I'm currently
> studying it and latest comments in the original thread, and would try go
> the way that
> was mentioned in the thread (last message) -
> [1]
> https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
> [2] https://github.com/zhihuiFan/postgres/tree/lazy_material_v2
> I agree it is not in the state for review, so I've decided not to change
> patch status,
> just revive the thread because we found that Pluggable Storage API is not
> somewhat
> not sufficient.
> Thanks for the recommendations, I'll check that out.
>
> Hi,
bq. is not somewhat not sufficient.

I am a bit confused by the double negation.
I guess you meant insufficient.

Cheers


Re: freeing LDAPMessage in CheckLDAPAuth

2022-09-05 Thread Zhihong Yu
On Sun, Sep 4, 2022 at 10:37 PM Michael Paquier  wrote:

> On Sun, Sep 04, 2022 at 06:52:37AM -0700, Zhihong Yu wrote:
> > Please take a look at patch v3.
>
> Fine as far as it goes.  I would have put the initialization of
> search_message closer to ldap_search_s() for consistency with libpq.
> That's what we do with ldap_search_st().
> --
> Michael
>
Hi,
Here is patch v4.


v4-ldap-msg-free.patch
Description: Binary data


Re: freeing LDAPMessage in CheckLDAPAuth

2022-09-04 Thread Zhihong Yu
On Sun, Sep 4, 2022 at 3:58 AM Zhihong Yu  wrote:

>
>
> On Sun, Sep 4, 2022 at 12:25 AM Michael Paquier 
> wrote:
>
>> On Sun, Sep 04, 2022 at 01:52:10AM -0400, Tom Lane wrote:
>> > I can't get too excited about this.  All of the error exit paths in
>> > backend authentication code will lead immediately to process exit, so
>> > the possibility of some memory being leaked really has no consequences
>> > worth worrying about.  If we *were* worried about it, sprinkling a few
>> > more ldap_msgfree() calls into the existing code would hardly make it
>> > more bulletproof.
>>
>> Even if this is not critical in the backend for this authentication
>> path, I'd like to think that it is still a good practice for future
>> code so as anything code-pasted around would get the call.  So I see
>> no reason to not put smth on HEAD at least.
>>
> Hi,
> Here is updated patch as you suggested in your previous email.
>
> Thanks
>
Hi,
Please take a look at patch v3.

Thanks


v3-ldap-msg-free.patch
Description: Binary data


Re: freeing LDAPMessage in CheckLDAPAuth

2022-09-04 Thread Zhihong Yu
On Sun, Sep 4, 2022 at 12:25 AM Michael Paquier  wrote:

> On Sun, Sep 04, 2022 at 01:52:10AM -0400, Tom Lane wrote:
> > I can't get too excited about this.  All of the error exit paths in
> > backend authentication code will lead immediately to process exit, so
> > the possibility of some memory being leaked really has no consequences
> > worth worrying about.  If we *were* worried about it, sprinkling a few
> > more ldap_msgfree() calls into the existing code would hardly make it
> > more bulletproof.
>
> Even if this is not critical in the backend for this authentication
> path, I'd like to think that it is still a good practice for future
> code so as anything code-pasted around would get the call.  So I see
> no reason to not put smth on HEAD at least.
>
Hi,
Here is updated patch as you suggested in your previous email.

Thanks


v2-ldap-msg-free.patch
Description: Binary data


freeing LDAPMessage in CheckLDAPAuth

2022-09-03 Thread Zhihong Yu
Hi,
In CheckLDAPAuth(), around line 2606:

if (r != LDAP_SUCCESS)
{
ereport(LOG,
(errmsg("could not search LDAP for filter \"%s\" on
server \"%s\": %s",

It seems that the call to ldap_msgfree() is missing in the above case.
According to
https://www.openldap.org/software//man.cgi?query=ldap_search_s=3=0=OpenLDAP+2.4-Release
:

   Note  that  *res*  parameter  of  *ldap*_*search*_*ext*_*s()*
and *ldap*_*search*_*s()*
   should be freed with *ldap*_*msgfree()* regardless of return
value of these
   functions.

Please see the attached patch which frees the search_message in the above case.


Thanks


ldap-msg-free.patch
Description: Binary data


Re: On login trigger: take three

2022-09-02 Thread Zhihong Yu
On Fri, Sep 2, 2022 at 8:37 AM Daniel Gustafsson  wrote:

> This had bitrotted a fair bit, attached is a rebase along with (mostly)
> documentation fixes.  0001 adds a generic GUC for ignoring event triggers
> and
> 0002 adds the login event with event trigger support, and hooks it up to
> the
> GUC such that broken triggers wont require single-user mode.  Moving the CF
> entry back to Needs Review.
>
> --
> Daniel Gustafsson   https://vmware.com/
>
> Hi,
For EventTriggerOnLogin():

+   LockSharedObject(DatabaseRelationId, MyDatabaseId, 0,
AccessExclusiveLock);
+
+   tuple = SearchSysCacheCopy1(DATABASEOID,
ObjectIdGetDatum(MyDatabaseId));
+
+   if (!HeapTupleIsValid(tuple))
+   elog(ERROR, "cache lookup failed for database %u",
MyDatabaseId);

Should the lock be taken after the check (HeapTupleIsValid) is done ?

Cheers


Re: cataloguing NOT NULL constraints

2022-08-31 Thread Zhihong Yu
On Wed, Aug 31, 2022 at 4:08 PM Zhihong Yu  wrote:

>
>
> On Wed, Aug 31, 2022 at 3:19 PM Alvaro Herrera 
> wrote:
>
>> So I was wrong in thinking that "this case was simple to implement" as I
>> replied upthread.  Doing that actually required me to rewrite large
>> parts of the patch.  I think it ended up being a good thing, because in
>> hindsight the approach I was using was somewhat bogus anyway, and the
>> current one should be better.  Please find it attached.
>>
>> There are still a few problems, sadly.  Most notably, I ran out of time
>> trying to fix a pg_upgrade issue with pg_dump in binary-upgrade mode.
>> I have to review that again, but I think it'll need a deeper rethink of
>> how we pg_upgrade inherited constraints.  So the pg_upgrade tests are
>> known to fail.  I'm not aware of any other tests failing, but I'm sure
>> the cfbot will prove me wrong.
>>
>> I reluctantly added a new ALTER TABLE subcommand type, AT_SetAttNotNull,
>> to allow setting pg_attribute.attnotnull without adding a CHECK
>> constraint (only used internally).  I would like to find a better way to
>> go about this, so I may remove it again, therefore it's not fully
>> implemented.
>>
>> There are *many* changed regress expect files and I didn't carefully vet
>> all of them.  Mostly it's the addition of CHECK constraints in the
>> footers of many \d listings and stuff like that.  At a quick glance they
>> appear valid, but I need to review them more carefully still.
>>
>> We've had pg_constraint.conparentid for a while now, but for some
>> constraints we continue to use conislocal/coninhcount.  I think we
>> should get rid of that and rely on conparentid completely.
>>
>> An easily fixed issue is that of constraint naming.
>> ChooseConstraintName has an argument for passing known constraint names,
>> but this patch doesn't use it and it must.
>>
>> One issue that I don't currently know how to fix, is the fact that we
>> need to know whether a column is a row type or not (because they need a
>> different null test).  At table creation time that's easy to know,
>> because we have the descriptor already built by the time we add the
>> constraints; but if you do ALTER TABLE .. ADD COLUMN .., ADD CONSTRAINT
>> then we don't.
>>
>> Some ancient code comments suggest that allowing a child table's NOT
>> NULL constraint acquired from parent shouldn't be independently
>> droppable.  This patch doesn't change that, but it's easy to do if we
>> decide to.  However, that'd be a compatibility break, so I'd rather not
>> do it in the same patch that introduces the feature.
>>
>> Overall, there's a lot more work required to get this to a good shape.
>> That said, I think it's the right direction.
>>
>> --
>> Álvaro Herrera   48°01'N 7°57'E  —
>> https://www.EnterpriseDB.com/
>> "La primera ley de las demostraciones en vivo es: no trate de usar el
>> sistema.
>> Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
>>
>
> Hi,
> For findNotNullConstraintAttnum():
>
> +   if (multiple == NULL)
> +   break;
>
> Shouldn't `pfree(arr)` be called before breaking ?
>
> +static Constraint *makeNNCheckConstraint(Oid nspid, char *constraint_name,
>
> You used `NN` because there is method makeCheckNotNullConstraint, right ?
> I think it would be better to expand `NN` so that its meaning is easy to
> understand.
>
> Cheers
>
Hi,
For tryExtractNotNullFromNode , in the block for `if (rel == NULL)`:

+   return false;

I think you meant returning NULL since false is for boolean.

Cheers


Re: cataloguing NOT NULL constraints

2022-08-31 Thread Zhihong Yu
On Wed, Aug 31, 2022 at 3:19 PM Alvaro Herrera 
wrote:

> So I was wrong in thinking that "this case was simple to implement" as I
> replied upthread.  Doing that actually required me to rewrite large
> parts of the patch.  I think it ended up being a good thing, because in
> hindsight the approach I was using was somewhat bogus anyway, and the
> current one should be better.  Please find it attached.
>
> There are still a few problems, sadly.  Most notably, I ran out of time
> trying to fix a pg_upgrade issue with pg_dump in binary-upgrade mode.
> I have to review that again, but I think it'll need a deeper rethink of
> how we pg_upgrade inherited constraints.  So the pg_upgrade tests are
> known to fail.  I'm not aware of any other tests failing, but I'm sure
> the cfbot will prove me wrong.
>
> I reluctantly added a new ALTER TABLE subcommand type, AT_SetAttNotNull,
> to allow setting pg_attribute.attnotnull without adding a CHECK
> constraint (only used internally).  I would like to find a better way to
> go about this, so I may remove it again, therefore it's not fully
> implemented.
>
> There are *many* changed regress expect files and I didn't carefully vet
> all of them.  Mostly it's the addition of CHECK constraints in the
> footers of many \d listings and stuff like that.  At a quick glance they
> appear valid, but I need to review them more carefully still.
>
> We've had pg_constraint.conparentid for a while now, but for some
> constraints we continue to use conislocal/coninhcount.  I think we
> should get rid of that and rely on conparentid completely.
>
> An easily fixed issue is that of constraint naming.
> ChooseConstraintName has an argument for passing known constraint names,
> but this patch doesn't use it and it must.
>
> One issue that I don't currently know how to fix, is the fact that we
> need to know whether a column is a row type or not (because they need a
> different null test).  At table creation time that's easy to know,
> because we have the descriptor already built by the time we add the
> constraints; but if you do ALTER TABLE .. ADD COLUMN .., ADD CONSTRAINT
> then we don't.
>
> Some ancient code comments suggest that allowing a child table's NOT
> NULL constraint acquired from parent shouldn't be independently
> droppable.  This patch doesn't change that, but it's easy to do if we
> decide to.  However, that'd be a compatibility break, so I'd rather not
> do it in the same patch that introduces the feature.
>
> Overall, there's a lot more work required to get this to a good shape.
> That said, I think it's the right direction.
>
> --
> Álvaro Herrera   48°01'N 7°57'E  —
> https://www.EnterpriseDB.com/
> "La primera ley de las demostraciones en vivo es: no trate de usar el
> sistema.
> Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
>

Hi,
For findNotNullConstraintAttnum():

+   if (multiple == NULL)
+   break;

Shouldn't `pfree(arr)` be called before breaking ?

+static Constraint *makeNNCheckConstraint(Oid nspid, char *constraint_name,

You used `NN` because there is method makeCheckNotNullConstraint, right ?
I think it would be better to expand `NN` so that its meaning is easy to
understand.

Cheers


Re: Removing unneeded self joins

2022-08-28 Thread Zhihong Yu
On Fri, Aug 26, 2022 at 3:02 PM Zhihong Yu  wrote:

> Hi,
> For v36-0001-Remove-self-joins.patch :
>
> bq removes inner join of plane table to itself
>
> plane table -> plain table
>
> For relation_has_unique_index_ext(), it seems when extra_clauses is NULL,
> there is no need to compute `exprs`.
>
> Cheers
>

For remove_self_joins_recurse():

+   if (bms_num_members(relids) > join_collapse_limit)
+   break;

The above line just comes out of the switch statement. This check should be
done again between foreach and switch.
Otherwise the above check wouldn't achieve what you want.

Cheers


Re: Removing unneeded self joins

2022-08-26 Thread Zhihong Yu
Hi,
For v36-0001-Remove-self-joins.patch :

bq removes inner join of plane table to itself

plane table -> plain table

For relation_has_unique_index_ext(), it seems when extra_clauses is NULL,
there is no need to compute `exprs`.

Cheers

>


Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2022-08-24 Thread Zhihong Yu
Hi,
I was looking at 0004-COPY_IGNORE_ERRORS.patch

+ * Ignore constraints if IGNORE_ERRORS is enabled
+ */
+static void
+safeExecConstraints(CopyFromState cstate, ResultRelInfo *resultRelInfo,
TupleTableSlot *myslot, EState *estate)

I think the existing ExecConstraints() can be expanded by
checking cstate->opts.ignore_errors so that it can selectively
ignore Constraint Violations.

This way you don't need safeExecConstraints().

Cheers


Re: handling multiple matching constraints in DetachPartitionFinalize()

2022-08-23 Thread Zhihong Yu
On Tue, Aug 23, 2022 at 10:53 AM Alvaro Herrera 
wrote:

> On 2022-Aug-23, Zhihong Yu wrote:
>
> > This is what I came up with.
>
> I suggest you provide a set of SQL commands that provoke some wrong
> behavior with the original code, and show that they generate good
> behavior after the patch.  Otherwise, it's hard to evaluate the
> usefulness of this.
>
> --
> Álvaro HerreraBreisgau, Deutschland  —
> https://www.EnterpriseDB.com/
> "Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"
>

Toggling enable_seqscan on / off using the example from `parenting a PK
constraint to a self-FK one` thread, it can be shown that different
constraint Id would be detached which is incorrect.
However, I am not sure whether toggling enable_seqscan mid-test is
legitimate.

Cheers


Re: handling multiple matching constraints in DetachPartitionFinalize()

2022-08-23 Thread Zhihong Yu
On Tue, Aug 23, 2022 at 10:10 AM Zhihong Yu  wrote:

> Hi,
> I was looking at the following code in DetachPartitionFinalize():
>
> /* If there's a constraint associated with the index, detach it
> too */
> constrOid =
> get_relation_idx_constraint_oid(RelationGetRelid(partRel),
> idxid);
>
> As mentioned in email thread `parenting a PK constraint to a self-FK one`,
> there may be multiple matching constraints, I think we should
> call ConstraintSetParentConstraint() for each of them.
>
> This means adding a helper method similar to
> get_relation_idx_constraint_oid() which finds constraint and calls
> ConstraintSetParentConstraint().
>
> I am preparing a patch.
> Please let me know if my proposal makes sense.
>
> Thanks
>

This is what I came up with.


detach-constraints.patch
Description: Binary data


handling multiple matching constraints in DetachPartitionFinalize()

2022-08-23 Thread Zhihong Yu
Hi,
I was looking at the following code in DetachPartitionFinalize():

/* If there's a constraint associated with the index, detach it too
*/
constrOid =
get_relation_idx_constraint_oid(RelationGetRelid(partRel),
idxid);

As mentioned in email thread `parenting a PK constraint to a self-FK one`,
there may be multiple matching constraints, I think we should
call ConstraintSetParentConstraint() for each of them.

This means adding a helper method similar to
get_relation_idx_constraint_oid() which finds constraint and calls
ConstraintSetParentConstraint().

I am preparing a patch.
Please let me know if my proposal makes sense.

Thanks


Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)

2022-08-23 Thread Zhihong Yu
On Tue, Aug 23, 2022 at 9:47 AM Alvaro Herrera 
wrote:

> On 2022-Aug-23, Zhihong Yu wrote:
>
> > I was thinking of the following patch.
> > Basically, if there is only one matching constraint. we still return it.
> >
> > diff --git a/src/postgres/src/backend/catalog/pg_constraint.c
> > b/src/postgres/src/backend/catalog/pg_constraint.c
> > index f0726e9aa0..ddade138b4 100644
> > --- a/src/postgres/src/backend/catalog/pg_constraint.c
> > +++ b/src/postgres/src/backend/catalog/pg_constraint.c
> > @@ -1003,7 +1003,8 @@ get_relation_idx_constraint_oid(Oid relationId, Oid
> > indexId)
> >   constrForm = (Form_pg_constraint) GETSTRUCT(tuple);
> >   if (constrForm->conindid == indexId)
> >   {
> > - constraintId = HeapTupleGetOid(tuple);
> > + if (constraintId == InvalidOid || constrForm->confrelid == 0)
> > + constraintId = HeapTupleGetOid(tuple);
> >   break;
> >   }
> >   }
>
> We could do this, but what do we gain by doing so?  It seems to me that
> my proposed formulation achieves the same and is less fuzzy about what
> the returned constraint is.  Please try to write a code comment that
> explains what this does and see if it makes sense.
>
> For my proposal, it would be "return the OID of a primary key or unique
> constraint associated with the given index in the given relation, or OID
> if no such index is catalogued".  This definition is clearly useful for
> partitioned tables, on which the unique and primary key constraints are
> useful elements.  There's nothing that cares about foreign keys.
>
> --
> Álvaro Herrera PostgreSQL Developer  —
> https://www.EnterpriseDB.com/
> "La virtud es el justo medio entre dos defectos" (Aristóteles)
>

A bigger question I have, even with the additional filtering, is what if
there are multiple constraints ?
How do we decide which unique / primary key constraint to return ?

Looks like there is no known SQL statements leading to such state, but
should we consider such possibility ?

Cheers


Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)

2022-08-23 Thread Zhihong Yu
On Tue, Aug 23, 2022 at 9:30 AM Alvaro Herrera 
wrote:

> On 2022-Aug-23, Jehan-Guillaume de Rorthais wrote:
>
> Hi,
>
> [...]
>
> > However, it seems get_relation_idx_constraint_oid(), introduced in
> eb7ed3f3063,
> > assume there could be only ONE constraint depending to an index. But in
> fact,
> > multiple constraints can rely on the same index, eg.: the PK and a self
> > referencing FK. In consequence, when looking for a constraint depending
> on an
> > index for the given relation, either the FK or a PK can appears first
> depending
> > on various conditions. It is then possible to trick it make a FK
> constraint a
> > parent of a PK...
>
> Hmm, wow, that sounds extremely stupid.  I think a sufficient fix might
> be to have get_relation_idx_constraint_oid ignore any constraints that
> are not unique or primary keys.  I tried your scenario with the attached
> and it seems to work correctly.  Can you confirm?  (I only ran the
> pg_regress tests, not anything else for now.)
>
> If this is OK, we should make this API quirkiness very explicit in the
> comments, so the patch needs to be a few lines larger in order to be
> committable.  Also, perhaps the check should be that contype equals
> either primary or unique, rather than it doesn't equal foreign.
>
> --
> Álvaro Herrera   48°01'N 7°57'E  —
> https://www.EnterpriseDB.com/


I was thinking of the following patch.
Basically, if there is only one matching constraint. we still return it.

diff --git a/src/postgres/src/backend/catalog/pg_constraint.c
b/src/postgres/src/backend/catalog/pg_constraint.c
index f0726e9aa0..ddade138b4 100644
--- a/src/postgres/src/backend/catalog/pg_constraint.c
+++ b/src/postgres/src/backend/catalog/pg_constraint.c
@@ -1003,7 +1003,8 @@ get_relation_idx_constraint_oid(Oid relationId, Oid
indexId)
  constrForm = (Form_pg_constraint) GETSTRUCT(tuple);
  if (constrForm->conindid == indexId)
  {
- constraintId = HeapTupleGetOid(tuple);
+ if (constraintId == InvalidOid || constrForm->confrelid == 0)
+ constraintId = HeapTupleGetOid(tuple);
  break;
  }
  }


Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)

2022-08-23 Thread Zhihong Yu
On Tue, Aug 23, 2022 at 8:07 AM Jehan-Guillaume de Rorthais 
wrote:

> Hi all,
>
> I've been able to work on this issue and isolate where in the code the
> oddity
> is laying.
>
> During ATExecAttachPartition(), AttachPartitionEnsureIndexes() look for
> existing
> required index on the partition to attach. It creates missing index, or
> sets the
> parent's index when a matching one exists on the partition. Good.
>
> When a matching index is found, if the parent index enforce a constraint,
> the
> function look for the similar constraint in the partition-to-be, and set
> the
> constraint parent as well:
>
> constraintOid =
> get_relation_idx_constraint_oid(RelationGetRelid(rel),
> idx);
>
> [...]
>
> /*
>  * If this index is being created in the parent because of a
>  * constraint, then the child needs to have a constraint also,
>  * so look for one.  If there is no such constraint, this
>  * index is no good, so keep looking.
>  */
> if (OidIsValid(constraintOid))
> {
> cldConstrOid = get_relation_idx_constraint_oid(
> RelationGetRelid(attachrel),
> cldIdxId);
> /* no dice */
> if (!OidIsValid(cldConstrOid))
> continue;
>  }
>  /* bingo. */
>  IndexSetParentIndex(attachrelIdxRels[i], idx);
>  if (OidIsValid(constraintOid))
> ConstraintSetParentConstraint(cldConstrOid, constraintOid,
>   RelationGetRelid(attachrel));
>
> However, it seems get_relation_idx_constraint_oid(), introduced in
> eb7ed3f3063,
> assume there could be only ONE constraint depending to an index. But in
> fact,
> multiple constraints can rely on the same index, eg.: the PK and a self
> referencing FK. In consequence, when looking for a constraint depending on
> an
> index for the given relation, either the FK or a PK can appears first
> depending
> on various conditions. It is then possible to trick it make a FK
> constraint a
> parent of a PK...
>
> In the following little scenario, when looking at the constraint linked to
> the PK unique index using the same index than
> get_relation_idx_constraint_oid
> use, this is the self-FK that is actually returned first by
> get_relation_idx_constraint_oid(), NOT the PK:
>
>   postgres=# DROP TABLE IF EXISTS parent, child1;
>
> CREATE TABLE parent (
> id bigint NOT NULL default 1,
> no_part smallint NOT NULL,
> id_abc bigint,
> FOREIGN KEY (id_abc, no_part) REFERENCES parent(id, no_part)
> ON UPDATE RESTRICT ON DELETE RESTRICT,
> PRIMARY KEY (id, no_part)
> )
> PARTITION BY LIST (no_part);
>
> CREATE TABLE child1 (
> id bigint NOT NULL default 1,
> no_part smallint NOT NULL,
> id_abc bigint,
> PRIMARY KEY (id, no_part),
> CONSTRAINT child1 CHECK ((no_part = 1))
> );
>
> -- force an indexscan as get_relation_idx_constraint_oid() use the
> unique
> -- index on (conrelid, contypid, conname) to scan pg_cosntraint
> set enable_seqscan TO off;
> set enable_bitmapscan TO off;
>
> SELECT conname
> FROM pg_constraint
> WHERE conrelid = 'parent'::regclass   <=== parent
>   AND conindid = 'parent_pkey'::regclass; <=== PK index
>
>   DROP TABLE
>   CREATE TABLE
>   CREATE TABLE
>   SET
>   SET
> conname
>   
>parent_id_abc_no_part_fkey < WOOPS!
>parent_pkey
>   (2 rows)
>
> In consequence, when attaching the partition, the PK of child1 is not
> marked as
> partition of the parent's PK, which is wrong. WORST, the PK of child1 is
> actually unexpectedly marked as a partition of the parent's **self-FK**:
>
>   postgres=# ALTER TABLE ONLY parent ATTACH PARTITION child1
>  FOR VALUES IN ('1');
>
> SELECT oid, conname, conparentid, conrelid, confrelid
> FROM pg_constraint
> WHERE conrelid in ('parent'::regclass, 'child1'::regclass)
> ORDER BY 1;
>
>   ALTER TABLE
> oid  |   conname   | conparentid | conrelid |
> confrelid
>
> ---+-+-+--+---
>16700 | parent_pkey |   0 |16695 | 0
>16701 | parent_id_abc_no_part_fkey  |   0 |16695 | 16695
>16706 | child1  |   0 |16702 | 0
>16708 | **child1_pkey** |   **16701** |16702 | 0
>16709 | parent_id_abc_no_part_fkey1 |   16701 |16695 | 16702
>16712 | parent_id_abc_no_part_fkey  |   16701 |16702 | 16695
>   (6 rows)
>
> The expected result should probably be something like:
>
> oid  |   conname   | conparentid | conrelid 

Re: including pid's for `There are XX other sessions using the database`

2022-08-21 Thread Zhihong Yu
On Sun, Aug 21, 2022 at 6:39 AM Julien Rouhaud  wrote:

> Hi,
>
> On Sat, Aug 20, 2022 at 02:52:29AM -0700, Zhihong Yu wrote:
> > On Fri, Aug 19, 2022 at 9:31 PM Euler Taveira  wrote:
> >
> > >
> > Thanks for responding.
> >
> > Since pg_stat_activity shows fewer number of connections compared to the
> > number revealed in the error message,
> > I am not sure the above query would terminate all connections for the
> > database to be dropped.
>
> How exactly are you checking pg_stat_activity?  If you query that view
> right
> after a failed attempt at dropping a database, there's no guarantee to
> find the
> exact same connections on the target database as client might connect or
> disconnect.
>
> If you prevent any further connection by e.g. tweaking the pg_hba.conf
> then you
> have a guarantee that the query will terminate all conflicting connections.
> Using the FORCE option is just a simpler way to do it, as dropdb() starts
> with
> preventing any new connection on the target database.
>
> Overall, I agree that adding the list of pid to the message error message
> doesn't seem useful.
>

Thanks for the comments, Euler and Julien.


timing information for switching database

2022-08-21 Thread Zhihong Yu
Hi,
In sqlsh, I issued `\timing on`.

I don't see timing information displayed for `\c database`.

Does someone know how I can obtain such information ?

Thanks


Re: including pid's for `There are XX other sessions using the database`

2022-08-20 Thread Zhihong Yu
On Fri, Aug 19, 2022 at 9:31 PM Euler Taveira  wrote:

> On Fri, Aug 19, 2022, at 2:10 PM, Zhihong Yu wrote:
>
> I want to poll the community on whether including proc->pid's in the error
> message would be useful for troubleshooting.
>
> Such message is only useful for a parameter into a pg_stat_activity query.
> You
> don't need the PID list if you already have the most important information:
> database name. I don't think revealing the current session PIDs from the
> database you want to drop will buy you anything. It could be a long list
> and it
> does not help you to solve the issue: why wasn't that database removed?
>
> Besides that, if you know that there is a possibility that a connection is
> open, you can always use the FORCE option. The old/other alternative is to
> use
> a query like
>
>SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname =
> 'foo';
>
> (possibly combined with a REVOKE CONNECT or pg_hba.conf modification)
> before
> executing DROP DATABASE.
>
>
> --
> Euler Taveira
> EDB   https://www.enterprisedb.com/
>
>
Thanks for responding.

Since pg_stat_activity shows fewer number of connections compared to the
number revealed in the error message,
I am not sure the above query would terminate all connections for the
database to be dropped.


including pid's for `There are XX other sessions using the database`

2022-08-19 Thread Zhihong Yu
Hi,
Currently errdetail_busy_db() only shows the number of other sessions using
the database but doesn't give any detail about them.
For one of the customers,pg_stat_activity is showing lower number of
connections compared to the number revealed in the error message.

Looking at CountOtherDBBackends(), it seems proc->pid is available
when nbackends is incremented.

I want to poll the community on whether including proc->pid's in the error
message would be useful for troubleshooting.

Thanks


Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Zhihong Yu
On Thu, Aug 11, 2022 at 6:28 PM Tom Lane  wrote:

> Zhihong Yu  writes:
> > How about patch v2 which uses the same check from cash_in() ?
>
> I'm not sure which part of this statement you're not getting:
> it is completely unacceptable for cash_out to fail on valid
> values of the type.  And this value is valid.  cash_in goes
> out of its way to take it, and you can also produce it via
> arithmetic operators.
>
> I understand that you're trying to get rid of an analyzer warning that
> negating INT64_MIN is (pedantically, not in practice) undefined behavior.
> But the way to fix that is to make the behavior conform to the C spec.
> Perhaps it would work to do
>
> Cashvalue = PG_GETARG_CASH(0);
> uint64  uvalue;
>
> if (value < 0)
> uvalue = -(uint64) value;
> else
> uvalue = value;
>
> and then use uvalue instead of "(uint64) value" in the loop.
> Of course, this begs the question of what negation means for
> an unsigned value.  I believe that this formulation is allowed
> by the C spec and produces the same results as what we do now,
> but I'm not convinced that it's clearer for the reader.
>
> Another possibility is
>
> if (value < 0)
> {
> if (value == INT64_MIN)
> uvalue = however you wanna spell -INT64_MIN;
> else
> uvalue = (uint64) -value;
> }
> else
> uvalue = value;
>
> but this really seems to be letting pedantry get the best of us.
>
> The short answer here is that the code works fine on every platform
> we support.  We know that because we have a regression test checking
> this exact case.  So it's not broken and I don't think there's a
> very good argument that it needs to be fixed.  Maybe the right thing
> is just to add a comment pointing out what happens for INT64_MIN.
>
> regards, tom lane
>
Hi,
Thanks for taking the time to contemplate various possibilities.

I thought of using uint64 as well - but as you have shown, the readability
isn't better.

I will keep this in the back of my mind.

Cheers


Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Zhihong Yu
On Thu, Aug 11, 2022 at 12:55 PM David Rowley  wrote:

> On Fri, 12 Aug 2022 at 05:58, Zhihong Yu  wrote:
> > Here is sample output with patch:
> >
> > # SELECT '-92233720368547758.085'::money;
> > ERROR:  value "-92233720368547758.085" is out of range for type money
> > LINE 1: SELECT '-92233720368547758.085'::money;
>
> I'm struggling to follow along here. There are already overflow checks
> for this in cash_in(), which is exactly where they should be.
>
> The above case already fails on master, there's even a regression test
> to make sure it does for this exact case, just look at money.out:356.
> So, if we're already stopping this from happening in cash_in(), why do
> you think it also needs to happen in cash_out()?
>
> I'm also not sure why you opted to use LONG_MIN for your check. The C
> type "Cash" is based on int64, that's not long.
>
> David
>

Hi, David:
I am very sorry for not having looked closer at the sample SQL statement
earlier.
Indeed, the previous statement didn't trigger cash_out().

I think this was due to the fact that sanitizer assertion may be separated
from the statement triggering the assertion.
I am still going over the test output, trying to pinpoint the statement.

Meanwhile, I want to thank you for pointing out the constant shouldn't be
used for the boundary check.

How about patch v2 which uses the same check from cash_in() ?
I will see which statement triggers the assertion.

Cheers


cash-out-of-range-v2.patch
Description: Binary data


Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Zhihong Yu
On Thu, Aug 11, 2022 at 10:55 AM Zhihong Yu  wrote:

>
>
> On Thu, Aug 11, 2022 at 10:40 AM Tom Lane  wrote:
>
>> Zhihong Yu  writes:
>> > In cash_out(), we have the following code:
>> > if (value < 0)
>> > {
>> > /* make the amount positive for digit-reconstruction loop */
>> > value = -value;
>>
>> > The negation cannot be represented in type long when the value is
>> LONG_MIN.
>>
>> Possibly not good, but it seems like the subsequent loop works anyway:
>>
>> regression=# select '-92233720368547758.08'::money;
>> money
>> -
>>  -$92,233,720,368,547,758.08
>> (1 row)
>>
>> Note that this exact test case appears in money.sql, so we know that
>> it works everywhere, not only my machine.
>>
>> > It seems we can error out when LONG_MIN is detected instead of
>> continuing
>> > with computation.
>>
>> How could you think that that's an acceptable solution?  Once the
>> value is stored, we'd better be able to print it.
>>
>> regards, tom lane
>>
>
> Thanks for taking a look.
> I raise this thread due to the following assertion :
>
> src/backend/utils/adt/cash.c:356:11: runtime error: negation of
> -9223372036854775808 cannot be represented in type 'Cash' (aka 'long');
> cast to an unsigned type to negate this value to itself
>
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
> ../../../../../../../src/postgres/src/backend/utils/adt/cash.c:356:11
>
>
> Though '-92233720368547758.085'::money displays correct error message in
> other builds, this statement wouldn't pass the build where
> UndefinedBehaviorSanitizer is active.
> I think we should fix this otherwise when there is new assertion triggered
> due to future changes around Cash (or other types covered by money.sql), we
> wouldn't see it.
>
> I am open to other ways of bypassing the above assertion.
>
> Cheers
>

Here is sample output with patch:

# SELECT '-92233720368547758.085'::money;
ERROR:  value "-92233720368547758.085" is out of range for type money
LINE 1: SELECT '-92233720368547758.085'::money;

FYI


Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Zhihong Yu
On Thu, Aug 11, 2022 at 10:40 AM Tom Lane  wrote:

> Zhihong Yu  writes:
> > In cash_out(), we have the following code:
> > if (value < 0)
> > {
> > /* make the amount positive for digit-reconstruction loop */
> > value = -value;
>
> > The negation cannot be represented in type long when the value is
> LONG_MIN.
>
> Possibly not good, but it seems like the subsequent loop works anyway:
>
> regression=# select '-92233720368547758.08'::money;
> money
> -
>  -$92,233,720,368,547,758.08
> (1 row)
>
> Note that this exact test case appears in money.sql, so we know that
> it works everywhere, not only my machine.
>
> > It seems we can error out when LONG_MIN is detected instead of continuing
> > with computation.
>
> How could you think that that's an acceptable solution?  Once the
> value is stored, we'd better be able to print it.
>
> regards, tom lane
>

Thanks for taking a look.
I raise this thread due to the following assertion :

src/backend/utils/adt/cash.c:356:11: runtime error: negation of
-9223372036854775808 cannot be represented in type 'Cash' (aka 'long');
cast to an unsigned type to negate this value to itself

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
../../../../../../../src/postgres/src/backend/utils/adt/cash.c:356:11


Though '-92233720368547758.085'::money displays correct error message in
other builds, this statement wouldn't pass the build where
UndefinedBehaviorSanitizer is active.
I think we should fix this otherwise when there is new assertion triggered
due to future changes around Cash (or other types covered by money.sql), we
wouldn't see it.

I am open to other ways of bypassing the above assertion.

Cheers


avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Zhihong Yu
Hi,
In cash_out(), we have the following code:

if (value < 0)
{
/* make the amount positive for digit-reconstruction loop */
value = -value;

The negation cannot be represented in type long when the value is LONG_MIN.
It seems we can error out when LONG_MIN is detected instead of continuing
with computation.

Please take a look at the patch and provide your feedback.

Thanks


cash-out-of-range.patch
Description: Binary data


Re: dropping datumSort field

2022-08-09 Thread Zhihong Yu
On Tue, Aug 9, 2022 at 9:04 PM David Rowley  wrote:

> On Wed, 10 Aug 2022 at 03:16, Zhihong Yu  wrote:
> > On Tue, Aug 9, 2022 at 8:01 AM Robert Haas 
> wrote:
> >>
> >> One problem with this patch is that, if I apply it, PostgreSQL does not
> compile:
> >>
> >> nodeSort.c:197:6: error: use of undeclared identifier 'tupDesc'
> >> if (tupDesc->natts == 1)
> >> ^
> >> 1 error generated.
> >>
> >> Leaving that aside, I don't really see any advantage in this sort of
> change.
>
> I'm with Robert on this one. If you look at the discussion for that
> commit, you'll find quite a bit of benchmarking was done around this
> [1].  The "if" test here is in quite a hot code path, so we want to
> ensure whatever is being referenced here causes the least amount of
> CPU cache misses.  Volcano executors which process a single row at a
> time are not exactly an ideal workload for a modern processor due to
> the bad L1i and L1d hit ratios. This becomes worse with more plan
> nodes as these caches are more likely to get flushed of useful cache
> lines when mode notes are visited. Various other fields in 'node' have
> just been accessed in the code leading up to the "if
> (node->datumSort)" check, so we're probably not going to encounter any
> CPU pipeline stalls waiting for cache lines to be loaded into L1d.  It
> seems you're proposing to change this and have offered no evidence of
> no performance regressions from doing so.  Going by the compilation
> error above, it seems unlikely that you've given performance any
> consideration at all.
>
> I mean this in the best way possible; for the future, I really
> recommend arriving with ideas that are well researched and tested.
> When you can, arrive with evidence to prove your change is good.  For
> this case, evidence would be benchmark results.  The problem is if you
> were to arrive with patches such as this too often then you'll start
> to struggle to get attention from anyone, let alone a committer. You
> don't want to build a reputation for bad quality work as it's likely
> most committers will steer clear of your work.  If you want a good
> recent example of a good proposal, have a look at Yuya Watari's
> write-up at [2] and [3]. There was a clear problem statement there and
> a patch that was clearly a proof of concept only, so the author was
> under no illusion that what he had was ideal.  Now, some other ideas
> were suggested on that thread to hopefully simplify the task and
> provide even better performance. Yuya went off and did that and
> arrived back armed with further benchmark results. I was quite
> impressed with this considering he's not posted to -hackers very
> often.  Now, if you were a committer, would you be looking at the
> patches from the person who sends in half-thought-through ideas, or
> patches from someone that has clearly put a great deal of effort into
> first clearly stating the problem and then proving that the problem is
> solved by the given patch?
>
> David
>
> [1]
> https://www.postgresql.org/message-id/CAApHDvrWV%3Dv0qKsC9_BHqhCn9TusrNvCaZDz77StCO--fmgbKA%40mail.gmail.com
> [2]
> https://www.postgresql.org/message-id/caj2pmkzncgoukse+_5lthd+kbxkvq6h2hqn8esxpxd+cxmg...@mail.gmail.com
> [3]
> https://www.postgresql.org/message-id/caj2pmkzkfvmphovyyuebpwb_nyyvk2+gadqgzxzvnjkvxgt...@mail.gmail.com


Hi, David:
Thanks for your detailed response.


Re: Fast COPY FROM based on batch insert

2022-08-09 Thread Zhihong Yu
On Tue, Aug 9, 2022 at 4:45 AM Etsuro Fujita 
wrote:

> On Tue, Jul 26, 2022 at 7:19 PM Etsuro Fujita 
> wrote:
> > Yeah, I think the patch is getting better, but I noticed some issues,
> > so I'm working on them.  I think I can post a new version in the next
> > few days.
>
> * When running AFTER ROW triggers in CopyMultiInsertBufferFlush(), the
> patch uses the slots passed to ExecForeignBatchInsert(), not the ones
> returned by the callback function, but I don't think that that is
> always correct, as the documentation about the callback function says:
>
>  The return value is an array of slots containing the data that was
>  actually inserted (this might differ from the data supplied, for
>  example as a result of trigger actions.)
>  The passed-in slots can be re-used for this
> purpose.
>
> postgres_fdw re-uses the passed-in slots, but other FDWs might not, so
> I modified the patch to reference the returned slots when running the
> AFTER ROW triggers.  I also modified the patch to initialize the
> tts_tableOid.  Attached is an updated patch, in which I made some
> minor adjustments to CopyMultiInsertBufferFlush() as well.
>
> * The patch produces incorrect error context information:
>
> create extension postgres_fdw;
> create server loopback foreign data wrapper postgres_fdw options
> (dbname 'postgres');
> create user mapping for current_user server loopback;
> create table t1 (f1 int, f2 text);
> create foreign table ft1 (f1 int, f2 text) server loopback options
> (table_name 't1');
> alter table t1 add constraint t1_f1positive check (f1 >= 0);
> alter foreign table ft1 add constraint ft1_f1positive check (f1 >= 0);
>
> — single insert
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> -1 foo
> >> 1 bar
> >> \.
> ERROR:  new row for relation "t1" violates check constraint "t1_f1positive"
> DETAIL:  Failing row contains (-1, foo).
> CONTEXT:  remote SQL command: INSERT INTO public.t1(f1, f2) VALUES ($1, $2)
> COPY ft1, line 1: "-1 foo"
>
> — batch insert
> alter server loopback options (add batch_size '2');
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> -1 foo
> >> 1 bar
> >> \.
> ERROR:  new row for relation "t1" violates check constraint "t1_f1positive"
> DETAIL:  Failing row contains (-1, foo).
> CONTEXT:  remote SQL command: INSERT INTO public.t1(f1, f2) VALUES
> ($1, $2), ($3, $4)
> COPY ft1, line 3
>
> In single-insert mode the error context information is correct, but in
> batch-insert mode it isn’t (i.e., the line number isn’t correct).
>
> The error occurs on the remote side, so I'm not sure if there is a
> simple fix.  What I came up with is to just suppress error context
> information other than the relation name, like the attached.  What do
> you think about that?
>
> (In CopyMultiInsertBufferFlush() your patch sets cstate->cur_lineno to
> buffer->linenos[i] even when running AFTER ROW triggers for the i-th
> row returned by ExecForeignBatchInsert(), but that wouldn’t always be
> correct, as the i-th returned row might not correspond to the i-th row
> originally stored in the buffer as the callback function returns only
> the rows that were actually inserted on the remote side.  I think the
> proposed fix would address this issue as well.)
>
> * The patch produces incorrect row count in cases where some/all of
> the rows passed to ExecForeignBatchInsert() weren’t inserted on the
> remote side:
>
> create function trig_null() returns trigger as $$ begin return NULL;
> end $$ language plpgsql;
> create trigger trig_null before insert on t1 for each row execute
> function trig_null();
>
> — single insert
> alter server loopback options (drop batch_size);
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> 0 foo
> >> 1 bar
> >> \.
> COPY 0
>
> — batch insert
> alter server loopback options (add batch_size '2');
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> 0foo
> >> 1 bar
> >> \.
> COPY 2
>
> The row count is correct in single-insert mode, but isn’t in batch-insert
> mode.
>
> The reason is that in batch-insert mode the row counter is updated
> immediately after adding the row to the buffer, not after doing
> ExecForeignBatchInsert(), which might ignore the row.  To fix, I
> modified the patch to delay updating the row counter (and the progress
> of the COPY command) until after doing the callback function.  For
> consistency, I also modified the patch to delay it even when batching
> into plain tables.  IMO I think that that would be more consistent
> with the single-insert mode, as in that mode we update them after
> writing the tuple out to the table or sending 

Re: dropping datumSort field

2022-08-09 Thread Zhihong Yu
On Tue, Aug 9, 2022 at 8:24 AM Robert Haas  wrote:

> On Tue, Aug 9, 2022 at 11:16 AM Zhihong Yu  wrote:
> > tupDesc is declared inside `if (!node->sort_Done)` block whereas the
> last reference to tupDesc is outside the if block.
>
> Yep.
>
> > I take your review comment and will go back to do more homework.
>
> The real point for me here is you haven't offered any reason to make
> this change. The structure member in question is basically free.
> Because of alignment padding it uses no more memory, and it makes the
> intent of the code clearer.
>
> Let's not change things just because we could.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com


I should have provided motivation for the patch.

I was aware of recent changes around ExprEvalStep. e.g.

Remove unused fields from ExprEvalStep

Though the datumSort field may be harmless for now, in the future, the
(auto) alignment padding may not work if more fields are added to the
struct.
I think we should always leave some room in struct's for future expansion.

As for making the intent of the code clearer, the datumSort field is only
used in one method.
My previous patch removed some comment which should have been shifted into
this method.
In my opinion, localizing the check in single method is easier to
understand than resorting to additional struct field.

Cheers


Re: dropping datumSort field

2022-08-09 Thread Zhihong Yu
On Tue, Aug 9, 2022 at 8:01 AM Robert Haas  wrote:

> On Mon, Aug 8, 2022 at 5:51 PM Zhihong Yu  wrote:
> > Hi,
> > I was looking at ExecSort() w.r.t. datum sort.
> >
> > I wonder if the datumSort field can be dropped.
> > Here is a patch illustrating the potential simplification.
> >
> > Please take a look.
>
> One problem with this patch is that, if I apply it, PostgreSQL does not
> compile:
>
> nodeSort.c:197:6: error: use of undeclared identifier 'tupDesc'
> if (tupDesc->natts == 1)
> ^
> 1 error generated.
>
> Leaving that aside, I don't really see any advantage in this sort of
> change.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com


Thanks for trying out the patch.

TupleDesc   tupDesc;

tupDesc is declared inside `if (!node->sort_Done)` block whereas the last
reference to tupDesc is outside the if block.

I take your review comment and will go back to do more homework.

Cheers


dropping datumSort field

2022-08-08 Thread Zhihong Yu
Hi,
I was looking at ExecSort() w.r.t. datum sort.

I wonder if the datumSort field can be dropped.
Here is a patch illustrating the potential simplification.

Please take a look.

Thanks


drop-datum-sort-field.patch
Description: Binary data


Re: Parallel Aggregates for string_agg and array_agg

2022-08-02 Thread Zhihong Yu
On Tue, Aug 2, 2022 at 4:46 PM David Rowley  wrote:

> On Wed, 20 Jun 2018 at 19:08, David Rowley 
> wrote:
> > I'll submit it again when there more consensus that we want this.
>
> Waking up this old thread again. If you don't have a copy, the entire
> thread is in [1].
>
> The remaining item that seemed to cause this patch to be rejected was
> raised in [2]. The summary of that was that it might not be a good
> idea to allow parallel aggregation of string_agg() and array_agg() as
> there might be some people who rely on the current ordering they get
> without having an ORDER BY clause in the aggregate function call.  Tom
> mentioned in [3] that users might not want to add an ORDER BY to their
> aggregate function because the performance of it is terrible.  That
> was true up until 1349d2790 [4], where I changed how ORDER BY /
> DISTINCT aggregation worked to allow the planner to provide pre-sorted
> input rather than always having nodeAgg.c do the sorting.  I think
> this removes quite a lot of the argument against the patch, but not
> all of it.  So here goes testing the water on seeing if any opinions
> have changed over the past few years?
>
> A rebased patch is attached.
>
> David
>
> [1]
> https://www.postgresql.org/message-id/flat/CAKJS1f98yPkRMsE0JnDh72%3DAQEUuE3atiCJtPVCtjhFwzCRJHQ%40mail.gmail.com#8bbce15b9279d2da2da99071f732a99d
> [2] https://www.postgresql.org/message-id/6538.1522096...@sss.pgh.pa.us
> [3] https://www.postgresql.org/message-id/18594.1522099...@sss.pgh.pa.us
> [4]
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1349d2790bf48a4de072931c722f39337e72055e

Hi,
For array_agg_combine():

+   if (state1->alen < reqsize)
+   {
+   /* Use a power of 2 size rather than allocating just reqsize */
+   state1->alen = pg_nextpower2_32(reqsize);
...
+   state1->nelems = reqsize;

I wonder why pg_nextpower2_32(reqsize) is used in the if block. It seems
reqsize should suffice.

Cheers


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-08-02 Thread Zhihong Yu
On Tue, Aug 2, 2022 at 11:02 AM Zhihong Yu  wrote:

>
>> Hi, David:
>
> I was looking at the final patch and noticed that setno field
> in agg_presorted_distinctcheck struct is never used.
>
> Looks like it was copied from neighboring struct.
>
> Can you take a look at the patch ?
>
> Thanks
>
>>
>
> Looks like setoff field is not used either.

Cheers


drop-setno-setoff-from-agg_presorted_distinctcheck.patch
Description: Binary data


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-08-02 Thread Zhihong Yu
>
>
> Hi, David:

I was looking at the final patch and noticed that setno field
in agg_presorted_distinctcheck struct is never used.

Looks like it was copied from neighboring struct.

Can you take a look at the patch ?

Thanks

>


drop-setno-from-agg_presorted_distinctcheck.patch
Description: Binary data


Re: Making Vars outer-join aware

2022-08-01 Thread Zhihong Yu
On Mon, Aug 1, 2022 at 12:51 PM Tom Lane  wrote:

> Here's a rebase up to HEAD, mostly to placate the cfbot.
> I accounted for d8e34fa7a (s/all_baserels/all_query_rels/
> in those places) and made one tiny bug-fix change.
> Nothing substantive as yet.
>
> regards, tom lane
>
> Hi,
For v3-0003-label-Var-nullability-in-parser.patch :

+   if (rtindex > 0 && rtindex <= list_length(pstate->p_nullingrels))
+   relids = (Bitmapset *) list_nth(pstate->p_nullingrels, rtindex - 1);
+   else
+   relids = NULL;
+
+   /*
+* Merge with any already-declared nulling rels.  (Typically there won't
+* be any, but let's get it right if there are.)
+*/
+   if (relids != NULL)

It seems the last if block can be merged into the previous if block. That
way `relids = NULL` can be omitted.

Cheers


Re: Skip partition tuple routing with constant partition key

2022-07-26 Thread Zhihong Yu
On Tue, Jul 26, 2022 at 3:28 PM David Rowley  wrote:

> Thank for looking at this.
>
> On Sat, 23 Jul 2022 at 01:23, Amit Langote 
> wrote:
> > +   /*
> > +* The Datum has changed.  Zero the number of times
> we've
> > +* found last_found_datum_index in a row.
> > +*/
> > +   partdesc->last_found_count = 0;
> >
> > +   /* Zero the "winning streak" on the cache hit count
> */
> > +   partdesc->last_found_count = 0;
> >
> > Might it be better for the two comments to say the same thing?  Also,
> > I wonder which one do you intend as the resetting of last_found_count:
> > setting it to 0 or 1?  I can see that the stanza at the end of the
> > function sets to 1 to start a new cycle.
>
> I think I've addressed all of your comments.  The above one in
> particular caused me to make some larger changes.
>
> The reason I was zeroing the last_found_count in LIST partitioned
> tables when the Datum was not equal to the previous found Datum was
> due to the fact that the code at the end of the function was only
> checking the partition indexes matched rather than the bound_offset vs
> last_found_datum_index.  The reason I wanted to zero this was that if
> you had a partition FOR VALUES IN(1,2), and you received rows with
> values alternating between 1 and 2 then we'd match to the same
> partition each time, however the equality test with the current
> 'values' and the Datum at last_found_datum_index would have been false
> each time.  If we didn't zero the last_found_count we'd have kept
> using the cache path even though the Datum and last Datum wouldn't
> have been equal each time. That would have resulted in always doing
> the cache check and failing, then doing the binary search anyway.
>
> I've now changed the code so that instead of checking the last found
> partition is the same as the last one, I'm now checking if
> bound_offset is the same as last_found_datum_index.  This will be
> false in the "values alternating between 1 and 2" case from above.
> This caused me to have to change how the caching works for LIST
> partitions with a NULL partition which is receiving NULL values.  I've
> coded things now to just skip the cache for that case. Finding the
> correct LIST partition for a NULL value is cheap and no need to cache
> that.  I've also moved all the code which updates the cache fields to
> the bottom of get_partition_for_tuple(). I'm only expecting to do that
> when bound_offset is set by the lookup code in the switch statement.
> Any paths, e.g. HASH partitioning lookup and LIST or RANGE with NULL
> values shouldn't reach the code which updates the partition fields.
> I've added an Assert(bound_offset >= 0) to ensure that stays true.
>
> There's probably a bit more to optimise here too, but not much. I
> don't think the partdesc->last_found_part_index = -1; is needed when
> we're in the code block that does return boundinfo->default_index;
> However, that only might very slightly speedup the case when we're
> inserting continuously into the DEFAULT partition. That code path is
> also used when we fail to find any matching partition. That's not one
> we need to worry about making go faster.
>
> I also ran the benchmarks again and saw that most of the use of
> likely() and unlikely() no longer did what I found them to do earlier.
> So the weirdness we saw there most likely was just down to random code
> layout changes. In this patch, I just dropped the use of either of
> those two macros.
>
> David
>
Hi,

+   return boundinfo->indexes[last_datum_offset + 1];
+
+   else if (cmpval < 0 && last_datum_offset + 1 <
boundinfo->ndatums)

nit: the `else` keyword is not needed.

Cheers


Question about ExplainOneQuery_hook

2022-07-26 Thread Zhihong Yu
Hi,
I was looking at ExplainOneQuery() where ExplainOneQuery_hook is called.

Currently the call to the hook is in if block and normal processing is in
else block.

What if the hook doesn't want to duplicate the whole code printing
execution plan ?

Please advise.

Thanks


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-07-25 Thread Zhihong Yu
On Mon, Jul 25, 2022 at 4:39 PM David Rowley  wrote:

> On Fri, 22 Jul 2022 at 21:33, Richard Guo  wrote:
> > I can see this problem with
> > the query below:
> >
> > select max(b order by b), max(a order by a) from t group by a;
> >
> > When processing the first aggregate, we compose the 'currpathkeys' as
> > {a, b} and mark this aggregate in 'aggindexes'. When it comes to the
> > second aggregate, we compose its pathkeys as {a} and decide that it is
> > not stronger than 'currpathkeys'. So the second aggregate is not
> > recorded in 'aggindexes'. As a result, we fail to mark aggpresorted for
> > the second aggregate.
>
> Yeah, you're right. I have a missing check to see if currpathkeys are
> better than the pathkeys for the current aggregate. In your example
> case we'd have still processed the 2nd aggregate the old way instead
> of realising we could take the new pre-sorted path for faster
> processing.
>
> I've adjusted that in the attached to make it properly include the
> case where currpathkeys are better.
>
> Thanks for the review.
>
> David
>
Hi,

sort order the the planner chooses is simply : there is duplicate `the`

+   /* mark this aggregate is covered by 'currpathkeys'
*/

is covered by -> as covered by

Cheers


Re: redacting password in SQL statement in server log

2022-07-24 Thread Zhihong Yu
On Sat, Jul 23, 2022 at 5:27 PM Tom Lane  wrote:

> Zhihong Yu  writes:
> > Currently, in situation such as duplicate role creation, the server log
> > would show something such as the following:
>
> > 2022-07-22 13:48:18.251 UTC [330] STATEMENT:  CREATE ROLE test WITH LOGIN
> > PASSWORD 'foobar';
>
> > The password itself should be redacted before logging the statement.
>
> This has been proposed multiple times, and rejected multiple times,
> primarily because it offers only false security: you'll never cover
> all the cases.  (The proposed patch manages to create a bunch of
> false positives to go along with its false negatives, too.)
>
> The only safe answer is to be sure to keep the server log contents
> secure.  Please see prior discussions in the archives.
>
> regards, tom lane
>
Hi,
I am thinking of adding `if not exists` to `CREATE ROLE` statement:

CREATE ROLE trustworthy if not exists;

In my previous example, if the user can issue the above command, there
would be no SQL statement logged.

Do you think it is worth adding `if not exists` clause ?

Thanks


Re: redacting password in SQL statement in server log

2022-07-23 Thread Zhihong Yu
On Sat, Jul 23, 2022 at 5:27 PM Tom Lane  wrote:

> Zhihong Yu  writes:
> > Currently, in situation such as duplicate role creation, the server log
> > would show something such as the following:
>
> > 2022-07-22 13:48:18.251 UTC [330] STATEMENT:  CREATE ROLE test WITH LOGIN
> > PASSWORD 'foobar';
>
> > The password itself should be redacted before logging the statement.
>
> This has been proposed multiple times, and rejected multiple times,
> primarily because it offers only false security: you'll never cover
> all the cases.  (The proposed patch manages to create a bunch of
> false positives to go along with its false negatives, too.)
>
> The only safe answer is to be sure to keep the server log contents
> secure.  Please see prior discussions in the archives.
>
> regards, tom lane
>

Pardon my laziness.

I will pay more attention.


redacting password in SQL statement in server log

2022-07-23 Thread Zhihong Yu
Hi,
Currently, in situation such as duplicate role creation, the server log
would show something such as the following:

2022-07-22 13:48:18.251 UTC [330] STATEMENT:  CREATE ROLE test WITH LOGIN
PASSWORD 'foobar';

The password itself should be redacted before logging the statement.

Here is sample output with the patch applied:

2022-07-23 23:28:20.359 UTC [16850] ERROR:  role "test" already exists
2022-07-23 23:28:20.359 UTC [16850] STATEMENT:  CREATE ROLE test WITH LOGIN
PASSWORD

Please take a look at the short patch.
I know variables should be declared at the start of the func - I can do
that once the approach is confirmed.

Cheers


redact-password-in-log.patch
Description: Binary data


potential memory leak in pg_regcomp()

2022-07-22 Thread Zhihong Yu
Hi,
I was looking at pg_regcomp():

re->re_guts = VS(MALLOC(sizeof(struct guts)));

I did some search trying to find where re_guts is freed but haven't
found it.
Can someone enlighten me?

Thanks


Re: Freeing sortgroupatts in use_physical_tlist

2022-07-15 Thread Zhihong Yu
On Fri, Jul 15, 2022 at 8:33 PM Tom Lane  wrote:

> Zhihong Yu  writes:
> > I was looking at the code in use_physical_tlist().
> > In the code block checking CP_LABEL_TLIST, I noticed that
> > the Bitmapset sortgroupatts is not freed before returning from the
> method.
> > Looking at create_foreignscan_plan() (in the same file):
> > bms_free(attrs_used);
> > It seems the intermediate Bitmapset is freed before returning.
>
> TBH, I'd say that it's probably the former code not the latter
> that is good practice.  Retail pfree's in code that's not in a
> loop very possibly expend more cycles than they are worth, because
> the space will get cleaned up anyway when the active memory
> context is reset, and pfree is not as cheap as one might wish.
> It might be possible to make a case for one method over the other
> with some study of the particular situation, but you can't say
> a priori which way is better.
>
> On the whole, I would not bother changing either of these bits
> of code without some clear evidence that it matters.  It likely
> doesn't.  It's even more likely that it doesn't matter enough
> to be worth investigating.
>
> regards, tom lane
>
Hi, Tom:
Thanks for responding over the weekend.

I will try to remember what you said.

Cheers


Freeing sortgroupatts in use_physical_tlist

2022-07-15 Thread Zhihong Yu
Hi,
I was looking at the code in use_physical_tlist().

In the code block checking CP_LABEL_TLIST, I noticed that
the Bitmapset sortgroupatts is not freed before returning from the method.

Looking at create_foreignscan_plan() (in the same file):

bms_free(attrs_used);

It seems the intermediate Bitmapset is freed before returning.

I would appreciate review comments for the proposed patch.

Thanks


free-sort-group-atts.patch
Description: Binary data


Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2022-07-13 Thread Zhihong Yu
On Wed, Jul 13, 2022 at 1:05 PM Dmitry Koval  wrote:

> Thanks you!
> I've fixed all things mentioned.
>
> --
> With best regards,
> Dmitry Koval
>
> Postgres Professional: http://postgrespro.com

Hi,
Toward the end of ATExecSplitPartition():

+   /* Unlock new partition. */
+   table_close(newPartRel, NoLock);

 Why is NoLock passed (instead of AccessExclusiveLock) ?

Cheers


Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2022-07-13 Thread Zhihong Yu
On Wed, Jul 13, 2022 at 11:28 AM Dmitry Koval 
wrote:

> Hi!
>
> Patch stop applying due to changes in upstream.
> Here is a rebased version.
>
> --
> With best regards,
> Dmitry Koval
>
> Postgres Professional: http://postgrespro.com

Hi,

+attachPartTable(List **wqueue, Relation rel, Relation partition,
PartitionBoundSpec *bound)

I checked naming of existing methods, such as AttachPartitionEnsureIndexes.
I think it would be better if the above method is
named attachPartitionTable.

+   if (!defaultPartCtx && OidIsValid(defaultPartOid))
+   {
+   pc = createSplitPartitionContext(table_open(defaultPartOid,
AccessExclusiveLock));

Since the value of pc would be passed to defaultPartCtx, there is no need
to assign to pc above. You can assign directly to defaultPartCtx.

+   /* Drop splitted partition. */

splitted -> split

+   /* Rename new partition if it is need. */

need -> needed.

Cheers


Re: should check interrupts in BuildRelationExtStatistics ?

2022-07-12 Thread Zhihong Yu
On Tue, Jul 12, 2022 at 1:31 PM Tom Lane  wrote:

> I wrote:
> > Thomas Munro  writes:
> >> On Wed, Jul 6, 2022 at 11:37 AM Tom Lane  wrote:
> >>> qsort_interruptible
>
> >> +1
>
> > So here's a patch that does it that way.
>
> Hearing no comments, pushed.
>
> regards, tom lane
>
>
> Hi,
Looking at the files under src/backend/utils/sort/, looks like license
header is missing from qsort_interruptible.c

Please consider the patch which adds license header to qsort_interruptible.c

Cheers


qsort-interruptible-copyright.patch
Description: Binary data


Re: Making Vars outer-join aware

2022-07-10 Thread Zhihong Yu
On Sun, Jul 10, 2022 at 12:39 PM Tom Lane  wrote:

> Here's v2 of this patch series.  It's functionally identical to v1,
> but I've rebased it over the recent auto-node-support-generation
> changes, and also extracted a few separable bits in hopes of making
> the main planner patch smaller.  (It's still pretty durn large,
> unfortunately.)  Unlike the original submission, each step will
> compile on its own, though the intermediate states mostly don't
> pass all regression tests.
>
> regards, tom lane
>
> Hi,
For v2-0004-cope-with-nullability-in-planner.patch.
In remove_unneeded_nulling_relids():

+   if (removable_relids == NULL)

Why is bms_is_empty() not used in the above check ?
Earlier there is `if (bms_is_empty(old_nulling_relids))`

+typedef struct reduce_outer_joins_partial_state

Since there are already reduce_outer_joins_pass1_state
and reduce_outer_joins_pass2_state, a comment
above reduce_outer_joins_partial_state would help other people follow its
purpose.

+   if (j->rtindex)
+   {
+   if (j->jointype == JOIN_INNER)
+   {
+   if (include_inner_joins)
+   result = bms_add_member(result, j->rtindex);
+   }
+   else
+   {
+   if (include_outer_joins)

Since there are other join types beside JOIN_INNER, should there be an
assertion in the else block ? e.g. jointype wouldn't be JOIN_UNIQUE_INNER.

Cheers


Re: Making the subquery alias optional in the FROM clause

2022-07-09 Thread Zhihong Yu
On Sat, Jul 9, 2022 at 5:18 AM Dean Rasheed 
wrote:

> On Sat, 9 Jul 2022 at 12:24, Zhihong Yu  wrote:
> >
> >  It seems the code would be more readable if you keep the assignment in
> else block below:
> >
> > +   else if (rte->rtekind == RTE_SUBQUERY ||
> > +rte->rtekind == RTE_VALUES)
> > continue;
> > -   rtename = rte->join_using_alias->aliasname;
> > }
> > -   else
> > -   rtename = rte->eref->aliasname;
> >
> > because rtename would be assigned in the `rte->rtekind == RTE_JOIN` case.
> >
>
> But then it would need 2 else blocks, one inside the rte->alias ==
> NULL block, for when rtekind is not RTE_JOIN, RTE_SUBQUERY or
> RTE_VALUES, and another after the block, for when rte->alias != NULL.
> I find it more readable this way.
>
> Regards,
> Dean
>

Hi, Dean:
Thanks for the explanation.

I should have looked closer :-)


Re: Making the subquery alias optional in the FROM clause

2022-07-09 Thread Zhihong Yu
On Sat, Jul 9, 2022 at 3:28 AM Dean Rasheed 
wrote:

> On Wed, 6 Jul 2022 at 15:09, Dean Rasheed 
> wrote:
> >
> > I'll post an update in a little while, but first, I found a bug, which
> > revealed a pre-existing bug in transformLockingClause(). I'll start a
> > new thread for that, since it'd be good to get that resolved
> > independently of this patch.
> >
>
> Attached is an update with the following changes:
>
> * Docs updated as suggested.
> * transformLockingClause() updated to skip subquery and values rtes
> without aliases.
> * eref->aliasname changed to "unnamed_subquery" for subqueries without
> aliases.
>
> Regards,
> Dean
>
Hi,
rtename is assigned at the beginning of the loop:

+   char   *rtename = rte->eref->aliasname;

 It seems the code would be more readable if you keep the assignment in
else block below:

+   else if (rte->rtekind == RTE_SUBQUERY ||
+rte->rtekind == RTE_VALUES)
continue;
-   rtename = rte->join_using_alias->aliasname;
}
-   else
-   rtename = rte->eref->aliasname;

because rtename would be assigned in the `rte->rtekind == RTE_JOIN` case.

Cheers


Re: Aggregate leads to superfluous projection from the scan

2022-07-08 Thread Zhihong Yu
On Fri, Jul 8, 2022 at 12:48 PM Zhihong Yu  wrote:

>
>
> On Fri, Jul 8, 2022 at 12:30 PM Tom Lane  wrote:
>
>> Ibrar Ahmed  writes:
>> > I give a quick look and I think in case whenever data is extracted from
>> the
>> > heap it shows all the columns. Therefore when columns are extracted from
>> > the index only it shows the indexed column only.
>>
>> This is operating as designed, and I don't think that the proposed
>> patch is an improvement.  The point of use_physical_tlist() is that
>> returning all the columns is cheaper because it avoids a projection
>> step.  That's true for any case where we have to fetch the heap
>> tuple, so IndexScan is included though IndexOnlyScan is not.
>>
>> Now, that's something that was true a decade or more ago.
>> There's been considerable discussion recently about cases where
>> it's not true anymore, for example with columnar storage or FDWs,
>> and so we ought to invent a way to prevent createplan.c from
>> doing it when it would be counterproductive.  But just summarily
>> turning it off is not an improvement.
>>
>> regards, tom lane
>>
> Hi,
> In createplan.c, there is `change_plan_targetlist`
>
> Plan *
> change_plan_targetlist(Plan *subplan, List *tlist, bool
> tlist_parallel_safe)
>
> But it doesn't have `Path` as parameter.
> So I am not sure whether the check of non-returnable columns should be
> done in change_plan_targetlist().
>
> bq. for example with columnar storage or FDWs,
>
> Yeah. The above is the case where I want to optimize.
>
> Cheers
>
Hi, Tom:
I was looking at the following comment in createplan.c :

 * For table scans, rather than using the relation targetlist (which is
 * only those Vars actually needed by the query), we prefer to generate
a
 * tlist containing all Vars in order.  This will allow the executor to
 * optimize away projection of the table tuples, if possible.

Maybe you can give me some background on the above decision.

Thanks


Re: Aggregate leads to superfluous projection from the scan

2022-07-08 Thread Zhihong Yu
On Fri, Jul 8, 2022 at 12:30 PM Tom Lane  wrote:

> Ibrar Ahmed  writes:
> > I give a quick look and I think in case whenever data is extracted from
> the
> > heap it shows all the columns. Therefore when columns are extracted from
> > the index only it shows the indexed column only.
>
> This is operating as designed, and I don't think that the proposed
> patch is an improvement.  The point of use_physical_tlist() is that
> returning all the columns is cheaper because it avoids a projection
> step.  That's true for any case where we have to fetch the heap
> tuple, so IndexScan is included though IndexOnlyScan is not.
>
> Now, that's something that was true a decade or more ago.
> There's been considerable discussion recently about cases where
> it's not true anymore, for example with columnar storage or FDWs,
> and so we ought to invent a way to prevent createplan.c from
> doing it when it would be counterproductive.  But just summarily
> turning it off is not an improvement.
>
> regards, tom lane
>
Hi,
In createplan.c, there is `change_plan_targetlist`

Plan *
change_plan_targetlist(Plan *subplan, List *tlist, bool tlist_parallel_safe)

But it doesn't have `Path` as parameter.
So I am not sure whether the check of non-returnable columns should be done
in change_plan_targetlist().

bq. for example with columnar storage or FDWs,

Yeah. The above is the case where I want to optimize.

Cheers


Re: Aggregate leads to superfluous projection from the scan

2022-07-08 Thread Zhihong Yu
On Fri, Jul 8, 2022 at 9:40 AM Zhihong Yu  wrote:

> Hi,
> Here is the query which involves aggregate on a single column:
>
>
> https://dbfiddle.uk/?rdbms=postgres_13=44bfd8f6b6b5aad34d00d449c04c5a96
>
> As you can see from `Output:`, there are many columns added which are not
> needed by the query executor.
>
> I wonder if someone has noticed this in the past.
> If so, what was the discussion around this topic ?
>
> Thanks
>
Hi,
With the patch, I was able to get the following output:

 explain (analyze, verbose) /*+ IndexScan(t) */select count(fire_year) from
fires t where objectid <= 200;
  QUERY PLAN
--
 Aggregate  (cost=119.00..119.01 rows=1 width=8) (actual time=9.453..9.453
rows=1 loops=1)
   Output: count(fire_year)
   ->  Index Scan using fires_pkey on public.fires t  (cost=0.00..116.50
rows=1000 width=4) (actual time=9.432..9.432 rows=0 loops=1)
 Output: fire_year
 Index Cond: (t.objectid <= 200)
 Planning Time: 52.598 ms
 Execution Time: 13.082 ms

Please pay attention to the column list after `Output:`

Tom:
Can you take a look and let me know what I may have missed ?

Thanks


index-scan-with-non-returnable.patch
Description: Binary data


Aggregate leads to superfluous projection from the scan

2022-07-08 Thread Zhihong Yu
Hi,
Here is the query which involves aggregate on a single column:

https://dbfiddle.uk/?rdbms=postgres_13=44bfd8f6b6b5aad34d00d449c04c5a96

As you can see from `Output:`, there are many columns added which are not
needed by the query executor.

I wonder if someone has noticed this in the past.
If so, what was the discussion around this topic ?

Thanks


index for inet column

2022-07-07 Thread Zhihong Yu
Hi,
I was able to create gin index on inet column in PG.

GIN is good with points/elements in sets. Is gin a good index for inet
column ?
It seems gist index would be better.

Comments are welcome.


Re: Bugs in copyfuncs/equalfuncs support for JSON node types

2022-07-04 Thread Zhihong Yu
On Mon, Jul 4, 2022 at 6:23 PM Tom Lane  wrote:

> In reviewing Peter's patch to auto-generate the backend/nodes
> support files, I compared what the patch's script produces to
> what is in the code now.  I found several discrepancies in the
> recently-added parse node types for JSON functions, and as far
> as I can see every one of those discrepancies is an error in
> the existing code.  Some of them are relatively harmless
> (e.g. COPY_LOCATION_FIELD isn't really different from
> COPY_SCALAR_FIELD), but some of them definitely are live bugs.
> I propose the attached patch.
>
> regards, tom lane
>
> Hi,
Patch looks good to me.

Thanks


Re: Removing unneeded self joins

2022-07-04 Thread Zhihong Yu
On Mon, Jul 4, 2022 at 6:52 AM Ronan Dunklau  wrote:

> Le jeudi 30 juin 2022, 16:11:51 CEST Andrey Lepikhov a écrit :
> > On 19/5/2022 16:47, Ronan Dunklau wrote:
> > > I'll take a look at that one.
> >
> > New version of the patch, rebased on current master:
> > 1. pgindent over the patch have passed.
> > 2. number of changed files is reduced.
> > 3. Some documentation and comments is added.
>
> Hello Andrey,
>
> Thanks for the updates.
>
> The general approach seems sensible to me, so I'm going to focus on some
> details.
>
> In a very recent thread [1], Tom Lane is proposing to add infrastructure
> to make Var aware of their nullability by outer joins. I wonder if that
> would help with avoiding the need for adding is not null clauses when the
> column is known not null ?
> If we have a precedent for adding a BitmapSet to the Var itself, maybe the
> whole discussion regarding keeping track of nullability can be extended to
> the original column nullability ?
>
> Also, I saw it was mentioned earlier in the thread but how difficult would
> it be to process the transformed quals through the EquivalenceClass
> machinery and the qual simplification ?
> For example, if the target audience of this patch is ORM, or inlined
> views, it wouldn't surprise me to see queries of this kind in the wild,
> which could be avoided altogether:
>
> postgres=# explain (costs off) select * from sj s1 join sj s2 on s1.a =
> s2.a where s1.b = 2 and s2.b =3;
>  QUERY PLAN
> -
>  Seq Scan on sj s2
>Filter: ((a IS NOT NULL) AND (b = 3) AND (b = 2))
> (2 lignes)
>
>
> +   for (counter = 0; counter < list_length(*sources);)
> +   {
> +   ListCell   *cell = list_nth_cell(*sources, counter);
> +   RestrictInfo *rinfo = castNode(RestrictInfo, lfirst(cell));
> +   int counter1;
> +
> 
> + ec->ec_members = list_delete_cell(ec->ec_members, cell);
>
>
> Why don't you use foreach() and foreach_delete_current macros for
> iterating and removing items in the lists, both in update_ec_members and
> update_ec_sources ?
>
>
> +   if ((bms_is_member(k, info->syn_lefthand) ^
> +bms_is_member(r,
> info->syn_lefthand)) ||
> +   (bms_is_member(k,
> info->syn_righthand) ^
> +bms_is_member(r,
> info->syn_righthand)))
>
> I think this is more compact and easier to follow than the previous
> version, but I'm not sure how common it is in postgres source code to use
> that kind of construction ?
>
> Some review about the comments:
>
>
> I see you keep using the terms "the keeping relation" and "the removing
> relation" in reference to the relation that is kept and the one that is
> removed.
> Aside from the grammar (the kept relation or the removed relation), maybe
> it would make it clearer to call them something else. In other parts of the
> code, you used "the remaining relation / the removed relation" which makes
> sense.
>
>  /*
>   * Remove the target relid from the planner's data structures, having
> - * determined that there is no need to include it in the query.
> + * determined that there is no need to include it in the query. Or replace
> + * with another relid.
> + * To reusability, this routine can work in two modes: delete relid from
> a plan
> + * or replace it. It is used in replace mode in a self-join removing
> process.
>
> This could be rephrased: ", optionally replacing it with another relid.
> The latter is used by the self-join removing process."
>
>
> [1]
> https://www.postgresql.org/message-id/flat/830269.1656693747%40sss.pgh.pa.us
>
> --
> Ronan Dunklau
>
>
> Hi,
bq. this is more compact and easier to follow than the previous version

A code comment can be added above the expression (involving XOR) to explain
the purpose of the expression.

Cheers


Re: Patch proposal: New hooks in the connection path

2022-07-04 Thread Zhihong Yu
On Mon, Jul 4, 2022 at 5:54 AM Drouvot, Bertrand 
wrote:

> Hi,
>
> On 7/2/22 1:00 AM, Nathan Bossart wrote:
> > Could we model this after fmgr_hook?  The first argument in that hook
> > indicates where it is being called from.  This doesn't alleviate the need
> > for several calls to the hook in the authentication logic, but extension
> > authors would only need to define one hook.
>
> I like the idea and indeed fmgr.h looks a good place to model it.
>
> Attached a new patch version doing so.
>
> Thanks
>
> --
>
> Bertrand Drouvot
> Amazon Web Services: https://aws.amazon.com

Hi,
+   FCET_SPT,   /* startup packet timeout */
+   FCET_BSP,   /* bad startup packet */

Looking at existing enum type, such as FmgrHookEventType, the part after
underscore is a word.
I think it would be good to follow existing practice and make the enums
more readable.

Cheers


Re: pgsql: dshash: Add sequential scan support.

2022-07-04 Thread Zhihong Yu
On Sun, Jul 3, 2022 at 7:56 PM Thomas Munro  wrote:

> [Re-directing to -hackers]
>
> On Fri, Mar 11, 2022 at 2:27 PM Andres Freund  wrote:
> > On 2022-03-10 20:09:56 -0500, Tom Lane wrote:
> > > Andres Freund  writes:
> > > > dshash: Add sequential scan support.
> > > > Add ability to scan all entries sequentially to dshash. The
> interface is
> > > > similar but a bit different both from that of dynahash and simple
> dshash
> > > > search functions. The most significant differences is that dshash's
> interfac
> > > > always needs a call to dshash_seq_term when scan ends.
> > >
> > > Umm ... what about error recovery?  Or have you just cemented the
> > > proposition that long-lived dshashes are unsafe?
> >
> > I don't think this commit made it worse. dshash_seq_term() releases an
> lwlock
> > (which will be released in case of an error) and unsets
> > hash_table->find_[exclusively_]locked. The latter weren't introduced by
> this
> > patch, and are also set by dshash_find().
> >
> > I agree that ->find_[exclusively_]locked are problematic from an error
> > recovery perspective.
>
> Right, as seen in the build farm at [1].  Also reproducible with something
> like:
>
> @@ -269,6 +269,14 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size
> request_size,
> return false;
> }
>
> +   /* XXX random fault injection */
> +   if (op == DSM_OP_ATTACH && random() < RAND_MAX / 8)
> +   {
> +   close(fd);
> +   elog(ERROR, "chaos");
> +   return false;
> +   }
> +
>
> I must have thought that it was easy and practical to write no-throw
> straight-line code and be sure to reach dshash_release_lock(), but I
> concede that it was a bad idea: even dsa_get_address() can throw*, and
> you're often likely to need to call that while accessing dshash
> elements.  For example, in lookup_rowtype_tupdesc_internal(), there is
> a sequence dshash_find(), ..., dsa_get_address(), ...,
> dshash_release_lock(), and I must have considered the range of code
> between find and release to be no-throw, but now I know that it is
> not.
>
> > It's per-backend state at least and just used for assertions. We could
> remove
> > it. Or stop checking it in places where it could be set wrongly:
> dshash_find()
> > and dshash_detach() couldn't check anymore, but the rest of the
> assertions
> > would still be valid afaics?
>
> Yeah, it's all for assertions... let's just remove it.  Those
> assertions were useful to me at some stage in development but won't
> hold as well as I thought, at least without widespread PG_FINALLY(),
> which wouldn't be nice.
>
> *dsa_get_address() might need to adjust the memory map with system
> calls, which might fail.  If you think of DSA as not only an allocator
> but also a poor man's user level virtual memory scheme to tide us over
> until we get threads, then this is a pretty low level kind of
> should-not-happen failure that is analogous on some level to SIGBUS or
> SIGSEGV or something like that, and we should PANIC.  Then we could
> claim that dsa_get_address() is no-throw.  At least, that was one
> argument I had with myself while investigating that strange Solaris
> shm_open() failure, but ... I lost the argument.  It's quite an
> extreme position to take just to support these assertions, which are
> of pretty limited value.
>
> [1]
> https://www.postgresql.org/message-id/20220701232009.jcwxpl45bptaxv5n%40alap3.anarazel.de

Hi,
In the description,

`new shared memory stats system in 15`

It would be clearer to add `release` before `15`.

Cheers


Re: [PATCH] Optimize json_lex_string by batching character copying

2022-06-24 Thread Zhihong Yu
Hi,
Looking at the patch,

+   if (copyable_characters_length)
+   {
+   /* flush copyable characters */
+   appendBinaryStringInfo(
+  lex->strval,
+  s - copyable_characters_length,
+  copyable_characters_length);
+
+   }
break;

I wonder why copyable_characters_length is not reset after flushing.

Cheers


Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-06-02 Thread Zhihong Yu
On Thu, Jun 2, 2022 at 5:08 AM Etsuro Fujita 
wrote:

> On Wed, Apr 6, 2022 at 3:58 PM Etsuro Fujita 
> wrote:
> > I have committed the patch after modifying it as such.
>
> The patch calls trivial_subqueryscan() during create_append_plan() to
> determine the triviality of a SubqueryScan that is a child of an
> Append node.  Unlike when calling it from
> set_subqueryscan_references(), this is done before some
> post-processing such as set_plan_references() on the subquery.  The
> reason why this is safe wouldn't be that obvious, so I added to
> trivial_subqueryscan() comments explaining this.  Attached is a patch
> for that.
>
> Best regards,
> Etsuro Fujita
>
Hi,
Suggestion on formatting the comment:

+ * node (or that for any plan node in the subplan tree), 2)
+ * set_plan_references() modifies the tlist for every plan node in the

It would be more readable if `2)` is put at the beginning of the second
line above.

+ * preserves the length and order of the tlist, and 3)
set_plan_references()
+ * might delete the topmost plan node like an Append or MergeAppend from
the

Similarly you can move `3) set_plan_references()` to the beginning of the
next line.

Cheers


showing effective max_connections

2022-06-01 Thread Zhihong Yu
Hi,
For non superusers, the max connections would be lower than what
max_connections
specifies.

Should we display the effective value when non superuser issues `SHOW
max_connections` ?

Thanks


Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2022-06-01 Thread Zhihong Yu
On Wed, Jun 1, 2022 at 11:58 AM Dmitry Koval  wrote:

> Hi,
>
> 1)
> > For attachPartTable, the parameter wqueue is missing from comment.
> > The parameters of CloneRowTriggersToPartition are called parent and
> partition.
> > I think it is better to name the parameters to attachPartTable in a
> similar manner.
> >
> > For struct SplitPartContext, SplitPartitionContext would be better name.
> >
> > +   /* Store partition contect into list. */
> > contect -> context
>
> Thanks, changed.
>
> 2)
> > For transformPartitionCmdForMerge(), nested loop is used to detect
> duplicate names.
> > If the number of partitions in partcmd->partlist, we should utilize map
> to speed up the check.
>
> I'm not sure what we should utilize map in this case because chance that
> number of merging partitions exceed dozens is low.
> Is there a function example that uses a map for such a small number of
> elements?
>
> 3)
> > For check_parent_values_in_new_partitions():
> >
> > +   if (!find_value_in_new_partitions(>partsupfunc[0],
> > + key->partcollation, parts,
> nparts, datum, false))
> > +   found = false;
> >
> > It seems we can break out of the loop when found is false.
>
> We have implicit "break" in "for" construction:
>
> +   for (i = 0; i < boundinfo->ndatums && found; i++)
>
> I'll change it to explicit "break;" to avoid confusion.
>
>
> Attached patch with the changes described above.
> --
> With best regards,
> Dmitry Koval
>
> Postgres Professional: http://postgrespro.com

Hi,
Thanks for your response.

w.r.t. #2, I think using nested loop is fine for now.
If, when this feature is merged, some user comes up with long merge list,
we can revisit this topic.

Cheers


Re: silence compiler warning in brin.c

2022-06-01 Thread Zhihong Yu
On Wed, Jun 1, 2022 at 10:06 AM Tom Lane  wrote:

> Zhihong Yu  writes:
> > On Wed, Jun 1, 2022 at 9:35 AM Nathan Bossart 
> > wrote:
> >> I'm seeing a compiler warning in brin.c with an older version of gcc.
> >> Specifically, it seems worried that a variable might not be initialized.
> >> AFAICT there is no real risk, so I've attached a small patch to silence
> the
> >> warning.
>
> Yeah, I noticed the other day that a couple of older buildfarm members
> (curculio, gaur) are grousing about this too.  We don't really have a
> hard-n-fast rule about how old a compiler needs to be before we stop
> worrying about its notions about uninitialized variables, but these are
> kind of old.  Still, since this is the only such warning from these
> animals, I'm inclined to silence it.
>
> > It seems the variable can be initialized to the value of GUCNestLevel
> since
> > later in the func:
> > /* Roll back any GUC changes executed by index functions */
> > AtEOXact_GUC(false, save_nestlevel);
>
> That seems pretty inappropriate.  If, thanks to some future thinko,
> control were able to reach the AtEOXact_GUC call despite not having
> called NewGUCNestLevel, we'd want that to fail.  It looks like
> AtEOXact_GUC asserts nestLevel > 0, so that either 0 or -1 would
> do as an "invalid" value; I'd lean a bit to using 0.
>
> regards, tom lane
>
Hi,

if (heapRel == NULL || heapoid != IndexGetRelation(indexoid, false))
ereport(ERROR,

I wonder why the above check is not placed in the else block:

else
heapRel = NULL;

because heapRel is not modified between the else and the above check.
If the check is placed in the else block, we can potentially save the call
to index_open().

Cheers


Re: silence compiler warning in brin.c

2022-06-01 Thread Zhihong Yu
On Wed, Jun 1, 2022 at 9:35 AM Nathan Bossart 
wrote:

> Hi hackers,
>
> I'm seeing a compiler warning in brin.c with an older version of gcc.
> Specifically, it seems worried that a variable might not be initialized.
> AFAICT there is no real risk, so I've attached a small patch to silence the
> warning.
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com

Hi,
It seems the variable can be initialized to the value of GUCNestLevel since
later in the func:

/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);

Cheers


  1   2   3   4   5   6   >