Re: Avoid a possible out-of-bounds access (src/backend/optimizer/util/relnode.c)

2023-09-23 Thread Etsuro Fujita
Hi,

On Sat, Sep 23, 2023 at 9:59 PM Ranier Vilela  wrote:
> Per Coverity.
> CID 1518088 (#2 of 2): Improper use of negative value (NEGATIVE_RETURNS)
>
> The function bms_singleton_member can returns a negative number.
>
> /*
> * Get a child rel for rel2 with the relids.  See above comments.
> */
> if (rel2_is_simple)
> {
> int varno = bms_singleton_member(child_relids2);
>
> child_rel2 = find_base_rel(root, varno);
> }
>
> It turns out that in the get_matching_part_pairs function (joinrels.c), the 
> return of bms_singleton_member is passed to the find_base_rel function, which 
> cannot receive a negative value.
>
> find_base_rel is protected by an Assertion, which effectively indicates that 
> the error does not occur in tests and in DEBUG mode.
>
> But this does not change the fact that bms_singleton_member can return a 
> negative value, which may occur on some production servers.
>
> Fix by changing the Assertion into a real test, to protect the 
> simple_rel_array array.

Thanks for the report and patch!  I will review the patch.

Best regards,
Etsuro Fujita




Re: [HACKERS] Should logtape.c blocks be of type long?

2023-09-23 Thread Michael Paquier
On Thu, Sep 21, 2023 at 09:53:02PM -0700, Peter Geoghegan wrote:
> No new thoughts. I'm still all in favor of this. Thanks for picking it up.

Okay, thanks.  I guess that nobody would complain if I were to apply
that..

> At some point we should completely ban the use of "long".

Indeed, or Windows decides that making long 8-byte is wiser, but I
doubt that's ever going to happen on backward-compatibility ground.
--
Michael


signature.asc
Description: PGP signature


Re: nbtree's ScalarArrayOp array mark/restore code appears to be buggy

2023-09-23 Thread Peter Geoghegan
On Sat, Sep 23, 2023 at 11:47 AM Peter Geoghegan  wrote:
> The fix for this should be fairly straightforward. We must teach
> _bt_restore_array_keys() to distinguish "past the end of the array"
> from "after the start of the array", so that doesn't spuriously skip a
> required call to _bt_preprocess_keys() . I already see that the
> problem goes away once _bt_restore_array_keys() is made to call
> _bt_preprocess_keys() unconditionally, so I'm already fairly confident
> that this will work.

Attached draft patch shows how this could work.

_bt_restore_array_keys() has comments that seem to suppose that
calling _bt_preprocess_keys is fairly expensive, and something that's
well worth avoiding. But...is it, really? I wonder if we'd be better
off just biting the bullet and always calling _bt_preprocess_keys
here. Could it really be such a big cost, compared to pinning and
locking the page in the btrestpos() path?

The current draft SAOP patch calls _bt_preprocess_keys() with a buffer
lock held. This is very likely unsafe, so I'll need to come up with a
principled approach (e.g. a stripped down, specialized version of
_bt_preprocess_keys that's safe to call while holding a lock seems doable).
I've been able to put that off for a few months now because it just
doesn't impact my microbenchmarks to any notable degree (and not just
in those cases where we can use the "numberOfKeys == 1" fast path at
the start of _bt_preprocess_keys). This at least suggests that the cost of
always calling _bt_preprocess_keys in _bt_restore_array_keys might be
acceptable.

--
Peter Geoghegan
From 781fd293aa7556f5ab029d2d5c2dfc829a4e8efd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 23 Sep 2023 15:00:59 -0700
Subject: [PATCH v1] Fix btmarkpos/btrestrpos array keys edge case.

Oversight in bug fix commit 70bc5833, which taught nbtree to handle
array keys during mark/restore processing, but missed this subtlety.

Author: Peter Geoghegan 
Discussion: CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com">https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com
Backpatch: 11- (all supported branches).
---
 src/include/access/nbtree.h  |  2 ++
 src/backend/access/nbtree/nbtree.c   |  1 +
 src/backend/access/nbtree/nbtutils.c | 18 ++
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f5c66964c..b38279e7b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1043,6 +1043,8 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
+	bool		arraysStarted;	/* Started array keys, but have yet to
+ * "reach past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
  * there are any unsatisfiable array keys) */
 	int			arrayKeyCount;	/* count indicating number of array scan keys
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 62bc9917f..6c5b5c69c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -364,6 +364,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
+	so->arraysStarted = false;
 	so->numArrayKeys = 0;
 	so->arrayKeys = NULL;
 	so->arrayContext = NULL;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4d..17e597f11 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -539,6 +539,8 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
+
+	so->arraysStarted = true;
 }
 
 /*
@@ -598,6 +600,14 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 	if (scan->parallel_scan != NULL)
 		_bt_parallel_advance_array_keys(scan);
 
+	/*
+	 * When no new array keys were found, the scan is "past the end" of the
+	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
+	 * a rescan is required.
+	 */
+	if (!found)
+		so->arraysStarted = false;
+
 	return found;
 }
 
@@ -648,11 +658,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	}
 
 	/*
-	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
-	 * sound like overkill, but in cases with multiple keys per index column
-	 * it seems necessary to do the full set of pushups.
+	 * If we changed any keys, or if we're not sure that so->keyData[]
+	 * contains the array keys indicated as current within so->arrayKeys[],
+	 * then we must redo _bt_preprocess_keys
 	 */
-	if (changed)
+	if (changed || !so->arraysStarted)
 	{
 		_bt_preprocess_keys(scan);
 		/* The mark should have been set on a consistent set of keys... */
-- 
2.40.1



Re: Eager page freeze criteria clarification

2023-09-23 Thread Melanie Plageman
On Mon, Aug 28, 2023 at 4:30 PM Melanie Plageman
 wrote:
> On Mon, Aug 28, 2023 at 12:26 PM Robert Haas  wrote:
> > In row D, your algorithms are all bad, really bad. I don't quite
> > understand how it can be that bad, actually.
>
> So, I realize now that this test was poorly designed. I meant it to be a
> worst case scenario, but I think one critical part was wrong. In this
> example one client is going at full speed inserting a row and then
> updating it. Then another rate-limited client is deleting old data
> periodically to keep the table at a constant size. I meant to bulk load
> the table with enough data that the delete job would have data to delete
> from the start. With the default autovacuum settings, over the course of
> 45 minutes, I usually saw around 40 autovacuums of the table. Due to the
> rate limiting, the first autovacuum of the table ends up freezing many
> pages that are deleted soon after. Thus the total number of page freezes
> is very high.
>
> I will redo benchmarking of workload D and start the table with the
> number of rows which the DELETE job seeks to maintain. My back of the
> envelope math says that this will mean ratios closer to a dozen (and not
> 5000).
...
> I'll rerun workload D in a more reasonable way and be back with results.

Hi,

I have edited and rerun all of the benchmarks for a subset of the
algorithms. I ran workloads A-I, except for G (G is not an interesting
workload), on Postgres master as well as Postgres with freeze heuristic
algorithms 4 and 5.

As a refresher, algorithms 4 and 5 are as follows:

Freeze tuples on a page opportunistically if the page would be totally
frozen and:

 4. Buffer is already dirty and no FPI is required OR page LSN is older
 than 33% of the LSNs since the last vacuum of the table.

 5. Buffer is already dirty and no FPI is required AND page LSN is older
 than 33% of the LSNs since the last vacuum of the table.

On master, the heuristic is to freeze a page opportunistically if it
would be totally frozen and if pruning emitted an FPI.

I made a few configuration changes for all benchmarks -- most notably
autovacuum is on for all workloads and autovacuum_naptime is decreased
to 10 seconds. The runtime of all time-based workloads was changed to 40
minutes. All workloads were run for a fixed duration except for the COPY
workload (workload F).

It is also worth noting that data loaded into the standard pgbench
tables (with pgbench -i) was loaded with COPY and not COPY FREEZE. Also,
pgbench -i and the pgbench runs were passed the --no-vacuum option.

The updated workloads are as follows:

A. Gaussian TPC-B like + select-only:
   pgbench scale 450 (DB < SB) with indexes on updated columns
   WL 1: 16 clients doing TPC-B like pgbench but with Gaussian access
   distribution (with parameter 6) for updated tables
   WL 2: 2 clients, rate-limited to 1000 TPS, doing select-only pgbench

B. TPC-B like
   pgbench scale 450
   16 clients doing built-in pgbench TPC-C like script

C. Shifting hot set:
   16 clients inserting a single row then updating an indexed column in
   that row

D. Shifting hot set, delete old data
   WL 1: 16 clients inserting one row then updating an indexed column in
   that row
   WL 2: 1 client, rate limited to 10 TPS, deleting data more than 1
   minute old

E. Shifting hot set, delete new data, access new data
   WL 1: 16 clients cycling through 2 single row inserts, an update of
   an indexed column in a single recently inserted row, and a delete of
   a single recently inserted row
   WL 2: 1 client, rate-limited to 0.2 TPS, selecting data from the last
   5 minutes

F. Many COPYs
   1 client, copying a total of 90 GB of data in 70 individual COPYs

G. N/A

H. Append only table
   16 clients, inserting a single row at a time

I. Work queue
   WL 1: 16 clients inserting a single row, updating a non-indexed
   column in that row twice, then deleting that row
   WL 2: 1 client, rate-limited to 0.02 TPS, COPYing data into another
   table

Below are some of the data I collected, including how much WAL was
generated, how much IO various backends did, TPS, and P99 latency.
These were collected at the end of the pgbench run.

Most numbers were rounded to the nearest whole number to make the chart
easier to see. So, for example, a 0 in WAL GB does not mean that no WAL
was emitted. P99 latency below 0 was rounded to a single decimal place.

All IO time is in milliseconds.

"P99 lat" was calculated using the "time" field from pgbench's
"per-transaction logging" option [1]. It is the "transaction's elapsed time"
or duration. I converted it to milliseconds.

"pages frozen" here refers to the number of pages marked frozen in the
visibility map at the end of the run. "page freezes" refers to the
number of times vacuum froze tuples in lazy_scan_prune(). This does not
include cases in which the page was found to be all frozen but no tuples
were explicitly frozen.

"AVs" is the number of autovacuums for the table or tables specifie

Re: bug fix and documentation improvement about vacuumdb

2023-09-23 Thread Nathan Bossart
On Fri, Sep 22, 2023 at 02:58:20PM +0200, Daniel Gustafsson wrote:
> I had a look at this and tweaked the testcase a bit to make the diff smaller,
> as well as removed the (in some cases) superfluous space in the generated SQL
> query mentioned upthread.  The attached two patches is what I propose we 
> commit
> to fix this, with a backpatch to v16 where it was introduced.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: nbtree's ScalarArrayOp array mark/restore code appears to be buggy

2023-09-23 Thread Peter Geoghegan
On Fri, Sep 22, 2023 at 8:17 PM Peter Geoghegan  wrote:
> My suspicion is that bugfix commit 70bc5833 missed some subtlety
> around what we need to do to make sure that the array keys stay "in
> sync" with the scan. I'll have time to debug the problem some more
> tomorrow.

I've figured out what's going on here.

If I make my test case "group by" both of the indexed columns from the
composite index (either index/table will do, since it's an equijoin),
a more detailed picture emerges that hints at the underlying problem:

┌───┬─┬─┐
│ count │ small_a │ small_b │
├───┼─┼─┤
│ 8,192 │   1 │   2 │
│ 8,192 │   1 │   3 │
│ 8,192 │   1 │   5 │
│ 8,192 │   1 │  10 │
│ 8,192 │   1 │  12 │
│ 8,192 │   1 │  17 │
│ 2,872 │   1 │  19 │
└───┴─┴─┘
(7 rows)

The count for the final row is wrong. It should be 8,192, just like
the earlier counts for lower (small_a, small_b) groups. Notably, the
issue is limited to the grouping that has the highest sort order. That
strongly hints that the problem has something to do with "array
wraparound".

The query qual contains "WHERE small_a IN (1, 3)", so we'll "wraps
around" from cur_elem index 1 (value 3) to cur_elem index 0 (value 1),
without encountering any rows where small_a is 3 (because there aren't
any in the index). That in itself isn't the problem. The problem is
that _bt_restore_array_keys() doesn't consider wraparound. It sees
that "cur_elem == mark_elem" for all array scan keys, and figues that
it doesn't need to call _bt_preprocess_keys(). This is incorrect,
since the current set of search-type scan keys (the set most recently
output, during the last _bt_preprocess_keys() call) still have the
value "3".

The fix for this should be fairly straightforward. We must teach
_bt_restore_array_keys() to distinguish "past the end of the array"
from "after the start of the array", so that doesn't spuriously skip a
required call to _bt_preprocess_keys() . I already see that the
problem goes away once _bt_restore_array_keys() is made to call
_bt_preprocess_keys() unconditionally, so I'm already fairly confident
that this will work.

-- 
Peter Geoghegan




Re: Should rolpassword be toastable?

2023-09-23 Thread Alexander Lakhin

23.09.2023 17:39, Tom Lane wrote:

I'm also now more than just slightly skeptical about whether
pg_database should have a toast table.  Has anybody tried,
say, storing a daticurules field wide enough to end up
out-of-line?


I tried, but failed, because pg_database accessed in InitPostgres() before
assigning MyDatabaseId only via the function GetDatabaseTupleByOid(),
which doesn't unpack the database tuple.
Another access to a system catalog with unassigned MyDatabaseId might occur
in the has_privs_of_role() call, but pg_auth_members contains no toastable
attributes.
So for now only pg_authid is worthy of condemnation, AFAICS.

Best regards,
Alexander




Re: Should rolpassword be toastable?

2023-09-23 Thread Tom Lane
Alexander Lakhin  writes:
> When playing with oversized tuples, I've found that it's possible to set
> such oversized password for a user, that could not be validated.
> For example:
> ...
> psql -U "test_user" -c "SELECT 1"
> psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: 
> FATAL:  cannot read pg_class without having 
> selected a database

My inclination is to fix this by removing pg_authid's toast table.
I was not in favor of "let's attach a toast table to every catalog"
to begin with, and I think this failure graphically illustrates
why that was not as great an idea as some people thought.
I also don't think it's worth trying to make it really work.

I'm also now more than just slightly skeptical about whether
pg_database should have a toast table.  Has anybody tried,
say, storing a daticurules field wide enough to end up
out-of-line?

regards, tom lane




Should rolpassword be toastable?

2023-09-23 Thread Alexander Lakhin

Hello hackers,

When playing with oversized tuples, I've found that it's possible to set
such oversized password for a user, that could not be validated.
For example:
SELECT format('CREATE ROLE test_user LOGIN PASSWORD ''SCRAM-SHA-256$' || repeat('0', 200) || 
'4096:NuDacwYSUxeOeFUEf3ivTQ==$Wgvq3OCYrJI6eUfvKlAzn4p/j3mzgWzXbVnWeFK1qhY=:r1qSP0j2QojCjLpFUjI0i6ckInvxJDKoyWnN3zF8WCM='';')

\gexec
-- the password is "pass"
(One can achieve the same result with a large salt size, for example, 2048.)

psql -U "test_user" -c "SELECT 1"
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  cannot read pg_class without having 
selected a database


I've tried to set attstorage = 'p' for the rolpassword attribute forcefully
by dirty hacking genbki.pl, and as a result I get an error on CREATE ROLE:
ERROR:  row is too big: size 2000256, maximum size 8160

Best regards,
Alexander




Avoid a possible out-of-bounds access (src/backend/optimizer/util/relnode.c)

2023-09-23 Thread Ranier Vilela
Hi,

Per Coverity.
CID 1518088 (#2 of 2): Improper use of negative value (NEGATIVE_RETURNS)

The function bms_singleton_member can returns a negative number.

/*
* Get a child rel for rel2 with the relids.  See above comments.
*/
if (rel2_is_simple)
{
int varno = bms_singleton_member(child_relids2);

child_rel2 = find_base_rel(root, varno);
}

It turns out that in the get_matching_part_pairs function (joinrels.c), the
return of bms_singleton_member is passed to the find_base_rel function,
which cannot receive a negative value.

find_base_rel is protected by an Assertion, which effectively indicates
that the error does not occur in tests and in DEBUG mode.

But this does not change the fact that bms_singleton_member can return a
negative value, which may occur on some production servers.

Fix by changing the Assertion into a real test, to protect the
simple_rel_array array.

best regards,
Ranier Vilela


0001-Avoid-possible-out-of-bounds-access.patch
Description: Binary data