Re: Avoid a possible out-of-bounds access (src/backend/optimizer/util/relnode.c)
Hi, On Sat, Sep 23, 2023 at 9:59 PM Ranier Vilela wrote: > Per Coverity. > CID 1518088 (#2 of 2): Improper use of negative value (NEGATIVE_RETURNS) > > The function bms_singleton_member can returns a negative number. > > /* > * Get a child rel for rel2 with the relids. See above comments. > */ > if (rel2_is_simple) > { > int varno = bms_singleton_member(child_relids2); > > child_rel2 = find_base_rel(root, varno); > } > > It turns out that in the get_matching_part_pairs function (joinrels.c), the > return of bms_singleton_member is passed to the find_base_rel function, which > cannot receive a negative value. > > find_base_rel is protected by an Assertion, which effectively indicates that > the error does not occur in tests and in DEBUG mode. > > But this does not change the fact that bms_singleton_member can return a > negative value, which may occur on some production servers. > > Fix by changing the Assertion into a real test, to protect the > simple_rel_array array. Thanks for the report and patch! I will review the patch. Best regards, Etsuro Fujita
Re: [HACKERS] Should logtape.c blocks be of type long?
On Thu, Sep 21, 2023 at 09:53:02PM -0700, Peter Geoghegan wrote: > No new thoughts. I'm still all in favor of this. Thanks for picking it up. Okay, thanks. I guess that nobody would complain if I were to apply that.. > At some point we should completely ban the use of "long". Indeed, or Windows decides that making long 8-byte is wiser, but I doubt that's ever going to happen on backward-compatibility ground. -- Michael signature.asc Description: PGP signature
Re: nbtree's ScalarArrayOp array mark/restore code appears to be buggy
On Sat, Sep 23, 2023 at 11:47 AM Peter Geoghegan wrote: > The fix for this should be fairly straightforward. We must teach > _bt_restore_array_keys() to distinguish "past the end of the array" > from "after the start of the array", so that doesn't spuriously skip a > required call to _bt_preprocess_keys() . I already see that the > problem goes away once _bt_restore_array_keys() is made to call > _bt_preprocess_keys() unconditionally, so I'm already fairly confident > that this will work. Attached draft patch shows how this could work. _bt_restore_array_keys() has comments that seem to suppose that calling _bt_preprocess_keys is fairly expensive, and something that's well worth avoiding. But...is it, really? I wonder if we'd be better off just biting the bullet and always calling _bt_preprocess_keys here. Could it really be such a big cost, compared to pinning and locking the page in the btrestpos() path? The current draft SAOP patch calls _bt_preprocess_keys() with a buffer lock held. This is very likely unsafe, so I'll need to come up with a principled approach (e.g. a stripped down, specialized version of _bt_preprocess_keys that's safe to call while holding a lock seems doable). I've been able to put that off for a few months now because it just doesn't impact my microbenchmarks to any notable degree (and not just in those cases where we can use the "numberOfKeys == 1" fast path at the start of _bt_preprocess_keys). This at least suggests that the cost of always calling _bt_preprocess_keys in _bt_restore_array_keys might be acceptable. -- Peter Geoghegan From 781fd293aa7556f5ab029d2d5c2dfc829a4e8efd Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sat, 23 Sep 2023 15:00:59 -0700 Subject: [PATCH v1] Fix btmarkpos/btrestrpos array keys edge case. Oversight in bug fix commit 70bc5833, which taught nbtree to handle array keys during mark/restore processing, but missed this subtlety. Author: Peter Geoghegan Discussion: CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com">https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com Backpatch: 11- (all supported branches). --- src/include/access/nbtree.h | 2 ++ src/backend/access/nbtree/nbtree.c | 1 + src/backend/access/nbtree/nbtutils.c | 18 ++ 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index f5c66964c..b38279e7b 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -1043,6 +1043,8 @@ typedef struct BTScanOpaqueData /* workspace for SK_SEARCHARRAY support */ ScanKey arrayKeyData; /* modified copy of scan->keyData */ + bool arraysStarted; /* Started array keys, but have yet to + * "reach past the end" of all arrays? */ int numArrayKeys; /* number of equality-type array keys (-1 if * there are any unsatisfiable array keys) */ int arrayKeyCount; /* count indicating number of array scan keys diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 62bc9917f..6c5b5c69c 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -364,6 +364,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys) so->keyData = NULL; so->arrayKeyData = NULL; /* assume no array keys for now */ + so->arraysStarted = false; so->numArrayKeys = 0; so->arrayKeys = NULL; so->arrayContext = NULL; diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index 7da499c4d..17e597f11 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -539,6 +539,8 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir) curArrayKey->cur_elem = 0; skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem]; } + + so->arraysStarted = true; } /* @@ -598,6 +600,14 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir) if (scan->parallel_scan != NULL) _bt_parallel_advance_array_keys(scan); + /* + * When no new array keys were found, the scan is "past the end" of the + * array keys. _bt_start_array_keys can still "restart" the array keys if + * a rescan is required. + */ + if (!found) + so->arraysStarted = false; + return found; } @@ -648,11 +658,11 @@ _bt_restore_array_keys(IndexScanDesc scan) } /* - * If we changed any keys, we must redo _bt_preprocess_keys. That might - * sound like overkill, but in cases with multiple keys per index column - * it seems necessary to do the full set of pushups. + * If we changed any keys, or if we're not sure that so->keyData[] + * contains the array keys indicated as current within so->arrayKeys[], + * then we must redo _bt_preprocess_keys */ - if (changed) + if (changed || !so->arraysStarted) { _bt_preprocess_keys(scan); /* The mark should have been set on a consistent set of keys... */ -- 2.40.1
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 4:30 PM Melanie Plageman wrote: > On Mon, Aug 28, 2023 at 12:26 PM Robert Haas wrote: > > In row D, your algorithms are all bad, really bad. I don't quite > > understand how it can be that bad, actually. > > So, I realize now that this test was poorly designed. I meant it to be a > worst case scenario, but I think one critical part was wrong. In this > example one client is going at full speed inserting a row and then > updating it. Then another rate-limited client is deleting old data > periodically to keep the table at a constant size. I meant to bulk load > the table with enough data that the delete job would have data to delete > from the start. With the default autovacuum settings, over the course of > 45 minutes, I usually saw around 40 autovacuums of the table. Due to the > rate limiting, the first autovacuum of the table ends up freezing many > pages that are deleted soon after. Thus the total number of page freezes > is very high. > > I will redo benchmarking of workload D and start the table with the > number of rows which the DELETE job seeks to maintain. My back of the > envelope math says that this will mean ratios closer to a dozen (and not > 5000). ... > I'll rerun workload D in a more reasonable way and be back with results. Hi, I have edited and rerun all of the benchmarks for a subset of the algorithms. I ran workloads A-I, except for G (G is not an interesting workload), on Postgres master as well as Postgres with freeze heuristic algorithms 4 and 5. As a refresher, algorithms 4 and 5 are as follows: Freeze tuples on a page opportunistically if the page would be totally frozen and: 4. Buffer is already dirty and no FPI is required OR page LSN is older than 33% of the LSNs since the last vacuum of the table. 5. Buffer is already dirty and no FPI is required AND page LSN is older than 33% of the LSNs since the last vacuum of the table. On master, the heuristic is to freeze a page opportunistically if it would be totally frozen and if pruning emitted an FPI. I made a few configuration changes for all benchmarks -- most notably autovacuum is on for all workloads and autovacuum_naptime is decreased to 10 seconds. The runtime of all time-based workloads was changed to 40 minutes. All workloads were run for a fixed duration except for the COPY workload (workload F). It is also worth noting that data loaded into the standard pgbench tables (with pgbench -i) was loaded with COPY and not COPY FREEZE. Also, pgbench -i and the pgbench runs were passed the --no-vacuum option. The updated workloads are as follows: A. Gaussian TPC-B like + select-only: pgbench scale 450 (DB < SB) with indexes on updated columns WL 1: 16 clients doing TPC-B like pgbench but with Gaussian access distribution (with parameter 6) for updated tables WL 2: 2 clients, rate-limited to 1000 TPS, doing select-only pgbench B. TPC-B like pgbench scale 450 16 clients doing built-in pgbench TPC-C like script C. Shifting hot set: 16 clients inserting a single row then updating an indexed column in that row D. Shifting hot set, delete old data WL 1: 16 clients inserting one row then updating an indexed column in that row WL 2: 1 client, rate limited to 10 TPS, deleting data more than 1 minute old E. Shifting hot set, delete new data, access new data WL 1: 16 clients cycling through 2 single row inserts, an update of an indexed column in a single recently inserted row, and a delete of a single recently inserted row WL 2: 1 client, rate-limited to 0.2 TPS, selecting data from the last 5 minutes F. Many COPYs 1 client, copying a total of 90 GB of data in 70 individual COPYs G. N/A H. Append only table 16 clients, inserting a single row at a time I. Work queue WL 1: 16 clients inserting a single row, updating a non-indexed column in that row twice, then deleting that row WL 2: 1 client, rate-limited to 0.02 TPS, COPYing data into another table Below are some of the data I collected, including how much WAL was generated, how much IO various backends did, TPS, and P99 latency. These were collected at the end of the pgbench run. Most numbers were rounded to the nearest whole number to make the chart easier to see. So, for example, a 0 in WAL GB does not mean that no WAL was emitted. P99 latency below 0 was rounded to a single decimal place. All IO time is in milliseconds. "P99 lat" was calculated using the "time" field from pgbench's "per-transaction logging" option [1]. It is the "transaction's elapsed time" or duration. I converted it to milliseconds. "pages frozen" here refers to the number of pages marked frozen in the visibility map at the end of the run. "page freezes" refers to the number of times vacuum froze tuples in lazy_scan_prune(). This does not include cases in which the page was found to be all frozen but no tuples were explicitly frozen. "AVs" is the number of autovacuums for the table or tables specifie
Re: bug fix and documentation improvement about vacuumdb
On Fri, Sep 22, 2023 at 02:58:20PM +0200, Daniel Gustafsson wrote: > I had a look at this and tweaked the testcase a bit to make the diff smaller, > as well as removed the (in some cases) superfluous space in the generated SQL > query mentioned upthread. The attached two patches is what I propose we > commit > to fix this, with a backpatch to v16 where it was introduced. LGTM -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: nbtree's ScalarArrayOp array mark/restore code appears to be buggy
On Fri, Sep 22, 2023 at 8:17 PM Peter Geoghegan wrote: > My suspicion is that bugfix commit 70bc5833 missed some subtlety > around what we need to do to make sure that the array keys stay "in > sync" with the scan. I'll have time to debug the problem some more > tomorrow. I've figured out what's going on here. If I make my test case "group by" both of the indexed columns from the composite index (either index/table will do, since it's an equijoin), a more detailed picture emerges that hints at the underlying problem: ┌───┬─┬─┐ │ count │ small_a │ small_b │ ├───┼─┼─┤ │ 8,192 │ 1 │ 2 │ │ 8,192 │ 1 │ 3 │ │ 8,192 │ 1 │ 5 │ │ 8,192 │ 1 │ 10 │ │ 8,192 │ 1 │ 12 │ │ 8,192 │ 1 │ 17 │ │ 2,872 │ 1 │ 19 │ └───┴─┴─┘ (7 rows) The count for the final row is wrong. It should be 8,192, just like the earlier counts for lower (small_a, small_b) groups. Notably, the issue is limited to the grouping that has the highest sort order. That strongly hints that the problem has something to do with "array wraparound". The query qual contains "WHERE small_a IN (1, 3)", so we'll "wraps around" from cur_elem index 1 (value 3) to cur_elem index 0 (value 1), without encountering any rows where small_a is 3 (because there aren't any in the index). That in itself isn't the problem. The problem is that _bt_restore_array_keys() doesn't consider wraparound. It sees that "cur_elem == mark_elem" for all array scan keys, and figues that it doesn't need to call _bt_preprocess_keys(). This is incorrect, since the current set of search-type scan keys (the set most recently output, during the last _bt_preprocess_keys() call) still have the value "3". The fix for this should be fairly straightforward. We must teach _bt_restore_array_keys() to distinguish "past the end of the array" from "after the start of the array", so that doesn't spuriously skip a required call to _bt_preprocess_keys() . I already see that the problem goes away once _bt_restore_array_keys() is made to call _bt_preprocess_keys() unconditionally, so I'm already fairly confident that this will work. -- Peter Geoghegan
Re: Should rolpassword be toastable?
23.09.2023 17:39, Tom Lane wrote: I'm also now more than just slightly skeptical about whether pg_database should have a toast table. Has anybody tried, say, storing a daticurules field wide enough to end up out-of-line? I tried, but failed, because pg_database accessed in InitPostgres() before assigning MyDatabaseId only via the function GetDatabaseTupleByOid(), which doesn't unpack the database tuple. Another access to a system catalog with unassigned MyDatabaseId might occur in the has_privs_of_role() call, but pg_auth_members contains no toastable attributes. So for now only pg_authid is worthy of condemnation, AFAICS. Best regards, Alexander
Re: Should rolpassword be toastable?
Alexander Lakhin writes: > When playing with oversized tuples, I've found that it's possible to set > such oversized password for a user, that could not be validated. > For example: > ... > psql -U "test_user" -c "SELECT 1" > psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: > FATAL: cannot read pg_class without having > selected a database My inclination is to fix this by removing pg_authid's toast table. I was not in favor of "let's attach a toast table to every catalog" to begin with, and I think this failure graphically illustrates why that was not as great an idea as some people thought. I also don't think it's worth trying to make it really work. I'm also now more than just slightly skeptical about whether pg_database should have a toast table. Has anybody tried, say, storing a daticurules field wide enough to end up out-of-line? regards, tom lane
Should rolpassword be toastable?
Hello hackers, When playing with oversized tuples, I've found that it's possible to set such oversized password for a user, that could not be validated. For example: SELECT format('CREATE ROLE test_user LOGIN PASSWORD ''SCRAM-SHA-256$' || repeat('0', 200) || '4096:NuDacwYSUxeOeFUEf3ivTQ==$Wgvq3OCYrJI6eUfvKlAzn4p/j3mzgWzXbVnWeFK1qhY=:r1qSP0j2QojCjLpFUjI0i6ckInvxJDKoyWnN3zF8WCM='';') \gexec -- the password is "pass" (One can achieve the same result with a large salt size, for example, 2048.) psql -U "test_user" -c "SELECT 1" psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: cannot read pg_class without having selected a database I've tried to set attstorage = 'p' for the rolpassword attribute forcefully by dirty hacking genbki.pl, and as a result I get an error on CREATE ROLE: ERROR: row is too big: size 2000256, maximum size 8160 Best regards, Alexander
Avoid a possible out-of-bounds access (src/backend/optimizer/util/relnode.c)
Hi, Per Coverity. CID 1518088 (#2 of 2): Improper use of negative value (NEGATIVE_RETURNS) The function bms_singleton_member can returns a negative number. /* * Get a child rel for rel2 with the relids. See above comments. */ if (rel2_is_simple) { int varno = bms_singleton_member(child_relids2); child_rel2 = find_base_rel(root, varno); } It turns out that in the get_matching_part_pairs function (joinrels.c), the return of bms_singleton_member is passed to the find_base_rel function, which cannot receive a negative value. find_base_rel is protected by an Assertion, which effectively indicates that the error does not occur in tests and in DEBUG mode. But this does not change the fact that bms_singleton_member can return a negative value, which may occur on some production servers. Fix by changing the Assertion into a real test, to protect the simple_rel_array array. best regards, Ranier Vilela 0001-Avoid-possible-out-of-bounds-access.patch Description: Binary data