subject:"making relfilenodes 56 bits"

Re: making relfilenodes 56 bits

2023-01-05 Thread Dilip Kumar

On Wed, Jan 4, 2023 at 5:45 PM vignesh C  wrote:
>
> On Fri, 21 Oct 2022 at 11:31, Michael Paquier  wrote:
> >
> > On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> > > Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> > > any actual checks on the sanity of the output?  I realize that the
> > > output's far from static, but still ...
> >
> > Honestly, checking all the fields is not that exciting, but the
> > maximum I can think of that would be portable enough is something like
> > the attached.  No arithmetic operators for xid limits things a bit,
> > but at least that's something.
> >
> > Thoughts?
>
> The patch does not apply on top of HEAD as in [1], please post a rebased 
> patch:
>

Because of the extra WAL overhead, we are not continuing with the
patch, I will withdraw it.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2023-01-04 Thread vignesh C

On Fri, 21 Oct 2022 at 11:31, Michael Paquier  wrote:
>
> On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> > Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> > any actual checks on the sanity of the output?  I realize that the
> > output's far from static, but still ...
>
> Honestly, checking all the fields is not that exciting, but the
> maximum I can think of that would be portable enough is something like
> the attached.  No arithmetic operators for xid limits things a bit,
> but at least that's something.
>
> Thoughts?

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
33ab0a2a527e3af5beee3a98fc07201e555d6e45 ===
=== applying patch ./controldata-regression-2.patch
patching file src/test/regress/expected/misc_functions.out
Hunk #1 succeeded at 642 with fuzz 2 (offset 48 lines).
patching file src/test/regress/sql/misc_functions.sql
Hunk #1 FAILED at 223.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/sql/misc_functions.sql.rej

[1] - http://cfbot.cputube.org/patch_41_3711.log

Regards,
Vignesh

Re: making relfilenodes 56 bits

2022-10-21 Thread Michael Paquier

On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> any actual checks on the sanity of the output?  I realize that the
> output's far from static, but still ...

Honestly, checking all the fields is not that exciting, but the
maximum I can think of that would be portable enough is something like
the attached.  No arithmetic operators for xid limits things a bit,
but at least that's something.

Thoughts?
--
Michael
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index 9f106c2a10..38987e2afc 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -594,3 +594,89 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
  Index Cond: (unique1 = g.g)
 (4 rows)
 
+--
+-- Test functions for control data
+--
+\x
+SELECT checkpoint_lsn > '0/0'::pg_lsn AS checkpoint_lsn,
+redo_lsn > '0/0'::pg_lsn AS redo_lsn,
+redo_wal_file IS NOT NULL AS redo_wal_file,
+timeline_id > 0 AS timeline_id,
+prev_timeline_id > 0 AS prev_timeline_id,
+next_xid IS NOT NULL AS next_xid,
+next_oid > 0 AS next_oid,
+next_multixact_id != '0'::xid AS next_multixact_id,
+next_multi_offset IS NOT NULL AS next_multi_offset,
+oldest_xid != '0'::xid AS oldest_xid,
+oldest_xid_dbid > 0 AS oldest_xid_dbid,
+oldest_active_xid != '0'::xid AS oldest_active_xid,
+oldest_multi_xid != '0'::xid AS oldest_multi_xid,
+oldest_multi_dbid > 0 AS oldest_multi_dbid,
+oldest_commit_ts_xid IS NOT NULL AS oldest_commit_ts_xid,
+newest_commit_ts_xid IS NOT NULL AS newest_commit_ts_xid
+  FROM pg_control_checkpoint();
+-[ RECORD 1 ]+--
+checkpoint_lsn   | t
+redo_lsn | t
+redo_wal_file| t
+timeline_id  | t
+prev_timeline_id | t
+next_xid | t
+next_oid | t
+next_multixact_id| t
+next_multi_offset| t
+oldest_xid   | t
+oldest_xid_dbid  | t
+oldest_active_xid| t
+oldest_multi_xid | t
+oldest_multi_dbid| t
+oldest_commit_ts_xid | t
+newest_commit_ts_xid | t
+
+SELECT max_data_alignment > 0 AS max_data_alignment,
+database_block_size > 0 AS database_block_size,
+blocks_per_segment > 0 AS blocks_per_segment,
+wal_block_size > 0 AS wal_block_size,
+max_identifier_length > 0 AS max_identifier_length,
+max_index_columns > 0 AS max_index_columns,
+max_toast_chunk_size > 0 AS max_toast_chunk_size,
+large_object_chunk_size > 0 AS large_object_chunk_size,
+float8_pass_by_value IS NOT NULL AS float8_pass_by_value,
+data_page_checksum_version >= 0 AS data_page_checksum_version
+  FROM pg_control_init();
+-[ RECORD 1 ]--+--
+max_data_alignment | t
+database_block_size| t
+blocks_per_segment | t
+wal_block_size | t
+max_identifier_length  | t
+max_index_columns  | t
+max_toast_chunk_size   | t
+large_object_chunk_size| t
+float8_pass_by_value   | t
+data_page_checksum_version | t
+
+SELECT min_recovery_end_lsn >= '0/0'::pg_lsn AS min_recovery_end_lsn,
+min_recovery_end_timeline >= 0 AS min_recovery_end_timeline,
+backup_start_lsn >= '0/0'::pg_lsn AS backup_start_lsn,
+backup_end_lsn >= '0/0'::pg_lsn AS backup_end_lsn,
+end_of_backup_record_required IS NOT NULL AS end_of_backup_record_required
+  FROM pg_control_recovery();
+-[ RECORD 1 ]-+--
+min_recovery_end_lsn  | t
+min_recovery_end_timeline | t
+backup_start_lsn  | t
+backup_end_lsn| t
+end_of_backup_record_required | t
+
+SELECT pg_control_version > 0 AS pg_control_version,
+catalog_version_no > 0 AS catalog_version_no,
+system_identifier >= 0 AS system_identifier,
+pg_control_last_modified <= now() AS pg_control_last_modified
+  FROM pg_control_system();
+-[ RECORD 1 ]+--
+pg_control_version   | t
+catalog_version_no   | t
+system_identifier| t
+pg_control_last_modified | t
+
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 639e9b352c..986e07c3a5 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -223,3 +223,47 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,1000) g ON a.unique1 = g;
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
+
+--
+-- Test functions for control data
+--
+\x
+SELECT checkpoint_lsn > '0/0'::pg_lsn AS checkpoint_lsn,
+redo_lsn > '0/0'::pg_lsn AS redo_lsn,
+redo_wal_file IS NOT NULL AS redo_wal_file,
+timeline_id > 0 AS timeline_id,
+prev_timeline_id > 0 AS prev_timeline_id,
+next_xid IS NOT NULL AS next_xid,
+next_oid > 0 AS next_oid,
+next_multixact_id != '0'::xid AS next_multixact_id,
+next_multi_offset IS NOT NULL AS next_multi_offset,
+oldest_xid !=

Re: problems with making relfilenodes 56-bits

2022-10-20 Thread Dilip Kumar

On Thu, Oct 20, 2022 at 12:51 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-17 17:14:21 -0400, Robert Haas wrote:
> > I have to admit that I worried about the same thing that Matthias
> > raises, more or less. But I don't know whether I'm right to be
> > worried. A variable-length representation of any kind is essentially a
> > gamble that values requiring fewer bytes will be more common than
> > values requiring more bytes, and by enough to justify the overhead
> > that the method has. And, you want it to be more common for each
> > individual user, not just overall. For example, more people are going
> > to have small relations than large ones, but nobody wants performance
> > to drop off a cliff when the relation passes a certain size threshold.
> > Now, it wouldn't drop off a cliff here, but what about someone with a
> > really big, append-only relation? Won't they just end up writing more
> > to WAL than with the present system?
>
> Perhaps. But I suspect it'd be a very small increase because they'd be using
> bulk-insert paths in all likelihood anyway, if they managed to get to a very
> large relation. And even in that case, if we e.g. were to make the record size
> variable length, they'd still pretty much never reach that and it'd be an
> overall win.

I think the number of cases where we will reduce the WAL size will be
far more than the cases where it will slightly increase the size.  And
also the number of bytes we save in winning cases is far bigger than
the number of bytes we increase.  So IMHO it seems like an overall win
at least from the WAL size reduction pov.  Do we have some number that
how much overhead it has for encoding/decoding?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

2022-10-19 Thread Andres Freund

Hi,

On 2022-10-17 17:14:21 -0400, Robert Haas wrote:
> I have to admit that I worried about the same thing that Matthias
> raises, more or less. But I don't know whether I'm right to be
> worried. A variable-length representation of any kind is essentially a
> gamble that values requiring fewer bytes will be more common than
> values requiring more bytes, and by enough to justify the overhead
> that the method has. And, you want it to be more common for each
> individual user, not just overall. For example, more people are going
> to have small relations than large ones, but nobody wants performance
> to drop off a cliff when the relation passes a certain size threshold.
> Now, it wouldn't drop off a cliff here, but what about someone with a
> really big, append-only relation? Won't they just end up writing more
> to WAL than with the present system?

Perhaps. But I suspect it'd be a very small increase because they'd be using
bulk-insert paths in all likelihood anyway, if they managed to get to a very
large relation. And even in that case, if we e.g. were to make the record size
variable length, they'd still pretty much never reach that and it'd be an
overall win.

The number of people with that large relations, leaving partitioning aside
which'd still benefit as each relation is smaller, strikes me as a very small
percentage. And as you say, it's not like there's a cliff where everything
starts to be horrible.

> Maybe not. They might still have some writes to relations other than
> the very large, append-only relation, and then they could still win.
> Also, if we assume that the overhead of the variable-length
> representation is never more than 1 byte beyond what is needed to
> represent the underlying quantity in the minimal number of bytes, they
> are only going to lose if their relation is already more than half the
> maximum theoretical size, and if that is the case, they are in danger
> of hitting the size limit anyway. You can argue that there's still a
> risk here, but it doesn't seem like that bad of a risk.

Another thing here is that I suspect we ought to increase our relation size
beyond 4 byte * blocksize at some point - and then we'll have to use variable
encodings... Admittedly the amount of work needed to get there is substantial.

Somewhat relatedly, I think we, very slowly, should move towards wider OIDs as
well. Not having to deal with oid wraparound will be a significant win
(particularly for toast), but to keep the overhead reasonable, we're going to
need variable encodings.

> But the same thing is not so obvious for, let's say, database OIDs.
> What if you just have one or a few databases, but due to the previous
> history of the cluster, their OIDs just happen to be big? Then you're
> just behind where you would have been without the patch. Granted, if
> this happens to you, you will be in the minority, because most users
> are likely to have small database OIDs, but the fact that other people
> are writing less WAL on average isn't going to make you happy about
> writing more WAL on average. And even for a user for which that
> doesn't happen, it's not at all unlikely that the gains they see will
> be less than what we see on a freshly-initdb'd database.

I agree that going for variable width encodings on the basis of the database
oid field alone would be an unconvincing proposition. But variably encoding
database oids when we already variably encode other fields seems like a decent
bet. If you e.g. think of the 56-bit relfilenode field itself - obviously what
I was thinking about in the first place - it's going to be a win much more
often.

To really loose you'd not just have to have a large database oid, but also a
large tablespace and relation oid and a huge block number...

> So I don't really know what the answer is here. I don't think this
> technique sucks, but I don't think it's necessarily a categorical win
> for every case, either. And it even seems hard to reason about which
> cases are likely to be wins and which cases are likely to be losses.

True. I'm far less concerned than you or Matthias about increasing the size in
rare cases as long as it wins in the majority of cases. But that doesn't mean
every case is easy to consider.

> > > 0002 - Rework XLogRecord
> > > This makes many fields in the xlog header optional, reducing the size
> > > of many xlog records by several bytes. This implements the design I
> > > shared in my earlier message [1].
> > >
> > > 0003 - Rework XLogRecordBlockHeader.
> > > This patch could be applied on current head, and saves some bytes in
> > > per-block data. It potentially saves some bytes per registered
> > > block/buffer in the WAL record (max 2 bytes for the first block, after
> > > that up to 3). See the patch's commit message in the patch for
> > > detailed information.
> >
> > The amount of complexity these two introduce seems quite substantial to
> > me. Both from a maintenance and a runtime perspective. I think we'd be

Re: problems with making relfilenodes 56-bits

2022-10-17 Thread Robert Haas

On Wed, Oct 12, 2022 at 5:13 PM Andres Freund  wrote:
> > I think a signficant part of this improvement comes from the premise
> > of starting with a fresh database. tablespace OID will indeed most
> > likely be low, but database OID may very well be linearly distributed
> > if concurrent workloads in the cluster include updating (potentially
> > unlogged) TOASTed columns and the databases are not created in one
> > "big bang" but over the lifetime of the cluster. In that case DBOID
> > will consume 5B for a significant fraction of databases (anything with
> > OID >=2^28).
> >
> > My point being: I don't think that we should have different WAL
> > performance in databases which is dependent on which OID was assigned
> > to that database.
>
> To me this is raising the bar to an absurd level. Some minor space usage
> increase after oid wraparound and for very large block numbers isn't a huge
> issue - if you're in that situation you already have a huge amount of wal.

I have to admit that I worried about the same thing that Matthias
raises, more or less. But I don't know whether I'm right to be
worried. A variable-length representation of any kind is essentially a
gamble that values requiring fewer bytes will be more common than
values requiring more bytes, and by enough to justify the overhead
that the method has. And, you want it to be more common for each
individual user, not just overall. For example, more people are going
to have small relations than large ones, but nobody wants performance
to drop off a cliff when the relation passes a certain size threshold.
Now, it wouldn't drop off a cliff here, but what about someone with a
really big, append-only relation? Won't they just end up writing more
to WAL than with the present system?

Maybe not. They might still have some writes to relations other than
the very large, append-only relation, and then they could still win.
Also, if we assume that the overhead of the variable-length
representation is never more than 1 byte beyond what is needed to
represent the underlying quantity in the minimal number of bytes, they
are only going to lose if their relation is already more than half the
maximum theoretical size, and if that is the case, they are in danger
of hitting the size limit anyway. You can argue that there's still a
risk here, but it doesn't seem like that bad of a risk.

But the same thing is not so obvious for, let's say, database OIDs.
What if you just have one or a few databases, but due to the previous
history of the cluster, their OIDs just happen to be big? Then you're
just behind where you would have been without the patch. Granted, if
this happens to you, you will be in the minority, because most users
are likely to have small database OIDs, but the fact that other people
are writing less WAL on average isn't going to make you happy about
writing more WAL on average. And even for a user for which that
doesn't happen, it's not at all unlikely that the gains they see will
be less than what we see on a freshly-initdb'd database.

So I don't really know what the answer is here. I don't think this
technique sucks, but I don't think it's necessarily a categorical win
for every case, either. And it even seems hard to reason about which
cases are likely to be wins and which cases are likely to be losses.

> > 0002 - Rework XLogRecord
> > This makes many fields in the xlog header optional, reducing the size
> > of many xlog records by several bytes. This implements the design I
> > shared in my earlier message [1].
> >
> > 0003 - Rework XLogRecordBlockHeader.
> > This patch could be applied on current head, and saves some bytes in
> > per-block data. It potentially saves some bytes per registered
> > block/buffer in the WAL record (max 2 bytes for the first block, after
> > that up to 3). See the patch's commit message in the patch for
> > detailed information.
>
> The amount of complexity these two introduce seems quite substantial to
> me. Both from a maintenance and a runtime perspective. I think we'd be better
> off using building blocks like variable lengths encoded values than open
> coding it in many places.

I agree that this looks pretty ornate as written, but I think there
might be some good ideas in here, too. It is also easy to reason about
this kind of thing at least in terms of space consumption. It's a bit
harder to know how things will play out in terms of CPU cycles and
code complexity.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

2022-10-13 Thread Matthias van de Meent

On Wed, 12 Oct 2022 at 23:13, Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-12 22:05:30 +0200, Matthias van de Meent wrote:
> > On Wed, 5 Oct 2022 at 01:50, Andres Freund  wrote:
> > > I light dusted off my old varint implementation from [1] and converted the
> > > RelFileLocator and BlockNumber from fixed width integers to varint ones. 
> > > This
> > > isn't meant as a serious patch, but an experiment to see if this is a path
> > > worth pursuing.
> > >
> > > A run of installcheck in a cluster with autovacuum=off, 
> > > full_page_writes=off
> > > (for increased reproducability) shows a decent saving:
> > >
> > > master: 241106544 - 230 MB
> > > varint: 227858640 - 217 MB
> >
> > I think a signficant part of this improvement comes from the premise
> > of starting with a fresh database. tablespace OID will indeed most
> > likely be low, but database OID may very well be linearly distributed
> > if concurrent workloads in the cluster include updating (potentially
> > unlogged) TOASTed columns and the databases are not created in one
> > "big bang" but over the lifetime of the cluster. In that case DBOID
> > will consume 5B for a significant fraction of databases (anything with
> > OID >=2^28).
> >
> > My point being: I don't think that we should have different WAL
> > performance in databases which is dependent on which OID was assigned
> > to that database.
>
> To me this is raising the bar to an absurd level. Some minor space usage
> increase after oid wraparound and for very large block numbers isn't a huge
> issue - if you're in that situation you already have a huge amount of wal.

I didn't want to block all varlen encoding, I just want to make clear
that I don't think it's great for performance testing and consistency
across installations if WAL size (and thus part of your performance)
is dependent on which actual database/relation/tablespace combination
you're running your workload in.

With the 56-bit relfilenode, the size of a block reference would
realistically differ between 7 bytes and 23 bytes:

- tblspc=0=1B
  db=16386=3B
  rel=797=2B (787 = 4 * default # of data relations in a fresh DB in PG14, + 1)
  block=0=1B

vs

- tsp>=2^28 = 5B
  dat>=2^28 =5B
  rel>=2^49 =8B
  block>=2^28 =5B

That's a difference of 16 bytes, of which only the block number can
realistically be directly influenced by the user ("just don't have
relations larger than X blocks").

If applied to Dilip's pgbench transaction data, that would imply a
minimum per transaction wal usage of 509 bytes, and a maximum per
transaction wal usage of 609 bytes. That is nearly a 20% difference in
WAL size based only on the location of your data, and I'm just not
comfortable with that. Users have little or zero control over the
internal IDs we assign to these fields, while it would affect
performance fairly significantly.

(difference % between min/max wal size is unchanged (within 0.1%)
after accounting for record alignment)

> > 0002 - Rework XLogRecord
> > This makes many fields in the xlog header optional, reducing the size
> > of many xlog records by several bytes. This implements the design I
> > shared in my earlier message [1].
> >
> > 0003 - Rework XLogRecordBlockHeader.
> > This patch could be applied on current head, and saves some bytes in
> > per-block data. It potentially saves some bytes per registered
> > block/buffer in the WAL record (max 2 bytes for the first block, after
> > that up to 3). See the patch's commit message in the patch for
> > detailed information.
>
> The amount of complexity these two introduce seems quite substantial to
> me. Both from an maintenance and a runtime perspective. I think we'd be better
> off using building blocks like variable lengths encoded values than open
> coding it in many places.

I guess that's true for length fields, but I don't think dynamic
header field presence (the 0002 rewrite, and the omission of
data_length in 0003) is that bad. We already have dynamic data
inclusion through block ids 25x; I'm not sure why we couldn't do that
more compactly with bitfields as indicators instead (hence the dynamic
header size).

As for complexity, I think my current patchset is mostly complex due
to a lack of tooling. Note that decoding makes common use of
COPY_HEADER_FIELD, which we don't really have an equivalent for in
XLogRecordAssemble. I think the code for 0002 would improve
significantly in readability if such construct would be available.

To reduce complexity in 0003, I could drop the 'repeat id'
optimization, as that reduces the complexity significantly, at the
cost of not saving that 1 byte per registered block after the first.

Kind regards,

Matthias van de Meent

Re: problems with making relfilenodes 56-bits

2022-10-12 Thread Andres Freund

Hi,

On 2022-10-12 22:05:30 +0200, Matthias van de Meent wrote:
> On Wed, 5 Oct 2022 at 01:50, Andres Freund  wrote:
> > On 2022-10-03 10:01:25 -0700, Andres Freund wrote:
> > > On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> > > > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund  
> > > > wrote:
> > > > I thought about trying to buy back some space elsewhere, and I think
> > > > that would be a reasonable approach to getting this committed if we
> > > > could find a way to do it. However, I don't see a terribly obvious way
> > > > of making it happen.
> > >
> > > I think there's plenty potential...
> >
> > I light dusted off my old varint implementation from [1] and converted the
> > RelFileLocator and BlockNumber from fixed width integers to varint ones. 
> > This
> > isn't meant as a serious patch, but an experiment to see if this is a path
> > worth pursuing.
> >
> > A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
> > (for increased reproducability) shows a decent saving:
> >
> > master: 241106544 - 230 MB
> > varint: 227858640 - 217 MB
> 
> I think a signficant part of this improvement comes from the premise
> of starting with a fresh database. tablespace OID will indeed most
> likely be low, but database OID may very well be linearly distributed
> if concurrent workloads in the cluster include updating (potentially
> unlogged) TOASTed columns and the databases are not created in one
> "big bang" but over the lifetime of the cluster. In that case DBOID
> will consume 5B for a significant fraction of databases (anything with
> OID >=2^28).
> 
> My point being: I don't think that we should have different WAL
> performance in databases which is dependent on which OID was assigned
> to that database.

To me this is raising the bar to an absurd level. Some minor space usage
increase after oid wraparound and for very large block numbers isn't a huge
issue - if you're in that situation you already have a huge amount of wal.


> 0002 - Rework XLogRecord
> This makes many fields in the xlog header optional, reducing the size
> of many xlog records by several bytes. This implements the design I
> shared in my earlier message [1].
> 
> 0003 - Rework XLogRecordBlockHeader.
> This patch could be applied on current head, and saves some bytes in
> per-block data. It potentially saves some bytes per registered
> block/buffer in the WAL record (max 2 bytes for the first block, after
> that up to 3). See the patch's commit message in the patch for
> detailed information.

The amount of complexity these two introduce seems quite substantial to
me. Both from an maintenance and a runtime perspective. I think we'd be better
off using building blocks like variable lengths encoded values than open
coding it in many places.

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

2022-10-11 Thread Dilip Kumar

On Mon, Oct 10, 2022 at 5:40 PM Robert Haas  wrote:
>
> On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar  wrote:
> > I have also executed my original test after applying these patches on
> > top of the 56 bit relfilenode patch.  So earlier we saw the WAL size
> > increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
> > patch now the WAL generated is 58179.2265625.  That means in this
> > particular example this patch is reducing the WAL size by 12% even
> > with the 56 bit relfilenode patch.
>
> That's a very promising result, but the question in my mind is how
> much work would be required to bring this patch to a committable
> state?

Right, the results are promising.  I have done some more testing with
make installcheck WAL size (fpw=off) and I have seen a similar gain
with this patch.

1. Head: 272 MB
2. 56 bit RelfileLocator: 285 MB
3. 56 bit RelfileLocator + this patch: 261 MB

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

2022-10-10 Thread Andres Freund

Hi,

On 2022-10-10 08:10:22 -0400, Robert Haas wrote:
> On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar  wrote:
> > I have also executed my original test after applying these patches on
> > top of the 56 bit relfilenode patch.  So earlier we saw the WAL size
> > increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
> > patch now the WAL generated is 58179.2265625.  That means in this
> > particular example this patch is reducing the WAL size by 12% even
> > with the 56 bit relfilenode patch.
> 
> That's a very promising result, but the question in my mind is how
> much work would be required to bring this patch to a committable
> state?

The biggest part clearly is to review the variable width integer patch. It's
not a large amount of code, but probably more complicated than average.

One complication there is that currently the patch assumes:
 * Note that this function, for efficiency, reads 8 bytes, even if the
 * variable integer is less than 8 bytes long. The buffer has to be
 * allocated sufficiently large to account for that fact. The maximum
 * amount of memory read is 9 bytes.

We could make a less efficient version without that assumption, but I think it
might be easier to just guarantee it in the xlog*.c case.

Using it in xloginsert.c is pretty darn simple, code-wise. xlogreader is bit
harder, although not for intrinsic reasons - the logic underlying
COPY_HEADER_FIELD seems unneccessary complicated to me. The minimal solution
would likely be to just wrap the varint reads in another weird macro.

Leaving the code issues themselves aside, one important thing would be to
evaluate what the performance impacts of the varint encoding/decoding are as
part of "full" server. I suspect it'll vanish in the noise, but we'd need to
validate that.

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

2022-10-10 Thread Robert Haas

On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar  wrote:
> I have also executed my original test after applying these patches on
> top of the 56 bit relfilenode patch.  So earlier we saw the WAL size
> increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
> patch now the WAL generated is 58179.2265625.  That means in this
> particular example this patch is reducing the WAL size by 12% even
> with the 56 bit relfilenode patch.

That's a very promising result, but the question in my mind is how
much work would be required to bring this patch to a committable
state?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

2022-10-10 Thread Dilip Kumar

On Wed, Oct 5, 2022 at 5:19 AM Andres Freund  wrote:

>
> I light dusted off my old varint implementation from [1] and converted the
> RelFileLocator and BlockNumber from fixed width integers to varint ones. This
> isn't meant as a serious patch, but an experiment to see if this is a path
> worth pursuing.
>
> A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
> (for increased reproducability) shows a decent saving:
>
> master: 241106544 - 230 MB
> varint: 227858640 - 217 MB
>
> The average record size goes from 102.7 to 95.7 bytes excluding the remaining
> FPIs, 118.1 to 111.0 including FPIs.
>

I have also executed my original test after applying these patches on
top of the 56 bit relfilenode patch.  So earlier we saw the WAL size
increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
patch now the WAL generated is 58179.2265625.  That means in this
particular example this patch is reducing the WAL size by 12% even
with the 56 bit relfilenode patch.

[1] 
https://www.postgresql.org/message-id/CAFiTN-uut%2B04AdwvBY_oK_jLvMkwXUpDJj5mXg--nek%2BucApPQ%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Andres Freund

Hi,

On 2022-10-03 10:01:25 -0700, Andres Freund wrote:
> On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund  wrote:
> > I thought about trying to buy back some space elsewhere, and I think
> > that would be a reasonable approach to getting this committed if we
> > could find a way to do it. However, I don't see a terribly obvious way
> > of making it happen.
>
> I think there's plenty potential...

I light dusted off my old varint implementation from [1] and converted the
RelFileLocator and BlockNumber from fixed width integers to varint ones. This
isn't meant as a serious patch, but an experiment to see if this is a path
worth pursuing.

A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
(for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

The average record size goes from 102.7 to 95.7 bytes excluding the remaining
FPIs, 118.1 to 111.0 including FPIs.

There's plenty other spots that could be converted (e.g. the record length
which rarely needs four bytes), this is just meant as a demonstration.


I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A
profile does show pg_varint_decode_uint64(), but partially that seems to be
offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains.

That would be tolerable, I think, because waldump --stats pretty much doesn't
do anything with the WAL.

But I suspect there's plenty of optimization potential in the varint
code. Right now it e.g. stores data as big endian, and the bswap instructions
do show up. And a lot of my bit-maskery could be optimized too.

Greetings,

Andres Freund
>From 68248e7aa220586cbf075d3cb6eb56a6461ae81f Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 4 Oct 2022 16:04:16 -0700
Subject: [PATCH v1 1/2] Add a varint implementation

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20191210015054.5otdfuftxrqb5...@alap3.anarazel.de
Backpatch:
---
 src/include/common/varint.h |  92 +++
 src/common/Makefile |   1 +
 src/common/meson.build  |   1 +
 src/common/varint.c | 228 
 4 files changed, 322 insertions(+)
 create mode 100644 src/include/common/varint.h
 create mode 100644 src/common/varint.c

diff --git a/src/include/common/varint.h b/src/include/common/varint.h
new file mode 100644
index 000..43d9ade7808
--- /dev/null
+++ b/src/include/common/varint.h
@@ -0,0 +1,92 @@
+#ifndef PG_VARINT_H
+#define PG_VARINT_H
+
+/*
+ * Variable length integer encoding.
+ *
+ * Represent the length of the varint, in bytes - 1, as the number of
+ * "leading" 0 bits in the first byte. I.e. the length is stored in
+ * unary.  If less than 9 bytes of data are needed, the leading 0 bits
+ * are followed by a 1 bit, followed by the data. When encoding 64bit
+ * integers, for 8 bytes of input data, there is however no high 1 bit
+ * in the second output byte, as that would be unnecessary (and
+ * increase the size to 10 bytes of output bytes). If, hypothetically,
+ * larger numbers were supported, that'd not be the case, obviously.
+ *
+ * The reason to encode the number of bytes as zero bits, rathter than
+ * ones, is that more architectures have instructions for efficiently
+ * finding the first bit set, than finding the first bit unset.
+ *
+ * In contrast to encoding the length as e.g. the high-bit in each following
+ * bytes, this allows for faster decoding (and encoding, although the benefit
+ * is smaller), because no length related branches are needed after processing
+ * the first byte and no 7 bit/byte -> 8 bit/byte conversion is needed. The
+ * mask / shift overhead is similar.
+ *
+ *
+ * I.e. the encoding of a unsiged 64bit integer is as follows
+ * [0 {#bytes - 1}][1 as separator][leading 0 bits][data]
+ *
+ * E.g.
+ * - 7 (input 0b'0111) is encoded as 0b1000'0111
+ * - 145 (input 0b1001'0001) is 0b0100'''1001'0001 (as the separator does
+ *   not fit into the one byte of the fixed width encoding)
+ * - 4141 (input 0b0001'''0010'1101) is 0b0101'''0010'1101 (the
+ *   leading 0, followed by a 1 indicating that two bytes are needed)
+ *
+ *
+ * Naively encoding negative two's complement integers this way would
+ * lead to such numbers always being stored as the maximum varint
+ * length, due to the leading sign bits.  Instead the sign bit is
+ * encoded in the least significant bit. To avoid different limits
+ * than native 64bit integers - in particular around the lowest
+ * possible value - the data portion is bit flipped instead of
+ * negated.
+ *
+ * The reason for the sign bit to be trailing is that that makes it
+ * trivial to reuse code of the unsigned case, whereas a leading bit
+ * would make that harder.
+ *
+ * XXX: One problem with storing negative numbers this way is that it
+ * makes varints not compare lexicographically. Is it worth the code
+ * duplication to allow

Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Robert Haas

On Tue, Oct 4, 2022 at 2:30 PM Andres Freund  wrote:
> Consider the following sequence:
>
> 1) we write WAL like this:
>
> [record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev 
> B_lsn]
>
> 2) crash, the sectors with A and C made it to disk, the one with B didn't
>
> 3) We replay A, discover B is invalid (possibly zeroes or old contents),
>insert a new record B' with the same length. Now it looks like this:
>
> [record A][tear boundary][record B', prev A_lsn][tear boundary][record C, 
> prev B_lsn]
>
> 4) crash, the sector with B' makes it to disk
>
> 5) we replay A, B', C, because C has an xl_prev that's compatible with B'
>location and a valid CRC.
>
> Oops.
>
> I think this can happen both within a single page and across page boundaries.
>
> I hope I am missing something here?

If you are, I don't know what it is off-hand. That seems like a
plausible scenario to me. It does require the OS to write things out
of order, and I don't know how likely that is in practice, but the
answer probably isn't zero.

> I think there might be reasonable ways to increase the guarantees based on the
> 2 byte xl_prev approach "alone". We don't have to store the offset from the
> page header as a plain offset. What about storing something like:
>   page_offset ^ (page_lsn >> wal_segsz_shift)
>
> I think something like that'd result in prev_not_really_lsn typically not
> simply matching after recycling. Of course it only provides so much
> protection, given 16bits...

Maybe. That does seem somewhat better, but I feel like it's hard to
reason about whether it's safe in absolute terms or just resistant to
the precise scenario Matthias postulated while remaining vulnerable to
slightly modified versions.

How about this: remove xl_prev. widen xl_crc to 64 bits. include the
CRC of the previous WAL record in the xl_crc calculation. That doesn't
cut quite as many bytes out of the record size as your proposal, but
it feels like it should strongly resist pretty much every attack of
this general type, with only the minor disadvantage that the more
expensive CRC calculation will destroy all hope of getting anything
committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Andres Freund

Hi,

On 2022-10-04 13:36:33 -0400, Robert Haas wrote:
> On Tue, Oct 4, 2022 at 11:34 AM Andres Freund  wrote:
> > > Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> > > recycled and receives a new record C at the place of A with the same
> > > length.
> > >
> > > With your proposal, record B would still be a valid record when it
> > > follows C; as the page-local serial number/offset reference to the
> > > previous record would still match after the torn write.
> > > With the current situation and a full LSN in xl_prev, the mismatching
> > > value in the xl_prev pointer allows us to detect this torn page write
> > > and halt replay, before redoing an old (incorrect) record.
> >
> > In this concrete scenario the 8 byte xl_prev doesn't provide *any* 
> > protection?
> > As you specified it, C has the same length as A, so B's xl_prev will be the
> > same whether it's a page local offset or the full 8 bytes.
> >
> > The relevant protection against issues like this isn't xl_prev, it's the
> > CRC. We could improve the CRC by using the "full width" LSN for xl_prev 
> > rather
> > than the offset.
>
> I'm really confused. xl_prev *is* a full-width LSN currently, as I
> understand it. So in the scenario that Matthias poses, let's say the
> segment was previously 000100040025 and now it's
> 000100040049. So if a given chunk of the page is leftover
> from when the page was 000100040025, it will have xl_prev
> values like 4/25xx. If it's been rewritten since the segment was
> recycled, it will have xl_prev values like 4/49xx. So, we can tell
> whether record B has been overwritten with a new record since the
> segment was recycled. But if we stored only 2 bytes in each xl_prev
> field, that would no longer be possible.

Oh, I think I misunderstood the scenario. I was thinking of cases where we
write out a bunch of pages, crash, only some of the pages made it to disk, we
then write new ones of the same length, and now find a record after the "real"
end of the WAL to be valid.  Not sure how I mentally swallowed the "recycled".

For the recycling scenario to be a problem we'll also need to crash, with
parts of the page ending up with the new contents and parts of the page ending
up with the old "pre recycling" content, correct? Because without a crash
we'll have zeroed out the remainder of the page (well, leaving walreceiver out
of the picture, grr).

However, this can easily happen without any record boundaries on the partially
recycled page, so we rely on the CRCs to protect against this.

Here I originally wrote a more in-depth explanation of the scenario I was
thinking about, where we alread rely on CRCs to protect us. But, ooph, I think
they don't reliably, with today's design. But maybe I'm missing more things
today. Consider the following sequence:

1) we write WAL like this:

[record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev 
B_lsn]

2) crash, the sectors with A and C made it to disk, the one with B didn't

3) We replay A, discover B is invalid (possibly zeroes or old contents),
   insert a new record B' with the same length. Now it looks like this:

[record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev 
B_lsn]

4) crash, the sector with B' makes it to disk

5) we replay A, B', C, because C has an xl_prev that's compatible with B'
   location and a valid CRC.

Oops.

I think this can happen both within a single page and across page boundaries.

I hope I am missing something here?

> A way to up the chances of detecting this case would be to store only
> 2 or 4 bytes of xl_prev on disk, but arrange to include the full
> xl_prev value in the xl_crc calculation.

Right, that's what I was suggesting as well.

> Then your chances of a collision are about 2^-32, or maybe more if you posit
> that CRC is a weak and crappy algorithm, but even then it's strictly better
> than just hoping that there isn't a tear point at a record boundary where
> the same length record precedes the tear in both the old and new WAL
> segments. However, on the flip side, even if you assume that CRC is a
> fantastic algorithm with beautiful and state-of-the-art bit mixing, the
> chances of it failing to notice the problem are still >0, whereas the
> current algorithm that compares the full xl_prev value is a sure
> thing. Because xl_prev values are never repeated, it's certain that when a
> segment is recycled, any values that were legal for the old one aren't legal
> in the new one.

Given that we already rely on the CRCs to detect corruption within a single
record spanning tear boundaries, this doesn't cause me a lot of heartburn. But
I suspect we might need to do something about the scenario I outlined above,
which likely would also increase the protection against this issue.

I think there might be reasonable ways to increase the guarantees based on the
2 byte xl_prev approach "alone". We don't have to store the offset

Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Robert Haas

On Tue, Oct 4, 2022 at 11:34 AM Andres Freund  wrote:
> > Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> > recycled and receives a new record C at the place of A with the same
> > length.
> >
> > With your proposal, record B would still be a valid record when it
> > follows C; as the page-local serial number/offset reference to the
> > previous record would still match after the torn write.
> > With the current situation and a full LSN in xl_prev, the mismatching
> > value in the xl_prev pointer allows us to detect this torn page write
> > and halt replay, before redoing an old (incorrect) record.
>
> In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
> As you specified it, C has the same length as A, so B's xl_prev will be the
> same whether it's a page local offset or the full 8 bytes.
>
> The relevant protection against issues like this isn't xl_prev, it's the
> CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
> than the offset.

I'm really confused. xl_prev *is* a full-width LSN currently, as I
understand it. So in the scenario that Matthias poses, let's say the
segment was previously 000100040025 and now it's
000100040049. So if a given chunk of the page is leftover
from when the page was 000100040025, it will have xl_prev
values like 4/25xx. If it's been rewritten since the segment was
recycled, it will have xl_prev values like 4/49xx. So, we can tell
whether record B has been overwritten with a new record since the
segment was recycled. But if we stored only 2 bytes in each xl_prev
field, that would no longer be possible.

So I'm lost. It seems like Matthias has correctly identified a real
hazard, and not some weird corner case but actually something that
will happen regularly. All you need is for the old segment that got
recycled to have a record stating at the same place where the page
tore, and for the previous record to have been the same length as the
one on the new page. Given that there's only <~1024 places on a page
where a record can start, and given that in many workloads the lengths
of WAL records will be fairly uniform, this doesn't seem unlikely at
all.

A way to up the chances of detecting this case would be to store only
2 or 4 bytes of xl_prev on disk, but arrange to include the full
xl_prev value in the xl_crc calculation. Then your chances of a
collision are about 2^-32, or maybe more if you posit that CRC is a
weak and crappy algorithm, but even then it's strictly better than
just hoping that there isn't a tear point at a record boundary where
the same length record precedes the tear in both the old and new WAL
segments. However, on the flip side, even if you assume that CRC is a
fantastic algorithm with beautiful and state-of-the-art bit mixing,
the chances of it failing to notice the problem are still >0, whereas
the current algorithm that compares the full xl_prev value is a sure
thing. Because xl_prev values are never repeated, it's certain that
when a segment is recycled, any values that were legal for the old one
aren't legal in the new one.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Andres Freund

Hi,

On 2022-10-04 15:05:47 +0200, Matthias van de Meent wrote:
> On Mon, 3 Oct 2022 at 23:26, Andres Freund  wrote:
> > On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> > > On Mon, 3 Oct 2022, 19:01 Andres Freund,  wrote:
> > > > Random idea: xl_prev is large. Store a full xl_prev in the page header, 
> > > > but
> > > > only store a 2 byte offset from the page header xl_prev within each 
> > > > record.
> > >
> > > With that small xl_prev we may not detect partial page writes in
> > > recycled segments; or other issues in the underlying file system. With
> > > small record sizes, the chance of returning incorrect data would be
> > > significant for small records (it would be approximately the chance of
> > > getting a record boundary on the underlying page boundary * chance of
> > > getting the same MAXALIGN-adjusted size record before the persistence
> > > boundary). That issue is part of the reason why my proposed change
> > > upthread still contains the full xl_prev.
> >
> > What exactly is the theory for this significant increase? I don't think
> > xl_prev provides a meaningful protection against torn pages in the first
> > place?
> 
> XLog pages don't have checksums, so they do not provide torn page
> protection capabilities on their own.
> A singular xlog record is protected against torn page writes through
> the checksum that covers the whole record - if only part of the record
> was written, we can detect that through the mismatching checksum.
> However, if records end at the tear boundary, we must know for certain
> that any record that starts after the tear is the record that was
> written after the one before the tear. Page-local references/offsets
> would not work, because the record decoding doesn't know which xlog
> page the record should be located on; it could be both the version of
> the page before it was recycled, or the one after.
> Currently, we can detect this because the value of xl_prev will point
> to a record far in the past (i.e. not the expected value), but with a
> page-local version of xl_prev we would be less likely to detect torn
> pages (and thus be unable to handle this without risk of corruption)
> due to the significant chance of the truncated xl_prev value being the
> same in both the old and new record.

Think this is addressable, see below.


> Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> recycled and receives a new record C at the place of A with the same
> length.
> 
> With your proposal, record B would still be a valid record when it
> follows C; as the page-local serial number/offset reference to the
> previous record would still match after the torn write.
> With the current situation and a full LSN in xl_prev, the mismatching
> value in the xl_prev pointer allows us to detect this torn page write
> and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.


The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.


> PS. there are ideas floating around (I heard about this one from
> Heikki) where we could concatenate WAL records into one combined
> record that has only one shared xl_prev+crc; which would save these 12
> bytes per record. However, that needs a lot of careful consideration
> to make sure that the persistence guarantee of operations doesn't get
> lost somewhere in the traffic.

One version of that is to move the CRCs to the page header, make the pages
smaller (512 bytes / 4K, depending on the hardware), and to pad out partial
pages when flushing them out. Rewriting pages is bad for hardware and prevents
having multiple WAL IOs in flight at the same time.

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Matthias van de Meent

On Mon, 3 Oct 2022 at 23:26, Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> > On Mon, 3 Oct 2022, 19:01 Andres Freund,  wrote:
> > > Random idea: xl_prev is large. Store a full xl_prev in the page header, 
> > > but
> > > only store a 2 byte offset from the page header xl_prev within each 
> > > record.
> >
> > With that small xl_prev we may not detect partial page writes in
> > recycled segments; or other issues in the underlying file system. With
> > small record sizes, the chance of returning incorrect data would be
> > significant for small records (it would be approximately the chance of
> > getting a record boundary on the underlying page boundary * chance of
> > getting the same MAXALIGN-adjusted size record before the persistence
> > boundary). That issue is part of the reason why my proposed change
> > upthread still contains the full xl_prev.
>
> What exactly is the theory for this significant increase? I don't think
> xl_prev provides a meaningful protection against torn pages in the first
> place?

XLog pages don't have checksums, so they do not provide torn page
protection capabilities on their own.
A singular xlog record is protected against torn page writes through
the checksum that covers the whole record - if only part of the record
was written, we can detect that through the mismatching checksum.
However, if records end at the tear boundary, we must know for certain
that any record that starts after the tear is the record that was
written after the one before the tear. Page-local references/offsets
would not work, because the record decoding doesn't know which xlog
page the record should be located on; it could be both the version of
the page before it was recycled, or the one after.
Currently, we can detect this because the value of xl_prev will point
to a record far in the past (i.e. not the expected value), but with a
page-local version of xl_prev we would be less likely to detect torn
pages (and thus be unable to handle this without risk of corruption)
due to the significant chance of the truncated xl_prev value being the
same in both the old and new record.

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

Kind regards,

Matthias van de Meent

PS. there are ideas floating around (I heard about this one from
Heikki) where we could concatenate WAL records into one combined
record that has only one shared xl_prev+crc; which would save these 12
bytes per record. However, that needs a lot of careful consideration
to make sure that the persistence guarantee of operations doesn't get
lost somewhere in the traffic.

Re: problems with making relfilenodes 56-bits

2022-10-03 Thread Andres Freund

Hi,

On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> On Mon, 3 Oct 2022, 19:01 Andres Freund,  wrote:
> > Random idea: xl_prev is large. Store a full xl_prev in the page header, but
> > only store a 2 byte offset from the page header xl_prev within each record.
> 
> With that small xl_prev we may not detect partial page writes in
> recycled segments; or other issues in the underlying file system. With
> small record sizes, the chance of returning incorrect data would be
> significant for small records (it would be approximately the chance of
> getting a record boundary on the underlying page boundary * chance of
> getting the same MAXALIGN-adjusted size record before the persistence
> boundary). That issue is part of the reason why my proposed change
> upthread still contains the full xl_prev.

What exactly is the theory for this significant increase? I don't think
xl_prev provides a meaningful protection against torn pages in the first
place?

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

2022-10-03 Thread Matthias van de Meent

On Mon, 3 Oct 2022, 19:01 Andres Freund,  wrote:
>
> Hi,
>
> On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund  wrote:
> > > I think it'd be interesting to look at per-record-type stats between two
> > > equivalent workload, to see where practical workloads suffer the most
> > > (possibly with fpw=off, to make things more repeatable).
> >
> > I would expect, and Dilip's results seem to confirm, the effect to be
> > pretty uniform: basically, nearly every record gets bigger by 4 bytes.
> > That's because most records contain at least one block reference, and
> > if they contain multiple block references, likely all but one will be
> > marked BKPBLOCK_SAME_REL, so we pay the cost just once.
>
> But it doesn't really matter that much if an already large record gets a bit
> bigger. Whereas it does matter if it's a small record. Focussing on optimizing
> the record types where the increase is large seems like a potential way
> forward to me, even if we can't find something generic.
>
>
> > I thought about trying to buy back some space elsewhere, and I think
> > that would be a reasonable approach to getting this committed if we
> > could find a way to do it. However, I don't see a terribly obvious way
> > of making it happen.
>
> I think there's plenty potential...
>
>
> > Trying to do it by optimizing specific WAL record
> > types seems like a real pain in the neck, because there's tons of
> > different WAL records that all have the same problem.
>
> I am not so sure about that. Improving a bunch of the most frequent small
> records might buy you back enough on just about every workload to be OK.
>
> I put the top record sizes for an installcheck run with full_page_writes off
> at the bottom. Certainly our regression tests aren't generally
> representative. But I think it still decently highlights how just improving a
> few records could buy you back more than enough.
>
>
> > Trying to do it in a generic way makes more sense, and the fact that we have
> > 2 padding bytes available in XLogRecord seems like a place to start looking,
> > but the way forward from there is not clear to me.
>
> Random idea: xl_prev is large. Store a full xl_prev in the page header, but
> only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

A different idea is removing most block_ids from the record, and
optionally reducing per-block length fields to 1B. Used block ids are
effectively always sequential, and we only allow 33+4 valid values, so
we can use 2 bits to distinguish between 'block belonging to this ID
field have at most 255B of data registered' and 'blocks up to this ID
follow sequentially without own block ID'. That would save 2N-1 total
bytes for N blocks. It is scraping the barrel, but I think it is quite
possible.

Lastly, we could add XLR_BLOCK_ID_DATA_MED for values >255 containing
up to UINT16_MAX lengths. That would save 2 bytes for records that
only just pass the 255B barrier, where 2B is still a fairly
significant part of the record size.

Kind regards,

Matthias van de Meent

Re: problems with making relfilenodes 56-bits

2022-10-03 Thread Andres Freund

Hi,

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> On Fri, Sep 30, 2022 at 8:20 PM Andres Freund  wrote:
> > I think it'd be interesting to look at per-record-type stats between two
> > equivalent workload, to see where practical workloads suffer the most
> > (possibly with fpw=off, to make things more repeatable).
>
> I would expect, and Dilip's results seem to confirm, the effect to be
> pretty uniform: basically, nearly every record gets bigger by 4 bytes.
> That's because most records contain at least one block reference, and
> if they contain multiple block references, likely all but one will be
> marked BKPBLOCK_SAME_REL, so we pay the cost just once.

But it doesn't really matter that much if an already large record gets a bit
bigger. Whereas it does matter if it's a small record. Focussing on optimizing
the record types where the increase is large seems like a potential way
forward to me, even if we can't find something generic.


> I thought about trying to buy back some space elsewhere, and I think
> that would be a reasonable approach to getting this committed if we
> could find a way to do it. However, I don't see a terribly obvious way
> of making it happen.

I think there's plenty potential...


> Trying to do it by optimizing specific WAL record
> types seems like a real pain in the neck, because there's tons of
> different WAL records that all have the same problem.

I am not so sure about that. Improving a bunch of the most frequent small
records might buy you back enough on just about every workload to be OK.

I put the top record sizes for an installcheck run with full_page_writes off
at the bottom. Certainly our regression tests aren't generally
representative. But I think it still decently highlights how just improving a
few records could buy you back more than enough.


> Trying to do it in a generic way makes more sense, and the fact that we have
> 2 padding bytes available in XLogRecord seems like a place to start looking,
> but the way forward from there is not clear to me.

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

Greetings,

Andres Freund

by total size:

Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
Heap/INSERT  1041666 ( 50.48)106565255 
( 50.54)0 (  0.00)106565255 ( 43.92)
Btree/INSERT_LEAF 352196 ( 17.07) 24067672 
( 11.41)0 (  0.00) 24067672 (  9.92)
Heap/DELETE   250852 ( 12.16) 13546008 
(  6.42)0 (  0.00) 13546008 (  5.58)
Hash/INSERT   108499 (  5.26)  7811928 
(  3.70)0 (  0.00)  7811928 (  3.22)
Transaction/COMMIT 16053 (  0.78)  6402657 
(  3.04)0 (  0.00)  6402657 (  2.64)
Gist/PAGE_UPDATE   57225 (  2.77)  5217100 
(  2.47)0 (  0.00)  5217100 (  2.15)
Gin/UPDATE_META_PAGE   23943 (  1.16)  4539970 
(  2.15)0 (  0.00)  4539970 (  1.87)
Gin/INSERT 27004 (  1.31)  3623998 
(  1.72)0 (  0.00)  3623998 (  1.49)
Gist/PAGE_SPLIT  448 (  0.02)  3391244 
(  1.61)0 (  0.00)  3391244 (  1.40)
SPGist/ADD_LEAF38968 (  1.89)  3341696 
(  1.58)0 (  0.00)  3341696 (  1.38)
...
XLOG/FPI7228 (  0.35)   378924 
(  0.18) 29788166 ( 93.67) 30167090 ( 12.43)
...
Gin/SPLIT141 (  0.01)13011 
(  0.01)  1187588 (  3.73)  1200599 (  0.49)
...
    
  
Total2063609 210848282 
[86.89%] 31802766 [13.11%]242651048 [100%]

(Included XLOG/FPI and Gin/SPLIT to explain why there's FPIs despite running 
with fpw=off)

sorted by number of records:
Heap/INSERT  1041666 ( 50.48)106565255 
( 50.54)0 (  0.00)106565255 ( 43.92)
Btree/INSERT_LEAF 352196 ( 17.07) 24067672 
( 11.41)

Re: problems with making relfilenodes 56-bits

2022-10-03 Thread Robert Haas

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund  wrote:
> I think it'd be interesting to look at per-record-type stats between two
> equivalent workload, to see where practical workloads suffer the most
> (possibly with fpw=off, to make things more repeatable).

I would expect, and Dilip's results seem to confirm, the effect to be
pretty uniform: basically, nearly every record gets bigger by 4 bytes.
That's because most records contain at least one block reference, and
if they contain multiple block references, likely all but one will be
marked BKPBLOCK_SAME_REL, so we pay the cost just once.

Because of alignment padding, the practical effect is probably that
about half of the records get bigger by 8 bytes and the other half
don't get bigger at all. But I see no reason to believe that things
are any better or worse than that. Most interesting record types are
going to contain some kind of variable-length payload, so the chances
that a 4 byte size increase pushes you across a MAXALIGN boundary seem
to be no better or worse than fifty-fifty.

> I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
> pay off for 56bit relfilenodes. The class of problems foreclosed is large
> enough to "waste" "improvement potential" on this.

I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen. Trying to do it by optimizing specific WAL record
types seems like a real pain in the neck, because there's tons of
different WAL records that all have the same problem. Trying to do it
in a generic way makes more sense, and the fact that we have 2 padding
bytes available in XLogRecord seems like a place to start looking, but
the way forward from there is not clear to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

2022-10-01 Thread Dilip Kumar

On Sat, Oct 1, 2022 at 5:50 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-09-30 15:36:11 +0530, Dilip Kumar wrote:
> > I have done some testing around this area to see the impact on WAL
> > size especially when WAL sizes are smaller, with a very simple test
> > with insert/update/delete I can see around an 11% increase in WAL size
> > [1] then I did some more test with pgbench with smaller scale
> > factor(1) there I do not see a significant increase in the WAL size
> > although it increases WAL size around 1-2%. [2].
>
> I think it'd be interesting to look at per-record-type stats between two
> equivalent workload, to see where practical workloads suffer the most
> (possibly with fpw=off, to make things more repeatable).

While testing pgbench, I dumped the wal sizes using waldump.  So in
pgbench case, most of the record sizes increased by 4 bytes as they
include single block references and the same is true for the other
test case I sent.  Here is the wal dump of what the sizes look like
for a single pgbench transaction[1].  Maybe for seeing these changes
with the different workloads we can run some of the files from the
regression test and compare the individual wal sizes.

Head:
rmgr: Heaplen (rec/tot): 54/54, tx:867, lsn:
0/02DD1280, prev 0/02DD1250, desc: LOCK off 44: xid 867: flags 0x01
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/16424 blk 226
rmgr: Heaplen (rec/tot):171/   171, tx:867, lsn:
0/02DD12B8, prev 0/02DD1280, desc: UPDATE off 44 xmax 867 flags 0x11 ;
new off 30 xmax 0, blkref #0: rel 1663/5/16424 blk 1639, blkref #1:
rel 1663/5/16424 blk 226
rmgr: Btree   len (rec/tot): 64/64, tx:867, lsn:
0/02DD1368, prev 0/02DD12B8, desc: INSERT_LEAF off 290, blkref #0: rel
1663/5/16432 blk 39
rmgr: Heaplen (rec/tot): 78/78, tx:867, lsn:
0/02DD13A8, prev 0/02DD1368, desc: HOT_UPDATE off 15 xmax 867 flags
0x10 ; new off 19 xmax 0, blkref #0: rel 1663/5/16427 blk 0
rmgr: Heaplen (rec/tot): 74/74, tx:867, lsn:
0/02DD13F8, prev 0/02DD13A8, desc: HOT_UPDATE off 9 xmax 867 flags
0x10 ; new off 10 xmax 0, blkref #0: rel 1663/5/16425 blk 0
rmgr: Heaplen (rec/tot): 79/79, tx:867, lsn:
0/02DD1448, prev 0/02DD13F8, desc: INSERT off 9 flags 0x08, blkref #0:
rel 1663/5/16434 blk 0
rmgr: Transaction len (rec/tot): 46/46, tx:867, lsn:
0/02DD1498, prev 0/02DD1448, desc: COMMIT 2022-10-01 11:24:03.464437
IST


Patch:
rmgr: Heaplen (rec/tot): 58/58, tx:818, lsn:
0/0218BEB0, prev 0/0218BE80, desc: LOCK off 34: xid 818: flags 0x01
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/14 blk 522
rmgr: Heaplen (rec/tot):175/   175, tx:818, lsn:
0/0218BEF0, prev 0/0218BEB0, desc: UPDATE off 34 xmax 818 flags 0x11 ;
new off 8 xmax 0, blkref #0: rel 1663/5/14 blk 1645, blkref #1:
rel 1663/5/14 blk 522
rmgr: Btree   len (rec/tot): 68/68, tx:818, lsn:
0/0218BFA0, prev 0/0218BEF0, desc: INSERT_LEAF off 36, blkref #0: rel
1663/5/100010 blk 89
rmgr: Heaplen (rec/tot): 82/82, tx:818, lsn:
0/0218BFE8, prev 0/0218BFA0, desc: HOT_UPDATE off 66 xmax 818 flags
0x10 ; new off 90 xmax 0, blkref #0: rel 1663/5/17 blk 0
rmgr: Heaplen (rec/tot): 78/78, tx:818, lsn:
0/0218C058, prev 0/0218BFE8, desc: HOT_UPDATE off 80 xmax 818 flags
0x10 ; new off 81 xmax 0, blkref #0: rel 1663/5/15 blk 0
rmgr: Heaplen (rec/tot): 83/83, tx:818, lsn:
0/0218C0A8, prev 0/0218C058, desc: INSERT off 80 flags 0x08, blkref
#0: rel 1663/5/100011 blk 0
rmgr: Transaction len (rec/tot): 46/46, tx:818, lsn:
0/0218C100, prev 0/0218C0A8, desc: COMMIT 2022-10-01 11:11:03.564063
IST


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

2022-09-30 Thread Peter Geoghegan

On Fri, Sep 30, 2022 at 5:20 PM Andres Freund  wrote:
> I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
> pay off for 56bit relfilenodes. The class of problems foreclosed is large
> enough to "waste" "improvement potential" on this.

I agree overall.

A closely related but distinct question occurs to me: if we're going
to be "wasting" space on alignment padding in certain cases one way or
another, can we at least recognize those cases and take advantage at
the level of individual WAL record formats? In other words: So far
we've been discussing the importance of not going over a critical
threshold for certain WAL records. But it might also be valuable to
consider recognizing that that's inevitable, and that we might as well
make the most of it by including one or two other things.

This seems most likely to matter when managing the problem of negative
compression with per-WAL-record compression schemes for things like
arrays of page offset numbers [1]. If (say) a given compression scheme
"wastes" space for arrays of only 1-3 items, but we already know that
the relevant space will all be lost to alignment needed by code one
level down in any case, does it really count as waste? We're likely
always going to have some kind of negative compression, but you do get
to influence where and when the negative compression happens.

Not sure how relevant this will turn out to be, but seems worth
considering. More generally, thinking about how things work across
multiple layers of abstraction seems like it could be valuable in
other ways.

[1] 
https://postgr.es/m/CAH2-WzmLCn2Hx9tQLdmdb+9CkHKLyWD2bsz=pmrebc4daxj...@mail.gmail.com
--
Peter Geoghegan

Re: problems with making relfilenodes 56-bits

2022-09-30 Thread Andres Freund

Hi,

On 2022-09-30 15:36:11 +0530, Dilip Kumar wrote:
> I have done some testing around this area to see the impact on WAL
> size especially when WAL sizes are smaller, with a very simple test
> with insert/update/delete I can see around an 11% increase in WAL size
> [1] then I did some more test with pgbench with smaller scale
> factor(1) there I do not see a significant increase in the WAL size
> although it increases WAL size around 1-2%. [2].

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
pay off for 56bit relfilenodes. The class of problems foreclosed is large
enough to "waste" "improvement potential" on this.

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

2022-09-30 Thread Dilip Kumar

On Thu, Sep 29, 2022 at 2:36 AM Robert Haas  wrote:

> 2. WAL Size. Block references in the WAL are by RelFileLocator, so if
> you make RelFileLocators bigger, WAL gets bigger. We'd have to test
> the exact impact of this, but it seems a bit scary

I have done some testing around this area to see the impact on WAL
size especially when WAL sizes are smaller, with a very simple test
with insert/update/delete I can see around an 11% increase in WAL size
[1] then I did some more test with pgbench with smaller scale
factor(1) there I do not see a significant increase in the WAL size
although it increases WAL size around 1-2%. [2].

[1]
checkpoint;
do $$
declare
 lsn1 pg_lsn;
 lsn2 pg_lsn;
 diff float;
begin
   select pg_current_wal_lsn() into lsn1;
   CREATE TABLE test(a int);
   for counter in 1..1000 loop
 INSERT INTO test values(1);
 UPDATE test set a=a+1;
 DELETE FROM test where a=1;
   end loop;
   DROP TABLE test;
   select pg_current_wal_lsn() into lsn2;
   select pg_wal_lsn_diff(lsn2, lsn1) into diff;
   raise notice '%', diff/1024;
end; $$;

wal generated head: 66199.09375 kB
wal generated patch: 73906.984375 kB
wal-size increase: 11%

[2]
./pgbench -i postgres
./pgbench -c1 -j1 -t 3 -M prepared postgres
wal generated head: 30780 kB
wal generated patch: 31284 kB
wal-size increase: ~1-2%

I have done further analysis to know why on pgbench workload the wal
size is increasing by 1-2%.  So with waldump I could see that wal size
per transaction size increased from 566 (on head) to 590 (with patch),
so that is around 4% but when we see total wal size difference after
30k transaction then it is just 1-2%
and I think that is because there would be other records which are not
impacted like FPI

Conclusion:  So as suspected with very small WAL sizes with a very
targeted test case we can see a significant 11% increase in WAL size
but with pgbench kind of workload the increase in WAL size is very
less.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-29 Thread Tom Lane

Michael Paquier  writes:
> While passing by, I have noticed this thread.  We don't really care
> about the contents returned by these functions, and one simple trick
> to check their execution is SELECT FROM.  Like in the attached, for
> example.

Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
any actual checks on the sanity of the output?  I realize that the
output's far from static, but still ...

regards, tom lane

Re: making relfilenodes 56 bits

2022-09-29 Thread Michael Paquier

On Thu, Sep 29, 2022 at 02:39:44PM -0400, Tom Lane wrote:
> The assertions in TupleDescInitEntry would have caught that,
> if only utils/misc/pg_controldata.c had more than zero test coverage.
> Seems like somebody ought to do something about that.

While passing by, I have noticed this thread.  We don't really care
about the contents returned by these functions, and one simple trick
to check their execution is SELECT FROM.  Like in the attached, for
example.
--
Michael
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index 9f106c2a10..93cba8e76d 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -594,3 +594,22 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
  Index Cond: (unique1 = g.g)
 (4 rows)
 
+--
+-- Test functions for control data
+--
+SELECT FROM pg_control_checkpoint();
+--
+(1 row)
+
+SELECT FROM pg_control_init();
+--
+(1 row)
+
+SELECT FROM pg_control_recovery();
+--
+(1 row)
+
+SELECT FROM pg_control_system();
+--
+(1 row)
+
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 639e9b352c..207d5a5292 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -223,3 +223,11 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,1000) g ON a.unique1 = g;
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
+
+--
+-- Test functions for control data
+--
+SELECT FROM pg_control_checkpoint();
+SELECT FROM pg_control_init();
+SELECT FROM pg_control_recovery();
+SELECT FROM pg_control_system();


signature.asc
Description: PGP signature

Re: problems with making relfilenodes 56-bits

2022-09-29 Thread Robert Haas

On Thu, Sep 29, 2022 at 12:24 PM Matthias van de Meent
 wrote:
> Currently, our minimal WAL record is exactly 24 bytes: length (4B),
> TransactionId (4B), previous record pointer (8B), flags (1B), redo
> manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
> fields, TransactionID could reasonably be omitted for certain WAL
> records (as example: index insertions don't really need the XID).
> Additionally, the length field could be made to be variable length,
> and any padding is just plain bad (adding 4 bytes to all
> insert/update/delete/lock records was frowned upon).

Right. I was shocked when I realized that we had two bytes of padding
in there, considering that numerous rmgrs are stealing bits from the
1-byte field that identifies the record type. My question was: why
aren't we exposing those 2 bytes for rmgr-type-specific use? Or for
something like xl_xact_commit, we could get rid of xl_xact_info if we
had those 2 bytes to work with.

Right now, I see that a bare commit record is 34 bytes which rounds
out to 40. With the trick above, we could shave off 4 bytes bringing
the size to 30 which would round to 32. That's a pretty significant
savings, although it'd be a lot better if we could get some kind of
savings for DML records which could be much higher frequency.

> I'm working on a prototype patch for a more bare-bones WAL record
> header of which the only required fields would be prevptr (8B), CRC
> (4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
> yet know the performance of this, but the considering that there will
> be a lot more conditionals in header decoding it might be slower for
> any one backend, but faster overall (less overall IOps)
>
> The flags field would be indications for additional information: [flag
> name (bits): explanation (additional xlog header data in bytes)]
> - len_size(0..1): xlog record size is at most xlrec_header_only (0B),
> uint8_max(1B), uint16_max(2B), uint32_max(4B)
> - has_xid (2): contains transaction ID of logging transaction (4B, or
> probably 8B when we introduce 64-bit xids)
> - has_cid (3): contains the command ID of the logging statement (4B)
> (rationale for logging CID in [0], now in record header because XID is
> included there as well, and both are required for consistent
> snapshots.
> - has_rminfo (4): has non-zero redo-manager flags field (1B)
> (rationale for separate field [1], non-zero allows 1B space
> optimization for one of each RMGR's operations)
> - special_rel (5): pre-existing definition
> - check_consistency (6): pre-existing definition
> - unset (7): no meaning defined yet. Could be used for full record
> compression, or other purposes.

Interesting. One fly in the ointment here is that WAL records start on
8-byte boundaries (probably MAXALIGN boundaries, but I didn't check
the details). And after the 24-byte header, there's a 2-byte header
(or 5-byte header) introducing the payload data (see
XLR_BLOCK_ID_DATA_SHORT/LONG). So if the size of the actual payload
data is a multiple of 8, and is short enough that we use the short
data header, we waste 6 bytes. If the data length is a multiple of 4,
we waste 2 bytes. And those are probably really common cases. So the
big improvements probably come from saving 2 bytes or 6 bytes or 10
bytes, and saving say 3 or 5 is probably not much better than 2. Or at
least that's what I'm guessing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-29 Thread Tom Lane

Robert Haas  writes:
> On Thu, Sep 29, 2022 at 10:50 AM Maxim Orlov  wrote:
>> In other words, we have 19 attributes. But tupdesc here is constructed for 
>> 18 elements:
>> tupdesc = CreateTemplateTupleDesc(18);

> I think that's a mistake. Thanks for the report.

The assertions in TupleDescInitEntry would have caught that,
if only utils/misc/pg_controldata.c had more than zero test coverage.
Seems like somebody ought to do something about that.

regards, tom lane

Re: problems with making relfilenodes 56-bits

2022-09-29 Thread Matthias van de Meent

On Thu, 29 Sep 2022, 00:06 Robert Haas,  wrote:
>
> 2. WAL Size. Block references in the WAL are by RelFileLocator, so if
> you make RelFileLocators bigger, WAL gets bigger. We'd have to test
> the exact impact of this, but it seems a bit scary: if you have a WAL
> stream with few FPIs doing DML on a narrow table, probably most
> records will contain 1 block reference (and occasionally more, but I
> guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
> block reference feels like it might add up to something significant. I
> don't really see any way around this, either: if you make relfilenode
> values wider, they take up more space. Perhaps there's a way to claw
> that back elsewhere, or we could do something really crazy like switch
> to variable-width representations of integer quantities in WAL
> records, but there doesn't seem to be any simple way forward other
> than, you know, deciding that we're willing to pay the cost of the
> additional WAL volume.

Re: WAL volume and record size optimization

I've been working off and on with WAL for some time now due to [0] and
the interest of Neon in the area, and I think we can reduce the size
of the base record by a significant margin:

Currently, our minimal WAL record is exactly 24 bytes: length (4B),
TransactionId (4B), previous record pointer (8B), flags (1B), redo
manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
fields, TransactionID could reasonably be omitted for certain WAL
records (as example: index insertions don't really need the XID).
Additionally, the length field could be made to be variable length,
and any padding is just plain bad (adding 4 bytes to all
insert/update/delete/lock records was frowned upon).

I'm working on a prototype patch for a more bare-bones WAL record
header of which the only required fields would be prevptr (8B), CRC
(4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
yet know the performance of this, but the considering that there will
be a lot more conditionals in header decoding it might be slower for
any one backend, but faster overall (less overall IOps)

The flags field would be indications for additional information: [flag
name (bits): explanation (additional xlog header data in bytes)]
- len_size(0..1): xlog record size is at most xlrec_header_only (0B),
uint8_max(1B), uint16_max(2B), uint32_max(4B)
- has_xid (2): contains transaction ID of logging transaction (4B, or
probably 8B when we introduce 64-bit xids)
- has_cid (3): contains the command ID of the logging statement (4B)
(rationale for logging CID in [0], now in record header because XID is
included there as well, and both are required for consistent
snapshots.
- has_rminfo (4): has non-zero redo-manager flags field (1B)
(rationale for separate field [1], non-zero allows 1B space
optimization for one of each RMGR's operations)
- special_rel (5): pre-existing definition
- check_consistency (6): pre-existing definition
- unset (7): no meaning defined yet. Could be used for full record
compression, or other purposes.

A normal record header (XLOG record with at least some registered
data) would be only 15 to 17 bytes (0-1B rminfo + 1-2B in xl_len), and
one with XID only up to 21 bytes. So, when compared to the current
XLogRecord format, we would in general recover 2 or 3 bytes from the
xl_tot_len field, 1 or 2 bytes from the alignment hole, and
potentially the 4 bytes of the xid when that data is considered
useless during recovery, or physical or logical replication.

Kind regards,

Matthias van de Meent

[0] 
https://postgr.es/m/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc%3DSdEHHsW6HS%2Bk9zw%40mail.gmail.com
[1] https://postgr.es/m/20220715173731.6t3km5cww3f5ztfq%40awork3.anarazel.de

Re: making relfilenodes 56 bits

2022-09-29 Thread Robert Haas

On Thu, Sep 29, 2022 at 10:50 AM Maxim Orlov  wrote:
> In other words, we have 19 attributes. But tupdesc here is constructed for 18 
> elements:
> tupdesc = CreateTemplateTupleDesc(18);
>
> Is that normal or not? Again, I'm not in this thread and if that is 
> completely ok, I'm sorry about the noise.

I think that's a mistake. Thanks for the report.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-29 Thread Maxim Orlov

Hi!

I'm not in the context of this thread, but I've notice something strange by
attempting to rebase my patch set from 64XID thread.
As far as I'm aware, this patch set is adding "relfilenumber". So, in
pg_control_checkpoint, we have next changes:

diff --git a/src/backend/utils/misc/pg_controldata.c
b/src/backend/utils/misc/pg_controldata.c
index 781f8b8758..d441cd97e2 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -79,8 +79,8 @@ pg_control_system(PG_FUNCTION_ARGS)
 Datum
 pg_control_checkpoint(PG_FUNCTION_ARGS)
 {
-   Datum   values[18];
-   boolnulls[18];
+   Datum   values[19];
+   boolnulls[19];
TupleDesc   tupdesc;
HeapTuple   htup;
ControlFileData *ControlFile;
@@ -129,6 +129,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
   XIDOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 18, "checkpoint_time",
   TIMESTAMPTZOID, -1, 0);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 19, "next_relfilenumber",
+  INT8OID, -1, 0);
tupdesc = BlessTupleDesc(tupdesc);

/* Read the control file. */

In other words, we have 19 attributes. But tupdesc here is constructed for
18 elements:
tupdesc = CreateTemplateTupleDesc(18);

Is that normal or not? Again, I'm not in this thread and if that is
completely ok, I'm sorry about the noise.

-- 
Best regards,
Maxim Orlov.

Re: problems with making relfilenodes 56-bits

2022-09-28 Thread Peter Geoghegan

On Wed, Sep 28, 2022 at 4:08 PM Tom Lane  wrote:
> As far as that goes, I'm entirely prepared to accept a conclusion
> that the benefits of widening relfilenodes justify whatever space
> or speed penalties may exist there.  However, we cannot honestly
> make that conclusion if we haven't measured said penalties.
> The same goes for the other issues you raise here.

I generally agree, but the devil is in the details.

I tend to agree with Robert that many individual WAL record types just
don't appear frequently enough to matter (it also helps that even the
per-record space overhead with wider 56-bit relfilenodes isn't so
bad). Just offhand I'd say that ginxlogSplit, ginxlogDeletePage,
ginxlogUpdateMeta, gistxlogPageReuse and xl_btree_reuse_page are
likely to be in this category (though would be nice to see some
numbers for those).

I'm much less sure about the other record types. Any WAL records with
a variable number of relfilenode entries seem like they might be more
of a problem. But I'm not ready to accept that that cannot be
ameliorated in some way. Just for example, it wouldn't be impossible to
do some kind of varbyte encoding for some record types. How many times
will the cluster actually need billions of relfilenodes? It has to
work, but maybe it can be suboptimal from a space overhead
perspective.

I'm not saying that we need to do anything fancy just yet. I'm
just saying that there definitely *are* options. Maybe it's not really
necessary to come up with something like a varbyte encoding, and maybe
the complexity it imposes just won't be worth it -- I really have no
opinion on that just yet.

--
Peter Geoghegan

Re: problems with making relfilenodes 56-bits

2022-09-28 Thread Tom Lane

Robert Haas  writes:
> 3. Sinval Message Size. Sinval messages are 16 bytes right now.
> They'll have to grow to 20 bytes if we do this. There's even less room
> for bit-squeezing here than there is for the WAL stuff. I'm skeptical
> that this really matters, but Tom seems concerned.

As far as that goes, I'm entirely prepared to accept a conclusion
that the benefits of widening relfilenodes justify whatever space
or speed penalties may exist there.  However, we cannot honestly
make that conclusion if we haven't measured said penalties.
The same goes for the other issues you raise here.

regards, tom lane

Re: making relfilenodes 56 bits

2022-09-28 Thread Thomas Munro

On Wed, Sep 28, 2022 at 9:40 PM Dilip Kumar  wrote:
> Thanks, Thomas, these all look fine to me.  So far we have committed
> the patch to make relfilenode 56 bits wide.  The Tombstone file
> removal patch is still pending to be committed, so when I will rebase
> that patch I will accommodate all these comments in that patch.

I noticed that your new unlinking algorithm goes like this:

stat("x")
stat("x.1")
stat("x.2")
stat("x.3") -> ENOENT /* now we know how many segments there are */
truncate("x.2")
unlink("x.2")
truncate("x.1")
unlink("x.1")
truncate("x")
unlink("x")

Could you say what problem this solves, and, guessing that it's just
that you want the 0 file to be "in the way" until the other files are
gone (at least while we're running; who knows what'll be left if you
power-cycle), could you do it like this instead?

truncate("x")
truncate("x.1")
truncate("x.2")
truncate("x.3") -> ENOENT /* now we know how many segments there are */
unlink("x.2")
unlink("x.1")
unlink("x")

problems with making relfilenodes 56-bits

2022-09-28 Thread Robert Haas

OK, so the recent commit and revert of the 56-bit relfilenode patch
revealed a few issues that IMHO need design-level input. Let me try to
surface those here, starting a new thread to separate this discussion
from the clutter:

1. Commit Record Alignment. ParseCommitRecord() and ParseAbortRecord()
are dependent on every subsidiary structure that can be added to a
commit or abort record requiring exactly 4-byte alignment. IMHO, this
seems awfully fragile, even leaving the 56-bit patch aside. Prepare
records seem to have a much saner scheme: they've also got a bunch of
different things that can be stuck onto the main record, but they
maxalign each top-level thing that they stick in there. So
ParsePrepareRecord() doesn't have to make any icky alignment
assumptions the way ParseCommitRecord() and ParseAbortRecord() do.
Unfortuantely, that scheme doesn't work as well for commit records,
because the very first top-level thing only needs 2 bytes. We're
currently using 4, and it would obviously be nicer to cut that down to
2 than to have it go up to 8. We could try to rejigger things around
somehow to avoid needing that 2-byte quantity in there as a separate
toplevel item, but I'm not quite sure how to do that, or we could just
copy everything to ensure alignment, but that seems kind of expensive.

If we don't decide to do either of those things, we should at least
better document, and preferably enforce via assets, the requirement
that these structs be exactly 4-byte aligned, so that nobody else
makes the same mistake in the future.

2. WAL Size. Block references in the WAL are by RelFileLocator, so if
you make RelFileLocators bigger, WAL gets bigger. We'd have to test
the exact impact of this, but it seems a bit scary: if you have a WAL
stream with few FPIs doing DML on a narrow table, probably most
records will contain 1 block reference (and occasionally more, but I
guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
block reference feels like it might add up to something significant. I
don't really see any way around this, either: if you make relfilenode
values wider, they take up more space. Perhaps there's a way to claw
that back elsewhere, or we could do something really crazy like switch
to variable-width representations of integer quantities in WAL
records, but there doesn't seem to be any simple way forward other
than, you know, deciding that we're willing to pay the cost of the
additional WAL volume.

3. Sinval Message Size. Sinval messages are 16 bytes right now.
They'll have to grow to 20 bytes if we do this. There's even less room
for bit-squeezing here than there is for the WAL stuff. I'm skeptical
that this really matters, but Tom seems concerned.

4. Other Uses of RelFileLocator. There are a bunch of structs I
haven't looked into yet that also embed RelFileLocator, which may have
their own issues with alignment, padding, and/or size: ginxlogSplit,
ginxlogDeletePage, ginxlogUpdateMeta, gistxlogPageReuse,
xl_heap_new_cid, xl_btree_reuse_page, LogicalRewriteMappingData,
xl_smgr_truncate, xl_seq_rec, ReorderBufferChange, FileTag. I think a
bunch of these are things that get written into WAL, but at least some
of them seem like they probably don't get written into WAL enough to
matter. Needs more investigation, though.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-28 Thread Dilip Kumar

On Wed, Sep 28, 2022 at 9:23 AM Thomas Munro  wrote:
>
> Hi Dilip,
>
> I am very happy to see these commits.  Here's some belated review for
> the tombstone-removal patch.
>
> > v7-0004-Don-t-delay-removing-Tombstone-file-until-next.patch
>
> More things you can remove:
>
>  * sync_unlinkfiletag in struct SyncOps
>  * the macro UNLINKS_PER_ABSORB
>  * global variable pendingUnlinks
>
> This comment after the question mark is obsolete:
>
> * XXX should we CHECK_FOR_INTERRUPTS in this loop?
> Escaping with an
> * error in the case of SYNC_UNLINK_REQUEST would leave the
> * no-longer-used file still present on disk, which
> would be bad, so
> * I'm inclined to assume that the checkpointer will
> always empty the
> * queue soon.
>
> (I think if the answer to the question is now yes, then we should
> replace the stupid sleep with a condition variable sleep, but there's
> another thread about that somewhere).
>
> In a couple of places in dbcommands.c you could now make this change:
>
> /*
> -* Force a checkpoint before starting the copy. This will
> force all dirty
> -* buffers, including those of unlogged tables, out to disk, to ensure
> -* source database is up-to-date on disk for the copy.
> -* FlushDatabaseBuffers() would suffice for that, but we also want to
> -* process any pending unlink requests. Otherwise, if a checkpoint
> -* happened while we're copying files, a file might be deleted just 
> when
> -* we're about to copy it, causing the lstat() call in copydir() to 
> fail
> -* with ENOENT.
> +* Force all dirty buffers, including those of unlogged tables, out to
> +* disk, to ensure source database is up-to-date on disk for the copy.
>  */
> -   RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
> - CHECKPOINT_WAIT |
> CHECKPOINT_FLUSH_ALL);
> +   FlushDatabaseBuffers(src_dboid);
>
> More obsolete comments you could change:
>
>  * If we were copying database at block levels then drop pages for the
>  * destination database that are in the shared buffer cache.  And tell
> -->  * checkpointer to forget any pending fsync and unlink
> requests for files
>
> --> * Tell checkpointer to forget any pending fsync and unlink requests 
> for
> * files in the database; else the fsyncs will fail at next
> checkpoint, or
> * worse, it will delete file
>
> In tablespace.c I think you could now make this change:
>
> if (!destroy_tablespace_directories(tablespaceoid, false))
> {
> -   /*
> -* Not all files deleted?  However, there can be
> lingering empty files
> -* in the directories, left behind by for example DROP
> TABLE, that
> -* have been scheduled for deletion at next checkpoint
> (see comments
> -* in mdunlink() for details).  We could just delete
> them immediately,
> -* but we can't tell them apart from important data
> files that we
> -* mustn't delete.  So instead, we force a checkpoint
> which will clean
> -* out any lingering files, and try again.
> -*/
> -   RequestCheckpoint(CHECKPOINT_IMMEDIATE |
> CHECKPOINT_FORCE | CHECKPOINT_WAIT);
> -
> +#ifdef WIN32
> /*
>  * On Windows, an unlinked file persists in the
> directory listing
>  * until no process retains an open handle for the
> file.  The DDL
> @@ -523,6 +513,7 @@ DropTableSpace(DropTableSpaceStmt *stmt)
>
> /* And now try again. */
> if (!destroy_tablespace_directories(tablespaceoid, false))
> +#endif
> {
> /* Still not empty, the files must be important then 
> */
> ereport(ERROR,

Thanks, Thomas, these all look fine to me.  So far we have committed
the patch to make relfilenode 56 bits wide.  The Tombstone file
removal patch is still pending to be committed, so when I will rebase
that patch I will accommodate all these comments in that patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-27 Thread Thomas Munro

Hi Dilip,

I am very happy to see these commits.  Here's some belated review for
the tombstone-removal patch.

> v7-0004-Don-t-delay-removing-Tombstone-file-until-next.patch

More things you can remove:

 * sync_unlinkfiletag in struct SyncOps
 * the macro UNLINKS_PER_ABSORB
 * global variable pendingUnlinks

This comment after the question mark is obsolete:

* XXX should we CHECK_FOR_INTERRUPTS in this loop?
Escaping with an
* error in the case of SYNC_UNLINK_REQUEST would leave the
* no-longer-used file still present on disk, which
would be bad, so
* I'm inclined to assume that the checkpointer will
always empty the
* queue soon.

(I think if the answer to the question is now yes, then we should
replace the stupid sleep with a condition variable sleep, but there's
another thread about that somewhere).

In a couple of places in dbcommands.c you could now make this change:

/*
-* Force a checkpoint before starting the copy. This will
force all dirty
-* buffers, including those of unlogged tables, out to disk, to ensure
-* source database is up-to-date on disk for the copy.
-* FlushDatabaseBuffers() would suffice for that, but we also want to
-* process any pending unlink requests. Otherwise, if a checkpoint
-* happened while we're copying files, a file might be deleted just when
-* we're about to copy it, causing the lstat() call in copydir() to fail
-* with ENOENT.
+* Force all dirty buffers, including those of unlogged tables, out to
+* disk, to ensure source database is up-to-date on disk for the copy.
 */
-   RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
- CHECKPOINT_WAIT |
CHECKPOINT_FLUSH_ALL);
+   FlushDatabaseBuffers(src_dboid);

More obsolete comments you could change:

 * If we were copying database at block levels then drop pages for the
 * destination database that are in the shared buffer cache.  And tell
-->  * checkpointer to forget any pending fsync and unlink
requests for files

--> * Tell checkpointer to forget any pending fsync and unlink requests for
* files in the database; else the fsyncs will fail at next
checkpoint, or
* worse, it will delete file

In tablespace.c I think you could now make this change:

if (!destroy_tablespace_directories(tablespaceoid, false))
{
-   /*
-* Not all files deleted?  However, there can be
lingering empty files
-* in the directories, left behind by for example DROP
TABLE, that
-* have been scheduled for deletion at next checkpoint
(see comments
-* in mdunlink() for details).  We could just delete
them immediately,
-* but we can't tell them apart from important data
files that we
-* mustn't delete.  So instead, we force a checkpoint
which will clean
-* out any lingering files, and try again.
-*/
-   RequestCheckpoint(CHECKPOINT_IMMEDIATE |
CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
+#ifdef WIN32
/*
 * On Windows, an unlinked file persists in the
directory listing
 * until no process retains an open handle for the
file.  The DDL
@@ -523,6 +513,7 @@ DropTableSpace(DropTableSpaceStmt *stmt)

/* And now try again. */
if (!destroy_tablespace_directories(tablespaceoid, false))
+#endif
{
/* Still not empty, the files must be important then */
ereport(ERROR,

Re: making relfilenodes 56 bits

2022-09-27 Thread Robert Haas

On Tue, Sep 27, 2022 at 2:33 AM Dilip Kumar  wrote:
> Looks fine to me.

OK, committed. I also committed the 0002 patch with some wordsmithing,
and I removed a < 0 test an an unsigned value because my compiler
complained about it. 0001 turned out to make headerscheck sad, so I
just pushed a fix for that, too.

I'm not too sure about 0003. I think if we need an is_shared flag
maybe we might as well just pass the tablespace OID. The is_shared
flag seems to just make things a bit complicated for the callers for
no real benefit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-26 Thread Robert Haas

On Wed, Sep 21, 2022 at 6:09 AM Dilip Kumar  wrote:
> Yeah you are right we can make it uint64.  With respect to this, we
> can not directly use uint64 because that is declared in c.h and that
> can not be used in
> postgres_ext.h IIUC.

Ugh.

> Can we move the existing definitions from
> c.h file to some common file (common for client and server)?

Yeah, I think that would be a good idea. Here's a quick patch that
moves them to common/relpath.h, which seems like a possibly-reasonable
choice, though perhaps you or someone else will have a better idea.

> Based on the discussion [1], it seems we can not use
> INT64_FORMAT/UINT64_FORMAT while using ereport.  But all other places
> I am using INT64_FORMAT/UINT64_FORMAT.  Does this make sense?
>
> [1] 
> https://www.postgresql.org/message-id/20220730113922.qd7qmenwcmzyacje%40alvherre.pgsql

Oh, hmm. So you're saying if the string is not translated then use
(U)INT64_FORMAT but if it is translated then cast? I guess that makes
sense. It feels a bit strange to have the style dependent on the
context like that, but maybe it's fine. I'll reread with that idea in
mind.

> If we're going to bank on that, we could adapt this more
> > heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
> > parameter.
>
> Yeah we might, although we need a bool to identify whether it is
> shared relation or not.

Why?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

move-relfilenumber-decls-v1.patch
Description: Binary data

Re: making relfilenodes 56 bits

2022-09-22 Thread Dilip Kumar

On Tue, Sep 20, 2022 at 7:46 PM Amul Sul  wrote:
>

Thanks for the review

> Here are a few minor suggestions I came across while reading this
> patch, might be useful:
>
> +#ifdef USE_ASSERT_CHECKING
> +
> +   {
>
> Unnecessary space after USE_ASSERT_CHECKING.

Changed

>
> +   return InvalidRelFileNumber;/* placate compiler */
>
> I don't think we needed this after the error on the latest branches.
> --

Changed

> +   LWLockAcquire(RelFileNumberGenLock, LW_SHARED);
> +   if (shutdown)
> +   checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
> +   else
> +   checkPoint.nextRelFileNumber = 
> ShmemVariableCache->loggedRelFileNumber;
> +
> +   LWLockRelease(RelFileNumberGenLock);
>
> This is done for the good reason, I think, it should have a comment
> describing different checkPoint.nextRelFileNumber assignment
> need and crash recovery perspective.
> --

Done

> +#define SizeOfRelFileLocatorBackend \
> +   (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))
>
> Can append empty parenthesis "()" to the macro name, to look like a
> function call at use or change the macro name to uppercase?
> --

Yeah we could SizeOfXXX macros are general practice I see used
everywhere in Postgres code so left as it is.

>  +   if (val < 0 || val > MAX_RELFILENUMBER)
> ..
>  if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
>
> How about adding a macro for this condition as RelFileNumberIsValid()?
> We can replace all the checks referring to MAX_RELFILENUMBER with this.

Actually, RelFileNumberIsValid is used to just check whether it is
InvalidRelFileNumber value i.e. 0.  Maybe for this we can introduce
RelFileNumberInValidRange() but I am not sure whether it would be
cleaner than what we have now, so left as it is for now.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-21 Thread Dilip Kumar

On Tue, Sep 20, 2022 at 10:44 PM Robert Haas  wrote:

Thanks for the review, please see my response inline for some of the
comments, rest all are accepted.

> On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar  wrote:
> > [ new patch ]
>
> +typedef pg_int64 RelFileNumber;
>
> This seems really random to me. First, why isn't this an unsigned
> type? OID is unsigned and I don't see a reason to change to a signed
> type. But even if we were going to change to a signed type, why
> pg_int64? That is declared like this:
>
> /* Define a signed 64-bit integer type for use in client API declarations. */
> typedef PG_INT64_TYPE pg_int64;
>
> Surely this is not a client API declaration
>
> Note that if we change this a lot of references to INT64_FORMAT will
> need to become UINT64_FORMAT.
>
> I think we should use int64 at the SQL level, because we don't have an
> unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
> So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
> similar. But internally I think using unsigned is cleaner.

Yeah you are right we can make it uint64.  With respect to this, we
can not directly use uint64 because that is declared in c.h and that
can not be used in
postgres_ext.h IIUC.  So what are the other option maybe we can
typedef the RelFIleNumber similar to what c.h done for uint64 i.e.

#ifdef HAVE_LONG_INT_64
typedef unsigned long int uint64;
#elif defined(HAVE_LONG_LONG_INT_64)
typedef long long int int64;
#endif

And maybe same for UINT64CONST ?

I am not liking duplicating this logic but is there any better
alternative for doing this?  Can we move the existing definitions from
c.h file to some common file (common for client and server)?

>
> +if (relnumber >= ShmemVariableCache->loggedRelFileNumber)
>
> It probably doesn't make any difference, but to me it seems better to
> test flushedRelFileNumber rather than logRelFileNumber here. What do
> you think?

Actually based on this condition are logging more so it make more
sense to check w.r.t loggedRelFileNumber, but OTOH technically,
without flushing log we are not supposed to use the relfilenumber so
make more sense to test flushedRelFileNumber.  But since both are the
same I am fine with flushedRelFileNumber.

>  /*
>   * We set up the lockRelId in case anything tries to lock the dummy
> - * relation.  Note that this is fairly bogus since relNumber may be
> - * different from the relation's OID.  It shouldn't really matter though.
> - * In recovery, we are running by ourselves and can't have any lock
> - * conflicts.  While syncing, we already hold AccessExclusiveLock.
> + * relation.  Note we are setting relId to just FirstNormalObjectId which
> + * is completely bogus.  It shouldn't really matter though. In recovery,
> + * we are running by ourselves and can't have any lock conflicts.  While
> + * syncing, we already hold AccessExclusiveLock.
>   */
>  rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
> -rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
> +rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;
>
> Boy, this makes me uncomfortable. The existing logic is pretty bogus,
> and we're replacing it with some other bogus thing. Do we know whether
> anything actually does try to use this for locking?
>
> One notable difference between the existing logic and your change is
> that, with the existing logic, we use a bogus value that will differ
> from one relation to the next, whereas with this change, it will
> always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
> (Oid) rlocator.relNumber would be a more natural adaptation?
>
> +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)\
> +do {\
> +if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
> +ereport(ERROR,  \
> +errcode(ERRCODE_INVALID_PARAMETER_VALUE),   \
> +errmsg("relfilenumber %lld is out of range",\
> +(long long) (relfilenumber))); \
> +} while (0)
>
> Here, you take the approach of casting the relfilenumber to long long
> and then using %lld. But elsewhere, you use
> INT64_FORMAT/UINT64_FORMAT. If we're going to use this technique, we
> ought to use it everywhere.

Based on the discussion [1], it seems we can not use
INT64_FORMAT/UINT64_FORMAT while using ereport.  But all other places
I am using INT64_FORMAT/UINT64_FORMAT.  Does this make sense?

[1] 
https://www.postgresql.org/message-id/20220730113922.qd7qmenwcmzyacje%40alvherre.pgsql

>  typedef struct
>  {
> -Oid reltablespace;
> -RelFileNumber relfilenumber;
> -} RelfilenumberMapKey;
> -
> -typedef struct
> -{
> -RelfilenumberMapKey key;/* lookup key - must be first */
> +RelFileNumber relfilenumber;/* lookup key - must be first */
>  Oid relid;

Re: making relfilenodes 56 bits

2022-09-20 Thread Robert Haas

On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar  wrote:
> [ new patch ]

+typedef pg_int64 RelFileNumber;

This seems really random to me. First, why isn't this an unsigned
type? OID is unsigned and I don't see a reason to change to a signed
type. But even if we were going to change to a signed type, why
pg_int64? That is declared like this:

/* Define a signed 64-bit integer type for use in client API declarations. */
typedef PG_INT64_TYPE pg_int64;

Surely this is not a client API declaration

Note that if we change this a lot of references to INT64_FORMAT will
need to become UINT64_FORMAT.

I think we should use int64 at the SQL level, because we don't have an
unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
similar. But internally I think using unsigned is cleaner.

+ * RelFileNumber is unique within a cluster.

Not really, because of CREATE DATABASE. Probably just drop this line.
Or else expand it: we never assign the same RelFileNumber twice within
the lifetime of the same cluster, but there can be multiple relations
with the same RelFileNumber e.g. because CREATE DATABASE duplicates
the RelFileNumber values from the template database. But maybe we
don't need this here, as it's already explained in relfilelocator.h.

+ret = (int8) (tag->relForkDetails[0] >> BUFTAG_RELNUM_HIGH_BITS);

Why not declare ret as ForkNumber instead of casting twice?

+uint64  relnum;
+
+Assert(relnumber <= MAX_RELFILENUMBER);
+Assert(forknum <= MAX_FORKNUM);
+
+relnum = relnumber;

Perhaps it'd be better to write uint64 relnum = relnumber instead of
initializing on a separate line.

+#define RELNUMBERCHARS  20  /* max chars printed by %llu */

Maybe instead of %llu we should say UINT64_FORMAT (or INT64_FORMAT if
there's some reason to stick with a signed type).

+elog(ERROR, "relfilenumber is out of bound");

It would have to be "out of bounds", with an "s". But maybe "is too
large" would be better.

+nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+loggedRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+flushedRelFileNumber = ShmemVariableCache->flushedRelFileNumber;

Maybe it would be a good idea to asset that next <= flushed and
flushed <= logged?

+#ifdef USE_ASSERT_CHECKING
+
+{
+RelFileLocatorBackend rlocator;
+char   *rpath;

Let's add a comment here, like "Because the RelFileNumber counter only
ever increases and never wraps around, it should be impossible for the
newly-allocated RelFileNumber to already be in use. But, if Asserts
are enabled, double check that there's no main-fork relation file with
the new RelFileNumber already on disk."

+elog(ERROR, "cannot forward RelFileNumber during recovery");

forward -> set (or advance)

+if (relnumber >= ShmemVariableCache->loggedRelFileNumber)

It probably doesn't make any difference, but to me it seems better to
test flushedRelFileNumber rather than logRelFileNumber here. What do
you think?

 /*
  * We set up the lockRelId in case anything tries to lock the dummy
- * relation.  Note that this is fairly bogus since relNumber may be
- * different from the relation's OID.  It shouldn't really matter though.
- * In recovery, we are running by ourselves and can't have any lock
- * conflicts.  While syncing, we already hold AccessExclusiveLock.
+ * relation.  Note we are setting relId to just FirstNormalObjectId which
+ * is completely bogus.  It shouldn't really matter though. In recovery,
+ * we are running by ourselves and can't have any lock conflicts.  While
+ * syncing, we already hold AccessExclusiveLock.
  */
 rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
-rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
+rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;

Boy, this makes me uncomfortable. The existing logic is pretty bogus,
and we're replacing it with some other bogus thing. Do we know whether
anything actually does try to use this for locking?

One notable difference between the existing logic and your change is
that, with the existing logic, we use a bogus value that will differ
from one relation to the next, whereas with this change, it will
always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
(Oid) rlocator.relNumber would be a more natural adaptation?

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)\
+do {\
+if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+ereport(ERROR,  \
+errcode(ERRCODE_INVALID_PARAMETER_VALUE),   \
+errmsg("relfilenumber %lld is out of range",\
+(long long) (relfilenumber))); \
+} while (0)

Here, you take the approach of casting the relfilenumber to

Re: making relfilenodes 56 bits

2022-09-20 Thread Amul Sul

On Fri, Sep 9, 2022 at 3:32 PM Dilip Kumar  wrote:
>
> On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar  wrote:
>
> > On a separate note, while reviewing the latest patch I see there is some 
> > risk of using the unflushed relfilenumber in GetNewRelFileNumber() 
> > function.  Basically, in the current code, the flushing logic is tightly 
> > coupled with the logging new relfilenumber logic and that might not work 
> > with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the idea 
> > is we need to keep the flushing logic separate from the logging, I am 
> > working on the idea and I will post the patch soon.
>
> I have fixed the issue, so now we will track nextRelFileNumber,
> loggedRelFileNumber and flushedRelFileNumber.  So whenever
> nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the
> loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more
> relfilenumbers.  And whenever nextRelFileNumber reaches the
> flushedRelFileNumber then we will do XlogFlush for WAL upto the last
> loggedRelFileNumber.  Ideally flushedRelFileNumber should always be
> VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can
> avoid tracking the flushedRelFileNumber.  But I feel keeping track of
> the flushedRelFileNumber looks cleaner and easier to understand.  For
> more details refer to the code in GetNewRelFileNumber().
>

Here are a few minor suggestions I came across while reading this
patch, might be useful:

+#ifdef USE_ASSERT_CHECKING
+
+   {

Unnecessary space after USE_ASSERT_CHECKING.
--

+   return InvalidRelFileNumber;/* placate compiler */

I don't think we needed this after the error on the latest branches.
--

+   LWLockAcquire(RelFileNumberGenLock, LW_SHARED);
+   if (shutdown)
+   checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+   else
+   checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+
+   LWLockRelease(RelFileNumberGenLock);

This is done for the good reason, I think, it should have a comment
describing different checkPoint.nextRelFileNumber assignment
need and crash recovery perspective.
--

+#define SizeOfRelFileLocatorBackend \
+   (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))

Can append empty parenthesis "()" to the macro name, to look like a
function call at use or change the macro name to uppercase?
--

 +   if (val < 0 || val > MAX_RELFILENUMBER)
..
 if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \

How about adding a macro for this condition as RelFileNumberIsValid()?
We can replace all the checks referring to MAX_RELFILENUMBER with this.

Regards,
Amul

Re: making relfilenodes 56 bits

2022-09-09 Thread Dilip Kumar

On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar  wrote:

> On a separate note, while reviewing the latest patch I see there is some risk 
> of using the unflushed relfilenumber in GetNewRelFileNumber() function.  
> Basically, in the current code, the flushing logic is tightly coupled with 
> the logging new relfilenumber logic and that might not work with all the 
> values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the idea is we need to 
> keep the flushing logic separate from the logging, I am working on the idea 
> and I will post the patch soon.

I have fixed the issue, so now we will track nextRelFileNumber,
loggedRelFileNumber and flushedRelFileNumber.  So whenever
nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the
loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more
relfilenumbers.  And whenever nextRelFileNumber reaches the
flushedRelFileNumber then we will do XlogFlush for WAL upto the last
loggedRelFileNumber.  Ideally flushedRelFileNumber should always be
VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can
avoid tracking the flushedRelFileNumber.  But I feel keeping track of
the flushedRelFileNumber looks cleaner and easier to understand.  For
more details refer to the code in GetNewRelFileNumber().

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From d04432467d34ebdc95934ab85409b1fad037c6cd Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 26 Aug 2022 10:20:18 +0530
Subject: [PATCH v17] Widen relfilenumber from 32 bits to 56 bits
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently relfilenumber is 32 bits wide and that has a risk of wraparound so
the relfilenumber can be reused.  And to guard against the relfilenumber reuse
there is some complicated hack which leaves a 0-length tombstone file around
until the next checkpoint.  And when we allocate a new relfilenumber
we also need to loop to check the on disk conflict.

As part of this patch we are making the relfilenumber 56 bits wide and there will be
no provision for wraparound.  So after this change we will be able to get rid of the
0-length tombstone file and the loop for checking the on-disk conflict of the
relfilenumbers.

The reason behind making it 56 bits wide instead of directly making 64 bits wide is
that if we make it 64 bits wide then the size of the BufferTag will be increased which
will increase the memory usage and that may also impact the performance.  So in order
to avoid that, inside the buffer tag, we will use 8 bits for the fork number and 56 bits
for the relfilenumber.
---
 contrib/pg_buffercache/Makefile|   4 +-
 .../pg_buffercache/pg_buffercache--1.3--1.4.sql|  30 
 contrib/pg_buffercache/pg_buffercache.control  |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c  |  39 +++-
 contrib/pg_prewarm/autoprewarm.c   |   4 +-
 contrib/pg_walinspect/expected/pg_walinspect.out   |   4 +-
 contrib/pg_walinspect/sql/pg_walinspect.sql|   4 +-
 contrib/test_decoding/expected/rewrite.out |   2 +-
 contrib/test_decoding/sql/rewrite.sql  |   2 +-
 doc/src/sgml/catalogs.sgml |   2 +-
 doc/src/sgml/pgbuffercache.sgml|   2 +-
 doc/src/sgml/storage.sgml  |   5 +-
 src/backend/access/gin/ginxlog.c   |   2 +-
 src/backend/access/rmgrdesc/gistdesc.c |   2 +-
 src/backend/access/rmgrdesc/heapdesc.c |   2 +-
 src/backend/access/rmgrdesc/nbtdesc.c  |   2 +-
 src/backend/access/rmgrdesc/seqdesc.c  |   2 +-
 src/backend/access/rmgrdesc/xlogdesc.c |  21 ++-
 src/backend/access/transam/README  |   5 +-
 src/backend/access/transam/varsup.c| 199 -
 src/backend/access/transam/xlog.c  |  50 ++
 src/backend/access/transam/xlogprefetcher.c|  14 +-
 src/backend/access/transam/xlogrecovery.c  |   6 +-
 src/backend/access/transam/xlogutils.c |  12 +-
 src/backend/backup/basebackup.c|   2 +-
 src/backend/catalog/catalog.c  |  95 --
 src/backend/catalog/heap.c |  25 +--
 src/backend/catalog/index.c|  11 +-
 src/backend/catalog/storage.c  |   8 +
 src/backend/commands/tablecmds.c   |  12 +-
 src/backend/commands/tablespace.c  |   2 +-
 src/backend/nodes/gen_node_support.pl  |   4 +-
 src/backend/replication/logical/decode.c   |   1 +
 src/backend/replication/logical/reorderbuffer.c|   2 +-
 src/backend/storage/file/reinit.c  |  28 +--
 src/backend/storage/freespace/fsmpage.c|   2 +-
 src/backend/storage/lmgr/lwlocknames.txt   |   1 +
 src/backend/storage/smgr/md.c  |   7 +

Re: making relfilenodes 56 bits

2022-09-08 Thread Dilip Kumar

On Tue, Sep 6, 2022 at 11:07 PM Robert Haas  wrote:

> On Tue, Sep 6, 2022 at 4:40 AM Dilip Kumar  wrote:
> > I have explored this area more and also tried to come up with a
> > working prototype, so while working on this I realized that we would
> > have almost to execute all the code which is getting generated as part
> > of the dumpDatabase() and dumpACL() which is basically,
> >
> > 1. UPDATE pg_catalog.pg_database SET datistemplate = false
> > 2. DROP DATABASE
> > 3. CREATE DATABASE with all the database properties like ENCODING,
> > LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
> > COLLATION_VERSION, TABLESPACE
> > 4. COMMENT ON DATABASE
> > 5. Logic inside dumpACL()
> >
> > I feel duplicating logic like this is really error-prone, but I do not
> > find any clear way to reuse the code as dumpDatabase() has a high
> > dependency on the Archive handle and generating the dump file.
>
> Yeah, I don't think this is the way to go at all. The duplicated logic
> is likely to get broken, and is also likely to annoy the next person
> who has to maintain it.
>

Right


> I suggest that for now we fall back on making the initial
> RelFileNumber for a system table equal to pg_class.oid. I don't really
> love that system and I think maybe we should change it at some point
> in the future, but all the alternatives seem too complicated to cram
> them into the current patch.
>

That makes sense.

On a separate note, while reviewing the latest patch I see there is some
risk of using the unflushed relfilenumber in GetNewRelFileNumber()
function.  Basically, in the current code, the flushing logic is tightly
coupled with the logging new relfilenumber logic and that might not work
with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the idea
is we need to keep the flushing logic separate from the logging, I am
working on the idea and I will post the patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-06 Thread Robert Haas

On Tue, Sep 6, 2022 at 4:40 AM Dilip Kumar  wrote:
> I have explored this area more and also tried to come up with a
> working prototype, so while working on this I realized that we would
> have almost to execute all the code which is getting generated as part
> of the dumpDatabase() and dumpACL() which is basically,
>
> 1. UPDATE pg_catalog.pg_database SET datistemplate = false
> 2. DROP DATABASE
> 3. CREATE DATABASE with all the database properties like ENCODING,
> LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
> COLLATION_VERSION, TABLESPACE
> 4. COMMENT ON DATABASE
> 5. Logic inside dumpACL()
>
> I feel duplicating logic like this is really error-prone, but I do not
> find any clear way to reuse the code as dumpDatabase() has a high
> dependency on the Archive handle and generating the dump file.

Yeah, I don't think this is the way to go at all. The duplicated logic
is likely to get broken, and is also likely to annoy the next person
who has to maintain it.

I suggest that for now we fall back on making the initial
RelFileNumber for a system table equal to pg_class.oid. I don't really
love that system and I think maybe we should change it at some point
in the future, but all the alternatives seem too complicated to cram
them into the current patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-06 Thread Dilip Kumar

On Sun, Sep 4, 2022 at 9:27 AM Dilip Kumar  wrote:
>
> On Sat, Sep 3, 2022 at 5:11 PM Amit Kapila  wrote:

> > Isn't this happening because we are passing "--clean
> > --create"/"--create" options to pg_restore in create_new_objects()? If
> > so, then I think one idea to decouple would be to not use those
> > options. Perform drop/create separately via commands (for create, we
> > need to generate the command as we are generating while generating the
> > dump in custom format), then rewrite the conflicting tables, and
> > finally restore the dump.
>
> Hmm, you are right.  So I think something like this is possible to do,
> I will explore this more. Thanks for the idea.

I have explored this area more and also tried to come up with a
working prototype, so while working on this I realized that we would
have almost to execute all the code which is getting generated as part
of the dumpDatabase() and dumpACL() which is basically,

1. UPDATE pg_catalog.pg_database SET datistemplate = false
2. DROP DATABASE
3. CREATE DATABASE with all the database properties like ENCODING,
LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
COLLATION_VERSION, TABLESPACE
4. COMMENT ON DATABASE
5. Logic inside dumpACL()

I feel duplicating logic like this is really error-prone, but I do not
find any clear way to reuse the code as dumpDatabase() has a high
dependency on the Archive handle and generating the dump file.

So currently I have implemented most of this logic except for a few
e.g. dumpACL(), comments on the database, etc.  So before we go too
far in this direction I wanted to know the opinions of others.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-03 Thread Dilip Kumar

On Sat, Sep 3, 2022 at 5:11 PM Amit Kapila  wrote:

> > I have found one more issue with this approach of rewriting the
> > conflicting table.  Earlier I thought we could do the conflict
> > checking and rewriting inside create_new_objects() right before the
> > restore command.  But after implementing (while testing) this I
> > realized that we DROP and CREATE the database while restoring the dump
> > that means it will again generate the conflicting system tables.  So
> > theoretically the rewriting should go in between the CREATE DATABASE
> > and restoring the object but as of now both create database and
> > restoring other objects are part of a single dump file.  I haven't yet
> > analyzed how feasible it is to generate the dump in two parts, first
> > part just to create the database and in second part restore the rest
> > of the object.
> >
>
> Isn't this happening because we are passing "--clean
> --create"/"--create" options to pg_restore in create_new_objects()? If
> so, then I think one idea to decouple would be to not use those
> options. Perform drop/create separately via commands (for create, we
> need to generate the command as we are generating while generating the
> dump in custom format), then rewrite the conflicting tables, and
> finally restore the dump.

Hmm, you are right.  So I think something like this is possible to do,
I will explore this more. Thanks for the idea.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-09-03 Thread Amit Kapila

On Tue, Aug 30, 2022 at 6:15 PM Dilip Kumar  wrote:
>
> On Fri, Aug 26, 2022 at 9:33 PM Robert Haas  wrote:
> >
> > On Fri, Aug 26, 2022 at 7:01 AM Dilip Kumar  wrote:
> > > While working on this solution I noticed one issue. Basically, the
> > > problem is that during binary upgrade when we try to rewrite a heap we
> > > would expect that “binary_upgrade_next_heap_pg_class_oid” and
> > > “binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
> > > creating a new heap. But we are not preserving anything so we don't
> > > have those values. One option to this problem is that we can first
> > > start the postmaster in non-binary upgrade mode perform all conflict
> > > checking and rewrite and stop the postmaster.  Then start postmaster
> > > again and perform the restore as we are doing now.  Although we will
> > > have to start/stop the postmaster one extra time we have a solution.
> >
> > Yeah, that seems OK. Or we could add a new function, like
> > binary_upgrade_allow_relation_oid_and_relfilenode_assignment(bool).
> > Not sure which way is better.
>
> I have found one more issue with this approach of rewriting the
> conflicting table.  Earlier I thought we could do the conflict
> checking and rewriting inside create_new_objects() right before the
> restore command.  But after implementing (while testing) this I
> realized that we DROP and CREATE the database while restoring the dump
> that means it will again generate the conflicting system tables.  So
> theoretically the rewriting should go in between the CREATE DATABASE
> and restoring the object but as of now both create database and
> restoring other objects are part of a single dump file.  I haven't yet
> analyzed how feasible it is to generate the dump in two parts, first
> part just to create the database and in second part restore the rest
> of the object.
>

Isn't this happening because we are passing "--clean
--create"/"--create" options to pg_restore in create_new_objects()? If
so, then I think one idea to decouple would be to not use those
options. Perform drop/create separately via commands (for create, we
need to generate the command as we are generating while generating the
dump in custom format), then rewrite the conflicting tables, and
finally restore the dump.

-- 
With Regards,
Amit Kapila.

Re: making relfilenodes 56 bits

2022-09-03 Thread Dilip Kumar

On Sat, Sep 3, 2022 at 1:50 PM Dilip Kumar  wrote:
>
> On Tue, Aug 30, 2022 at 9:23 PM Robert Haas  wrote:
>
> > Well, that's very awkward. It doesn't seem like it would be very
> > difficult to teach pg_upgrade to call pg_restore without --clean and
> > just do the drop database itself, but that doesn't really help,
> > because pg_restore will in any event be creating the new database.
> > That doesn't seem like something we can practically refactor out,
> > because only pg_dump knows what properties to use when creating the
> > new database. What we could do is have the dump include a command like
> > SELECT pg_binary_upgrade_move_things_out_of_the_way(some_arguments_here),
> > but that doesn't really help very much, because passing the whole list
> > of relfilenode values from the old database seems pretty certain to be
> > a bad idea. The whole idea here was that we'd be able to build a hash
> > table on the new database's system table OIDs, and it seems like
> > that's not going to work.
>
> Right.
>
> > We could try to salvage some portion of the idea by making
> > pg_binary_upgrade_move_things_out_of_the_way() take a more restricted
> > set of arguments, like the smallest and largest relfilenode values
> > from the old database, and then we'd just need to move things that
> > overlap. But that feels pretty hit-or-miss to me as to whether it
> > actually avoids any work, and
> > pg_binary_upgrade_move_things_out_of_the_way() might also be annoying
> > to write. So perhaps we have to go back to the drawing board here.
>
> So as of now, we have two open options 1) the current approach and
> what patch is following to use Oid as relfilenode for the system
> tables when initially created.  2) call
> pg_binary_upgrade_move_things_out_of_the_way() which force rewrite all
> the system tables.
>
> Another idea that I am not very sure how feasible is. Can we change
> the dump such that in binary upgrade mode it will not use template0 as
> a template database (in creating database command) but instead some
> new database as a template e.g. template-XYZ?   And later for conflict
> checking, we will create this template-XYZ database on the new cluster
> and then we will perform all the conflict check (from all the
> databases of the old cluster) and rewrite operations on this database.
> And later all the databases will be created using template-XYZ as the
> template and all the rewriting stuff we have done is still intact.
> The problems I could think of are 1) only for a binary upgrade we will
> have to change the pg_dump.  2) we will have to use another database
> name as the reserved database name but what if that name is already in
> use in the previous cluster?

While we are still thinking on this issue, I have rebased the patch on
the latest head and fixed a couple of minor issues.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From a645d39c6109269bb74ed8f0627e61ecf9b0535a Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 26 Aug 2022 10:20:18 +0530
Subject: [PATCH v16] Widen relfilenumber from 32 bits to 56 bits
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently relfilenumber is 32 bits wide and that has a risk of wraparound so
the relfilenumber can be reused.  And to guard against the relfilenumber reuse
there is some complicated hack which leaves a 0-length tombstone file around
until the next checkpoint.  And when we allocate a new relfilenumber
we also need to loop to check the on disk conflict.

As part of this patch we are making the relfilenumber 56 bits wide and there will be
no provision for wraparound.  So after this change we will be able to get rid of the
0-length tombstone file and the loop for checking the on-disk conflict of the
relfilenumbers.

The reason behind making it 56 bits wide instead of directly making 64 bits wide is
that if we make it 64 bits wide then the size of the BufferTag will be increased which
will increase the memory usage and that may also impact the performance.  So in order
to avoid that, inside the buffer tag, we will use 8 bits for the fork number and 56 bits
for the relfilenumber.
---
 contrib/pg_buffercache/Makefile|   4 +-
 .../pg_buffercache/pg_buffercache--1.3--1.4.sql|  30 
 contrib/pg_buffercache/pg_buffercache.control  |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c  |  39 -
 contrib/pg_prewarm/autoprewarm.c   |   4 +-
 contrib/pg_walinspect/expected/pg_walinspect.out   |   4 +-
 contrib/pg_walinspect/sql/pg_walinspect.sql|   4 +-
 contrib/test_decoding/expected/rewrite.out |   2 +-
 contrib/test_decoding/sql/rewrite.sql  |   2 +-
 doc/src/sgml/catalogs.sgml |   2 +-
 doc/src/sgml/pgbuffercache.sgml|   2 +-
 doc/src/sgml/storage.sgml  |   5 +-
 src/backend/access/gin/ginxlog.c   |   2 +-

Re: making relfilenodes 56 bits

2022-09-03 Thread Dilip Kumar

On Tue, Aug 30, 2022 at 9:23 PM Robert Haas  wrote:

> Well, that's very awkward. It doesn't seem like it would be very
> difficult to teach pg_upgrade to call pg_restore without --clean and
> just do the drop database itself, but that doesn't really help,
> because pg_restore will in any event be creating the new database.
> That doesn't seem like something we can practically refactor out,
> because only pg_dump knows what properties to use when creating the
> new database. What we could do is have the dump include a command like
> SELECT pg_binary_upgrade_move_things_out_of_the_way(some_arguments_here),
> but that doesn't really help very much, because passing the whole list
> of relfilenode values from the old database seems pretty certain to be
> a bad idea. The whole idea here was that we'd be able to build a hash
> table on the new database's system table OIDs, and it seems like
> that's not going to work.

Right.

> We could try to salvage some portion of the idea by making
> pg_binary_upgrade_move_things_out_of_the_way() take a more restricted
> set of arguments, like the smallest and largest relfilenode values
> from the old database, and then we'd just need to move things that
> overlap. But that feels pretty hit-or-miss to me as to whether it
> actually avoids any work, and
> pg_binary_upgrade_move_things_out_of_the_way() might also be annoying
> to write. So perhaps we have to go back to the drawing board here.

So as of now, we have two open options 1) the current approach and
what patch is following to use Oid as relfilenode for the system
tables when initially created.  2) call
pg_binary_upgrade_move_things_out_of_the_way() which force rewrite all
the system tables.

Another idea that I am not very sure how feasible is. Can we change
the dump such that in binary upgrade mode it will not use template0 as
a template database (in creating database command) but instead some
new database as a template e.g. template-XYZ?   And later for conflict
checking, we will create this template-XYZ database on the new cluster
and then we will perform all the conflict check (from all the
databases of the old cluster) and rewrite operations on this database.
And later all the databases will be created using template-XYZ as the
template and all the rewriting stuff we have done is still intact.
The problems I could think of are 1) only for a binary upgrade we will
have to change the pg_dump.  2) we will have to use another database
name as the reserved database name but what if that name is already in
use in the previous cluster?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-30 Thread Robert Haas

On Tue, Aug 30, 2022 at 8:45 AM Dilip Kumar  wrote:
> I have found one more issue with this approach of rewriting the
> conflicting table.  Earlier I thought we could do the conflict
> checking and rewriting inside create_new_objects() right before the
> restore command.  But after implementing (while testing) this I
> realized that we DROP and CREATE the database while restoring the dump
> that means it will again generate the conflicting system tables.  So
> theoretically the rewriting should go in between the CREATE DATABASE
> and restoring the object but as of now both create database and
> restoring other objects are part of a single dump file.  I haven't yet
> analyzed how feasible it is to generate the dump in two parts, first
> part just to create the database and in second part restore the rest
> of the object.
>
> Thoughts?

Well, that's very awkward. It doesn't seem like it would be very
difficult to teach pg_upgrade to call pg_restore without --clean and
just do the drop database itself, but that doesn't really help,
because pg_restore will in any event be creating the new database.
That doesn't seem like something we can practically refactor out,
because only pg_dump knows what properties to use when creating the
new database. What we could do is have the dump include a command like
SELECT pg_binary_upgrade_move_things_out_of_the_way(some_arguments_here),
but that doesn't really help very much, because passing the whole list
of relfilenode values from the old database seems pretty certain to be
a bad idea. The whole idea here was that we'd be able to build a hash
table on the new database's system table OIDs, and it seems like
that's not going to work.

We could try to salvage some portion of the idea by making
pg_binary_upgrade_move_things_out_of_the_way() take a more restricted
set of arguments, like the smallest and largest relfilenode values
from the old database, and then we'd just need to move things that
overlap. But that feels pretty hit-or-miss to me as to whether it
actually avoids any work, and
pg_binary_upgrade_move_things_out_of_the_way() might also be annoying
to write. So perhaps we have to go back to the drawing board here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-30 Thread Dilip Kumar

On Fri, Aug 26, 2022 at 9:33 PM Robert Haas  wrote:
>
> On Fri, Aug 26, 2022 at 7:01 AM Dilip Kumar  wrote:
> > While working on this solution I noticed one issue. Basically, the
> > problem is that during binary upgrade when we try to rewrite a heap we
> > would expect that “binary_upgrade_next_heap_pg_class_oid” and
> > “binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
> > creating a new heap. But we are not preserving anything so we don't
> > have those values. One option to this problem is that we can first
> > start the postmaster in non-binary upgrade mode perform all conflict
> > checking and rewrite and stop the postmaster.  Then start postmaster
> > again and perform the restore as we are doing now.  Although we will
> > have to start/stop the postmaster one extra time we have a solution.
>
> Yeah, that seems OK. Or we could add a new function, like
> binary_upgrade_allow_relation_oid_and_relfilenode_assignment(bool).
> Not sure which way is better.

I have found one more issue with this approach of rewriting the
conflicting table.  Earlier I thought we could do the conflict
checking and rewriting inside create_new_objects() right before the
restore command.  But after implementing (while testing) this I
realized that we DROP and CREATE the database while restoring the dump
that means it will again generate the conflicting system tables.  So
theoretically the rewriting should go in between the CREATE DATABASE
and restoring the object but as of now both create database and
restoring other objects are part of a single dump file.  I haven't yet
analyzed how feasible it is to generate the dump in two parts, first
part just to create the database and in second part restore the rest
of the object.

Thoughts?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-26 Thread Robert Haas

On Fri, Aug 26, 2022 at 7:01 AM Dilip Kumar  wrote:
> While working on this solution I noticed one issue. Basically, the
> problem is that during binary upgrade when we try to rewrite a heap we
> would expect that “binary_upgrade_next_heap_pg_class_oid” and
> “binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
> creating a new heap. But we are not preserving anything so we don't
> have those values. One option to this problem is that we can first
> start the postmaster in non-binary upgrade mode perform all conflict
> checking and rewrite and stop the postmaster.  Then start postmaster
> again and perform the restore as we are doing now.  Although we will
> have to start/stop the postmaster one extra time we have a solution.

Yeah, that seems OK. Or we could add a new function, like
binary_upgrade_allow_relation_oid_and_relfilenode_assignment(bool).
Not sure which way is better.

> But while thinking about this I started to think that since now we are
> completely decoupling the concept of Oid and relfilenumber then
> logically during REWRITE we should only be allocating new
> relfilenumber but we don’t really need to allocate the Oid at all.
> Yeah, we can do that if inside make_new_heap() if we pass the
> OIDOldHeap to heap_create_with_catalog(), then it will just create new
> storage(relfilenumber) but not a new Oid. But the problem is that the
> ATRewriteTable() and finish_heap_swap() functions are completely based
> on the relation cache.  So now if we only create a new relfilenumber
> but not a new Oid then we will have to change this infrastructure to
> copy at smgr level.

I think it would be a good idea to continue preserving the OIDs. If
nothing else, it makes debugging way easier, but also, there might be
user-defined regclass columns or something. Note the comments in
check_for_reg_data_type_usage().

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-26 Thread Dilip Kumar

On Thu, Aug 25, 2022 at 5:26 PM Dilip Kumar  wrote:

> I agree on this that this system is easy to explain that we just
> rewrite anything that conflicts so looks more future-proof.  Okay, I
> will try this solution and post the patch.

While working on this solution I noticed one issue. Basically, the
problem is that during binary upgrade when we try to rewrite a heap we
would expect that “binary_upgrade_next_heap_pg_class_oid” and
“binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
creating a new heap. But we are not preserving anything so we don't
have those values. One option to this problem is that we can first
start the postmaster in non-binary upgrade mode perform all conflict
checking and rewrite and stop the postmaster.  Then start postmaster
again and perform the restore as we are doing now.  Although we will
have to start/stop the postmaster one extra time we have a solution.

But while thinking about this I started to think that since now we are
completely decoupling the concept of Oid and relfilenumber then
logically during REWRITE we should only be allocating new
relfilenumber but we don’t really need to allocate the Oid at all.
Yeah, we can do that if inside make_new_heap() if we pass the
OIDOldHeap to heap_create_with_catalog(), then it will just create new
storage(relfilenumber) but not a new Oid. But the problem is that the
ATRewriteTable() and finish_heap_swap() functions are completely based
on the relation cache.  So now if we only create a new relfilenumber
but not a new Oid then we will have to change this infrastructure to
copy at smgr level.

Thoughts?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-25 Thread Dilip Kumar

On Tue, Aug 23, 2022 at 8:00 PM Robert Haas  wrote:
>
> On Tue, Aug 23, 2022 at 2:06 AM Dilip Kumar  wrote:
> > OTOH, if we keep the two separate ranges for the user and system table
> > then we don't need all this complex logic of conflict checking.
>
> True. That's the downside. The question is whether it's worth adding
> some complexity to avoid needing separate ranges.

Other than complexity, we will have to check the conflict for all the
user table's relfilenumber from the old cluster into the hash build on
the new cluster's relfilenumber, isn't it extra overhead if there are
a lot of user tables?  But I think we are already restoring all those
tables in the new cluster so compared to that it will be very small.

> Honestly, if we don't care about having separate ranges, we can do
> something even simpler and just make the starting relfilenumber for
> system tables same as the OID. Then we don't have to do anything at
> all, outside of not changing the OID assigned to pg_largeobject in a
> future release. Then as long as pg_upgrade is targeting a new cluster
> with completely fresh databases that have not had any system table
> rewrites so far, there can't be any conflict.

I think having the OID-based system and having two ranges are not
exactly the same.  Because if we have the OID-based relfilenumber
allocation for system table (initially) and then later allocate from
the nextRelFileNumber counter then it seems like a mix of old system
(where actually OID and relfilenumber are tightly connected) and the
new system where nextRelFileNumber is completely independent counter.
OTOH having two ranges means logically we are not making dependent on
OID we are just allocating from a central counter but after catalog
initialization, we will leave some gap and start from a new range. So
I don't think this system is hard to explain.

> And perhaps that is the best solution after all, but while it is
> simple in terms of code, I feel it's a bit complicated for human
> beings. It's very simple to understand the scheme that Amit proposed:
> if there's anything in the new cluster that would conflict, we move it
> out of the way. We don't have to assume the new cluster hasn't had any
> table rewrites. We don't have to nail down starting relfilenumber
> assignments for system tables. We don't have to worry about
> relfilenumber or OID assignments changing between releases.
> pg_largeobject is not a special case. There are no special ranges of
> OIDs or relfilenumbers required. It just straight up works -- all the
> time, no matter what, end of story.

I agree on this that this system is easy to explain that we just
rewrite anything that conflicts so looks more future-proof.  Okay, I
will try this solution and post the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-24 Thread Robert Haas

On Mon, Aug 1, 2022 at 7:57 AM Dilip Kumar  wrote:
> I have fixed other comments, and also fixed comments from Alvaro to
> use %lld instead of INT64_FORMAT inside the ereport and wherever he
> suggested.

Notwithstanding the ongoing discussion about the exact approach for
the main patch, it seemed OK to push the preparatory patch you posted
here, so I have now done that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-24 Thread Amit Kapila

On Tue, Aug 23, 2022 at 3:28 PM Dilip Kumar  wrote:
>
> On Tue, Aug 23, 2022 at 3:16 PM Amit Kapila  wrote:
>
> > One more thing we may want to think about is what if there are tables
> > created by extension? For example, I think BDR creates some tables
> > like node_group, conflict_history, etc. Now, I think if such an
> > extension is created by default, both old and new clusters will have
> > those tables. Isn't there a chance of relfilenumber conflict in such
> > cases?
>
> Shouldn't they behave as a normal user table? because before upgrade
> anyway new cluster can not have any table other than system tables and
> those tables created by an extension should also be restored as other
> user table does.
>

Right.

-- 
With Regards,
Amit Kapila.

Re: making relfilenodes 56 bits

2022-08-24 Thread Amit Kapila

On Tue, Aug 23, 2022 at 8:00 PM Robert Haas  wrote:
>
> On Tue, Aug 23, 2022 at 2:06 AM Dilip Kumar  wrote:
> > OTOH, if we keep the two separate ranges for the user and system table
> > then we don't need all this complex logic of conflict checking.
>
> True. That's the downside. The question is whether it's worth adding
> some complexity to avoid needing separate ranges.
>
> Honestly, if we don't care about having separate ranges, we can do
> something even simpler and just make the starting relfilenumber for
> system tables same as the OID. Then we don't have to do anything at
> all, outside of not changing the OID assigned to pg_largeobject in a
> future release. Then as long as pg_upgrade is targeting a new cluster
> with completely fresh databases that have not had any system table
> rewrites so far, there can't be any conflict.
>
> And perhaps that is the best solution after all, but while it is
> simple in terms of code, I feel it's a bit complicated for human
> beings. It's very simple to understand the scheme that Amit proposed:
> if there's anything in the new cluster that would conflict, we move it
> out of the way. We don't have to assume the new cluster hasn't had any
> table rewrites. We don't have to nail down starting relfilenumber
> assignments for system tables. We don't have to worry about
> relfilenumber or OID assignments changing between releases.
> pg_largeobject is not a special case. There are no special ranges of
> OIDs or relfilenumbers required. It just straight up works -- all the
> time, no matter what, end of story.
>

This sounds simple to understand. It seems we always create new system
tables in the new cluster before the upgrade, so I think it is safe to
assume there won't be any table rewrite in it. OTOH, if the
relfilenumber allocation scheme is robust to deal with table rewrites
then we probably don't need to worry about this assumption changing in
the future.

-- 
With Regards,
Amit Kapila.

Re: making relfilenodes 56 bits

2022-08-23 Thread Robert Haas

On Tue, Aug 23, 2022 at 2:06 AM Dilip Kumar  wrote:
> OTOH, if we keep the two separate ranges for the user and system table
> then we don't need all this complex logic of conflict checking.

True. That's the downside. The question is whether it's worth adding
some complexity to avoid needing separate ranges.

Honestly, if we don't care about having separate ranges, we can do
something even simpler and just make the starting relfilenumber for
system tables same as the OID. Then we don't have to do anything at
all, outside of not changing the OID assigned to pg_largeobject in a
future release. Then as long as pg_upgrade is targeting a new cluster
with completely fresh databases that have not had any system table
rewrites so far, there can't be any conflict.

And perhaps that is the best solution after all, but while it is
simple in terms of code, I feel it's a bit complicated for human
beings. It's very simple to understand the scheme that Amit proposed:
if there's anything in the new cluster that would conflict, we move it
out of the way. We don't have to assume the new cluster hasn't had any
table rewrites. We don't have to nail down starting relfilenumber
assignments for system tables. We don't have to worry about
relfilenumber or OID assignments changing between releases.
pg_largeobject is not a special case. There are no special ranges of
OIDs or relfilenumbers required. It just straight up works -- all the
time, no matter what, end of story.

The other schemes we're talking about here all require a bunch of
assumptions about stuff like what I just mentioned. We can certainly
do it that way, and maybe it's even for the best. But I feel like it's
a little bit fragile. Maybe some future change gets blocked because it
would break one of the assumptions that the system relies on, or maybe
someone doesn't even realize there's an issue and changes something
that introduces a bug into this system. Or on the other hand maybe
not. But I think there's at least some value in considering whether
adding a little more code might actually make things simpler to reason
about, and whether that might be a good enough reason to do it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-23 Thread Dilip Kumar

On Tue, Aug 23, 2022 at 3:16 PM Amit Kapila  wrote:
>
> > OTOH, if we keep the two separate ranges for the user and system table
> > then we don't need all this complex logic of conflict checking.  From
> > the old cluster, we can just remember the relfilenumbr of the
> > pg_largeobject, and in the new cluster before trying to restore we can
> > just query the new cluster pg_class and find out whether it is used by
> > any system table and if so then we can just rewrite that system table.
> >
>
> Before re-write of that system table, I think we need to set
> NextRelFileNumber to a number greater than the max relfilenumber from
> the old cluster, otherwise, it can later conflict with some user
> table.

Yes we will need to do that.

> > And I think using two ranges might not be that complicated because as
> > soon as we are done with the initdb we can just set NextRelFileNumber
> > to the first user range relfilenumber so I think this could be the
> > simplest solution.  And I think what Amit is suggesting is something
> > on this line?
> >
>
> Yeah, I had thought of checking only pg_largeobject. I think the
> advantage of having separate ranges is that we have a somewhat simpler
> logic in the upgrade but OTOH the other scheme has the advantage of
> having a single allocation scheme. Do we see any other pros/cons of
> one over the other?

I feel having a separate range is not much different from having a
single allocation scheme, after cluster initialization, we will just
have to set the NextRelFileNumber to something called
FirstNormalRelFileNumber which looks fine to me.

> One more thing we may want to think about is what if there are tables
> created by extension? For example, I think BDR creates some tables
> like node_group, conflict_history, etc. Now, I think if such an
> extension is created by default, both old and new clusters will have
> those tables. Isn't there a chance of relfilenumber conflict in such
> cases?

Shouldn't they behave as a normal user table? because before upgrade
anyway new cluster can not have any table other than system tables and
those tables created by an extension should also be restored as other
user table does.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-23 Thread Amit Kapila

On Tue, Aug 23, 2022 at 11:36 AM Dilip Kumar  wrote:
>
> On Tue, Aug 23, 2022 at 8:33 AM Dilip Kumar  wrote:
> >
> > On Tue, Aug 23, 2022 at 1:46 AM Robert Haas  wrote:
> > >
> > > On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila  
> > > wrote:
> > > > To solve that problem, how about rewriting the system table in the new
> > > > cluster which has a conflicting relfilenode? I think we can probably
> > > > do this conflict checking before processing the tables from the old
> > > > cluster.
> > >
> > > Thanks for chiming in.
> > >
> > > Right now, there are two parts to the relfilenumber preservation
> > > system, and this scheme doesn't quite fit into either of them. First,
> > > the dump includes commands to set pg_class.relfilenode in the new
> > > cluster to the same value that it had in the old cluster. The dump
> > > can't include any SQL commands that depend on what's happening in the
> > > new cluster because pg_dump(all) only connects to a single cluster,
> > > which in this case is the old cluster. Second, pg_upgrade itself
> > > copies the files from the old cluster to the new cluster. This doesn't
> > > involve a database connection at all. So there's no part of the
> > > current relfilenode preservation mechanism that can look at the old
> > > AND the new database and decide on some SQL to execute against the new
> > > database.
> > >
> > > I thought for a while that we could use the information that's already
> > > gathered by get_rel_infos() to do what you're suggesting here, but it
> > > doesn't quite work, because that function excludes system tables, and
> > > we can't afford to do that here. We'd either need to modify that query
> > > to include system tables - at least for the new cluster - or run a
> > > separate one to gather information about system tables in the new
> > > cluster. Then, we could put all the pg_class.relfilenode values we
> > > found in the new cluster into a hash table, loop over the list of rels
> > > this function found in the old cluster, and for each one, probe into
> > > the hash table. If we find a match, that's a system table that needs
> > > to be moved out of the way before calling create_new_objects(), or
> > > maybe inside that function but before it runs pg_restore.
> > >
> > > That doesn't seem too crazy, I think. It's a little bit of new
> > > mechanism, but it doesn't sound horrific. It's got the advantage of
> > > being significantly cheaper than my proposal of moving everything out
> > > of the way unconditionally, and at the same time it retains one of the
> > > key advantages of that proposal - IMV, anyway - which is that we don't
> > > need separate relfilenode ranges for user and system objects any more.
> > > So I guess on balance I kind of like it, but maybe I'm missing
> > > something.
> >
> > Okay, so this seems exactly the same as your previous proposal but
> > instead of unconditionally rewriting all the system tables we will
> > rewrite only those conflict with the user table or pg_largeobject from
> > the previous cluster.  Although it might have additional
> > implementation complexity on the pg upgrade side, it seems cheaper
> > than rewriting everything.
>
> OTOH, if we keep the two separate ranges for the user and system table
> then we don't need all this complex logic of conflict checking.  From
> the old cluster, we can just remember the relfilenumbr of the
> pg_largeobject, and in the new cluster before trying to restore we can
> just query the new cluster pg_class and find out whether it is used by
> any system table and if so then we can just rewrite that system table.
>

Before re-write of that system table, I think we need to set
NextRelFileNumber to a number greater than the max relfilenumber from
the old cluster, otherwise, it can later conflict with some user
table.

> And I think using two ranges might not be that complicated because as
> soon as we are done with the initdb we can just set NextRelFileNumber
> to the first user range relfilenumber so I think this could be the
> simplest solution.  And I think what Amit is suggesting is something
> on this line?
>

Yeah, I had thought of checking only pg_largeobject. I think the
advantage of having separate ranges is that we have a somewhat simpler
logic in the upgrade but OTOH the other scheme has the advantage of
having a single allocation scheme. Do we see any other pros/cons of
one over the other?

One more thing we may want to think about is what if there are tables
created by extension? For example, I think BDR creates some tables
like node_group, conflict_history, etc. Now, I think if such an
extension is created by default, both old and new clusters will have
those tables. Isn't there a chance of relfilenumber conflict in such
cases?

-- 
With Regards,
Amit Kapila.

Re: making relfilenodes 56 bits

2022-08-23 Thread Dilip Kumar

On Tue, Aug 23, 2022 at 8:33 AM Dilip Kumar  wrote:
>
> On Tue, Aug 23, 2022 at 1:46 AM Robert Haas  wrote:
> >
> > On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila  wrote:
> > > To solve that problem, how about rewriting the system table in the new
> > > cluster which has a conflicting relfilenode? I think we can probably
> > > do this conflict checking before processing the tables from the old
> > > cluster.
> >
> > Thanks for chiming in.
> >
> > Right now, there are two parts to the relfilenumber preservation
> > system, and this scheme doesn't quite fit into either of them. First,
> > the dump includes commands to set pg_class.relfilenode in the new
> > cluster to the same value that it had in the old cluster. The dump
> > can't include any SQL commands that depend on what's happening in the
> > new cluster because pg_dump(all) only connects to a single cluster,
> > which in this case is the old cluster. Second, pg_upgrade itself
> > copies the files from the old cluster to the new cluster. This doesn't
> > involve a database connection at all. So there's no part of the
> > current relfilenode preservation mechanism that can look at the old
> > AND the new database and decide on some SQL to execute against the new
> > database.
> >
> > I thought for a while that we could use the information that's already
> > gathered by get_rel_infos() to do what you're suggesting here, but it
> > doesn't quite work, because that function excludes system tables, and
> > we can't afford to do that here. We'd either need to modify that query
> > to include system tables - at least for the new cluster - or run a
> > separate one to gather information about system tables in the new
> > cluster. Then, we could put all the pg_class.relfilenode values we
> > found in the new cluster into a hash table, loop over the list of rels
> > this function found in the old cluster, and for each one, probe into
> > the hash table. If we find a match, that's a system table that needs
> > to be moved out of the way before calling create_new_objects(), or
> > maybe inside that function but before it runs pg_restore.
> >
> > That doesn't seem too crazy, I think. It's a little bit of new
> > mechanism, but it doesn't sound horrific. It's got the advantage of
> > being significantly cheaper than my proposal of moving everything out
> > of the way unconditionally, and at the same time it retains one of the
> > key advantages of that proposal - IMV, anyway - which is that we don't
> > need separate relfilenode ranges for user and system objects any more.
> > So I guess on balance I kind of like it, but maybe I'm missing
> > something.
>
> Okay, so this seems exactly the same as your previous proposal but
> instead of unconditionally rewriting all the system tables we will
> rewrite only those conflict with the user table or pg_largeobject from
> the previous cluster.  Although it might have additional
> implementation complexity on the pg upgrade side, it seems cheaper
> than rewriting everything.

OTOH, if we keep the two separate ranges for the user and system table
then we don't need all this complex logic of conflict checking.  From
the old cluster, we can just remember the relfilenumbr of the
pg_largeobject, and in the new cluster before trying to restore we can
just query the new cluster pg_class and find out whether it is used by
any system table and if so then we can just rewrite that system table.
And I think using two ranges might not be that complicated because as
soon as we are done with the initdb we can just set NextRelFileNumber
to the first user range relfilenumber so I think this could be the
simplest solution.  And I think what Amit is suggesting is something
on this line?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-22 Thread Dilip Kumar

On Tue, Aug 23, 2022 at 1:46 AM Robert Haas  wrote:
>
> On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila  wrote:
> > To solve that problem, how about rewriting the system table in the new
> > cluster which has a conflicting relfilenode? I think we can probably
> > do this conflict checking before processing the tables from the old
> > cluster.
>
> Thanks for chiming in.
>
> Right now, there are two parts to the relfilenumber preservation
> system, and this scheme doesn't quite fit into either of them. First,
> the dump includes commands to set pg_class.relfilenode in the new
> cluster to the same value that it had in the old cluster. The dump
> can't include any SQL commands that depend on what's happening in the
> new cluster because pg_dump(all) only connects to a single cluster,
> which in this case is the old cluster. Second, pg_upgrade itself
> copies the files from the old cluster to the new cluster. This doesn't
> involve a database connection at all. So there's no part of the
> current relfilenode preservation mechanism that can look at the old
> AND the new database and decide on some SQL to execute against the new
> database.
>
> I thought for a while that we could use the information that's already
> gathered by get_rel_infos() to do what you're suggesting here, but it
> doesn't quite work, because that function excludes system tables, and
> we can't afford to do that here. We'd either need to modify that query
> to include system tables - at least for the new cluster - or run a
> separate one to gather information about system tables in the new
> cluster. Then, we could put all the pg_class.relfilenode values we
> found in the new cluster into a hash table, loop over the list of rels
> this function found in the old cluster, and for each one, probe into
> the hash table. If we find a match, that's a system table that needs
> to be moved out of the way before calling create_new_objects(), or
> maybe inside that function but before it runs pg_restore.
>
> That doesn't seem too crazy, I think. It's a little bit of new
> mechanism, but it doesn't sound horrific. It's got the advantage of
> being significantly cheaper than my proposal of moving everything out
> of the way unconditionally, and at the same time it retains one of the
> key advantages of that proposal - IMV, anyway - which is that we don't
> need separate relfilenode ranges for user and system objects any more.
> So I guess on balance I kind of like it, but maybe I'm missing
> something.

Okay, so this seems exactly the same as your previous proposal but
instead of unconditionally rewriting all the system tables we will
rewrite only those conflict with the user table or pg_largeobject from
the previous cluster.  Although it might have additional
implementation complexity on the pg upgrade side, it seems cheaper
than rewriting everything.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-22 Thread Robert Haas

On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila  wrote:
> To solve that problem, how about rewriting the system table in the new
> cluster which has a conflicting relfilenode? I think we can probably
> do this conflict checking before processing the tables from the old
> cluster.

Thanks for chiming in.

Right now, there are two parts to the relfilenumber preservation
system, and this scheme doesn't quite fit into either of them. First,
the dump includes commands to set pg_class.relfilenode in the new
cluster to the same value that it had in the old cluster. The dump
can't include any SQL commands that depend on what's happening in the
new cluster because pg_dump(all) only connects to a single cluster,
which in this case is the old cluster. Second, pg_upgrade itself
copies the files from the old cluster to the new cluster. This doesn't
involve a database connection at all. So there's no part of the
current relfilenode preservation mechanism that can look at the old
AND the new database and decide on some SQL to execute against the new
database.

I thought for a while that we could use the information that's already
gathered by get_rel_infos() to do what you're suggesting here, but it
doesn't quite work, because that function excludes system tables, and
we can't afford to do that here. We'd either need to modify that query
to include system tables - at least for the new cluster - or run a
separate one to gather information about system tables in the new
cluster. Then, we could put all the pg_class.relfilenode values we
found in the new cluster into a hash table, loop over the list of rels
this function found in the old cluster, and for each one, probe into
the hash table. If we find a match, that's a system table that needs
to be moved out of the way before calling create_new_objects(), or
maybe inside that function but before it runs pg_restore.

That doesn't seem too crazy, I think. It's a little bit of new
mechanism, but it doesn't sound horrific. It's got the advantage of
being significantly cheaper than my proposal of moving everything out
of the way unconditionally, and at the same time it retains one of the
key advantages of that proposal - IMV, anyway - which is that we don't
need separate relfilenode ranges for user and system objects any more.
So I guess on balance I kind of like it, but maybe I'm missing
something.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-22 Thread Amit Kapila

On Mon, Aug 22, 2022 at 1:25 PM Amit Kapila  wrote:
>
> On Sat, Jul 30, 2022 at 1:59 AM Robert Haas  wrote:
> >
> > On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar  wrote:
> > > There was also an issue where the user table from the old cluster's
> > > relfilenode could conflict with the system table of the new cluster.
> > > As a solution currently for system table object (while creating
> > > storage first time) we are keeping the low range of relfilenumber,
> > > basically we are using the same relfilenumber as OID so that during
> > > upgrade the normal user table from the old cluster will not conflict
> > > with the system tables in the new cluster.  But with this solution
> > > Robert told me (in off list chat) a problem that in future if we want
> > > to make relfilenumber completely unique within a cluster by
> > > implementing the CREATEDB differently then we can not do that as we
> > > have created fixed relfilenodes for the system tables.
> > >
> > > I am not sure what exactly we can do to avoid that because even if we
> > > do something  to avoid that in the new cluster the old cluster might
> > > be already using the non-unique relfilenode so after upgrading the new
> > > cluster will also get those non-unique relfilenode.
> >
> > I think this aspect of the patch could use some more discussion.
> >
> > To recap, the problem is that pg_upgrade mustn't discover that a
> > relfilenode that is being migrated from the old cluster is being used
> > for some other table in the new cluster. Since the new cluster should
> > only contain system tables that we assume have never been rewritten,
> > they'll all have relfilenodes equal to their OIDs, and thus less than
> > 16384. On the other hand all the user tables from the old cluster will
> > have relfilenodes greater than 16384, so we're fine. pg_largeobject,
> > which also gets migrated, is a special case. Since we don't change OID
> > assignments from version to version, it should have either the same
> > relfilenode value in the old and new clusters, if never rewritten, or
> > else the value in the old cluster will be greater than 16384, in which
> > case no conflict is possible.
> >
> > But if we just assign all relfilenode values from a central counter,
> > then we have got trouble. If the new version has more system catalog
> > tables than the old version, then some value that got used for a user
> > table in the old version might get used for a system table in the new
> > version, which is a problem. One idea for fixing this is to have two
> > RelFileNumber ranges: a system range (small values) and a user range.
> > System tables get values in the system range initially, and in the
> > user range when first rewritten. User tables always get values in the
> > user range. Everything works fine in this scenario except maybe for
> > pg_largeobject: what if it gets one value from the system range in the
> > old cluster, and a different value from the system range in the new
> > cluster, but some other system table in the new cluster gets the value
> > that pg_largeobject had in the old cluster? Then we've got trouble.
> >
>
> To solve that problem, how about rewriting the system table in the new
> cluster which has a conflicting relfilenode? I think we can probably
> do this conflict checking before processing the tables from the old
> cluster.
>

I think while rewriting of system table during the upgrade, we need to
ensure that it gets relfilenumber from the system range, otherwise, if
we allocate it from the user range, there will be a chance of conflict
with the user tables from the old cluster. Another way could be to set
the next-relfilenumber counter for the new cluster to the value from
the old cluster as mentioned by Robert in his previous email [1].

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmoYsNiF8JGZ%2BKp7Zgcct67Qk%2B%2BYAp%2B1ybOQ0qomUayn%2B7A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: making relfilenodes 56 bits

2022-08-22 Thread Amit Kapila

On Sat, Jul 30, 2022 at 1:59 AM Robert Haas  wrote:
>
> On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar  wrote:
> > There was also an issue where the user table from the old cluster's
> > relfilenode could conflict with the system table of the new cluster.
> > As a solution currently for system table object (while creating
> > storage first time) we are keeping the low range of relfilenumber,
> > basically we are using the same relfilenumber as OID so that during
> > upgrade the normal user table from the old cluster will not conflict
> > with the system tables in the new cluster.  But with this solution
> > Robert told me (in off list chat) a problem that in future if we want
> > to make relfilenumber completely unique within a cluster by
> > implementing the CREATEDB differently then we can not do that as we
> > have created fixed relfilenodes for the system tables.
> >
> > I am not sure what exactly we can do to avoid that because even if we
> > do something  to avoid that in the new cluster the old cluster might
> > be already using the non-unique relfilenode so after upgrading the new
> > cluster will also get those non-unique relfilenode.
>
> I think this aspect of the patch could use some more discussion.
>
> To recap, the problem is that pg_upgrade mustn't discover that a
> relfilenode that is being migrated from the old cluster is being used
> for some other table in the new cluster. Since the new cluster should
> only contain system tables that we assume have never been rewritten,
> they'll all have relfilenodes equal to their OIDs, and thus less than
> 16384. On the other hand all the user tables from the old cluster will
> have relfilenodes greater than 16384, so we're fine. pg_largeobject,
> which also gets migrated, is a special case. Since we don't change OID
> assignments from version to version, it should have either the same
> relfilenode value in the old and new clusters, if never rewritten, or
> else the value in the old cluster will be greater than 16384, in which
> case no conflict is possible.
>
> But if we just assign all relfilenode values from a central counter,
> then we have got trouble. If the new version has more system catalog
> tables than the old version, then some value that got used for a user
> table in the old version might get used for a system table in the new
> version, which is a problem. One idea for fixing this is to have two
> RelFileNumber ranges: a system range (small values) and a user range.
> System tables get values in the system range initially, and in the
> user range when first rewritten. User tables always get values in the
> user range. Everything works fine in this scenario except maybe for
> pg_largeobject: what if it gets one value from the system range in the
> old cluster, and a different value from the system range in the new
> cluster, but some other system table in the new cluster gets the value
> that pg_largeobject had in the old cluster? Then we've got trouble.
>

To solve that problem, how about rewriting the system table in the new
cluster which has a conflicting relfilenode? I think we can probably
do this conflict checking before processing the tables from the old
cluster.

-- 
With Regards,
Amit Kapila.

Re: making relfilenodes 56 bits

2022-08-11 Thread Dilip Kumar

On Thu, Aug 11, 2022 at 10:58 AM Dilip Kumar  wrote:
>
> On Tue, Aug 9, 2022 at 8:51 PM Robert Haas  wrote:
> >
> > On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar  wrote:
> > > I think even if we start the range from the 4 billion we can not avoid
> > > keeping two separate ranges for system and user tables otherwise the
> > > next upgrade where old and new clusters both have 56 bits
> > > relfilenumber will get conflicting files.  And, for the same reason we
> > > still have to call SetNextRelFileNumber() during upgrade.
> >
> > Well, my proposal to move everything from the new cluster up to higher
> > numbers would address this without requiring two ranges.
> >
> > > So the idea is, we will be having 2 ranges for relfilenumbers, system
> > > range will start from 4 billion and user range maybe something around
> > > 4.1 (I think we can keep it very small though, just reserve 50k
> > > relfilenumber for system for future expansion and start user range
> > > from there).
> >
> > A disadvantage of this is that it basically means all the file names
> > in new clusters are going to be 10 characters long. That's not a big
> > disadvantage, but it's not wonderful. File names that are only 5-7
> > characters long are common today, and easier to remember.
>
> That's correct.
>
> > > So now system tables have no issues and also the user tables from the
> > > old cluster have no issues.  But pg_largeobject might get conflict
> > > when both old and new cluster are using 56 bits relfilenumber, because
> > > it is possible that in the new cluster some other system table gets
> > > that relfilenumber which is used by pg_largeobject in the old cluster.
> > >
> > > This could be resolved if we allocate pg_largeobject's relfilenumber
> > > from the user range, that means this relfilenumber will always be the
> > > first value from the user range.  So now if the old and new cluster
> > > both are using 56bits relfilenumber then pg_largeobject in both
> > > cluster would have got the same relfilenumber and if the old cluster
> > > is using the current 32 bits relfilenode system then the whole range
> > > of the new cluster is completely different than that of the old
> > > cluster.
> >
> > I think this can work, but it does rely to some extent on the fact
> > that there are no other tables which need to be treated like
> > pg_largeobject. If there were others, they'd need fixed starting
> > RelFileNumber assignments, or some other trick, like renumbering them
> > twice in the cluster, first two a known-unused value and then back to
> > the proper value. You'd have trouble if in the other cluster
> > pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
> > cluster the reverse, without some hackery.
>
> Agree, if it has more catalog like pg_largeobject then it would
> require some hacking.
>
> > I do feel like your idea here has some advantages - my proposal
> > requires rewriting all the catalogs in the new cluster before we do
> > anything else, and that's going to take some time even though they
> > should be small. But I also feel like it has some disadvantages: it
> > seems to rely on complicated reasoning and special cases more than I'd
> > like.
>
> One other advantage with your approach is that since we are starting
> the "nextrelfilenumber" after the old cluster's relfilenumber range.
> So only at the beginning we need to set the "nextrelfilenumber" but
> after that while upgrading each object we don't need to set the
> nextrelfilenumber every time because that is already higher than the
> complete old cluster range.  In other 2 approaches we will have to try
> to set the nextrelfilenumber everytime we preserve the relfilenumber
> during upgrade.

I was also thinking that whether we will get the max "relfilenumber"
from the old cluster at the cluster level or per database level?  I
mean if we want to get database level we can run simple query on
pg_class and get it but there also we will need to see how to handle
the mapped relation if they are rewritten?  I don't think we can get
the max relfilenumber from the old cluster at the cluster level.
Maybe in the newer version we can expose a function from the server to
just return the NextRelFileNumber and that would be the max
relfilenumber but I'm not sure how to do that in the old version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-10 Thread Dilip Kumar

On Tue, Aug 9, 2022 at 8:51 PM Robert Haas  wrote:
>
> On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar  wrote:
> > I think even if we start the range from the 4 billion we can not avoid
> > keeping two separate ranges for system and user tables otherwise the
> > next upgrade where old and new clusters both have 56 bits
> > relfilenumber will get conflicting files.  And, for the same reason we
> > still have to call SetNextRelFileNumber() during upgrade.
>
> Well, my proposal to move everything from the new cluster up to higher
> numbers would address this without requiring two ranges.
>
> > So the idea is, we will be having 2 ranges for relfilenumbers, system
> > range will start from 4 billion and user range maybe something around
> > 4.1 (I think we can keep it very small though, just reserve 50k
> > relfilenumber for system for future expansion and start user range
> > from there).
>
> A disadvantage of this is that it basically means all the file names
> in new clusters are going to be 10 characters long. That's not a big
> disadvantage, but it's not wonderful. File names that are only 5-7
> characters long are common today, and easier to remember.

That's correct.

> > So now system tables have no issues and also the user tables from the
> > old cluster have no issues.  But pg_largeobject might get conflict
> > when both old and new cluster are using 56 bits relfilenumber, because
> > it is possible that in the new cluster some other system table gets
> > that relfilenumber which is used by pg_largeobject in the old cluster.
> >
> > This could be resolved if we allocate pg_largeobject's relfilenumber
> > from the user range, that means this relfilenumber will always be the
> > first value from the user range.  So now if the old and new cluster
> > both are using 56bits relfilenumber then pg_largeobject in both
> > cluster would have got the same relfilenumber and if the old cluster
> > is using the current 32 bits relfilenode system then the whole range
> > of the new cluster is completely different than that of the old
> > cluster.
>
> I think this can work, but it does rely to some extent on the fact
> that there are no other tables which need to be treated like
> pg_largeobject. If there were others, they'd need fixed starting
> RelFileNumber assignments, or some other trick, like renumbering them
> twice in the cluster, first two a known-unused value and then back to
> the proper value. You'd have trouble if in the other cluster
> pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
> cluster the reverse, without some hackery.

Agree, if it has more catalog like pg_largeobject then it would
require some hacking.

> I do feel like your idea here has some advantages - my proposal
> requires rewriting all the catalogs in the new cluster before we do
> anything else, and that's going to take some time even though they
> should be small. But I also feel like it has some disadvantages: it
> seems to rely on complicated reasoning and special cases more than I'd
> like.

One other advantage with your approach is that since we are starting
the "nextrelfilenumber" after the old cluster's relfilenumber range.
So only at the beginning we need to set the "nextrelfilenumber" but
after that while upgrading each object we don't need to set the
nextrelfilenumber every time because that is already higher than the
complete old cluster range.  In other 2 approaches we will have to try
to set the nextrelfilenumber everytime we preserve the relfilenumber
during upgrade.

Other than these two approaches we have another approach (what the
patch set is already doing) where we keep the system relfilenumber
range same as Oid.  I know you have already pointed out that this
might have some hidden bug but one advantage of this approach is it is
simple compared two above two approaches in the sense that it doesn't
need to maintain two ranges and it also doesn't need to rewrite all
system tables in the new cluster.  So I think it would be good if we
can get others' opinions on all these 3 approaches.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-09 Thread Robert Haas

On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar  wrote:
> I think even if we start the range from the 4 billion we can not avoid
> keeping two separate ranges for system and user tables otherwise the
> next upgrade where old and new clusters both have 56 bits
> relfilenumber will get conflicting files.  And, for the same reason we
> still have to call SetNextRelFileNumber() during upgrade.

Well, my proposal to move everything from the new cluster up to higher
numbers would address this without requiring two ranges.

> So the idea is, we will be having 2 ranges for relfilenumbers, system
> range will start from 4 billion and user range maybe something around
> 4.1 (I think we can keep it very small though, just reserve 50k
> relfilenumber for system for future expansion and start user range
> from there).

A disadvantage of this is that it basically means all the file names
in new clusters are going to be 10 characters long. That's not a big
disadvantage, but it's not wonderful. File names that are only 5-7
characters long are common today, and easier to remember.

> So now system tables have no issues and also the user tables from the
> old cluster have no issues.  But pg_largeobject might get conflict
> when both old and new cluster are using 56 bits relfilenumber, because
> it is possible that in the new cluster some other system table gets
> that relfilenumber which is used by pg_largeobject in the old cluster.
>
> This could be resolved if we allocate pg_largeobject's relfilenumber
> from the user range, that means this relfilenumber will always be the
> first value from the user range.  So now if the old and new cluster
> both are using 56bits relfilenumber then pg_largeobject in both
> cluster would have got the same relfilenumber and if the old cluster
> is using the current 32 bits relfilenode system then the whole range
> of the new cluster is completely different than that of the old
> cluster.

I think this can work, but it does rely to some extent on the fact
that there are no other tables which need to be treated like
pg_largeobject. If there were others, they'd need fixed starting
RelFileNumber assignments, or some other trick, like renumbering them
twice in the cluster, first two a known-unused value and then back to
the proper value. You'd have trouble if in the other cluster
pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
cluster the reverse, without some hackery.

I do feel like your idea here has some advantages - my proposal
requires rewriting all the catalogs in the new cluster before we do
anything else, and that's going to take some time even though they
should be small. But I also feel like it has some disadvantages: it
seems to rely on complicated reasoning and special cases more than I'd
like.

What do other people think?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-05 Thread Dilip Kumar

On Thu, Aug 4, 2022 at 5:01 PM Dilip Kumar  wrote:
>
> On Sat, Jul 30, 2022 at 1:59 AM Robert Haas  wrote:
>
> > One solution to all this is to do as Dilip proposes here: for system
> > relations, keep assigning the OID as the initial relfilenumber.
> > Actually, we really only need to do this for pg_largeobject; all the
> > other relfilenumber values could be assigned from a counter, as long
> > as they're assigned from a range distinct from what we use for user
> > relations.
> >
> > But I don't really like that, because I feel like the whole thing
> > where we start out with relfilenumber=oid is a recipe for hidden bugs.
> > I believe we'd be better off if we decouple those concepts more
> > thoroughly. So here's another idea: what if we set the
> > next-relfilenumber counter for the new cluster to the value from the
> > old cluster, and then rewrote all the (thus-far-empty) system tables?
>
> You mean in a new cluster start the next-relfilenumber counter from
> the highest relfilenode/Oid value in the old cluster right?.  Yeah, if
> we start next-relfilenumber after the range of the old cluster then we
> can also avoid the logic of SetNextRelFileNumber() during upgrade.
>
> My very initial idea around this was to start the next-relfilenumber
> directly from the 4 billion in the new cluster so there can not be any
> conflict and we don't even need to identify the highest value of used
> relfilenode in the old cluster.  In fact we don't need to rewrite the
> system table before upgrading I think.  So what do we lose with this?
> just 4 billion relfilenode? does that really matter provided the range
> we get with the 56 bits relfilenumber.

I think even if we start the range from the 4 billion we can not avoid
keeping two separate ranges for system and user tables otherwise the
next upgrade where old and new clusters both have 56 bits
relfilenumber will get conflicting files.  And, for the same reason we
still have to call SetNextRelFileNumber() during upgrade.

So the idea is, we will be having 2 ranges for relfilenumbers, system
range will start from 4 billion and user range maybe something around
4.1 (I think we can keep it very small though, just reserve 50k
relfilenumber for system for future expansion and start user range
from there).

So now system tables have no issues and also the user tables from the
old cluster have no issues.  But pg_largeobject might get conflict
when both old and new cluster are using 56 bits relfilenumber, because
it is possible that in the new cluster some other system table gets
that relfilenumber which is used by pg_largeobject in the old cluster.

This could be resolved if we allocate pg_largeobject's relfilenumber
from the user range, that means this relfilenumber will always be the
first value from the user range.  So now if the old and new cluster
both are using 56bits relfilenumber then pg_largeobject in both
cluster would have got the same relfilenumber and if the old cluster
is using the current 32 bits relfilenode system then the whole range
of the new cluster is completely different than that of the old
cluster.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-08-04 Thread Dilip Kumar

On Sat, Jul 30, 2022 at 1:59 AM Robert Haas  wrote:

> One solution to all this is to do as Dilip proposes here: for system
> relations, keep assigning the OID as the initial relfilenumber.
> Actually, we really only need to do this for pg_largeobject; all the
> other relfilenumber values could be assigned from a counter, as long
> as they're assigned from a range distinct from what we use for user
> relations.
>
> But I don't really like that, because I feel like the whole thing
> where we start out with relfilenumber=oid is a recipe for hidden bugs.
> I believe we'd be better off if we decouple those concepts more
> thoroughly. So here's another idea: what if we set the
> next-relfilenumber counter for the new cluster to the value from the
> old cluster, and then rewrote all the (thus-far-empty) system tables?

You mean in a new cluster start the next-relfilenumber counter from
the highest relfilenode/Oid value in the old cluster right?.  Yeah, if
we start next-relfilenumber after the range of the old cluster then we
can also avoid the logic of SetNextRelFileNumber() during upgrade.

My very initial idea around this was to start the next-relfilenumber
directly from the 4 billion in the new cluster so there can not be any
conflict and we don't even need to identify the highest value of used
relfilenode in the old cluster.  In fact we don't need to rewrite the
system table before upgrading I think.  So what do we lose with this?
just 4 billion relfilenode? does that really matter provided the range
we get with the 56 bits relfilenumber.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-31 Thread Dilip Kumar

On Fri, Jul 29, 2022 at 10:55 PM Robert Haas  wrote:
>
> On Thu, Jul 28, 2022 at 10:29 AM Robert Haas  wrote:
> > > I have done some cleanup in 0002 as well, basically, earlier we were
> > > storing the result of the BufTagGetRelFileLocator() in a separate
> > > variable which is not required everywhere.  So wherever possible I
> > > have avoided using the intermediate variable.
> >
> > I'll have a look at this next.
>
> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does. I would have expected it
> to return void and take an argument of type RelFileLocator * into
> which it writes the results. On the other hand, I was also taught that
> one should avoid passing a struct type as an argument, and smgropen()
> has been doing that since Tom Lane committed
> 87bd95638552b8fc1f5f787ce5b862bb6fc2eb80 all the way back in 2004. So
> maybe this isn't that relevant any more on modern compilers? Or maybe
> for small structs it doesn't matter much? I dunno.
>
> Other than that, I think your 0002 looks fine.

Generally, I try to avoid it, but I see in current code also if the
structure is small and by directly returning the structure it makes
the other code easy then we are doing this way[1].  I wanted to do
this way is a) if we pass as an argument then I will have to use an
extra variable which makes some code complicated, it's not a big
issue, infact I had it that way in the previous version but simplified
in one of the recent versions.  b) If I allocate memory and return
pointer then also I need to store that address and later free that.

[1]
static inline ForEachState
for_each_from_setup(const List *lst, int N)
{
ForEachState r = {lst, N};

Assert(N >= 0);
return r;
}

static inline FullTransactionId
FullTransactionIdFromEpochAndXid(uint32 epoch, TransactionId xid)
{
FullTransactionId result;

result.value = ((uint64) epoch) << 32 | xid;

return result;
}


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-31 Thread Dilip Kumar

On Sat, Jul 30, 2022 at 1:35 AM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > On 2022-Jul-29, Robert Haas wrote:
> >> Yeah, if we think it's OK to pass around structs, then that seems like
> >> the right solution. Otherwise functions that take RelFileLocator
> >> should be changed to take const RelFileLocator * and we should adjust
> >> elsewhere accordingly.
>
> > We do that in other places.  See get_object_address() for another
> > example.  Now, I don't see *why* they do it.
>
> If it's a big struct then avoiding copying it is good; but RelFileLocator
> isn't that big.
>
> While researching that statement I did happen to notice that no one has
> bothered to update the comment immediately above struct RelFileLocator,
> and it is something that absolutely does require attention if there
> are plans to make RelFileNumber something other than 32 bits.

I think we need to update this comment in the patch where we are
making RelFileNumber 64 bits wide.  But as such I do not see a problem
in using RelFileLocator directly as key because if we make
RelFileNumber 64 bits then its structure will already be 8 byte
aligned so there should not be any padding.  However, if we use some
other structure as key which contain RelFileLocator i.e.
RelFileLocatorBackend then there will be a problem.  So for handling
that issue while computing the key size (wherever we have
RelFileLocatorBackend as key) I have avoided the padding bytes in size
by introducing this new macro[1].

[1]
#define SizeOfRelFileLocatorBackend \
(offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-30 Thread Alvaro Herrera

On 2022-Jul-30, Dilip Kumar wrote:

> On Thu, Jul 28, 2022 at 9:29 PM Alvaro Herrera  
> wrote:

> > Please don't do this; rather use %llu and cast to (long long).
> > Otherwise the string becomes mangled for translation.
> 
> Okay, actually I did not understand the clear logic of when to use
> %llu and to use (U)INT64_FORMAT.  They are both used for 64-bit
> integers.  So do you think it is fine to replace all INT64_FORMAT in
> my patch with %llu?

The point here is that there are two users of the source code: one is
the compiler, and the other is gettext, which extracts the string for
the translation catalog.  The compiler is OK with UINT64_FORMAT, of
course (because the preprocessor deals with it).  But gettext is quite
stupid and doesn't understand that UINT64_FORMAT expands to some
specifier, so it truncates the string at the double quote sign just
before; in other words, it just doesn't work.  So whenever you have a
string that ends up in a translation catalog, you must not use
UINT64_FORMAT or any other preprocessor macro; it has to be a straight
specifier in the format string.

We have found that the most convenient notation is to use %llu in the
string and cast the argument to (unsigned long long), so our convention
is to use that.

For strings that do not end up in a translation catalog, there's no
reason to use %llu-and-cast; UINT64_FORMAT is okay.

> > > @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
> > >   if (memcmp(replay_image_masked, primary_image_masked, 
> > > BLCKSZ) != 0)
> > >   {
> > >   elog(FATAL,
> > > -  "inconsistent page found, rel %u/%u/%u, 
> > > forknum %u, blkno %u",
> > > +  "inconsistent page found, rel %u/%u/" 
> > > INT64_FORMAT ", forknum %u, blkno %u",
> > >rlocator.spcOid, rlocator.dbOid, 
> > > rlocator.relNumber,
> > >forknum, blkno);
> >
> > Should this one be an ereport, and thus you do need to change it to that
> > and handle it like that?
> 
> Okay, so you mean irrespective of this patch should this be converted
> to ereport?

Yes, I think this should be an ereport with errcode(ERRCODE_DATA_CORRUPTED).

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: making relfilenodes 56 bits

2022-07-29 Thread Dilip Kumar

On Thu, Jul 28, 2022 at 9:29 PM Alvaro Herrera  wrote:
>
> Not a full review, just a quick skim of 0003.

Thanks for the review

> > + if (!shutdown)
> > + {
> > + if (ShmemVariableCache->loggedRelFileNumber < 
> > checkPoint.nextRelFileNumber)
> > + elog(ERROR, "nextRelFileNumber can not go backward 
> > from " INT64_FORMAT "to" INT64_FORMAT,
> > +  checkPoint.nextRelFileNumber, 
> > ShmemVariableCache->loggedRelFileNumber);
> > +
> > + checkPoint.nextRelFileNumber = 
> > ShmemVariableCache->loggedRelFileNumber;
> > + }
>
> Please don't do this; rather use %llu and cast to (long long).
> Otherwise the string becomes mangled for translation.  I think there are
> many uses of this sort of pattern in strings, but not all of them are
> translatable so maybe we don't care -- for example contrib doesn't have
> translations.  And the rmgrdesc routines don't translate either, so we
> probably don't care about it there; and nothing that uses elog either.
> But this one in particular I think should be an ereport, not an elog.
> There are several other ereports in various places of the patch also.

Okay, actually I did not understand the clear logic of when to use
%llu and to use (U)INT64_FORMAT.  They are both used for 64-bit
integers.  So do you think it is fine to replace all INT64_FORMAT in
my patch with %llu?

> > @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
> >   if (memcmp(replay_image_masked, primary_image_masked, BLCKSZ) 
> > != 0)
> >   {
> >   elog(FATAL,
> > -  "inconsistent page found, rel %u/%u/%u, 
> > forknum %u, blkno %u",
> > +  "inconsistent page found, rel %u/%u/" 
> > INT64_FORMAT ", forknum %u, blkno %u",
> >rlocator.spcOid, rlocator.dbOid, 
> > rlocator.relNumber,
> >forknum, blkno);
>
> Should this one be an ereport, and thus you do need to change it to that
> and handle it like that?

Okay, so you mean irrespective of this patch should this be converted
to ereport?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-29 Thread Thomas Munro

On Sat, Jul 30, 2022 at 9:11 AM Thomas Munro  wrote:
> on all modern Unix-y systems,

(I meant to write AMD64 there)

Re: making relfilenodes 56 bits

2022-07-29 Thread Thomas Munro

On Sat, Jul 30, 2022 at 8:08 AM Tom Lane  wrote:
> Robert Haas  writes:
> > I was taught that when programming in C one should avoid returning a
> > struct type, as BufTagGetRelFileLocator does.
>
> FWIW, I think that was invalid pre-ANSI-C, and maybe even in C89.
> C99 and later requires it.  But it is pass-by-value and you have
> to think twice about whether you want the struct to be copied.

C89 had that.

As for what it actually does in a non-inlined function: on all modern
Unix-y systems, 128 bit first arguments and return values are
transferred in register pairs[1].  So if you define a struct that
holds uint32_t, uint32_t, uint64_t and compile a function that takes
one and returns it, you see the struct being transferred directly from
input registers to output registers:

   0x <+0>:mov%rdi,%rax
   0x0003 <+3>:mov%rsi,%rdx
   0x0006 <+6>:ret

Similar on ARM64.  There it's an empty function, so it must be using
the same register in and out[2].

The MSVC calling convention is different and doesn't seem to be able
to pass it through registers, so it schleps it out to memory at a
return address[3].  But that's pretty similar to the proposed
alternative anyway, so surely no worse.  *shrug*  And of course those
"constructor"-like functions are inlined anyway.

[1] https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI
[2] https://gcc.godbolt.org/z/qfPzhW7YM
[3] https://gcc.godbolt.org/z/WqvYz6xjs

Re: making relfilenodes 56 bits

2022-07-29 Thread Robert Haas

On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar  wrote:
> There was also an issue where the user table from the old cluster's
> relfilenode could conflict with the system table of the new cluster.
> As a solution currently for system table object (while creating
> storage first time) we are keeping the low range of relfilenumber,
> basically we are using the same relfilenumber as OID so that during
> upgrade the normal user table from the old cluster will not conflict
> with the system tables in the new cluster.  But with this solution
> Robert told me (in off list chat) a problem that in future if we want
> to make relfilenumber completely unique within a cluster by
> implementing the CREATEDB differently then we can not do that as we
> have created fixed relfilenodes for the system tables.
>
> I am not sure what exactly we can do to avoid that because even if we
> do something  to avoid that in the new cluster the old cluster might
> be already using the non-unique relfilenode so after upgrading the new
> cluster will also get those non-unique relfilenode.

I think this aspect of the patch could use some more discussion.

To recap, the problem is that pg_upgrade mustn't discover that a
relfilenode that is being migrated from the old cluster is being used
for some other table in the new cluster. Since the new cluster should
only contain system tables that we assume have never been rewritten,
they'll all have relfilenodes equal to their OIDs, and thus less than
16384. On the other hand all the user tables from the old cluster will
have relfilenodes greater than 16384, so we're fine. pg_largeobject,
which also gets migrated, is a special case. Since we don't change OID
assignments from version to version, it should have either the same
relfilenode value in the old and new clusters, if never rewritten, or
else the value in the old cluster will be greater than 16384, in which
case no conflict is possible.

But if we just assign all relfilenode values from a central counter,
then we have got trouble. If the new version has more system catalog
tables than the old version, then some value that got used for a user
table in the old version might get used for a system table in the new
version, which is a problem. One idea for fixing this is to have two
RelFileNumber ranges: a system range (small values) and a user range.
System tables get values in the system range initially, and in the
user range when first rewritten. User tables always get values in the
user range. Everything works fine in this scenario except maybe for
pg_largeobject: what if it gets one value from the system range in the
old cluster, and a different value from the system range in the new
cluster, but some other system table in the new cluster gets the value
that pg_largeobject had in the old cluster? Then we've got trouble. It
doesn't help if we assign pg_largeobject a starting relfilenode from
the user range, either: now a relfilenode that needs to end up
containing the some user table from the old cluster might find itself
blocked by pg_largeobject in the new cluster.

One solution to all this is to do as Dilip proposes here: for system
relations, keep assigning the OID as the initial relfilenumber.
Actually, we really only need to do this for pg_largeobject; all the
other relfilenumber values could be assigned from a counter, as long
as they're assigned from a range distinct from what we use for user
relations.

But I don't really like that, because I feel like the whole thing
where we start out with relfilenumber=oid is a recipe for hidden bugs.
I believe we'd be better off if we decouple those concepts more
thoroughly. So here's another idea: what if we set the
next-relfilenumber counter for the new cluster to the value from the
old cluster, and then rewrote all the (thus-far-empty) system tables?
Then every system relation in the new cluster has a relfilenode value
greater than any in use in the old cluster, so we can afterwards
migrate over every relfilenode from the old cluster with no risk of
conflicting with anything. Then all the special cases go away. We
don't need system and user ranges for relfilenodes, and
pg_largeobject's not a special case, either. We can assign relfilenode
values to system relations in exactly the same we do for user
relations: assign a value from the global counter and forget about it.
If this cluster happens to be the "new cluster" for a pg_upgrade
attempt, the procedure described at the beginning of this paragraph
will move everything that might conflict out of the way.

One thing to perhaps not like about this is that it's a little more
expensive: clustering every system table in every database on a new
cluster isn't completely free. Perhaps it's not expensive enough to be
a big problem, though.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-29 Thread Tom Lane

Robert Haas  writes:
> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does.

FWIW, I think that was invalid pre-ANSI-C, and maybe even in C89.
C99 and later requires it.  But it is pass-by-value and you have
to think twice about whether you want the struct to be copied.

regards, tom lane

Re: making relfilenodes 56 bits

2022-07-29 Thread Tom Lane

Alvaro Herrera  writes:
> On 2022-Jul-29, Robert Haas wrote:
>> Yeah, if we think it's OK to pass around structs, then that seems like
>> the right solution. Otherwise functions that take RelFileLocator
>> should be changed to take const RelFileLocator * and we should adjust
>> elsewhere accordingly.

> We do that in other places.  See get_object_address() for another
> example.  Now, I don't see *why* they do it.

If it's a big struct then avoiding copying it is good; but RelFileLocator
isn't that big.

While researching that statement I did happen to notice that no one has
bothered to update the comment immediately above struct RelFileLocator,
and it is something that absolutely does require attention if there
are plans to make RelFileNumber something other than 32 bits.

 * Note: various places use RelFileLocator in hashtable keys.  Therefore,
 * there *must not* be any unused padding bytes in this struct.  That
 * should be safe as long as all the fields are of type Oid.
 */
typedef struct RelFileLocator
{
OidspcOid;/* tablespace */
OiddbOid; /* database */
RelFileNumber  relNumber; /* relation */
} RelFileLocator;

regards, tom lane

Re: making relfilenodes 56 bits

2022-07-29 Thread Robert Haas

On Fri, Jul 29, 2022 at 3:18 PM Alvaro Herrera  wrote:
> On 2022-Jul-29, Robert Haas wrote:
> > Yeah, if we think it's OK to pass around structs, then that seems like
> > the right solution. Otherwise functions that take RelFileLocator
> > should be changed to take const RelFileLocator * and we should adjust
> > elsewhere accordingly.
>
> We do that in other places.  See get_object_address() for another
> example.  Now, I don't see *why* they do it.  I suppose there's
> notational convenience; for get_object_address() I think it'd be uglier
> with another out argument (it already has *relp).  For smgropen() it's
> not clear at all that there is any.
>
> For the new function, there's at least a couple of places that the
> calling convention makes simpler, so I don't see why you wouldn't use it
> that way.

All right, perhaps it's fine as Dilip has it, then.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-29 Thread Alvaro Herrera

On 2022-Jul-29, Robert Haas wrote:

> Yeah, if we think it's OK to pass around structs, then that seems like
> the right solution. Otherwise functions that take RelFileLocator
> should be changed to take const RelFileLocator * and we should adjust
> elsewhere accordingly.

We do that in other places.  See get_object_address() for another
example.  Now, I don't see *why* they do it.  I suppose there's
notational convenience; for get_object_address() I think it'd be uglier
with another out argument (it already has *relp).  For smgropen() it's
not clear at all that there is any.

For the new function, there's at least a couple of places that the
calling convention makes simpler, so I don't see why you wouldn't use it
that way.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"

Re: making relfilenodes 56 bits

2022-07-29 Thread Robert Haas

On Fri, Jul 29, 2022 at 2:12 PM Alvaro Herrera  wrote:
> On 2022-Jul-29, Robert Haas wrote:
> > I was taught that when programming in C one should avoid returning a
> > struct type, as BufTagGetRelFileLocator does.
>
> Doing it like that helps RelFileLocatorSkippingWAL, which takes a bare
> RelFileLocator as argument.  With this coding you can call one function
> with the other function as its argument.
>
> However, with the current definition of relpathbackend() and siblings,
> it looks quite disastrous -- BufTagGetRelFileLocator is being called
> three times.  You could argue that a solution would be to turn those
> macros into static inline functions.

Yeah, if we think it's OK to pass around structs, then that seems like
the right solution. Otherwise functions that take RelFileLocator
should be changed to take const RelFileLocator * and we should adjust
elsewhere accordingly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-29 Thread Alvaro Herrera

On 2022-Jul-29, Robert Haas wrote:

> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does.

Doing it like that helps RelFileLocatorSkippingWAL, which takes a bare
RelFileLocator as argument.  With this coding you can call one function
with the other function as its argument.

However, with the current definition of relpathbackend() and siblings,
it looks quite disastrous -- BufTagGetRelFileLocator is being called
three times.  You could argue that a solution would be to turn those
macros into static inline functions.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg0.php

Re: making relfilenodes 56 bits

2022-07-29 Thread Robert Haas

On Thu, Jul 28, 2022 at 10:29 AM Robert Haas  wrote:
> > I have done some cleanup in 0002 as well, basically, earlier we were
> > storing the result of the BufTagGetRelFileLocator() in a separate
> > variable which is not required everywhere.  So wherever possible I
> > have avoided using the intermediate variable.
>
> I'll have a look at this next.

I was taught that when programming in C one should avoid returning a
struct type, as BufTagGetRelFileLocator does. I would have expected it
to return void and take an argument of type RelFileLocator * into
which it writes the results. On the other hand, I was also taught that
one should avoid passing a struct type as an argument, and smgropen()
has been doing that since Tom Lane committed
87bd95638552b8fc1f5f787ce5b862bb6fc2eb80 all the way back in 2004. So
maybe this isn't that relevant any more on modern compilers? Or maybe
for small structs it doesn't matter much? I dunno.

Other than that, I think your 0002 looks fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-29 Thread vignesh C

On Wed, Jul 27, 2022 at 6:02 PM Dilip Kumar  wrote:
>
> On Wed, Jul 27, 2022 at 3:27 PM vignesh C  wrote:
> >
>
> > Thanks for the updated patch, Few comments:
> > 1) The format specifier should be changed from %u to INT64_FORMAT
> > autoprewarm.c -> apw_load_buffers
> > ...
> > if (fscanf(file, "%u,%u,%u,%u,%u\n", [i].database,
> >[i].tablespace, [i].filenumber,
> >, [i].blocknum) != 5)
> > ...
> >
> > 2) The format specifier should be changed from %u to INT64_FORMAT
> > autoprewarm.c -> apw_dump_now
> > ...
> > ret = fprintf(file, "%u,%u,%u,%u,%u\n",
> >   block_info_array[i].database,
> >   block_info_array[i].tablespace,
> >   block_info_array[i].filenumber,
> >   (uint32) block_info_array[i].forknum,
> >   block_info_array[i].blocknum);
> > ...
> >
> > 3) should the comment "entry point for old extension version" be on
> > top of pg_buffercache_pages, as the current version will use
> > pg_buffercache_pages_v1_4
> > +
> > +Datum
> > +pg_buffercache_pages(PG_FUNCTION_ARGS)
> > +{
> > +   return pg_buffercache_pages_internal(fcinfo, OIDOID);
> > +}
> > +
> > +/* entry point for old extension version */
> > +Datum
> > +pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
> > +{
> > +   return pg_buffercache_pages_internal(fcinfo, INT8OID);
> > +}
> >
> > 4) we could use the new style or ereport by removing the brackets
> > around errcode:
> > +   if (fctx->record[i].relfilenumber > OID_MAX)
> > +   ereport(ERROR,
> > +
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +
> > errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> > an OID",
> > +
> >  fctx->record[i].relfilenumber),
> > +
> > errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> > UPDATE")));
> >
> > like:
> > ereport(ERROR,
> >
> > errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> >
> > errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> > an OID",
> >
> > fctx->record[i].relfilenumber),
> >
> > errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> > UPDATE"));
> >
> > 5) Similarly in the below code too:
> > +   /* check whether the relfilenumber is within a valid range */
> > +   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> > +   ereport(ERROR,
> > +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",
> > +   (relfilenumber;
> >
> >
> > 6) Similarly in the below code too:
> > +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
> >  \
> > +do {
> >  \
> > +   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
> > +   ereport(ERROR,
> >  \
> > +
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
> > +errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",   \
> > +   (relfilenumber; \
> > +} while (0)
> > +
> >
> >
> > 7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
> > this macro be used here too:
> > pg_filenode_relation(PG_FUNCTION_ARGS)
> >  {
> > Oid reltablespace = PG_GETARG_OID(0);
> > -   RelFileNumber relfilenumber = PG_GETARG_OID(1);
> > +   RelFileNumber relfilenumber = PG_GETARG_INT64(1);
> > Oid heaprel;
> >
> > +   /* check whether the relfilenumber is within a valid range */
> > +   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> > +   ereport(ERROR,
> > +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",
> > +   (relfilenumber;
> >
> >
> > 8) I felt this include is not required:
> > diff --git a/src/backend/access/transam/varsup.c
> > b/src/backend/access/transam/varsup.c
> > index 849a7ce..a2f0d35 100644
> > --- a/src/backend/access/transam/varsup.c
> > +++ b/src/backend/access/transam/varsup.c
> > @@ -13,12 +13,16 @@
> >
> >  #include "postgres.h"
> >
> > +#include 
> > +
> >  #include "access/clog.h"
> >  #include "access/commit_ts.h"
> >
> > 9) should we change elog to ereport to use the New-style error reporting API
> > +   /* safety check, we should never get this far in a HS standby */
> > +   if (RecoveryInProgress())
> > +   elog(ERROR, "cannot assign RelFileNumber during recovery");
> > +
> > +   if (IsBinaryUpgrade)
> > +   elog(ERROR, "cannot assign RelFileNumber during binary
> > upgrade");
> >
> > 10) Here nextRelFileNumber is protected by RelFileNumberGenLock,

Re: making relfilenodes 56 bits

2022-07-29 Thread Ashutosh Sharma

On Fri, Jul 29, 2022 at 6:26 PM Ashutosh Sharma 
wrote:

> On Thu, Jul 28, 2022 at 5:02 PM Dilip Kumar  wrote:
>
> +/* --
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
>
> Above comment says that RelFileNumber zero is invalid which is technically
> correct because we don't have any relation file in disk with zero number.
> But the point is that if someone reads below definition of
> CHECK_RELFILENUMBER_RANGE he/she might get confused because as per this
> definition relfilenumber zero is valid.
>

Please ignore the above comment shared in my previous email.  It is a
little over-thinking on my part that generated this comment in my mind.
Sorry for that. Here are the other comments I have:

+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_buffercache DROP VIEW pg_buffercache;
+ALTER EXTENSION pg_buffercache DROP FUNCTION pg_buffercache_pages();
+
+/* Then we can drop them */
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+/* Now redefine */
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages_v1_4'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE VIEW pg_buffercache AS
+   SELECT P.* FROM pg_buffercache_pages() AS P
+   (bufferid integer, relfilenode int8, reltablespace oid, reldatabase oid,
+relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+pinning_backends int4);

As we are dropping the function and view I think it would be good if we
*don't* use the "OR REPLACE" keyword when re-defining them.

==

+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relfilenode" INT64_FORMAT " is too
large to be represented as an OID",
+   fctx->record[i].relfilenumber),
+errhint("Upgrade the extension using ALTER
EXTENSION pg_buffercache UPDATE")));

I think it would be good to recommend users to upgrade to the latest
version instead of just saying upgrade the pg_buffercache using ALTER
EXTENSION 

==

--- a/contrib/pg_walinspect/sql/pg_walinspect.sql
+++ b/contrib/pg_walinspect/sql/pg_walinspect.sql
@@ -39,10 +39,10 @@ SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
 -- Test for filtering out WAL records of a particular table
 -- ===

-SELECT oid AS sample_tbl_oid FROM pg_class WHERE relname = 'sample_tbl'
\gset
+SELECT relfilenode AS sample_tbl_relfilenode FROM pg_class WHERE relname =
'sample_tbl' \gset

Is this change required? The original query is just trying to fetch table
oid not relfilenode and AFAIK we haven't changed anything in table oid.

==

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)   \
+do {   \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+   ereport(ERROR,  \
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
+errmsg("relfilenumber " INT64_FORMAT " is out of range",
\
+   (relfilenumber; \
+} while (0)
+

I think we can shift this macro to some header file and reuse it at several
places.

==


+* Generate a new relfilenumber.  We cannot reuse the old relfilenumber
+* because of the possibility that that relation will be moved back to
the

that that relation -> that relation

--
With Regards,
Ashutosh Sharma.

Re: making relfilenodes 56 bits

2022-07-29 Thread Ashutosh Sharma

On Thu, Jul 28, 2022 at 5:02 PM Dilip Kumar  wrote:

+/* --
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage

Above comment says that RelFileNumber zero is invalid which is technically
correct because we don't have any relation file in disk with zero number.
But the point is that if someone reads below definition of
CHECK_RELFILENUMBER_RANGE he/she might get confused because as per this
definition relfilenumber zero is valid.

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)   \
+do {   \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+   ereport(ERROR,  \
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
+errmsg("relfilenumber " INT64_FORMAT " is out of range",
\
+   (relfilenumber; \
+} while (0)
+

+   RelFileNumber relfilenumber = PG_GETARG_INT64(0);
+   CHECK_RELFILENUMBER_RANGE(relfilenumber);

It seems like the relfilenumber in above definition represents relfilenode
value in pg_class which can hold zero value which actually means it's a
mapped relation. I think it would be good to provide some clarity here.

--
With Regards,
Ashutosh Sharma.

Re: making relfilenodes 56 bits

2022-07-29 Thread Alvaro Herrera

On 2022-Jul-28, Robert Haas wrote:

> On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera  
> wrote:
> > I do wonder why do we keep relfilenodes limited to decimal digits.  Why
> > not use hex digits?  Then we know the limit is 14 chars, as in
> > 0x00FF in the MAX_RELFILENUMBER definition.
> 
> Hmm, but surely we want the error messages to be printed using the
> same format that we use for the actual filenames.

Of course.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."   (Nathaniel Smith)

Re: making relfilenodes 56 bits

2022-07-28 Thread Joshua Drake

On Thu, Jul 28, 2022 at 9:52 AM Robert Haas  wrote:

> On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera 
> wrote:
> > I do wonder why do we keep relfilenodes limited to decimal digits.  Why
> > not use hex digits?  Then we know the limit is 14 chars, as in
> > 0x00FF in the MAX_RELFILENUMBER definition.
>
> Hmm, but surely we want the error messages to be printed using the
> same format that we use for the actual filenames. We could make the
> filenames use hex characters too, but I'm not wild about changing
> user-visible details like that.
>

>From a DBA perspective this would be a regression in usability.

JD

-- 

   - Founder - https://commandprompt.com/ - 24x7x365 Postgres since 1997
   - Founder and Co-Chair - https://postgresconf.org/
   - Founder - https://postgresql.us - United States PostgreSQL
   - Public speaker, published author, postgresql expert, and people
   believer.
   - Host - More than a refresh
   : A podcast about
   data and the people who wrangle it.

Re: making relfilenodes 56 bits

2022-07-28 Thread Robert Haas

On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera  wrote:
> I do wonder why do we keep relfilenodes limited to decimal digits.  Why
> not use hex digits?  Then we know the limit is 14 chars, as in
> 0x00FF in the MAX_RELFILENUMBER definition.

Hmm, but surely we want the error messages to be printed using the
same format that we use for the actual filenames. We could make the
filenames use hex characters too, but I'm not wild about changing
user-visible details like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-28 Thread Alvaro Herrera

Not a full review, just a quick skim of 0003.

On 2022-Jul-28, Dilip Kumar wrote:

> + if (!shutdown)
> + {
> + if (ShmemVariableCache->loggedRelFileNumber < 
> checkPoint.nextRelFileNumber)
> + elog(ERROR, "nextRelFileNumber can not go backward from 
> " INT64_FORMAT "to" INT64_FORMAT,
> +  checkPoint.nextRelFileNumber, 
> ShmemVariableCache->loggedRelFileNumber);
> +
> + checkPoint.nextRelFileNumber = 
> ShmemVariableCache->loggedRelFileNumber;
> + }

Please don't do this; rather use %llu and cast to (long long).
Otherwise the string becomes mangled for translation.  I think there are
many uses of this sort of pattern in strings, but not all of them are
translatable so maybe we don't care -- for example contrib doesn't have
translations.  And the rmgrdesc routines don't translate either, so we
probably don't care about it there; and nothing that uses elog either.
But this one in particular I think should be an ereport, not an elog.
There are several other ereports in various places of the patch also.

> @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
>   if (memcmp(replay_image_masked, primary_image_masked, BLCKSZ) 
> != 0)
>   {
>   elog(FATAL,
> -  "inconsistent page found, rel %u/%u/%u, 
> forknum %u, blkno %u",
> +  "inconsistent page found, rel %u/%u/" 
> INT64_FORMAT ", forknum %u, blkno %u",
>rlocator.spcOid, rlocator.dbOid, 
> rlocator.relNumber,
>forknum, blkno);

Should this one be an ereport, and thus you do need to change it to that
and handle it like that?

> + if (xlrec->rlocator.relNumber > 
> ShmemVariableCache->nextRelFileNumber)
> + elog(ERROR, "unexpected relnumber " INT64_FORMAT "that 
> is bigger than nextRelFileNumber " INT64_FORMAT,
> +  xlrec->rlocator.relNumber, 
> ShmemVariableCache->nextRelFileNumber);

You missed one whitespace here after the INT64_FORMAT.

> diff --git a/src/bin/pg_controldata/pg_controldata.c 
> b/src/bin/pg_controldata/pg_controldata.c
> index c390ec5..f727078 100644
> --- a/src/bin/pg_controldata/pg_controldata.c
> +++ b/src/bin/pg_controldata/pg_controldata.c
> @@ -250,6 +250,8 @@ main(int argc, char *argv[])
>   printf(_("Latest checkpoint's NextXID:  %u:%u\n"),
>  
> EpochFromFullTransactionId(ControlFile->checkPointCopy.nextXid),
>  
> XidFromFullTransactionId(ControlFile->checkPointCopy.nextXid));
> + printf(_("Latest checkpoint's NextRelFileNumber:  " INT64_FORMAT "\n"),
> +ControlFile->checkPointCopy.nextRelFileNumber);

This one must definitely be translatable.

>  /* Characters to allow for an RelFileNumber in a relation path */
> -#define RELNUMBERCHARS   OIDCHARS/* same as OIDCHARS */
> +#define RELNUMBERCHARS   20  /* max chars printed by %lu */

Maybe say %llu here instead.

I do wonder why do we keep relfilenodes limited to decimal digits.  Why
not use hex digits?  Then we know the limit is 14 chars, as in
0x00FF in the MAX_RELFILENUMBER definition.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end." (2nd Commandment for C programmers)

Re: making relfilenodes 56 bits

2022-07-28 Thread Robert Haas

On Thu, Jul 28, 2022 at 7:32 AM Dilip Kumar  wrote:
> Thanks, I have rebased other patches,  actually, there is a new 0001
> patch now.  It seems during renaming relnode related Oid to
> RelFileNumber, some of the references were missed and in the last
> patch set I kept it as part of main patch 0003, but I think it's
> better to keep it separate.  So took out those changes and created
> 0001, but you think this can be committed as part of 0003 only then
> also it's fine with me.

I committed this in part. I took out the introduction of
RELNUMBERCHARS as I think that should probably be a separate commit,
but added in a comment change that you seem to have overlooked.

> I have done some cleanup in 0002 as well, basically, earlier we were
> storing the result of the BufTagGetRelFileLocator() in a separate
> variable which is not required everywhere.  So wherever possible I
> have avoided using the intermediate variable.

I'll have a look at this next.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-27 Thread Robert Haas

On Wed, Jul 27, 2022 at 12:37 PM Dilip Kumar  wrote:
> Just realised that this should have been BufferTagsEqual instead of 
> BufferTagEqual
>
> I will modify this and send an updated patch tomorrow.

I changed it and committed.

What was formerly 0002 will need minor rebasing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-27 Thread Dilip Kumar

On Wed, 27 Jul 2022 at 9:49 PM, Dilip Kumar  wrote:

> On Wed, Jul 27, 2022 at 12:07 AM Robert Haas 
> wrote:
> >
> > On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar 
> wrote:
> > > I have thought about it while doing so but I am not sure whether it is
> > > a good idea or not, because before my change these all were macros
> > > with 2 naming conventions so I just changed to inline function so why
> > > to change the name.
> >
> > Well, the reason to change the name would be for consistency. It feels
> > weird to have some NAMES_LIKETHIS() and other NamesLikeThis().
> >
> > Now, an argument against that is that it will make back-patching more
> > annoying, if any code using these functions/macros is touched. But
> > since the calling sequence is changing anyway (you now have to pass a
> > pointer rather than the object itself) that argument doesn't really
> > carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
> > etc.
>
> Okay, so I have renamed these 2 functions and BUFFERTAGS_EQUAL as well
> to BufferTagEqual().


Just realised that this should have been BufferTagsEqual instead of
BufferTagEqual

I will modify this and send an updated patch tomorrow.

—
Dilip

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-27 Thread vignesh C

On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar  wrote:
>
> On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro  wrote:
> >
> > On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar  wrote:
> > > [v10 patch set]
> >
> > Hi Dilip, I'm experimenting with these patches and will hopefully have
> > more to say soon, but I just wanted to point out that this builds with
> > warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> > there is the good kind of uninitialised data on Linux, and the bad
> > kind of uninitialised data on those other pesky systems?
>
> Here is the patch to fix the issue, basically, while asserting for the
> file existence it was not setting the relfilenumber in the
> relfilelocator before generating the path so it was just checking for
> the existence of the random path so it was asserting randomly.

Thanks for the updated patch, Few comments:
1) The format specifier should be changed from %u to INT64_FORMAT
autoprewarm.c -> apw_load_buffers
...
if (fscanf(file, "%u,%u,%u,%u,%u\n", [i].database,
   [i].tablespace, [i].filenumber,
   , [i].blocknum) != 5)
...

2) The format specifier should be changed from %u to INT64_FORMAT
autoprewarm.c -> apw_dump_now
...
ret = fprintf(file, "%u,%u,%u,%u,%u\n",
  block_info_array[i].database,
  block_info_array[i].tablespace,
  block_info_array[i].filenumber,
  (uint32) block_info_array[i].forknum,
  block_info_array[i].blocknum);
...

3) should the comment "entry point for old extension version" be on
top of pg_buffercache_pages, as the current version will use
pg_buffercache_pages_v1_4
+
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+   return pg_buffercache_pages_internal(fcinfo, OIDOID);
+}
+
+/* entry point for old extension version */
+Datum
+pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
+{
+   return pg_buffercache_pages_internal(fcinfo, INT8OID);
+}

4) we could use the new style or ereport by removing the brackets
around errcode:
+   if (fctx->record[i].relfilenumber > OID_MAX)
+   ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+
errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
an OID",
+
 fctx->record[i].relfilenumber),
+
errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
UPDATE")));

like:
ereport(ERROR,

errcode(ERRCODE_INVALID_PARAMETER_VALUE),

errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
an OID",

fctx->record[i].relfilenumber),

errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
UPDATE"));

5) Similarly in the below code too:
+   /* check whether the relfilenumber is within a valid range */
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relfilenumber " INT64_FORMAT
" is out of range",
+   (relfilenumber;


6) Similarly in the below code too:
+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
 \
+do {
 \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+   ereport(ERROR,
 \
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
+errmsg("relfilenumber " INT64_FORMAT
" is out of range",   \
+   (relfilenumber; \
+} while (0)
+


7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
this macro be used here too:
pg_filenode_relation(PG_FUNCTION_ARGS)
 {
Oid reltablespace = PG_GETARG_OID(0);
-   RelFileNumber relfilenumber = PG_GETARG_OID(1);
+   RelFileNumber relfilenumber = PG_GETARG_INT64(1);
Oid heaprel;

+   /* check whether the relfilenumber is within a valid range */
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relfilenumber " INT64_FORMAT
" is out of range",
+   (relfilenumber;


8) I felt this include is not required:
diff --git a/src/backend/access/transam/varsup.c
b/src/backend/access/transam/varsup.c
index 849a7ce..a2f0d35 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -13,12 +13,16 @@

 #include "postgres.h"

+#include 
+
 #include "access/clog.h"
 #include "access/commit_ts.h"

9) should we change elog to ereport to use the New-style error reporting API
+   /* safety check, we should never get this far in a HS standby */
+   if (RecoveryInProgress())
+   elog(ERROR, "cannot assign RelFileNumber

Re: making relfilenodes 56 bits

2022-07-27 Thread Dilip Kumar

On Wed, Jul 27, 2022 at 1:24 PM Ashutosh Sharma  wrote:
>
> Some more comments:

Note: Please don't top post.

> ==
>
> Shouldn't we retry for the new relfilenumber if 
> "ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a 
> cases where some of the tables are dropped by the user and relfilenumber of 
> those tables can be reused for which we would need to find the relfilenumber 
> that can be resued. For e.g. consider below example:
>
> postgres=# create table t1(a int);
> CREATE TABLE
>
> postgres=# create table t2(a int);
> CREATE TABLE
>
> postgres=# create table t3(a int);
> ERROR:  relfilenumber is out of bound
>
> postgres=# drop table t1, t2;
> DROP TABLE
>
> postgres=# checkpoint;
> CHECKPOINT
>
> postgres=# vacuum;
> VACUUM
>
> Now if I try to recreate table t3, it should succeed, shouldn't it? But it 
> doesn't because we simply error out by seeing the nextRelFileNumber saved in 
> the shared memory.
>
> postgres=# create table t1(a int);
> ERROR:  relfilenumber is out of bound
>
> I think, above should have worked.

No, it should not, the whole point of this design is not to reuse the
relfilenumber ever within a cluster lifetime.  You might want to read
this mail[1] that by the time we use 2^56 relfilenumbers the cluster
will anyway reach its lifetime by other factors.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BZrDms7gSjckme8YV2tzxgZ0KVfGcsjaFoKyzQX_f_Mw%40mail.gmail.com

> ==
>
> 
> 
> Note that while a table's filenode often matches its OID, this is
> not necessarily the case; some operations, like
> TRUNCATE, REINDEX, 
> CLUSTER and some forms
> of ALTER TABLE, can change the filenode while preserving 
> the OID.
>
> I think this note needs some improvement in storage.sgml. It says the table's 
> relfilenode mostly matches its OID, but it doesn't. This will happen only in 
> case of system table and maybe never in case of user table.

Yes, this should be changed.

> postgres=# create table t2(a int);
> ERROR:  relfilenumber is out of bound
>
> Since this is a user-visible error, I think it would be good to mention 
> relfilenode instead of relfilenumber. Elsewhere (including the user manual) 
> we refer to this as a relfilenode.

No this is expected to be an internal error because in general during
the cluster lifetime ideally, we should never reach this number.  So
we are putting this check so that it should not reach this number due
to some other computational/programming mistake.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

1 2 >

1 - 100 of 156 matches

Mail list logo