from:"Ashutosh Sharma"

Should we represent temp files as unsigned long int instead of signed long int type?

2023-10-25 Thread Ashutosh Sharma

Hi All,

At present, we represent temp files as a signed long int number. And
depending on the system architecture (32 bit or 64 bit), the range of
signed long int varies, for example on a 32-bit system it will range
from -2,147,483,648 to 2,147,483,647. AFAIU, this will not allow a
session to create more than 2 billion temporary files and that is not
a small number at all, but still what if we make it an unsigned long
int which will allow a session to create 4 billion temporary files if
needed. I might be sounding a little stupid here because 2 billion
temporary files is like 2000 peta bytes (2 billion * 1GB), considering
each temp file is 1GB in size which is not a small data size at all,
it is a huge amount of data storage. However, since the variable we
use to name temporary files is a static long int (static long
tempFileCounter = 0;), there is a possibility that this number will
get exhausted soon if the same session is trying to create too many
temp files via multiple queries.

Just adding few lines of code related to this from postmaster.c:

/*
 * Number of temporary files opened during the current session;
 * this is used in generation of tempfile names.
 */
static long tempFileCounter = 0;

/*
 * Generate a tempfile name that should be unique within the current
 * database instance.
 */
snprintf(tempfilepath, sizeof(tempfilepath), "%s/%s%d.%ld",
 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, tempFileCounter++);

--
With Regards,
Ashutosh Sharma.

Re: Can a role have indirect ADMIN OPTION on another role?

2023-09-07 Thread Ashutosh Sharma

On Thu, Sep 7, 2023 at 3:43 AM David G. Johnston
 wrote:
>
> On Wed, Sep 6, 2023 at 1:55 PM Ashutosh Sharma  wrote:
>>
>> But what if roleB doesn't want to give roleA access to
>> the certain objects it owns.
>
>
> Not doable - roleA can always pretend they are roleB one way or another since 
> roleA made roleB.
>

Okay. It makes sense. thanks.!

--
With Regards,
Ashutosh Sharma.

Re: Can a role have indirect ADMIN OPTION on another role?

2023-09-07 Thread Ashutosh Sharma

On Thu, Sep 7, 2023 at 12:20 AM Robert Haas  wrote:
>
> On Wed, Sep 6, 2023 at 1:33 PM Ashutosh Sharma  wrote:
> > Actually I have one more question. With this new design, assuming that
> > createrole_self_grant is set to 'set, inherit' in postgresql.conf and
> > if roleA creates roleB. So, in this case, roleA will inherit
> > permissions of roleB which means roleA will have access to objects
> > owned by roleB. But what if roleB doesn't want to give roleA access to
> > the certain objects it owns. As an example let's say that roleB
> > creates a table 't' and by default (with this setting) roleA will have
> > access to this table, but for some reason roleB does not want roleA to
> > have access to it. So what's the option for roleB? I've tried running
> > "revoke select on table t from roleA" but that doesn't seem to be
> > working. the only option that works is roleA himself set inherit
> > option on roleB to false - "grant roleB to roleA with inherit false;"
>
> It doesn't matter what roleB wants. roleA is strictly more powerful
> than roleB and can do whatever they want to roleB or roleB's objects
> regardless of how roleB feels about it.
>
> In the same way, the superuser is strictly more powerful than either
> roleA or roleB and can override any security control that either one
> of them put in place.
>
> Neither roleB nor roleA has any right to hide their data from the
> superuser, and roleB has no right to hide data from roleA. It's a
> hierarchy. If you're on top, you're in charge, and that's it.
>
> Here again, it can't really meaningfully work in any other way.
> Suppose you were to add a feature to allow roleB to hide data from
> roleA. Given that roleA has the ability to change roleB's password,
> how could that possibly work? When you give one user the ability to
> administer another user, that includes the right to change that user's
> password, change whether they can log in, drop the role, give the
> privileges of that role to themselves or other users, and a whole
> bunch of other super-powerful stuff. You can't really give someone
> that level of power over another account and, at the same time, expect
> the account being administered to be able to keep the more powerful
> account from doing stuff. It just can't possibly work. If you want
> roleB to be able to resist roleA, you have to give roleA less power.
>

I agree with you. thank you once again.

--
With Regards,
Ashutosh Sharma.

Re: Can a role have indirect ADMIN OPTION on another role?

2023-09-06 Thread Ashutosh Sharma

On Wed, Sep 6, 2023 at 9:03 PM Robert Haas  wrote:
>
> On Wed, Sep 6, 2023 at 11:14 AM Ashutosh Sharma  wrote:
> > In PG-16, I see that we have made a lot of changes in the area roles
> > and privileges. I have a question related to this and here is my
> > question:
> >
> > Let's say there is a roleA who creates roleB and then roleB creates
> > another role, say roleC. By design, A can administer B and B can
> > administer C. But, can A administer C although it has not created C?
>
> Ultimately, yes, because A can get access to all of B's privileges,
> which include administering C. However, A might or might not have B's
> privileges by default, depending on the value of createrole_self_grant
> in effect at the time when B was created. So, depending on the
> situation, A might (or might not) need to do something like GRANT
> roleB to roleA or SET ROLE roleB in order to be able to actually
> execute the administration commands in question.
>
> IMHO, it really couldn't reasonably work in any other way. Consider
> that A's right to administer B includes the right to change B's
> password. If the superuser wants users A and B that can't interfere
> with each other, the superuser should create both of those accounts
> themselves instead of letting one create the other.
>

Thank you for the clarification. This is very helpful.

Actually I have one more question. With this new design, assuming that
createrole_self_grant is set to 'set, inherit' in postgresql.conf and
if roleA creates roleB. So, in this case, roleA will inherit
permissions of roleB which means roleA will have access to objects
owned by roleB. But what if roleB doesn't want to give roleA access to
the certain objects it owns. As an example let's say that roleB
creates a table 't' and by default (with this setting) roleA will have
access to this table, but for some reason roleB does not want roleA to
have access to it. So what's the option for roleB? I've tried running
"revoke select on table t from roleA" but that doesn't seem to be
working. the only option that works is roleA himself set inherit
option on roleB to false - "grant roleB to roleA with inherit false;"

--
With Regards,
Ashutosh Sharma.

Can a role have indirect ADMIN OPTION on another role?

2023-09-06 Thread Ashutosh Sharma

Hi,

In PG-16, I see that we have made a lot of changes in the area roles
and privileges. I have a question related to this and here is my
question:

Let's say there is a roleA who creates roleB and then roleB creates
another role, say roleC. By design, A can administer B and B can
administer C. But, can A administer C although it has not created C?

--
With Regards,
Ashutosh Sharma.

Re: Return value of pg_promote()

2023-08-28 Thread Ashutosh Sharma

Hi Michael,

On Thu, Aug 17, 2023 at 6:07 AM Michael Paquier  wrote:
>
> On Wed, Aug 16, 2023 at 05:02:09PM +0900, Michael Paquier wrote:
> >  if (kill(PostmasterPid, SIGUSR1) != 0)
> >  {
> > -ereport(WARNING,
> > -(errmsg("failed to send signal to postmaster: %m")));
> >  (void) unlink(PROMOTE_SIGNAL_FILE);
> > -PG_RETURN_BOOL(false);
> > +ereport(ERROR,
> > +(errmsg("failed to send signal to postmaster: %m")));
> >  }
> >
> > Shouldn't you assign an error code to this one rather than the
> > default one for internal errors, like ERRCODE_SYSTEM_ERROR?
> >
> >  /* return immediately if waiting was not requested */
> > @@ -744,7 +743,9 @@ pg_promote(PG_FUNCTION_ARGS)
> >   * necessity for manual cleanup of all postmaster children.
> >   */
> >  if (rc & WL_POSTMASTER_DEATH)
> > -PG_RETURN_BOOL(false);
> > +ereport(FATAL,
> > +(errcode(ERRCODE_ADMIN_SHUTDOWN),
> > + errmsg("terminating connection due to unexpected 
> > postmaster exit")));
> >
> > I would add an errcontext here, to let somebody know that the
> > connection died while waiting for the promotion to be processed, say
> > "while waiting on promotion".
>
> I have just noticed that we do not have a CF entry for this proposal,
> so I have added one with Laurenz as author:
> https://commitfest.postgresql.org/44/4504/
>
> For now the patch is waiting on author.  Could you address my
> last review?

Thanks for reviewing the patch and adding a CF entry for it. PFA patch
that addresses your review comments.

And... Sorry for the delayed response. I totally missed it.

--
With Regards,
Ashutosh Sharma.


v2-error-out-in-case-pg_promote-unable-to-send-SIGUSR1-to-postmaster.patch
Description: Binary data

Re: Return value of pg_promote()

2023-06-08 Thread Ashutosh Sharma

On Wed, Jun 7, 2023 at 9:55 PM Fujii Masao  wrote:
>
>
>
> On 2023/06/07 2:00, Laurenz Albe wrote:
> > On Tue, 2023-06-06 at 16:35 +0530, Ashutosh Sharma wrote:
> >> At present, pg_promote() returns true to the caller on successful
> >> promotion of standby, however it returns false in multiple scenarios
> >> which includes:
> >>
> >> 1) The SIGUSR1 signal could not be sent to the postmaster process.
> >> 2) The postmaster died during standby promotion.
> >> 3) Standby couldn't be promoted within the specified wait time.
> >>
> >> For an application calling this function, if pg_promote returns false,
> >> it is hard to interpret the reason behind it. So I think we should
> >> *only* allow pg_promote to return false when the server could not be
> >> promoted in the given wait time and in other scenarios it should just
> >> throw an error (FATAL, ERROR ... depending on the type of failure that
> >> occurred). Please let me know your thoughts on this change. thanks.!
> >
> > As the original author, I'd say that that sounds reasonable, particularly
> > in case #1.  If the postmaster dies, we are going to die too, so it
> > probably doesn't matter much.  But I think an error is certainly also
> > correct in that case.
>
> +1
>

Thanks for sharing your thoughts, Laurenz and Fujii-san. I've prepared
a patch that makes pg_promote error out if it couldn't send SIGUSR1 to
the postmaster or if the postmaster died in the middle of standby
promotion. PFA. Please note that now (with this patch) pg_promote only
returns false if the standby could not be promoted within the given
wait time. In case of any kind of failure, it just reports an error
based on the type of failure that occurred.

--
With Regards,
Ashutosh Sharma.


error-out-in-case-pg_promote-unable-to-send-SIGUSR1-to-postmaster.patch
Description: Binary data

Return value of pg_promote()

2023-06-06 Thread Ashutosh Sharma

Hi All,

At present, pg_promote() returns true to the caller on successful
promotion of standby, however it returns false in multiple scenarios
which includes:

1) The SIGUSR1 signal could not be sent to the postmaster process.
2) The postmaster died during standby promotion.
3) Standby couldn't be promoted within the specified wait time.

For an application calling this function, if pg_promote returns false,
it is hard to interpret the reason behind it. So I think we should
*only* allow pg_promote to return false when the server could not be
promoted in the given wait time and in other scenarios it should just
throw an error (FATAL, ERROR ... depending on the type of failure that
occurred). Please let me know your thoughts on this change. thanks.!

--
With Regards,
Ashutosh Sharma.

Re: Minimal logical decoding on standbys

2023-02-15 Thread Ashutosh Sharma

On Wed, Feb 15, 2023 at 11:48 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-02-15 18:02:11 +0530, Ashutosh Sharma wrote:
> > Thanks Andres. I have one more query (both for you and Bertrand). I
> > don't know if this has already been answered somewhere in this mail
> > thread, if yes, please let me know the mail that answers this query.
> >
> > Will there be a problem if we mandate the use of physical replication
> > slots and hot_standby_feedback to support minimum LD on standby. I
> > know people can do a physical replication setup without a replication
> > slot or even with hot_standby_feedback turned off, but are we going to
> > have any issue if we ask them to use a physical replication slot and
> > turn on hot_standby_feedback for LD on standby. This will reduce the
> > code changes required to do conflict handling for logical slots on
> > standby which is being done by v50-0001 and v50-0002* patches
> > currently.
>
> I don't think it would. E.g. while restoring from archives we can't rely on
> knowing that the slot still exists on the primary.
>
> We can't just do corrupt things, including potentially crashing, when the
> configuration is wrong. We can't ensure that the configuration is accurate all
> the time. So we need to detect this case. Hence needing to detect conflicts.
>

OK. Got it, thanks.

>
> > IMHO even in normal scenarios i.e. when we are not doing LD on
> > standby, we should mandate the use of a physical replication slot.
>
> I don't think that's going to fly. There plenty scenarios where you e.g. don't
> want to use a slot, e.g. when you want to limit space use on the primary.
>

I think this can be controlled via max_slot_wal_keep_size GUC, if I
understand it correctly.

--
With Regards,
Ashutosh Sharma.

Re: Minimal logical decoding on standbys

2023-02-15 Thread Ashutosh Sharma

On Thu, Jan 12, 2023 at 11:46 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-01-12 20:08:55 +0530, Ashutosh Sharma wrote:
> > I previously participated in the discussion on "Synchronizing the
> > logical replication slots from Primary to Standby" and one of the
> > purposes of that project was to synchronize logical slots from primary
> > to standby so that if failover occurs, it will not affect the logical
> > subscribers of the old primary much. Can someone help me understand
> > how we are going to solve this problem with this patch? Are we going
> > to encourage users to do LR from standby instead of primary to get rid
> > of such problems during failover?
>
> It only provides a building block towards that. The "Synchronizing the logical
> replication slots from Primary to Standby" project IMO needs all of the
> infrastructure in this patch. With the patch, a logical rep solution can
> e.g. maintain one slot on the primary and one on the standby, and occasionally
> forward the slot on the standby to the position of the slot on the primary. In
> case of a failover it can just start consuming changes from the former
> standby, all the necessary changes are guaranteed to be present.
>
>
> > Also, one small observation:
> >
> > I just played around with the latest (v38) patch a bit and found that
> > when a new logical subscriber of standby is created, it actually
> > creates two logical replication slots for it on the standby server.
> > May I know the reason for creating an extra replication slot other
> > than the one created by create subscription command? See below:
>
> That's unrelated to this patch. There's no changes to the "higher level"
> logical replication code dealing with pubs and subs, it's all on the "logical
> decoding" level.
>
> I think this because logical rep wants to be able to concurrently perform
> ongoing replication, and synchronize tables added to the replication set. The
> pg_16399_sync_16392_7187728548042694423 slot should vanish after the initial
> synchronization.
>

Thanks Andres. I have one more query (both for you and Bertrand). I
don't know if this has already been answered somewhere in this mail
thread, if yes, please let me know the mail that answers this query.

Will there be a problem if we mandate the use of physical replication
slots and hot_standby_feedback to support minimum LD on standby. I
know people can do a physical replication setup without a replication
slot or even with hot_standby_feedback turned off, but are we going to
have any issue if we ask them to use a physical replication slot and
turn on hot_standby_feedback for LD on standby. This will reduce the
code changes required to do conflict handling for logical slots on
standby which is being done by v50-0001 and v50-0002* patches
currently.

IMHO even in normal scenarios i.e. when we are not doing LD on
standby, we should mandate the use of a physical replication slot.

--
With Regards,
Ashutosh Sharma.

Re: Minimal logical decoding on standbys

2023-01-12 Thread Ashutosh Sharma

Hi,

On Thu, Jan 12, 2023 at 5:29 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 1/11/23 7:04 PM, Drouvot, Bertrand wrote:
> > Hi,
> >
> > Please find V38 attached, I'll look at the other comments you've done in 
> > [1] on 0004 and 0006.
> >
> >

Sorry for joining late. I totally missed it. AFAICU, with this patch
users can now do LR from standby, previously they could only do it
from the primary server.

To start with, I have one small question:

I previously participated in the discussion on "Synchronizing the
logical replication slots from Primary to Standby" and one of the
purposes of that project was to synchronize logical slots from primary
to standby so that if failover occurs, it will not affect the logical
subscribers of the old primary much. Can someone help me understand
how we are going to solve this problem with this patch? Are we going
to encourage users to do LR from standby instead of primary to get rid
of such problems during failover?

Also, one small observation:

I just played around with the latest (v38) patch a bit and found that
when a new logical subscriber of standby is created, it actually
creates two logical replication slots for it on the standby server.
May I know the reason for creating an extra replication slot other
than the one created by create subscription command? See below:

Subscriber:
=
create subscription t1_sub connection 'host=127.0.0.1 port=38500
dbname=postgres user=ashu' publication t1_pub;

Standby:
===
postgres=# select * from pg_replication_slots;
slot_name|  plugin  | slot_type |
datoid | database | temporary | active | active_pid | xmin |
catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_w
al_size | two_phase
-+--+---++--+---+++--+--+-+-++---
+---
 pg_16399_sync_16392_7187728548042694423 | pgoutput | logical   |
5 | postgres | f | t  | 112595 |  |  760 |
0/3082528   | | reserved   |
| f
 t1_sub  | pgoutput | logical   |
5 | postgres | f | t  | 111940 |  |  760 |
0/30824F0   | 0/3082528   | reserved   |
| f

May I know the reason for creating pg_16399_sync_16392_7187728548042694423?

--
With Regards,
Ashutosh Sharma.

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-11-04 Thread Ashutosh Sharma

On Fri, Nov 4, 2022 at 3:29 PM Amit Kapila  wrote:
>
> On Thu, Nov 3, 2022 at 4:49 PM Amit Kapila  wrote:
> >
> > On Mon, Sep 19, 2022 at 8:09 PM Ashutosh Sharma  
> > wrote:
> > >
> > > Thanks for the clarification. Attached is the patch with the changes.
> > > Please have a look.
> > >
> >
> > LGTM. I'll push this tomorrow unless there are any other
> > comments/suggestions for this.
> >
>
> Pushed!
>

thanks Amit.!

--
With Regards,
Ashutosh Sharma.

Re: Add more docs for pg_surgery?

2022-09-26 Thread Ashutosh Sharma

On Mon, Sep 26, 2022 at 9:29 PM Zhang Mingli  wrote:
>
> Hi, hackers
>
> heap_force_kill/heap_force_freeze doesn’t consider other transactions that 
> are using the same tuples even with tuple-locks.
> The functions may break transaction semantic, ex:
>
> session1
> ```
> create table htab(id int);
> insert into htab values (100), (200), (300), (400), (500);
> ```
>
> session2
> ```
> begin isolation level repeatable read;
> select * from htab for share;
>  id
> -
>  100
>  200
>  300
>  400
>  500
> (5 rows)
> ```
>
> session1
> ```
> select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
>  heap_force_kill
> -
>
> (1 row)
> ```
>
> session2
> ```
> select * from htab for share;
>  id
> -
>  200
>  300
>  400
>  500
> (4 rows)
> ```
>
> session2 should get the same results as it's repeatable read isolation level.
>
> By reading the doc:
> ```
> The pg_surgery module provides various functions to perform surgery on a 
> damaged relation. These functions are unsafe by design and using them may 
> corrupt (or further corrupt) your database. For example, these functions can 
> easily be used to make a table inconsistent with its own indexes, to cause 
> UNIQUE or FOREIGN KEY constraint violations, or even to make tuples visible 
> which, when read, will cause a database server crash. They should be used 
> with great caution and only as a last resort.
>
> ```
> I know they are powerful tools, but also a little surprise with the above 
> example.
>
> Should we add more docs to tell the users that the tool will change the 
> tuples anyway even there are tuple-locks on them?
>

As the name suggests and as documented, heap_force_kill will "force
kill" the tuple, regardless of whether it is visible to another
transaction or not. And further it looks like you are doing an
experiment on undamaged relation which is not recommended as
documented. If the relation would have been damaged, you probably may
not be able to access it.

--
With Regards,
Ashutosh Sharma.

Re: binary version of pg_current_wal_insert_lsn and pg_walfile_name functions

2022-09-23 Thread Ashutosh Sharma

On Fri, Sep 23, 2022 at 12:24 PM Ashutosh Sharma  wrote:
>
> PFA v2 patch.
>
> Changes in the v2 patch:
>
> - Reuse the existing get_controlfile function in
> src/common/controldata_utils.c instead of adding a new one.
>
> - Set env variable PGDATA with the data directory specified by the user.
>

Forgot to attach the patch with above changes. Here it is.

--
With Regards,
Ashutosh Sharma.


v2-0001-Enhance-pg_waldump-to-show-latest-LSN-and-the-corrre.patch
Description: Binary data

Re: binary version of pg_current_wal_insert_lsn and pg_walfile_name functions

2022-09-23 Thread Ashutosh Sharma

PFA v2 patch.

Changes in the v2 patch:

- Reuse the existing get_controlfile function in
src/common/controldata_utils.c instead of adding a new one.

- Set env variable PGDATA with the data directory specified by the user.

--
With Regards,
Ashutosh Sharma.

Re: binary version of pg_current_wal_insert_lsn and pg_walfile_name functions

2022-09-22 Thread Ashutosh Sharma

On Fri, Sep 23, 2022 at 6:05 AM Bharath Rupireddy
 wrote:
>
> On Thu, Sep 22, 2022 at 10:25 PM Ashutosh Sharma  
> wrote:
> >
> > PFA that enhances pg_waldump to show the latest LSN and the
> > corresponding WAL file when the -l or --lastLSN option is passed an
> > argument to pg_waldump. Below is an example:
>
> Thanks for the patch. I have some quick thoughts about it.
>
> > When the user passes the '-l' command line option along with the data
> > directory path to pg_waldump, it reads the control file from the data
> > directory.
>
> I don't think we need a new option for data directory -D. pg_waldump's
> option 'p' can be used, please see the comments around
> identify_target_directory().
>

-p is the path to the WAL directory. It doesn't necessarily have to be
a data directory, however the user can specify the data directory path
here as well using which the path to the WAL directory can be
recognized, but as I said it doesn't mean -p will always represent the
data directory.

> > From the control file, it gets information like redo
> > pointer and current timeline id.
>
> Is there any reason for not using get_control_file() from
> src/common/controldata_utils.c, but defining the exact same function
> in pg_waldump.c?
>

Will give it a thought on it later. If possible, will try to reuse it.

> > The redo pointer is considered to be
> > the start pointer from where the pg_waldump starts reading wal data
> > until end-of-wal to find the last LSN. For details please check the
> > attached patch.
>
> Making it dependent on the controlfile limits the usability of this
> feature. Imagine, using this feature on an archive location or
> pg_receivewal target directory where there are WAL files but no
> controlfile. I think we can choose the appropriate combinations of
> existing pg_waldump options, for instance, let users enter the start
> WAL segment via startseg and/or start LSN via --start and the new
> option for end WAL segment and end LSN.
>

I have written this patch assuming that the end user is not aware of
any LSN or any other WAL data and wants to know the last LSN. So all
he can do is take the help of the control data to find the redo LSN
and use that as a reference point (start pointer) to find the last
LSN. And whatever is the WAL directory (be it archive location or wall
collected via pg_receivewal or pg_wal directory), we will consider the
redo pointer as the start pointer. Now, it's possible that the WAL
corresponding to the start pointer is not at all available in the WAL
directory like archive location or pg_receivewal directory in which
this cannot be used, but this is very unlikely.

> > Please note that for compressed and .partial wal files this doesn't work.
>
> Looking forward to the above capability because it expands the
> usability of this feature.
>

This is a different task altogether. We will probably need to work on
it separately.

--
With Regards,
Ashutosh Sharma.

Re: binary version of pg_current_wal_insert_lsn and pg_walfile_name functions

2022-09-22 Thread Ashutosh Sharma

On Thu, Sep 22, 2022 at 7:41 AM Bharath Rupireddy
 wrote:
>
> On Wed, Sep 21, 2022 at 9:53 PM Ashutosh Sharma  wrote:
> >
> > Yeah, we can either add this functionality to pg_waldump or maybe add
> > a new binary itself that would return this information.
>
> IMV, a separate tool isn't the way, since pg_waldump already reads WAL
> files and decodes WAL records, what's proposed here can be an
> additional functionality of pg_waldump.
>
> It will be great if an initial patch is posted here.
>

PFA that enhances pg_waldump to show the latest LSN and the
corresponding WAL file when the -l or --lastLSN option is passed an
argument to pg_waldump. Below is an example:

ashu@92893de650ed:~/pgsql$ pg_waldump -l -D ./data-dir
Latest LSN: 0/148A45F8
Latest WAL filename: 00010014

How has it been coded?

When the user passes the '-l' command line option along with the data
directory path to pg_waldump, it reads the control file from the data
directory. From the control file, it gets information like redo
pointer and current timeline id. The redo pointer is considered to be
the start pointer from where the pg_waldump starts reading wal data
until end-of-wal to find the last LSN. For details please check the
attached patch.

Please note that for compressed and .partial wal files this doesn't work.

--
With Regards,
Ashutosh Sharma.

0001-Enhance-pg_waldump-to-show-latest-LSN-and-the-WAL-fi.patch
Description: Binary data

Re: binary version of pg_current_wal_insert_lsn and pg_walfile_name functions

2022-09-21 Thread Ashutosh Sharma

On Tue, Sep 20, 2022 at 5:13 PM Bharath Rupireddy
 wrote:
>
> On Mon, Sep 19, 2022 at 8:19 PM Ashutosh Sharma  wrote:
> >
> > Hi All,
> >
> > Currently, we have pg_current_wal_insert_lsn and pg_walfile_name sql
> > functions which gives us information about the next wal insert
> > location and the WAL file that the next wal insert location belongs
> > to. Can we have a binary version of these sql functions?
>
> +1 for the idea in general.
>
> As said, pg_waldump seems to be the right candidate. I think we want
> the lsn of the last WAL record and its info and the WAL file name
> given an input data directory or just the pg_wal directory or any
> directory where WAL files are located. For instance, one can use this
> on an archive location containing archived WAL files or on a node
> where pg_receivewal is receiving WAL files. Am I missing any other
> use-cases?
>

Yeah, we can either add this functionality to pg_waldump or maybe add
a new binary itself that would return this information.

--
With Regards,
Ashutosh Sharma.

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-09-21 Thread Ashutosh Sharma

On Wed, Sep 21, 2022 at 7:21 PM Ashutosh Bapat
 wrote:
>
> On Mon, Sep 19, 2022 at 8:09 PM Ashutosh Sharma  wrote:
> >
> > On Mon, Sep 19, 2022 at 5:24 PM Ashutosh Bapat
> >  wrote:
> > >
> > > On Mon, Sep 19, 2022 at 1:43 PM Ashutosh Sharma  
> > > wrote:
> > > >
> > > > On Fri, Sep 9, 2022 at 5:36 PM Ashutosh Bapat
> > > >  wrote:
> > > > >
> > > > > On Thu, Sep 8, 2022 at 8:32 PM Ashutosh Sharma 
> > > > >  wrote:
> > > > > >
> > > > > > On Thu, Sep 8, 2022 at 6:23 PM Ashutosh Bapat
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Thu, Sep 8, 2022 at 4:14 PM Ashutosh Sharma 
> > > > > > >  wrote:
> > > > > Can you please point to the documentation.
> > > > >
> > > >
> > > > AFAIU there is just one documentation. Here is the link for it:
> > > >
> > > > https://www.postgresql.org/docs/current/view-pg-replication-slots.html
> > >
> > > Thanks. Description of confirmed_flush_lsn is "The address (LSN) up to
> > > which the logical slot's consumer has confirmed receiving data. Data
> > > older than this is not available anymore. NULL for physical slots."
> > > The second sentence is misleading. AFAIU, it really should be "Data
> > > corresponding to the transactions committed before this LSN is not
> > > available anymore". WAL before restart_lsn is likely to be removed but
> > > WAL with LSN higher than restart_lsn is preserved. This correction
> > > makes more sense because of the third sentence.
> > >
> >
> > Thanks for the clarification. Attached is the patch with the changes.
> > Please have a look.
> >
> Looks good to me. Do you want to track this through commitfest app?
>

Yeah, I've added an entry for it in the commitfest app and marked it
as ready for committer. Thanks for the suggestion.

--
With Regards,
Ashutosh Sharma.

binary version of pg_current_wal_insert_lsn and pg_walfile_name functions

2022-09-19 Thread Ashutosh Sharma

Hi All,

Currently, we have pg_current_wal_insert_lsn and pg_walfile_name sql
functions which gives us information about the next wal insert
location and the WAL file that the next wal insert location belongs
to. Can we have a binary version of these sql functions? It would be
like any other binaries we have for e.g. pg_waldump to which we can
pass the location of the pg_wal directory. This binary would scan
through the directory to return the next wal insert location and the
wal file the next wal insert pointer belongs to.

The binary version of these sql functions can be used when the server
is offline. This can help us to know the overall WAL data that needs
to be replayed when the server is in recovery. In the control file we
do have the redo pointer. Knowing the end pointer would definitely be
helpful.

If you are ok then I will prepare a patch for it and share it. Please
let me know your thoughts/comments. thank you.!

--
With Regards,
Ashutosh Sharma.

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-09-19 Thread Ashutosh Sharma

On Mon, Sep 19, 2022 at 5:24 PM Ashutosh Bapat
 wrote:
>
> On Mon, Sep 19, 2022 at 1:43 PM Ashutosh Sharma  wrote:
> >
> > On Fri, Sep 9, 2022 at 5:36 PM Ashutosh Bapat
> >  wrote:
> > >
> > > On Thu, Sep 8, 2022 at 8:32 PM Ashutosh Sharma  
> > > wrote:
> > > >
> > > > On Thu, Sep 8, 2022 at 6:23 PM Ashutosh Bapat
> > > >  wrote:
> > > > >
> > > > > On Thu, Sep 8, 2022 at 4:14 PM Ashutosh Sharma 
> > > > >  wrote:
> > > Can you please point to the documentation.
> > >
> >
> > AFAIU there is just one documentation. Here is the link for it:
> >
> > https://www.postgresql.org/docs/current/view-pg-replication-slots.html
>
> Thanks. Description of confirmed_flush_lsn is "The address (LSN) up to
> which the logical slot's consumer has confirmed receiving data. Data
> older than this is not available anymore. NULL for physical slots."
> The second sentence is misleading. AFAIU, it really should be "Data
> corresponding to the transactions committed before this LSN is not
> available anymore". WAL before restart_lsn is likely to be removed but
> WAL with LSN higher than restart_lsn is preserved. This correction
> makes more sense because of the third sentence.
>

Thanks for the clarification. Attached is the patch with the changes.
Please have a look.

--
With Regards,
Ashutosh Sharma.
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f573a4..0d7285d 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2392,8 +2392,9 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
   
The address (LSN) up to which the logical
-   slot's consumer has confirmed receiving data. Data older than this is
-   not available anymore. NULL for physical slots.
+   slot's consumer has confirmed receiving data. Data corresponding to the
+   transactions committed before this LSN is not
+   available anymore. NULL for physical slots.

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-09-19 Thread Ashutosh Sharma

On Fri, Sep 9, 2022 at 5:36 PM Ashutosh Bapat
 wrote:
>
> On Thu, Sep 8, 2022 at 8:32 PM Ashutosh Sharma  wrote:
> >
> > On Thu, Sep 8, 2022 at 6:23 PM Ashutosh Bapat
> >  wrote:
> > >
> > > On Thu, Sep 8, 2022 at 4:14 PM Ashutosh Sharma  
> > > wrote:
> Can you please point to the documentation.
>

AFAIU there is just one documentation. Here is the link for it:

https://www.postgresql.org/docs/current/view-pg-replication-slots.html

> It's true that it needs to be clarified. But what you are saying may
> not be entirely true in case of streamed transaction. In that case we
> might send logically decoded changes of an ongoing transaction as
> well. They may even get applied but not necessarily committed. It's a
> bit complicated. :)
>

This can happen in case of big transactions. That's the reason I
mentioned that the transaction has a small set of data which is not
yet committed but the confirmed_flush_lsn says it has already reached
the logical subscriber.

And.. lastly sorry for the delayed response. I was not well and
couldn't access email for quite some days. The poor dengue had almost
killed me :(

--
With Regards,
Ashutosh Sharma.

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-09-08 Thread Ashutosh Sharma

On Thu, Sep 8, 2022 at 6:23 PM Ashutosh Bapat
 wrote:
>
> On Thu, Sep 8, 2022 at 4:14 PM Ashutosh Sharma  wrote:
> >
> > Hi All,
> >
> > The logically decoded data are sent to the logical subscriber at the time 
> > of transaction commit, assuming that the data is small. However, before the 
> > transaction commit is performed, the LSN representing the data that is yet 
> > to be received by the logical subscriber appears in the confirmed_flush_lsn 
> > column of pg_replication_slots catalog. Isn't the information seen in the 
> > confirmed_flush_lsn column while the transaction is in progress incorrect ? 
> > esp considering the description given in the pg doc for this column.
> >
> > Actually, while the transaction is running, the publisher keeps on sending 
> > keepalive messages containing LSN of the last decoded data saved in reorder 
> > buffer and the subscriber responds with the same LSN as the last received 
> > LSN which is then updated as confirmed_flush_lsn by the publisher. I think 
> > the LSN that we are sending with the keepalive message should be the one 
> > representing the transaction begin message, not the LSN of the last decoded 
> > data which is yet to be sent. Please let me know if I am missing something 
> > here.
>
> The transactions with commit lsn < confirmed_flush_lsn are confirmed
> to be received (and applied by the subscriber. Setting LSN
> corresponding to a WAL record within a transaction in progress as
> confirmed_flush should be ok. Since the transactions are interleaved
> in WAL stream, it's quite possible that LSNs of some WAL records of an
> inflight transaction are lesser than commit LSN of some another
> transaction. So setting commit LSN of another effectively same as
> setting it to any of the LSNs of any previous WAL record irrespective
> of the transaction that it belongs to.

Thank you Ashutosh for the explanation. I still feel that the
documentation on confirmed_flush_lsn needs some improvement. It
actually claims that all the data before the confirmed_flush_lsn has
been received by the logical subscriber, but that's not the case. It
actually means that all the data belonging to the transactions with
commit lsn < confirmed_flush_lsn has been received and applied by the
subscriber. So setting confirmed_flush_lsn to the lsn of wal records
generated by running transaction might make people think that the wal
records belonging to previous data of the same running transaction has
already been received and applied by the subscriber node, but that's
not true.

--
With Regards,
Ashutosh Sharma.

confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-09-08 Thread Ashutosh Sharma

Hi All,

The logically decoded data are sent to the logical subscriber at the time
of transaction commit, assuming that the data is small. However, before the
transaction commit is performed, the LSN representing the data that is yet
to be received by the logical subscriber appears in the confirmed_flush_lsn
column of pg_replication_slots catalog. Isn't the information seen in the
confirmed_flush_lsn column while the transaction is in progress incorrect ?
esp considering the description given in the pg doc for this column.

Actually, while the transaction is running, the publisher keeps on sending
keepalive messages containing LSN of the last decoded data saved in reorder
buffer and the subscriber responds with the same LSN as the last received
LSN which is then updated as confirmed_flush_lsn by the publisher. I think
the LSN that we are sending with the keepalive message should be the one
representing the transaction begin message, not the LSN of the last decoded
data which is yet to be sent. Please let me know if I am missing something
here.

--
With Regards,
Ashutosh Sharma.

Re: Correct comment in RemoveNonParentXlogFiles()

2022-08-04 Thread Ashutosh Sharma

On Thu, Aug 4, 2022 at 11:30 AM Kyotaro Horiguchi 
wrote:

> At Wed, 3 Aug 2022 18:16:33 +0530, Ashutosh Sharma 
> wrote in
> > Following comment in RemoveNonParentXlogFiles() says that we are trying
> to
> > remove any WAL file whose segment number is >= the segment number of the
> > first WAL file on the new timeline. However, looking at the code, I can
> say
> > that we are trying to remove the WAL files from the previous timeline
> whose
> > segment number is just greater than (not equal to) the segment number of
> > the first WAL file in the new timeline. I think we should improve this
> > comment, thoughts?
> >
> > /*
> >  * Remove files that are on a timeline older than the new one
> we're
> >  * switching to, but with a segment number >= the first segment
> on
> > the
> >  * new timeline.
> >  */
> > if (strncmp(xlde->d_name, switchseg, 8) < 0 &&
> > strcmp(xlde->d_name + 8, switchseg + 8) > 0)
>
> I'm not sure I'm fully getting your point.  The current comment is
> correctly saying that it removes the segments "on a timeline older
> than the new one". I agree about segment comparison.
>

Yeah my complaint is about the comment on segment comparison for removal.


>
> So, if I changed that comment, I would finish with the following change.
>
> -  * switching to, but with a segment number >= the first segment
> on
> +  * switching to, but with a segment number greater than the
> first segment on
>

which looks correct to me and will inline it with the code.


>
> That disagreement started at the time the code was introduced by
> b2a5545bd6.  Leaving the last segment in the old timeline is correct
> since it is renamed to .partial later.  If timeline switch happened
> just at segment boundary, that segment would not not be created.
>

Yeah, that's why we keep the last segment (partially written) from the old
timeline, which means we're not deleting it here. So the comment should not
be saying that we are also removing the last wal segment from the old
timeline which is equal to the first segment from the new timeline.

--
With Regards,
Ashutosh Sharma.

Correct comment in RemoveNonParentXlogFiles()

2022-08-03 Thread Ashutosh Sharma

HI All,

Following comment in RemoveNonParentXlogFiles() says that we are trying to
remove any WAL file whose segment number is >= the segment number of the
first WAL file on the new timeline. However, looking at the code, I can say
that we are trying to remove the WAL files from the previous timeline whose
segment number is just greater than (not equal to) the segment number of
the first WAL file in the new timeline. I think we should improve this
comment, thoughts?


/*
 * Remove files that are on a timeline older than the new one we're
 * switching to, but with a segment number >= the first segment on
the
 * new timeline.
 */
if (strncmp(xlde->d_name, switchseg, 8) < 0 &&
strcmp(xlde->d_name + 8, switchseg + 8) > 0)

--
With Regards,
Ashutosh Sharma.

Re: making relfilenodes 56 bits

2022-07-29 Thread Ashutosh Sharma

On Fri, Jul 29, 2022 at 6:26 PM Ashutosh Sharma 
wrote:

> On Thu, Jul 28, 2022 at 5:02 PM Dilip Kumar  wrote:
>
> +/* --
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
>
> Above comment says that RelFileNumber zero is invalid which is technically
> correct because we don't have any relation file in disk with zero number.
> But the point is that if someone reads below definition of
> CHECK_RELFILENUMBER_RANGE he/she might get confused because as per this
> definition relfilenumber zero is valid.
>

Please ignore the above comment shared in my previous email.  It is a
little over-thinking on my part that generated this comment in my mind.
Sorry for that. Here are the other comments I have:

+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_buffercache DROP VIEW pg_buffercache;
+ALTER EXTENSION pg_buffercache DROP FUNCTION pg_buffercache_pages();
+
+/* Then we can drop them */
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+/* Now redefine */
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages_v1_4'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE VIEW pg_buffercache AS
+   SELECT P.* FROM pg_buffercache_pages() AS P
+   (bufferid integer, relfilenode int8, reltablespace oid, reldatabase oid,
+relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+pinning_backends int4);

As we are dropping the function and view I think it would be good if we
*don't* use the "OR REPLACE" keyword when re-defining them.

==

+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relfilenode" INT64_FORMAT " is too
large to be represented as an OID",
+   fctx->record[i].relfilenumber),
+errhint("Upgrade the extension using ALTER
EXTENSION pg_buffercache UPDATE")));

I think it would be good to recommend users to upgrade to the latest
version instead of just saying upgrade the pg_buffercache using ALTER
EXTENSION 

==

--- a/contrib/pg_walinspect/sql/pg_walinspect.sql
+++ b/contrib/pg_walinspect/sql/pg_walinspect.sql
@@ -39,10 +39,10 @@ SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
 -- Test for filtering out WAL records of a particular table
 -- ===

-SELECT oid AS sample_tbl_oid FROM pg_class WHERE relname = 'sample_tbl'
\gset
+SELECT relfilenode AS sample_tbl_relfilenode FROM pg_class WHERE relname =
'sample_tbl' \gset

Is this change required? The original query is just trying to fetch table
oid not relfilenode and AFAIK we haven't changed anything in table oid.

==

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)   \
+do {   \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+   ereport(ERROR,  \
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
+errmsg("relfilenumber " INT64_FORMAT " is out of range",
\
+   (relfilenumber; \
+} while (0)
+

I think we can shift this macro to some header file and reuse it at several
places.

==


+* Generate a new relfilenumber.  We cannot reuse the old relfilenumber
+* because of the possibility that that relation will be moved back to
the

that that relation -> that relation

--
With Regards,
Ashutosh Sharma.

Re: making relfilenodes 56 bits

2022-07-29 Thread Ashutosh Sharma

On Thu, Jul 28, 2022 at 5:02 PM Dilip Kumar  wrote:

+/* --
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage

Above comment says that RelFileNumber zero is invalid which is technically
correct because we don't have any relation file in disk with zero number.
But the point is that if someone reads below definition of
CHECK_RELFILENUMBER_RANGE he/she might get confused because as per this
definition relfilenumber zero is valid.

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)   \
+do {   \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+   ereport(ERROR,  \
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
+errmsg("relfilenumber " INT64_FORMAT " is out of range",
\
+   (relfilenumber; \
+} while (0)
+

+   RelFileNumber relfilenumber = PG_GETARG_INT64(0);
+   CHECK_RELFILENUMBER_RANGE(relfilenumber);

It seems like the relfilenumber in above definition represents relfilenode
value in pg_class which can hold zero value which actually means it's a
mapped relation. I think it would be good to provide some clarity here.

--
With Regards,
Ashutosh Sharma.

Re: making relfilenodes 56 bits

2022-07-27 Thread Ashutosh Sharma

Some more comments:

==

Shouldn't we retry for the new relfilenumber if
"ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a
cases where some of the tables are dropped by the user and relfilenumber of
those tables can be reused for which we would need to find the
relfilenumber that can be resued. For e.g. consider below example:

postgres=# create table t1(a int);
CREATE TABLE

postgres=# create table t2(a int);
CREATE TABLE

postgres=# create table t3(a int);
ERROR:  relfilenumber is out of bound

postgres=# drop table t1, t2;
DROP TABLE

postgres=# checkpoint;
CHECKPOINT

postgres=# vacuum;
VACUUM

Now if I try to recreate table t3, it should succeed, shouldn't it? But it
doesn't because we simply error out by seeing the nextRelFileNumber saved
in the shared memory.

postgres=# create table t1(a int);
ERROR:  relfilenumber is out of bound

I think, above should have worked.

==



Note that while a table's filenode often matches its OID, this is
not necessarily the case; some operations, like
TRUNCATE, REINDEX,
CLUSTER and some forms
of ALTER TABLE, can change the filenode while preserving
the OID.

I think this note needs some improvement in storage.sgml. It says the
table's relfilenode mostly matches its OID, but it doesn't. This will
happen only in case of system table and maybe never in case of user table.

==

postgres=# create table t2(a int);
ERROR:  relfilenumber is out of bound

Since this is a user-visible error, I think it would be good to mention
relfilenode instead of relfilenumber. Elsewhere (including the user manual)
we refer to this as a relfilenode.

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 10:36 PM Ashutosh Sharma 
wrote:

> Thanks Dilip. Here are few comments that  could find upon quickly
> reviewing the v11 patch:
>
>  /*
> + * Similar to the XLogPutNextOid but instead of writing NEXTOID log
> record it
> + * writes a NEXT_RELFILENUMBER log record.  If '*prevrecptr' is a valid
> + * XLogRecPtrthen flush the wal upto this record pointer otherwise flush
> upto
>
> XLogRecPtrthen -> XLogRecPtr then
>
> ==
>
> +   switch (relpersistence)
> +   {
> +   case RELPERSISTENCE_TEMP:
> +   backend = BackendIdForTempRelations();
> +   break;
> +   case RELPERSISTENCE_UNLOGGED:
> +   case RELPERSISTENCE_PERMANENT:
> +   backend = InvalidBackendId;
> +   break;
> +   default:
> +   elog(ERROR, "invalid relpersistence: %c", relpersistence);
> +   return InvalidRelFileNumber;/* placate compiler */
> +   }
>
>
> I think the above check should be added at the beginning of the function
> for the reason that if we come to the default switch case we won't be
> acquiring the lwlock and do other stuff to get a new relfilenumber.
>
> ==
>
> -   newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
> +* Generate a new relfilenumber.  We cannot reuse the old relfilenumber
> +* because of the possibility that that relation will be moved back to
> the
>
> that that relation -> that relation.
>
> ==
>
> + * option_parse_relfilenumber
> + *
> + * Parse relfilenumber value for an option.  If the parsing is successful,
> + * returns; if parsing fails, returns false.
> + */
>
> If parsing is successful, returns true;
>
> --
> With Regards,
> Ashutosh Sharma.
>
> On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar  wrote:
>
>> On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma 
>> wrote:
>>
>> Hi,
>> Note: please avoid top posting.
>>
>> > /*
>> >  * If relfilenumber is unspecified by the caller then create
>> storage
>> > -* with oid same as relid.
>> > +* with relfilenumber same as relid if it is a system table
>> otherwise
>> > +* allocate a new relfilenumber.  For more details read
>> comments atop
>> > +* FirstNormalRelFileNumber declaration.
>> >  */
>> > if (!RelFileNumberIsValid(relfilenumber))
>> > -   relfilenumber = relid;
>> > +   {
>> > +   relfilenumber = relid < FirstNormalObjectId ?
>> > +   relid : GetNewRelFileNumber();
>> >
>> > Above code says that in the case of system table we want relfilenode to
>> be the same as object id. This technically means that the relfilenode or
>> oid for the system tables would not be exceeding 16383. However in the
>> below lines of code added in the patch, it says there is some chance for
>> the storage path of the user tables from the old cluster conflicting with
>> the storage path of the sy

Re: making relfilenodes 56 bits

2022-07-26 Thread Ashutosh Sharma

Thanks Dilip. Here are few comments that  could find upon quickly reviewing
the v11 patch:

 /*
+ * Similar to the XLogPutNextOid but instead of writing NEXTOID log record
it
+ * writes a NEXT_RELFILENUMBER log record.  If '*prevrecptr' is a valid
+ * XLogRecPtrthen flush the wal upto this record pointer otherwise flush
upto

XLogRecPtrthen -> XLogRecPtr then

==

+   switch (relpersistence)
+   {
+   case RELPERSISTENCE_TEMP:
+   backend = BackendIdForTempRelations();
+   break;
+   case RELPERSISTENCE_UNLOGGED:
+   case RELPERSISTENCE_PERMANENT:
+   backend = InvalidBackendId;
+   break;
+   default:
+   elog(ERROR, "invalid relpersistence: %c", relpersistence);
+   return InvalidRelFileNumber;/* placate compiler */
+   }


I think the above check should be added at the beginning of the function
for the reason that if we come to the default switch case we won't be
acquiring the lwlock and do other stuff to get a new relfilenumber.

==

-   newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
+* Generate a new relfilenumber.  We cannot reuse the old relfilenumber
+* because of the possibility that that relation will be moved back to
the

that that relation -> that relation.

==

+ * option_parse_relfilenumber
+ *
+ * Parse relfilenumber value for an option.  If the parsing is successful,
+ * returns; if parsing fails, returns false.
+ */

If parsing is successful, returns true;

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar  wrote:

> On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma 
> wrote:
>
> Hi,
> Note: please avoid top posting.
>
> > /*
> >  * If relfilenumber is unspecified by the caller then create
> storage
> > -* with oid same as relid.
> > +* with relfilenumber same as relid if it is a system table
> otherwise
> > +* allocate a new relfilenumber.  For more details read comments
> atop
> > +* FirstNormalRelFileNumber declaration.
> >  */
> > if (!RelFileNumberIsValid(relfilenumber))
> > -   relfilenumber = relid;
> > +   {
> > +   relfilenumber = relid < FirstNormalObjectId ?
> > +   relid : GetNewRelFileNumber();
> >
> > Above code says that in the case of system table we want relfilenode to
> be the same as object id. This technically means that the relfilenode or
> oid for the system tables would not be exceeding 16383. However in the
> below lines of code added in the patch, it says there is some chance for
> the storage path of the user tables from the old cluster conflicting with
> the storage path of the system tables in the new cluster. Assuming that the
> OIDs for the user tables on the old cluster would start with 16384 (the
> first object ID), I see no reason why there would be a conflict.
>
>
> Basically, the above comment says that the initial system table
> storage will be created with the same relfilenumber as Oid so you are
> right that will not exceed 16383.  And below code is explaining the
> reason that in order to avoid the conflict with the user table from
> the older cluster we do it this way.  Otherwise, in the new design, we
> have no intention to keep the relfilenode same as Oid.  But during an
> upgrade from the older cluster which is not following this new design
> might have user table relfilenode which can conflict with the system
> table in the new cluster so we have to ensure that with the new design
> also when creating the initial cluster we keep the system table
> relfilenode in low range and directly using Oid is the best idea for
> this purpose instead of defining the completely new range and
> maintaining a separate counter for that.
>
> > +/* --
> > + * RelFileNumber zero is InvalidRelFileNumber.
> > + *
> > + * For the system tables (OID < FirstNormalObjectId) the initial storage
> > + * will be created with the relfilenumber same as their oid.  And,
> later for
> > + * any storage the relfilenumber allocated by GetNewRelFileNumber()
> will start
> > + * at 10.  Thus, when upgrading from an older cluster, the relation
> storage
> > + * path for the user table from the old cluster will not conflict with
> the
> > + * relation storage path for the system table from the new cluster.
> Anyway,
> > + * the new cluster must not have any user tables while upgrading, so we
> needn't
> > + * worry about them.
> > + * --
> > + */
> > +#define FirstNormalRelFileNumber   ((RelFileNumber) 10)
> >
> > ==
> >
> > When WAL logging the next object id we have the chose

Re: making relfilenodes 56 bits

2022-07-26 Thread Ashutosh Sharma

/*
 * If relfilenumber is unspecified by the caller then create storage
-* with oid same as relid.
+* with relfilenumber same as relid if it is a system table
otherwise
+* allocate a new relfilenumber.  For more details read comments
atop
+* FirstNormalRelFileNumber declaration.
 */
if (!RelFileNumberIsValid(relfilenumber))
-   relfilenumber = relid;
+   {
+   relfilenumber = relid < FirstNormalObjectId ?
+   relid : GetNewRelFileNumber();

Above code says that in the case of system table we want relfilenode to be
the same as object id. This technically means that the relfilenode or oid
for the system tables would not be exceeding 16383. However in the below
lines of code added in the patch, it says there is some chance for the
storage path of the user tables from the old cluster conflicting with the
storage path of the system tables in the new cluster. Assuming that the
OIDs for the user tables on the old cluster would start with 16384 (the
first object ID), I see no reason why there would be a conflict.

+/* --
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage
+ * will be created with the relfilenumber same as their oid.  And, later
for
+ * any storage the relfilenumber allocated by GetNewRelFileNumber() will
start
+ * at 10.  Thus, when upgrading from an older cluster, the relation
storage
+ * path for the user table from the old cluster will not conflict with the
+ * relation storage path for the system table from the new cluster.
Anyway,
+ * the new cluster must not have any user tables while upgrading, so we
needn't
+ * worry about them.
+ * --
+ */
+#define FirstNormalRelFileNumber   ((RelFileNumber) 10)

==

When WAL logging the next object id we have the chosen the xlog threshold
value as 8192 whereas for relfilenode it is 512. Any reason for choosing
this low arbitrary value in case of relfilenumber?

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar  wrote:

> On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro 
> wrote:
> >
> > On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar 
> wrote:
> > > [v10 patch set]
> >
> > Hi Dilip, I'm experimenting with these patches and will hopefully have
> > more to say soon, but I just wanted to point out that this builds with
> > warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> > there is the good kind of uninitialised data on Linux, and the bad
> > kind of uninitialised data on those other pesky systems?
>
> Here is the patch to fix the issue, basically, while asserting for the
> file existence it was not setting the relfilenumber in the
> relfilelocator before generating the path so it was just checking for
> the existence of the random path so it was asserting randomly.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Re: making relfilenodes 56 bits

2022-07-25 Thread Ashutosh Sharma

Hi,

As oid and relfilenumber are linked with each other, I still see that if
the oid value reaches the threshold limit, we are unable to create a table
with storage. For example I set FirstNormalObjectId to 4294967294 (one
value less than the range limit of 2^32 -1 = 4294967295). Now when I try to
create a table, the CREATE TABLE command gets stuck because it is unable to
find the OID for the comp type although it can find a new relfilenumber.

postgres=# create table t1(a int);
CREATE TABLE

postgres=# select oid, reltype, relfilenode from pg_class where relname =
't1';
oid |  reltype   | relfilenode
++-
 4294967295 | 4294967294 |  10
(1 row)

postgres=# create table t2(a int);
^CCancel request sent
ERROR:  canceling statement due to user request

creation of t2 table gets stuck as it is unable to find a new oid.
Basically the point that I am trying to put here is even though we will be
able to find the new relfile number by increasing the relfilenumber size
but still the commands like above will not execute if the oid value (of 32
bits) has reached the threshold limit.

--
With Regards,
Ashutosh Sharma.



On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar  wrote:

> On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar  wrote:
> >
> > I was doing some more testing by setting the FirstNormalRelFileNumber
> > to a high value(more than 32 bits) I have noticed a couple of problems
> > there e.g. relpath is still using OIDCHARS macro which says max
> > relfilenumber file name can be only 10 character long which is no
> > longer true.  So there we need to change this value to 20 and also
> > need to carefully rename the macros and other variable names used for
> > this purpose.
> >
> > Similarly there was some issue in macro in buf_internal.h while
> > fetching the relfilenumber.  So I will relook into all those issues
> > and repost the patch soon.
>
> I have fixed these existing issues and there was also some issue in
> pg_dump.c which was creating problems in upgrading to the same version
> while using a higher range of the relfilenumber.
>
> There was also an issue where the user table from the old cluster's
> relfilenode could conflict with the system table of the new cluster.
> As a solution currently for system table object (while creating
> storage first time) we are keeping the low range of relfilenumber,
> basically we are using the same relfilenumber as OID so that during
> upgrade the normal user table from the old cluster will not conflict
> with the system tables in the new cluster.  But with this solution
> Robert told me (in off list chat) a problem that in future if we want
> to make relfilenumber completely unique within a cluster by
> implementing the CREATEDB differently then we can not do that as we
> have created fixed relfilenodes for the system tables.
>
> I am not sure what exactly we can do to avoid that because even if we
> do something  to avoid that in the new cluster the old cluster might
> be already using the non-unique relfilenode so after upgrading the new
> cluster will also get those non-unique relfilenode.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Re: How about renaming XLogFileRead() to XLogFileOpenForRead() and XLogFileOpen() to XLogFileOpenForWrite()?

2022-05-10 Thread Ashutosh Sharma

On Tue, May 10, 2022 at 6:46 PM Bharath Rupireddy
 wrote:
>
> On Tue, May 10, 2022 at 6:16 PM Ashutosh Sharma  wrote:
> >
> > Hi All,
> >
> > Currently, in postgres we have two different functions that are
> > specially used to open the WAL files for reading and writing purposes.
> > The first one is XLogFileOpen() that is just used to open the WAL file
> > so that we can write WAL data in it. And then we have another function
> > named XLogFileRead that does the same thing but is used when reading
> > the WAL files during recovery time. How about renaming the function
> > XLogFileRead to XLogFileOpenForRead and the other one can be renamed
> > to XLogFileOpenForWrite. I think it will make the function name more
> > clear and increase the readability. At least XlogFileRead doesn't look
> > good to me, from the function name it actually appears like we are
> > trying to read a WAL file here but actually we are opening it so that
> > it can be read by some other routine.
> >
> > Also I see that we are passing emode to the XLogFileRead function
> > which is not being used anywhere in the function, so can we remove it?
>
> Renaming XLogFileOpen to XLogFileOpenForWrite while it uses O_RDWR,
> not O_RDWR is sort of conflicting. Also, I'm concerned that
> XLogFileOpen is an extern function, the external modules using it
> might break.

Why would the external modules open WAL files to perform write
operations? AFAIU from the code, this function is specifically written
to open WAL files during WAL write operation. So I don't see any
problem in renaming it to XLogFileOpenForWrite(). Infact as I said, it
does increase the readability. And likewise, XLogFileRead can be
renamed to XLogFileOpenForRead which seems you are okay with.

--
With Regards,
Ashutosh Sharma.

How about renaming XLogFileRead() to XLogFileOpenForRead() and XLogFileOpen() to XLogFileOpenForWrite()?

2022-05-10 Thread Ashutosh Sharma

Hi All,

Currently, in postgres we have two different functions that are
specially used to open the WAL files for reading and writing purposes.
The first one is XLogFileOpen() that is just used to open the WAL file
so that we can write WAL data in it. And then we have another function
named XLogFileRead that does the same thing but is used when reading
the WAL files during recovery time. How about renaming the function
XLogFileRead to XLogFileOpenForRead and the other one can be renamed
to XLogFileOpenForWrite. I think it will make the function name more
clear and increase the readability. At least XlogFileRead doesn't look
good to me, from the function name it actually appears like we are
trying to read a WAL file here but actually we are opening it so that
it can be read by some other routine.

Also I see that we are passing emode to the XLogFileRead function
which is not being used anywhere in the function, so can we remove it?

--
With Regards,
Ashutosh Sharma.

Re: orafce: some of the build time generated files are not present in .gitignore and also checked into the repository

2022-04-26 Thread Ashutosh Sharma

Sure, I'll do that.

Thanks,
Ashutosh

On Tue, Apr 26, 2022 at 5:51 PM Daniel Gustafsson  wrote:
>
> > On 26 Apr 2022, at 14:19, Ashutosh Sharma  wrote:
>
> > The below files in orafce contrib module are generated at build time.
> > However, these are checked into the repository. Shouldn't these files
> > be removed from the repository and added to the .gitignore file so
> > that they get ignored in the future commits.
>
> You should probably take this to the orafce project instead, possibly with a 
> PR
> against their repo?
>
> --
> Daniel Gustafsson   https://vmware.com/
>

Re: double inclusion of '-p' flex flag

2022-04-26 Thread Ashutosh Sharma

On Tue, Apr 26, 2022 at 5:55 PM John Naylor
 wrote:
>
> On Tue, Apr 26, 2022 at 7:16 PM Ashutosh Sharma  wrote:
> >
> > Hi,
> >
> > I see that we have included '-p' flex flag twice in the commands used
> > to generate the scanner files. See below:
> >
> > src/backend/parser/Makefile:60:scan.c: FLEXFLAGS = -CF -p -p
> > src/backend/utils/adt/Makefile:122:jsonpath_scan.c: FLEXFLAGS = -CF -p 
> > -p
> > src/bin/psql/Makefile:61:   psqlscanslash.c: FLEXFLAGS = -Cfe -p -p
> > src/fe_utils/Makefile:43:psqlscan.c: FLEXFLAGS = -Cfe -p -p
> > src/backend/utils/adt/Makefile:122:jsonpath_scan.c: FLEXFLAGS = -CF -p 
> > -p
> > src/bin/psql/Makefile:61:psqlscanslash.c: FLEXFLAGS = -Cfe -p -p
> >
> > Do we need this or can the extra -p flag be removed?
>
> From the Flex manual:
>
> "generates a performance report to stderr. The report consists of
> comments regarding features of the flex input file which will cause a
> serious loss of performance in the resulting scanner. If you give the
> flag twice, you will also get comments regarding features that lead to
> minor performance losses."
>

Ahh. I see. This information is missing in the man page. thanks.!

--
With Regards,
Ashutosh Sharma.

orafce: some of the build time generated files are not present in .gitignore and also checked into the repository

2022-04-26 Thread Ashutosh Sharma

Hi,

The below files in orafce contrib module are generated at build time.
However, these are checked into the repository. Shouldn't these files
be removed from the repository and added to the .gitignore file so
that they get ignored in the future commits.

sqlparse.c
sqlscan.c
sqlparse.h

--
With Regards,
Ashutosh Sharma.

double inclusion of '-p' flex flag

2022-04-26 Thread Ashutosh Sharma

Hi,

I see that we have included '-p' flex flag twice in the commands used
to generate the scanner files. See below:

src/backend/parser/Makefile:60:scan.c: FLEXFLAGS = -CF -p -p
src/backend/utils/adt/Makefile:122:jsonpath_scan.c: FLEXFLAGS = -CF -p -p
src/bin/psql/Makefile:61:   psqlscanslash.c: FLEXFLAGS = -Cfe -p -p
src/fe_utils/Makefile:43:psqlscan.c: FLEXFLAGS = -Cfe -p -p
src/backend/utils/adt/Makefile:122:jsonpath_scan.c: FLEXFLAGS = -CF -p -p
src/bin/psql/Makefile:61:psqlscanslash.c: FLEXFLAGS = -Cfe -p -p

Do we need this or can the extra -p flag be removed?

--
With Regards,
Ashutosh Sharma.

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-16 Thread Ashutosh Sharma

I can see that the pg_get_wal_records_info function shows the details
of the WAL record whose existence is beyond the user specified
stop/end lsn pointer. See below:

ashu@postgres=# select * from pg_get_wal_records_info('0/0128',
'0/0129');
-[ RECORD 1 
]+--
start_lsn| 0/128
end_lsn  | 0/19F
prev_lsn | 0/0
xid  | 0
resource_manager | XLOG
record_length| 114
fpi_length   | 0
description  | CHECKPOINT_SHUTDOWN redo 0/128; tli 1; prev tli
1; fpw true; xid 0:3; oid 1; multi 1; offset 0; oldest xid 3 in DB
1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0;
oldest running xid 0; shutdown
block_ref|
data_length  | 88
data |
\x280101000100010003001027010003000100010001007255a5c43162ff7f

In this case, the end lsn pointer specified by the user is
'0/0129'. There is only one WAL record which starts before this
specified end lsn pointer whose start pointer is at 0128, but that
WAL record ends at 0/19F which is way beyond the specified end
lsn. So, how come we are able to display the complete WAL record info?
AFAIU, end lsn is the lsn pointer where you need to stop reading the
WAL data. If that is true, then there exists no valid WAL record
between the start and end lsn in this particular case.

--
With Regards,
Ashutosh Sharma.

On Wed, Mar 16, 2022 at 7:56 PM Stephen Frost  wrote:
>
> Greetings,
>
> * Bharath Rupireddy (bharath.rupireddyforpostg...@gmail.com) wrote:
> > On Tue, Mar 15, 2022 at 7:21 AM Bharath Rupireddy
> >  wrote:
> > >
> > > On Mon, Mar 14, 2022 at 8:25 PM Stephen Frost  wrote:
> > > >
> > > > > As this patch is currently written, pg_monitor has access these
> > > > > functions, though I don't think that's the right privilege level at
> > > > > least for pg_get_raw_wal_record().
> > > >
> > > > Yeah, I agree that pg_monitor isn't the right thing for such a function
> > > > to be checking.
> > >
> > > On Thu, Mar 10, 2022 at 1:52 PM Jeff Davis  wrote:
> > > >
> > > > * pg_get_raw_wal_record() seems too powerful for pg_monitor. Maybe that
> > > > function should require pg_read_server_files? Or at least
> > > > pg_read_all_data?
> > >
> > > The v9 patch set posted at [1] grants execution on
> > > pg_get_raw_wal_record() to the pg_monitor role.
> > >
> > > pg_read_all_data may not be the right choice, but pg_read_server_files
> > > is as these functions do read the WAL files on the server. If okay,
> > > I'm happy to grant execution on pg_get_raw_wal_record() to the
> > > pg_read_server_files role.
> > >
> > > Thoughts?
> > >
> > > [1] 
> > > https://www.postgresql.org/message-id/CALj2ACVRH-z8mZLyFkpLvY4qRhxQCqU_BLkFTtwt%2BTPZNhfEVg%40mail.gmail.com
> >
> > Attaching v10 patch set which allows pg_get_raw_wal_record to be
> > executed by either superuser or users with pg_read_server_files role,
> > no other change from v9 patch set.
>
> In a quick look, that seems reasonable to me.  If folks want to give out
> access to this function individually they're also able to do so, which
> is good.  Doesn't seem worthwhile to introduce a new predefined role for
> this one function.
>
> Thanks,
>
> Stephen

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-15 Thread Ashutosh Sharma

On Tue, Mar 15, 2022 at 10:17 PM Robert Haas  wrote:
>
> On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma  
> wrote:
> >
> > Few comments on the latest patch:
> >
> > -   /* We need to construct the pathname for this database */
> > -   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > +   if (xlrec->dbid != InvalidOid)
> > +   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > +   else
> > +   dbpath = pstrdup("global");
> >
> > Do we really need this change? Is GetDatabasePath() alone not capable
> > of handling it?
>
> Well, I mean, that function has a special case for
> GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
> is 0.
>

Wouldn't this be true only in case of a shared map file (when dbOid is
Invalid and tblspcOid is globaltablespace_oid) or am I missing
something?

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-15 Thread Ashutosh Sharma

Few comments on the latest patch:

-   /* We need to construct the pathname for this database */
-   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+   if (xlrec->dbid != InvalidOid)
+   dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+   else
+   dbpath = pstrdup("global");

Do we really need this change? Is GetDatabasePath() alone not capable
of handling it?

==

+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+
Oid tbid, Oid dbid,
+
char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+
Oid dbid, char *srcpath,
+
List *rnodelist, Snapshot snapshot);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char
*srcpath);

I think we can shorten these function names to probably
ScanSourceDBPgClassRel(), ScanSourceDBPgClassTuple() and likewise?

--
With Regards,
Ashutosh Sharma.

On Tue, Mar 15, 2022 at 3:24 PM Dilip Kumar  wrote:
>
> On Mon, Mar 14, 2022 at 10:04 PM Robert Haas  wrote:
>
> > I think it would make sense to have two different WAL records e.g.
> > XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's
> > easy to see how this could be generalized to other strategies in the
> > future.
>
> Done that way.  In dbase_desc(), for XLOG_DBASE_CREATE_FILE_COPY I
> have kept the older description i.e. "copy dir" and for
> XLOG_DBASE_CREATE_WAL_LOG it is "create dir", because logically the
> first one is actually copying and the second one is just creating the
> directory.  Do you think we should be using "copy dir file_copy" and
> "copy dir wal_log" in the description as well?
>
> > On Mon, Mar 14, 2022 at 12:04 PM Robert Haas  wrote:
> > > I was looking at 0001 and 0002 again and realized that I swapped the
> > > names load_relmap_file() and read_relmap_file() from what I should
> > > have done. Here's a revised version. With this, read_relmap_file() and
> > > write_relmap_file() become functions that just read and write the file
> > > without touching any global variables, and load_relmap_file() is the
> > > function that reads data from the file and puts it into a global
> > > variable, which seems more sensible than the way I had it before.
>
> Okay, I have included this patch and rebased other patches on top of that.
>
> > Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think
> > 'bool permanent' would be better (note BM_PERMANENT). This would
> > involve reversing true and false.
>
> Okay changed.
>
> > Regarding 0005:
> >
> > +   CREATEDB_WAL_LOG = 0,
> > +   CREATEDB_FILE_COPY = 1
> >
> > I still think you don't need = 0 and = 1 here.
>
> Done
>
> > I'll probably go through and do a pass over the comments once you post
> > the next version of this. There seems to be work needed in a bunch of
> > places, but it probably makes more sense for me to go through and
> > adjust the things that seem to need it rather than listing a bunch of
> > changes for you to make.
>
> Sure, thanks.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-11 Thread Ashutosh Sharma

On Fri, Mar 11, 2022 at 8:21 PM Robert Haas  wrote:
>
> On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma  
> wrote:
> > Looks better, but why do you want to pass elevel to the
> > load_relmap_file()? Are we calling this function from somewhere other
> > than read_relmap_file()? If not, do we have any plans to call this
> > function  directly bypassing read_relmap_file for any upcoming patch?
>
> If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In
> that case, we only need to stop trying to create a database; we don't
> need to terminate the session. On the other hand if we can't read our
> own database's relmap files, that's an unrecoverable error, because we
> will not be able to run any queries at all, so FATAL is appropriate.
>

OK. I can see it being used in the v13 patch. In the previous patches
it was hard-coded with FATAL. Also, we simply error out when doing
file copy as I can see in the copy_file function. So yes FATAL is not
the right option to use when creating a database. Thanks.

--
With Regards,
Ashutosh Sharma.

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-11 Thread Ashutosh Sharma

Some comments on pg_walinspect-docc.patch this time:

+   
+
+ pg_get_wal_record_info(in_lsn pg_lsn, lsn OUT pg_lsn,
prev_lsn OUT pg_lsn, xid OUT xid, resource_manager OUT text, length
OUT int4, total_length OUT int4, description OUT text, block_ref OUT
text, data OUT bytea, data_len OUT int4)
+

You may shorten this by mentioning just the function input parameters
and specify "returns record" like shown below. So no need to specify
all the OUT params.

pg_get_wal_record_info(in_lsn pg_lsn) returns record.

Please check the documentation for other functions for reference.

==

+
+ pg_get_wal_records_info(start_lsn pg_lsn, end_lsn
pg_lsn DEFAULT NULL, lsn OUT pg_lsn, prev_lsn OUT pg_lsn, xid OUT xid,
resource_manager OUT text, length OUT int4, total_length OUT int4,
description OUT text, block_ref OUT text, data OUT bytea, data_len OUT
int4)
+

Same comment applies here as well. In the return type you can just
mention - "returns setof record" like shown below:

pg_get_wal_records_info(start_lsn pg_lsn, end_lsn pg_lsn) returns setof records.

You may also check for such optimizations at other places. I might
have missed some.

==

+
+postgres=# select prev_lsn, xid, resource_manager, length,
total_length, block_ref from pg_get_wal_records_info('0/158A7F0',
'0/1591400');
+ prev_lsn  | xid | resource_manager | length | total_length |
   block_ref
+---+-+--++--+--
+ 0/158A7B8 | 735 | Heap | 54 | 7838 | blkref
#0: rel 1663/5/2619 blk 18 (FPW); hole: offset: 88, length: 408
+ 0/158A7F0 | 735 | Btree| 53 | 8133 | blkref
#0: rel 1663/5/2696 blk 1 (FPW); hole: offset: 1632, length: 112
+ 0/158C6A8 | 735 | Heap | 53 |  873 | blkref
#0: rel 1663/5/1259 blk 0 (FPW); hole: offset: 212, length: 7372

Instead of specifying column names in the targetlist I think it's
better to use "*" so that it will display all the output columns. Also
you may shorten the gap between start and end lsn to reduce the output
size.

==

Any reason for not specifying author name in the .sgml file. Do you
want me to add my name to the author? :)

  
   Ashutosh Sharma ashu.coe...@gmail.com
  
 

--
With Regards,
Ashutosh Sharma.

On Thu, Mar 10, 2022 at 10:15 PM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 10, 2022 at 1:52 PM Jeff Davis  wrote:
> >
> > On Wed, 2022-03-02 at 22:37 +0530, Bharath Rupireddy wrote:
> > >
> > > Attaching v6 patch set with above review comments addressed. Please
> > > review it further.
>
> Thanks Jeff for reviewing it. I've posted the latest v7 patch-set
> upthread [1] which is having more simple-yet-useful-and-effective
> functions.
>
> > * Don't issue WARNINGs or other messages for ordinary situations, like
> > when pg_get_wal_records_info() hits the end of WAL.
>
> v7 patch-set [1] has no warnings, but the functions will error out if
> future LSN is specified.
>
> > * It feels like the APIs that allow waiting for the end of WAL are
> > slightly off. Can't you just do pg_get_wal_records_info(start_lsn,
> > least(pg_current_wal_flush_lsn(), end_lsn)) if you want the non-waiting
> > behavior? Try to make the API more orthogonal, where a few basic
> > functions can be combined to give you everything you need, rather than
> > specifying extra parameters and issuing WARNINGs. I
>
> v7 patch-set [1] onwards waiting mode has been removed for all of the
> functions, again to keep things simple-yet-useful-and-effective.
> However, we can always add new pg_walinspect functions that wait for
> future WAL in the next versions once basic stuff gets committed and if
> many users ask for it.
>
> > * In the docs, include some example output. I don't see any output in
> > the tests, which makes sense because it's mostly non-deterministic, but
> > it would be helpful to see sample output of at least
> > pg_get_wal_records_info().
>
> +1. Added for pg_get_wal_records_info and pg_get_wal_stats.
>
> > * Is pg_get_wal_stats() even necessary, or can you get the same
> > information with a query over pg_get_wal_records_info()? For instance,
> > if you want to group by transaction ID rather than rmgr, then
> > pg_get_wal_stats() is useless.
>
> Yes, you are right pg_get_wal_stats provides WAL stats per resource
> manager which is similar to pg_waldump with --start, --end and --stats
> option. It provides more information than pg_get_wal_records_info and
> is a good way of getting stats than adding more columns to
> pg_get_wal_records_info, calculating percentage in sql and having
> group by clause. IMO, pg_get_wal_stats is more readable and useful.
>
>

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-11 Thread Ashutosh Sharma

You may also need to add documentation to app-createdb.sgml. Currently
you have just added to create_database.sgml. Also, I had a quick look
at the new changes done in v13-0005-WAL-logged-CREATE-DATABASE.patch
and they seemed fine to me although I haven't put much emphasis on the
comments and other cosmetic stuff.

--
With Regards,
Ashutosh Sharma.

On Fri, Mar 11, 2022 at 3:51 PM Dilip Kumar  wrote:
>
> On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar  wrote:
> >
> > On Thu, Mar 10, 2022 at 10:18 PM Robert Haas  wrote:
> > >
> > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar  wrote:
> > > > I have completely changed the logic for this refactoring.  Basically,
> > > > write_relmap_file(), is already having parameters to control whether
> > > > to write wal, send inval and we are already passing the dbpath.
> > > > Instead of making a new function I just pass one additional parameter
> > > > to this function itself about whether we are creating a new map or not
> > > > and I think with that changes are very less and this looks cleaner to
> > > > me.  Similarly for load_relmap_file() also I just had to pass the
> > > > dbpath and memory for destination map.  Please have a look and let me
> > > > know your thoughts.
> > >
> > > It's not terrible, but how about something like the attached instead?
> > > I think this has the effect of reducing the number of cases that the
> > > low-level code needs to know about from 2 to 1, instead of making it
> > > go up from 2 to 3.
> >
> > Yeah this looks cleaner, I will rebase the remaining patch.
>
> Here is the updated version of the patch set.
>
> Changes, 1) it take Robert's patch as first refactoring patch 2)
> Rebase other new relmapper apis on top of that in 0002 3) Some code
> refactoring in main patch 0005 and also one problem fix, earlier in
> wal log method I have removed ForceSyncCommit(), but IMHO that is
> equally valid whether we use file_copy or wal_log because in both
> cases we are creating the disk files.  4) Support strategy in createdb
> tool and add test case as part of 0006.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-11 Thread Ashutosh Sharma

On Fri, Mar 11, 2022 at 8:22 AM Kyotaro Horiguchi
 wrote:
>
> - The difference between pg_get_wal_record_info and _records_ other than
> - the number of argument is the former accepts incorrect LSNs.
>
> The discussion is somewhat confused after some twists and turns..  It
> should be something like the following.
>
> pg_get_wal_record_info and pg_get_wal_records_info are almost same
> since the latter can show a single record.  However it is a bit
> annoying to do that. Since, other than it doens't accept same LSNs for
> start and end, it doesn't show a record when there' no record in the
> specfied LSN range.  But I don't think there's no usefulness of the
> behavior.
>

So, do you want the pg_get_wal_record_info function to be removed as
we can use pg_get_wal_records_info() to achieve what it does?

--
With Regards,
Ashutosh Sharma.

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-11 Thread Ashutosh Sharma

On Fri, Mar 11, 2022 at 8:22 AM Kyotaro Horiguchi
 wrote:
>
> Sorry, some minor non-syntactical corrections.
>
> At Fri, 11 Mar 2022 11:38:22 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > I played with this a bit, and would like to share some thoughts on it.
> >
> > It seems to me too rigorous that pg_get_wal_records_info/stats()
> > reject future LSNs as end-LSN and I think WARNING or INFO and stop at
> > the real end-of-WAL is more kind to users.  I think the same with the
> > restriction that start and end LSN are required to be different.
> >
> > The definition of end-lsn is fuzzy here.  If I fed a future LSN to the
> > functions, they tell me the beginning of the current insertion point
> > in error message.  On the other hand they don't accept the same
> > value as end-LSN.  I think it is right that they tell the current
> > insertion point and they should take the end-LSN as the LSN to stop
> > reading.
> >
> > I think pg_get_wal_stats() is worth to have but I think it should be
> > implemented in SQL.  Currently pg_get_wal_records_info() doesn't tell
> > about FPI since pg_waldump doesn't but it is internally collected (of
> > course!) and easily revealed. If we do that, the
> > pg_get_wal_records_stats() would be reduced to the following SQL
> > statement
> >
> > SELECT resource_manager resmgr,
> >count(*) AS N,
> >  (count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS "%N",
> >  sum(total_length) AS "combined size",
> >  (sum(total_length) * 100 / sum(sum(total_length)) OVER 
> > tot)::numeric(5,2) AS "%combined size",
> >  sum(fpi_len) AS fpilen,
> >  (sum(fpi_len) * 100 / sum(sum(fpi_len)) OVER tot)::numeric(5,2) AS 
> > "%fpilen"
> >  FROM pg_get_wal_records_info('0/100', '0/175DD7f')
> >  GROUP by resource_manager
> >  WINDOW tot AS ()
> >  ORDER BY "combined size" desc;
> >
> > The only difference with pg_waldump is the statement above doesn't
> > show lines for the resource managers that don't contained in the
> > result of pg_get_wal_records_info(). But I don't think that matters.
> >
> >
> > Sometimes the field description has very long (28kb long) content. It
> > makes the result output almost unreadable and I had a bit hard time
> > struggling with the output full of '-'s.  I would like have a default
> > limit on the length of such fields that can be long but I'm not sure
> > we want that.
> >
> >
> - The difference between pg_get_wal_record_info and _records_ other than
> - the number of argument is the former accepts incorrect LSNs.
>
> The discussion is somewhat confused after some twists and turns..  It
> should be something like the following.
>
> pg_get_wal_record_info and pg_get_wal_records_info are almost same
> since the latter can show a single record.  However it is a bit
> annoying to do that. Since, other than it doens't accept same LSNs for
> start and end, it doesn't show a record when there' no record in the
> specfied LSN range.  But I don't think there's no usefulness of the
> behavior.
>
> The following works,
>  pg_get_wal_record_info('0/100');

This does work but it doesn't show any WARNING message for the start
pointer adjustment. I think it should.

>  pg_get_wal_records_info('0/100');
>

I think this is fine. It should be working because the user hasn't
specified the end pointer so we assume the default end pointer is
end-of-WAL.

> but this doesn't
>  pg_get_wal_records_info('0/100', '0/100');
> > ERROR:  WAL start LSN must be less than end LSN
>

I think this behaviour is fine. We cannot have the same start and end
lsn pointers.

> And the following shows no records.
>  pg_get_wal_records_info('0/100', '0/101');
>  pg_get_wal_records_info('0/100', '0/128');
>

I think we should be erroring out here saying - couldn't find any
valid WAL record between given start and end lsn because there exists
no valid wal records between the specified start and end lsn pointers.

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-10 Thread Ashutosh Sharma

On Thu, Mar 10, 2022 at 10:18 PM Robert Haas  wrote:
>
> On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar  wrote:
> > I have completely changed the logic for this refactoring.  Basically,
> > write_relmap_file(), is already having parameters to control whether
> > to write wal, send inval and we are already passing the dbpath.
> > Instead of making a new function I just pass one additional parameter
> > to this function itself about whether we are creating a new map or not
> > and I think with that changes are very less and this looks cleaner to
> > me.  Similarly for load_relmap_file() also I just had to pass the
> > dbpath and memory for destination map.  Please have a look and let me
> > know your thoughts.
>
> It's not terrible, but how about something like the attached instead?
> I think this has the effect of reducing the number of cases that the
> low-level code needs to know about from 2 to 1, instead of making it
> go up from 2 to 3.
>

Looks better, but why do you want to pass elevel to the
load_relmap_file()? Are we calling this function from somewhere other
than read_relmap_file()? If not, do we have any plans to call this
function  directly bypassing read_relmap_file for any upcoming patch?

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-10 Thread Ashutosh Sharma

Thanks Dilip for working on the review comments. I'll take a look at
the new version of patch and let you know my comments, if any.

--
With Regards,
Ashutosh Sharma.

On Thu, Mar 10, 2022 at 8:38 PM Dilip Kumar  wrote:
>
> On Thu, Mar 10, 2022 at 7:22 PM Ashutosh Sharma  wrote:
> >
> > Here are some review comments on the latest patch
> > (v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review
> > of the v10 patch but that applies for this latest version as well.
> >
> > +   /* Now errors are fatal ... */
> > +   START_CRIT_SECTION();
> >
> > Did you mean PANIC instead of FATAL?
>
> I think here fatal didn't really mean the error level FATAL, that
> means critical and I have seen it is used in other places also.  But I
> really don't think we need this comments to removed to avoid any
> confusion.
>
> > ==
> >
> > +
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +errmsg("invalid create
> > strategy %s", strategy),
> > +errhint("Valid strategies are
> > \"wal_log\", and \"file_copy\".")));
> > +   }
> >
> >
> > Should this be - "invalid createdb strategy" instead of "invalid
> > create strategy"?
>
> Changed
>
> > ==
> >
> > +   /*
> > +* In case of ALTER DATABASE SET TABLESPACE we don't need 
> > to do
> > +* anything for the object which are not in the source
> > db's default
> > +* tablespace.  The source and destination dboid will be 
> > same in
> > +* case of ALTER DATABASE SET TABLESPACE.
> > +*/
> > +   else if (src_dboid == dst_dboid)
> > +   continue;
> > +   else
> > +   dstrnode.spcNode = srcrnode.spcNode;
> >
> >
> > Is this change still required? Do we support the WAL_COPY strategy for
> > ALTER DATABASE?
>
> Yeah now it is unreachable code so removed.
>
> > ==
> >
> > +   /* Open the source and the destination relation at
> > smgr level. */
> > +   src_smgr = smgropen(srcrnode, InvalidBackendId);
> > +   dst_smgr = smgropen(dstrnode, InvalidBackendId);
> > +
> > +   /* Copy relation storage from source to the destination. */
> > +   CreateAndCopyRelationData(src_smgr, dst_smgr,
> > relinfo->relpersistence);
> >
> > Do we need to do smgropen for destination relfilenode here? Aren't we
> > already doing that inside RelationCreateStorage?
>
> Yeah I have changed the complete logic and removed the smgr_open for
> both source and destination and moved inside
> CreateAndCopyRelationData, please check the updated code.
>
> > ==
> >
> > +   use_wal = XLogIsNeeded() &&
> > +   (relpersistence == RELPERSISTENCE_PERMANENT ||
> > copying_initfork);
> > +
> > +   /* Get number of blocks in the source relation. */
> > +   nblocks = smgrnblocks(src, forkNum);
> >
> > What if number of blocks in a source relation is ZERO? Should we check
> > for that and return immediately. We have already done smgrcreate.
>
> Yeah make sense to optimize, with that we will not have to get the
> buffer strategy so done.
>
> > ==
> >
> > +   /* We don't need to copy the shared objects to the target. */
> > +   if (classForm->reltablespace == GLOBALTABLESPACE_OID)
> > +   return NULL;
> > +
> > +   /*
> > +* If the object doesn't have the storage then nothing to be
> > +* done for that object so just ignore it.
> > +*/
> > +   if (!RELKIND_HAS_STORAGE(classForm->relkind))
> > +   return NULL;
> >
> > We can probably club together above two if-checks.
>
> Done
>
> > ==
> >
> > +  
> > +   strategy
> > +   
> > +
> > + This is used for copying the database directory.  Currently, we 
> > have
> > + two strategies the WAL_LOG and the
> > + FILE_COPY.  If WAL_LOG 
> > strategy
> > + is used then the database will be copied block by block and it 
> > will
> > + also WAL log each copied block.  Otherwise, if FILE_COPY
> >
> > I think we need to mention the default strategy in the documentation page.
>
> Done
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-03-10 Thread Ashutosh Sharma

Here are some review comments on the latest patch
(v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review
of the v10 patch but that applies for this latest version as well.

+   /* Now errors are fatal ... */
+   START_CRIT_SECTION();

Did you mean PANIC instead of FATAL?

==

+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid create
strategy %s", strategy),
+errhint("Valid strategies are
\"wal_log\", and \"file_copy\".")));
+   }


Should this be - "invalid createdb strategy" instead of "invalid
create strategy"?

==

+   /*
+* In case of ALTER DATABASE SET TABLESPACE we don't need to do
+* anything for the object which are not in the source
db's default
+* tablespace.  The source and destination dboid will be same in
+* case of ALTER DATABASE SET TABLESPACE.
+*/
+   else if (src_dboid == dst_dboid)
+   continue;
+   else
+   dstrnode.spcNode = srcrnode.spcNode;


Is this change still required? Do we support the WAL_COPY strategy for
ALTER DATABASE?

==

+   /* Open the source and the destination relation at
smgr level. */
+   src_smgr = smgropen(srcrnode, InvalidBackendId);
+   dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+   /* Copy relation storage from source to the destination. */
+   CreateAndCopyRelationData(src_smgr, dst_smgr,
relinfo->relpersistence);

Do we need to do smgropen for destination relfilenode here? Aren't we
already doing that inside RelationCreateStorage?

==

+   use_wal = XLogIsNeeded() &&
+   (relpersistence == RELPERSISTENCE_PERMANENT ||
copying_initfork);
+
+   /* Get number of blocks in the source relation. */
+   nblocks = smgrnblocks(src, forkNum);

What if number of blocks in a source relation is ZERO? Should we check
for that and return immediately. We have already done smgrcreate.

==

+   /* We don't need to copy the shared objects to the target. */
+   if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+   return NULL;
+
+   /*
+* If the object doesn't have the storage then nothing to be
+* done for that object so just ignore it.
+*/
+   if (!RELKIND_HAS_STORAGE(classForm->relkind))
+   return NULL;

We can probably club together above two if-checks.

==

+  
+   strategy
+   
+
+ This is used for copying the database directory.  Currently, we have
+ two strategies the WAL_LOG and the
+ FILE_COPY.  If WAL_LOG strategy
+ is used then the database will be copied block by block and it will
+ also WAL log each copied block.  Otherwise, if FILE_COPY

I think we need to mention the default strategy in the documentation page.

--
With Regards,
Ashutosh Sharma.

On Thu, Mar 10, 2022 at 4:32 PM Dilip Kumar  wrote:
>
> On Wed, Mar 9, 2022 at 6:44 PM Robert Haas  wrote:
> >
> > > Right, infact now also if you see the logic, the
> > > write_relmap_file_internal() is taking care of the actual update and
> > > the write_relmap_file() is doing update + setting the global
> > > variables.  So yeah we can rename as you suggested in 0001 and here
> > > also we can change the logic as you suggested that the recovery and
> > > createdb will only call the first function which is just doing the
> > > update.
> >
> > But I think we want the path construction to be managed by the
> > function rather than the caller, too.
>
> I have completely changed the logic for this refactoring.  Basically,
> write_relmap_file(), is already having parameters to control whether
> to write wal, send inval and we are already passing the dbpath.
> Instead of making a new function I just pass one additional parameter
> to this function itself about whether we are creating a new map or not
> and I think with that changes are very less and this looks cleaner to
> me.  Similarly for load_relmap_file() also I just had to pass the
> dbpath and memory for destination map.  Please have a look and let me
> know your thoughts.
>
> > Sure, I guess. It's just not obvious why the argument would actually
> > need to be a function that copies storage, or why there's more than
> > one way to copy storage. I'd rather keep all the code paths unified,
> > if we can, and set behavior via flags or something, maybe. I'm not
> > sure whether that's realistic, though.
>
> I try considering that, I think it doesn't look good to make it flag
> based,  One of the main problem I noticed is that now

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-03-09 Thread Ashutosh Sharma

On Tue, Mar 8, 2022 at 8:31 PM Nitin Jadhav
 wrote:
>
> > > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > > -[ RECORD 1 ]-+-
> > > pid   | 22043
> > > type  | checkpoint
> > > kind  | immediate force wait requested time
> > >
> > > I think the output in the kind column can be displayed as {immediate,
> > > force, wait, requested, time}. By the way these are all checkpoint
> > > flags so it is better to display it as checkpoint flags instead of
> > > checkpoint kind as mentioned in one of my previous comments.
> >
> > I will update in the next patch.
>
> The current format matches with the server log message for the
> checkpoint start in LogCheckpointStart(). Just to be consistent, I
> have not changed the code.
>

See below, how flags are shown in other sql functions like:

ashu@postgres=# select * from heap_tuple_infomask_flags(2304, 1);
raw_flags| combined_flags
-+
 {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
(1 row)

This looks more readable and it's easy to understand for the
end-users.. Further comparing the way log messages are displayed with
the way sql functions display its output doesn't look like a right
comparison to me. Obviously both should show matching data but the way
it is shown doesn't need to be the same. In fact it is not in most of
the cases.

> I have taken care of the rest of the comments in v5 patch for which
> there was clarity.
>

Thank you very much. Will take a look at it later.

--
With Regards,
Ashutosh Sharma.

Re: Synchronizing slots from primary to standby

2022-03-09 Thread Ashutosh Sharma

Hi,

I have spent little time trying to understand the concern raised by
Andres and while doing so I could think of a couple of issues which I
would like to share here. Although I'm not quite sure how inline these
are with the problems seen by Andres.

1) Firstly, what if we come across a situation where the failover
occurs when the confirmed flush lsn has been updated on primary, but
is yet to be updated on the standby? I believe this may very well be
the case especially considering that standby sends sql queries to the
primary to synchronize the replication slots at regular intervals and
if the primary dies just after updating the confirmed flush lsn of its
logical subscribers then the standby may not be able to get this
information/update from the primary which means we'll probably end up
having a broken logical replication slot on the new primary.

2) Secondly, if the standby goes down, the logical subscribers will
stop receiving new changes from the primary as per the design of this
patch OR if standby lags behind the primary for whatever reason, it
will have a direct impact on logical subscribers as well.

--
With Regards,
Ashutosh Sharma.

On Sat, Feb 19, 2022 at 3:53 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-02-11 15:28:19 +0100, Peter Eisentraut wrote:
> > On 05.02.22 20:59, Andres Freund wrote:
> > > On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote:
> > > >  From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001
> > > > From: Peter Eisentraut
> > > > Date: Mon, 3 Jan 2022 14:43:36 +0100
> > > > Subject: [PATCH v3] Synchronize logical replication slots from primary 
> > > > to
> > > >   standby
> > > I've just skimmed the patch and the related threads. As far as I can tell 
> > > this
> > > cannot be safely used without the conflict handling in [1], is that 
> > > correct?
> >
> > This or similar questions have been asked a few times about this or similar
> > patches, but they always come with some doubt.
>
> I'm certain it's a problem - the only reason I couched it was that there could
> have been something clever in the patch preventing problems that I missed
> because I just skimmed it.
>
>
> > If we think so, it would be
> > useful perhaps if we could come up with test cases that would demonstrate
> > why that other patch/feature is necessary.  (I'm not questioning it
> > personally, I'm just throwing out ideas here.)
>
> The patch as-is just breaks one of the fundamental guarantees necessary for
> logical decoding, that no rows versions can be removed that are still required
> for logical decoding (signalled via catalog_xmin). So there needs to be an
> explicit mechanism upholding that guarantee, but there is not right now from
> what I can see.
>
> One piece of the referenced patchset is that it adds information about removed
> catalog rows to a few WAL records, and then verifies during replay that no
> record can be replayed that removes resources that are still needed. If such a
> conflict exists it's dealt with as a recovery conflict.
>
> That itself doesn't provide prevention against removal of required, but it
> provides detection. The prevention against removal can then be done using a
> physical replication slot with hot standby feedback or some other mechanism
> (e.g. slot syncing mechanism could maintain a "placeholder" slot on the
> primary for all sync targets or something like that).
>
> Even if that infrastructure existed / was merged, the slot sync stuff would
> still need some very careful logic to protect against problems due to
> concurrent WAL replay and "synchronized slot" creation. But that's doable.
>
> Greetings,
>
> Andres Freund
>
>

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-04 Thread Ashutosh Sharma

Thanks Bharath for working on all my review comments. I took a quick
look at the new version of the patch (v7-pg_walinspect.patch) and this
version looks a lot better. I'll do some detailed review later (maybe
next week or so) and let you know my further comments, if any.

--
With Regards,
Ashutosh Sharma.

On Fri, Mar 4, 2022 at 3:54 PM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 3, 2022 at 10:05 PM Robert Haas  wrote:
> >
> > On Fri, Feb 25, 2022 at 6:03 AM Bharath Rupireddy
> >  wrote:
> > > Added a new function that returns the first and last valid WAL record
> > > LSN of a given WAL file.
> >
> > Sounds like fuzzy thinking to me. WAL records can cross file
> > boundaries, and forgetting about that leads to all sorts of problems.
> > Just give people one function that decodes a range of LSNs and call it
> > good. Why do you need anything else? If people want to get the first
> > record that begins in a segment or the first record any portion of
> > which is in a particular segment or the last record that begins in a
> > segment or the last record that ends in a segment or any other such
> > thing, they can use a WHERE clause for that... and if you think they
> > can't, then that should be good cause to rethink the return value of
> > the one-and-only SRF that I think you need here.
>
> Thanks Robert.
>
> Thanks to others for your review comments.
>
> Here's the v7 patch set. These patches are based on the motive that
> "keep it simple and short yet effective and useful". With that in
> mind, I have not implemented the wait mode for any of the functions
> (as it doesn't look an effective use-case and requires adding a new
> page_read callback, instead throw error if future LSN is specified),
> also these functions will give WAL information on the current server's
> timeline. Having said that, I'm open to adding new functions in future
> once this initial version gets in, if there's a requirement and users
> ask for the new functions.
>
> Please review the v7 patch set and provide your thoughts.
>
> Regards,
> Bharath Rupireddy.

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-03-04 Thread Ashutosh Sharma

Please don't mix comments from multiple reviewers into one thread.
It's hard to understand which comments are mine or Julien's or from
others. Can you please respond to the email from each of us separately
with an inline response. That will be helpful to understand your
thoughts on our review comments.

--
With Regards,
Ashutosh Sharma.

On Fri, Mar 4, 2022 at 4:59 PM Nitin Jadhav
 wrote:
>
> Thanks for reviewing.
>
> > 6) How about shutdown and end-of-recovery checkpoint? Are you planning
> > to have an ereport_startup_progress mechanism as 0002?
>
> I thought of including it earlier then I felt lets first make the
> current patch stable. Once all the fields are properly decided and the
> patch gets in then we can easily extend the functionality to shutdown
> and end-of-recovery cases. I have also observed that the timer
> functionality wont work properly in case of shutdown as we are doing
> an immediate checkpoint. So this needs a lot of discussion and I would
> like to handle this on a separate thread.
> ---
>
> > 7) I think you don't need to call checkpoint_progress_start and
> > pgstat_progress_update_param, any other progress reporting function
> > for shutdown and end-of-recovery checkpoint right?
>
> I had included the guards earlier and then removed later based on the
> discussion upthread.
> ---
>
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-+-
> > pid   | 22043
> > type  | checkpoint
> > kind  | immediate force wait requested time
> > start_lsn | 0/14C60F8
> > start_time| 2022-03-03 18:59:56.018662+05:30
> > phase | performing two phase checkpoint
> >
> >
> > This is the output I see when the checkpointer process has come out of
> > the two phase checkpoint and is currently writing checkpoint xlog
> > records and doing other stuff like updating control files etc. Is this
> > okay?
>
> The idea behind choosing the phases is based on the functionality
> which takes longer time to execute. Since after two phase checkpoint
> till post checkpoint cleanup won't take much time to execute, I have
> not added any additional phase for that. But I also agree that this
> gives wrong information to the user. How about mentioning the phase
> information at the end of each phase like "Initializing",
> "Initialization done", ..., "two phase checkpoint done", "post
> checkpoint cleanup done", .., "finalizing". Except for the first phase
> ("initializing") and last phase ("finalizing"), all the other phases
> describe the end of a certain operation. I feel this gives correct
> information even though the phase name/description does not represent
> the entire code block between two phases. For example if the current
> phase is ''two phase checkpoint done". Then the user can infer that
> the checkpointer has done till two phase checkpoint and it is doing
> other stuff that are after that. Thoughts?
>
> > The output of log_checkpoint shows the number of buffers written is 3
> > whereas the output of pg_stat_progress_checkpoint shows it as 0. See
> > below:
> >
> > 2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
> > buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> > write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
> > longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB
> >
> > --
> >
> > [local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
> > -[ RECORD 1 ]-+-
> > pid   | 22043
> > type  | checkpoint
> > kind  | immediate force wait requested time
> > start_lsn | 0/14C60F8
> > start_time| 2022-03-03 18:59:56.018662+05:30
> > phase | finalizing
> > buffers_total | 0
> > buffers_processed | 0
> > buffers_written   | 0
> >
> > Any idea why this mismatch?
>
> Good catch. In BufferSync() we have 'num_to_scan' (buffers_total)
> which indicates the total number of buffers to be processed. Based on
> that, the 'buffers_processed' and 'buffers_written' counter gets
> incremented. I meant these values may reach upto 'buffers_total'. The
> current pg_stat_progress_view support above information. There is
> another place when 'ckpt_bufs_written' gets incremented (In
> SlruInternalWritePage()). This increment is above the 'buffers_total'
> value and it is included in the server log message (checkpoint end)
> and not included in the

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-03 Thread Ashutosh Sharma

I think we should also see if we can allow end users to input timeline
information with the pg_walinspect functions. This will help the end
users to get information about WAL records from previous timeline
which can be helpful in case of restored servers.

--
With Regards,
Ashutosh Sharma.

On Thu, Mar 3, 2022 at 8:20 AM Kyotaro Horiguchi
 wrote:
>
> Hi.
>
> +#ifdef FRONTEND
> +/*
> + * Functions that are currently not needed in the backend, but are better
> + * implemented inside xlogreader.c because of the internal facilities 
> available
> + * here.
> + */
> +
>  #endif /* FRONTEND */
>
> Why didn't you remove the emptied #ifdef section?
>
> +int
> +read_local_xlog_page_2(XLogReaderState *state, XLogRecPtr targetPagePtr,
> +  int reqLen, XLogRecPtr 
> targetRecPtr, char *cur_page)
>
> The difference with the original function is this function has one
> additional if-block amid. I think we can insert the code directly in
> the original function.
>
> +   /*
> +* We are trying to read future WAL. Let's not wait 
> for WAL to be
> +* available if indicated.
> +*/
> +   if (state->private_data != NULL)
>
> However, in the first place it seems to me there's not need for the
> function to take care of no_wait affairs.
>
> If, for expample, pg_get_wal_record_info() with no_wait = true, it is
> enough that the function identifies the bleeding edge of WAL then loop
> until the LSN.  So I think no need for the new function, nor for any
> modification on the origical function.
>
> The changes will reduce the footprint of the patch largely, I think.
>
> At Wed, 2 Mar 2022 22:37:43 +0530, Bharath Rupireddy 
>  wrote in
> > On Wed, Mar 2, 2022 at 8:12 PM Ashutosh Sharma  
> > wrote:
> > >
> > > Some review comments on v5 patch (v5-0001-pg_walinspect.patch)
> >
> > Thanks for reviewing.
> >
> > > +--
> > > +-- pg_get_wal_records_info()
> > > +--
> > > +CREATE FUNCTION pg_get_wal_records_info(IN start_lsn pg_lsn,
> > > +IN end_lsn pg_lsn,
> > > +IN wait_for_wal boolean DEFAULT false,
> > > +OUT lsn pg_lsn,
> > >
> > > What does the wait_for_wal flag mean here when one has already
> > > specified the start and end lsn? AFAIU, If a user has specified a
> > > start and stop LSN, it means that the user knows the extent to which
> > > he/she wants to display the WAL records in which case we need to stop
> > > once the end lsn has reached . So what is the meaning of wait_for_wal
> > > flag? Does it look sensible to have the wait_for_wal flag here? To me
> > > it doesn't.
> >
> > Users can always specify a future end_lsn and set wait_for_wal to
> > true, then the pg_get_wal_records_info/pg_get_wal_stats functions can
> > wait for the WAL. IMO, this is useful. If you remember you were okay
> > with wait/nowait versions for some of the functions upthread [1]. I'm
> > not going to retain this behaviour for both
> > pg_get_wal_records_info/pg_get_wal_stats as it is similar to
> > pg_waldump's --follow option.
>
> I agree to this for now. However, I prefer that NULL or invalid
> end_lsn is equivalent to pg_current_wal_lsn().
>
> > > ==
> > >
> > > +--
> > > +-- pg_get_wal_records_info_till_end_of_wal()
> > > +--
> > > +CREATE FUNCTION pg_get_wal_records_info_till_end_of_wal(IN start_lsn 
> > > pg_lsn,
> > > +OUT lsn pg_lsn,
> > > +OUT prev_lsn pg_lsn,
> > > +OUT xid xid,
> > >
> > > Why is this function required? Is pg_get_wal_records_info() alone not
> > > enough? I think it is. See if we can make end_lsn optional in
> > > pg_get_wal_records_info() and lets just have it alone. I think it can
> > > do the job of pg_get_wal_records_info_till_end_of_wal function.
>
> I rather agree to Ashutosh. This feature can be covered by
> pg_get_wal_records(start_lsn, NULL, false).
>
> > > ==
> > >
> > > +--
> > > +-- pg_get_wal_stats_till_end_of_wal()
> > > +--
> > > +CREATE FUNCTION pg_get_wal_stats_till_end_of_wal(IN start_lsn pg_lsn,
> > > +OUT resource_manager text,
> > > +OUT count int8,
> > >
> > > Above comment applies to this function as well. Isn't pg_get_wal_stats() 
> > > enough?
> >
> > I'm doing the following input validations for these functions to not

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-03-03 Thread Ashutosh Sharma

Here are some of my review comments on the latest patch:

+ 
+  
+   type text
+  
+  
+   Type of checkpoint. See .
+  
+ 
+
+ 
+  
+   kind text
+  
+  
+   Kind of checkpoint. See .
+  
+ 

This looks a bit confusing. Two columns, one with the name "checkpoint
types" and another "checkpoint kinds". You can probably rename
checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
as-it-is.

==

+  
pg_stat_progress_checkpointpg_stat_progress_checkpoint
+  One row only, showing the progress of the checkpoint.

Let's make this message consistent with the already existing message
for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
view in "Dynamic Statistics Views" table.

==

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-+-
pid   | 22043
type  | checkpoint
kind  | immediate force wait requested time

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

==

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-+-
pid   | 22043
type  | checkpoint
kind  | immediate force wait requested time
start_lsn | 0/14C60F8
start_time| 2022-03-03 18:59:56.018662+05:30
phase | performing two phase checkpoint

This is the output I see when the checkpointer process has come out of
the two phase checkpoint and is currently writing checkpoint xlog
records and doing other stuff like updating control files etc. Is this
okay?

==

The output of log_checkpoint shows the number of buffers written is 3
whereas the output of pg_stat_progress_checkpoint shows it as 0. See
below:

2022-03-03 20:04:45.643 IST [22043] LOG:  checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB

--

[local]:5432 ashu@postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-+-
pid   | 22043
type  | checkpoint
kind  | immediate force wait requested time
start_lsn | 0/14C60F8
start_time| 2022-03-03 18:59:56.018662+05:30
phase | finalizing
buffers_total | 0
buffers_processed | 0
buffers_written   | 0

Any idea why this mismatch?

==

I think we can add a couple of more information to this view -
start_time for buffer write operation and start_time for buffer sync
operation. These are two very time consuming tasks in a checkpoint and
people would find it useful to know how much time is being taken by
the checkpoint in I/O operation phase. thoughts?

--
With Regards,
Ashutosh Sharma.

On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
 wrote:
>
> Thanks for reviewing.
>
> > > > I suggested upthread to store the starting timeline instead.  This way 
> > > > you can
> > > > deduce whether it's a restartpoint or a checkpoint, but you can also 
> > > > deduce
> > > > other information, like what was the starting WAL.
> > >
> > > I don't understand why we need the timeline here to just determine
> > > whether it's a restartpoint or checkpoint.
> >
> > I'm not saying it's necessary, I'm saying that for the same space usage we 
> > can
> > store something a bit more useful.  If no one cares about having the 
> > starting
> > timeline available for no extra cost then sure, let's just store the kind
> > directly.
>
> Fixed.
>
> > 2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
> > directly instead of new function pg_stat_get_progress_checkpoint_kind?
> > + snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
> > + (flags == 0) ? "unknown" : "",
> > + (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
> > + (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
> > + (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
> > + (flags & CHECKPOINT_FORCE) ? "force " : "",
> > + (flags & CHECKPOINT_WAIT) ? "wait " : "",
> > + (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
> > + (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
> > + (flags & CHECKPOINT_FLUSH_ALL) ? &

Re: Make mesage at end-of-recovery less scary.

2022-03-03 Thread Ashutosh Sharma

On Wed, Mar 2, 2022 at 7:47 AM Kyotaro Horiguchi
 wrote:
>
> At Sat, 19 Feb 2022 09:31:33 +0530, Ashutosh Sharma  
> wrote in
> > The changes looks good. thanks.!
>
> Thanks!
>
> Some recent core change changed WAL insertion speed during the TAP
> test and revealed one forgotton case of EndOfWAL.  When a record
> header flows into the next page, XLogReadRecord does separate check
> from ValidXLogRecordHeader by itself.
>

The new changes made in the patch look good. Thanks to the recent
changes to speed WAL insertion that have helped us catch this bug.

One small comment:

record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-   total_len = record->xl_tot_len;

Do you think we need to change the position of the comments written
for above code that says:

/*
 * Read the record length.
     *
...
...

--
With Regards,
Ashutosh Sharma.

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-02 Thread Ashutosh Sharma

On Wed, Mar 2, 2022 at 10:37 PM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 2, 2022 at 8:12 PM Ashutosh Sharma  wrote:
> >
> > Some review comments on v5 patch (v5-0001-pg_walinspect.patch)
>
> Thanks for reviewing.
>
> > +--
> > +-- pg_get_wal_records_info()
> > +--
> > +CREATE FUNCTION pg_get_wal_records_info(IN start_lsn pg_lsn,
> > +IN end_lsn pg_lsn,
> > +IN wait_for_wal boolean DEFAULT false,
> > +OUT lsn pg_lsn,
> >
> > What does the wait_for_wal flag mean here when one has already
> > specified the start and end lsn? AFAIU, If a user has specified a
> > start and stop LSN, it means that the user knows the extent to which
> > he/she wants to display the WAL records in which case we need to stop
> > once the end lsn has reached . So what is the meaning of wait_for_wal
> > flag? Does it look sensible to have the wait_for_wal flag here? To me
> > it doesn't.
>
> Users can always specify a future end_lsn and set wait_for_wal to
> true, then the pg_get_wal_records_info/pg_get_wal_stats functions can
> wait for the WAL. IMO, this is useful. If you remember you were okay
> with wait/nowait versions for some of the functions upthread [1]. I'm
> not going to retain this behaviour for both
> pg_get_wal_records_info/pg_get_wal_stats as it is similar to
> pg_waldump's --follow option.
>

It is not at all similar to pg_waldumps behaviour. Please check the
behaviour of pg_waldump properly. Does it wait for any wal records
when a user has specified a stop pointer? It doesn't and it shouldn't.
I mean does it even make sense to wait for the WAL when a stop pointer
is specified? And it's quite understandable that if a user has asked
pg_walinspect to stop at a certain point, it must. Also, What if there
are already WAL records after the stop pointer, in that case does it
even make sense to have a wait flag. WHat would be the meaning of the
wait flag in that case?

Further, have you checked wait_for_wal flag behaviour, is it even working?

> >
> > +--
> > +-- pg_get_wal_records_info_till_end_of_wal()
> > +--
> > +CREATE FUNCTION pg_get_wal_records_info_till_end_of_wal(IN start_lsn 
> > pg_lsn,
> > +OUT lsn pg_lsn,
> > +OUT prev_lsn pg_lsn,
> > +OUT xid xid,
> >
> > Why is this function required? Is pg_get_wal_records_info() alone not
> > enough? I think it is. See if we can make end_lsn optional in
> > pg_get_wal_records_info() and lets just have it alone. I think it can
> > do the job of pg_get_wal_records_info_till_end_of_wal function.
> >
> > ==
> >
> > +--
> > +-- pg_get_wal_stats_till_end_of_wal()
> > +--
> > +CREATE FUNCTION pg_get_wal_stats_till_end_of_wal(IN start_lsn pg_lsn,
> > +OUT resource_manager text,
> > +OUT count int8,
> >
> > Above comment applies to this function as well. Isn't pg_get_wal_stats() 
> > enough?
>
> I'm doing the following input validations for these functions to not
> cause any issues with invalid LSN. If I were to have the default value
> for end_lsn as 0/0, I can't perform input validations right? That is
> the reason I'm having separate functions {pg_get_wal_records_info,
> pg_get_wal_stats}_till_end_of_wal() versions.
>

You can do it. Please check pg_waldump to understand how it is done
there. You cannot have multiple functions doing different things when
one single function can do all the job.

> > ==
> >
> >
> > +   if (loc <= read_upto)
> > +   break;
> > +
> > +   /* Let's not wait for WAL to be available if
> > indicated */
> > +   if (loc > read_upto &&
> > +   state->private_data != NULL)
> > +   {
> >
> > Why loc > read_upto? The first if condition is (loc <= read_upto)
> > followed by the second if condition - (loc > read_upto). Is the first
> > if condition (loc <= read_upto) not enough to indicate that loc >
> > read_upto?
>
> Yeah, that's unnecessary, I improved the comment there and removed loc
> > read_upto.
>
> > ==
> >
> > +#define IsEndOfWALReached(state) \
> > +   (state->private_data != NULL && \
> > +   (((ReadLocalXLOGPage2Private *)
> > xlogreader->private_data)->no_wait == true) && \
> > +   (((ReadLocalXLOGPage2Private *)
> > xlogreader->private_data)->reached_end_of_wal == true))
> >
> > I think we should either use state or xlogreader. First line says
> > state->private_data and second line x

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-02 Thread Ashutosh Sharma

Some review comments on v5 patch (v5-0001-pg_walinspect.patch)

+--
+-- pg_get_wal_records_info()
+--
+CREATE FUNCTION pg_get_wal_records_info(IN start_lsn pg_lsn,
+IN end_lsn pg_lsn,
+IN wait_for_wal boolean DEFAULT false,
+OUT lsn pg_lsn,

What does the wait_for_wal flag mean here when one has already
specified the start and end lsn? AFAIU, If a user has specified a
start and stop LSN, it means that the user knows the extent to which
he/she wants to display the WAL records in which case we need to stop
once the end lsn has reached . So what is the meaning of wait_for_wal
flag? Does it look sensible to have the wait_for_wal flag here? To me
it doesn't.

==

+--
+-- pg_get_wal_records_info_till_end_of_wal()
+--
+CREATE FUNCTION pg_get_wal_records_info_till_end_of_wal(IN start_lsn pg_lsn,
+OUT lsn pg_lsn,
+OUT prev_lsn pg_lsn,
+OUT xid xid,

Why is this function required? Is pg_get_wal_records_info() alone not
enough? I think it is. See if we can make end_lsn optional in
pg_get_wal_records_info() and lets just have it alone. I think it can
do the job of pg_get_wal_records_info_till_end_of_wal function.

==

+--
+-- pg_get_wal_stats_till_end_of_wal()
+--
+CREATE FUNCTION pg_get_wal_stats_till_end_of_wal(IN start_lsn pg_lsn,
+OUT resource_manager text,
+OUT count int8,

Above comment applies to this function as well. Isn't pg_get_wal_stats() enough?

==

+   if (loc <= read_upto)
+   break;
+
+   /* Let's not wait for WAL to be available if
indicated */
+   if (loc > read_upto &&
+   state->private_data != NULL)
+   {

Why loc > read_upto? The first if condition is (loc <= read_upto)
followed by the second if condition - (loc > read_upto). Is the first
if condition (loc <= read_upto) not enough to indicate that loc >
read_upto?

==

+#define IsEndOfWALReached(state) \
+   (state->private_data != NULL && \
+   (((ReadLocalXLOGPage2Private *)
xlogreader->private_data)->no_wait == true) && \
+   (((ReadLocalXLOGPage2Private *)
xlogreader->private_data)->reached_end_of_wal == true))

I think we should either use state or xlogreader. First line says
state->private_data and second line xlogreader->private_data.

==

+   (((ReadLocalXLOGPage2Private *)
xlogreader->private_data)->reached_end_of_wal == true))
+

There is a new patch coming to make the end of WAL messages less
scary. It introduces the EOW flag in xlogreaderstate maybe we can use
that instead of introducing new flags in private area to represent the
end of WAL.

==

+/*
+ * XLogReaderRoutine->page_read callback for reading local xlog files
+ *
+ * This function is same as read_local_xlog_page except that it works in both
+ * wait and no wait mode. The callers can specify about waiting in private_data
+ * of XLogReaderState.
+ */
+int
+read_local_xlog_page_2(XLogReaderState *state, XLogRecPtr targetPagePtr,
+  int reqLen, XLogRecPtr
targetRecPtr, char *cur_page)
+{
+   XLogRecPtr  read_upto,

Do we really need this function? Can't we make use of an existing WAL
reader function - read_local_xlog_page()?

--
With Regards,
Ashutosh Sharma.

On Fri, Feb 25, 2022 at 4:33 PM Bharath Rupireddy
 wrote:
>
> On Wed, Feb 16, 2022 at 9:04 AM Ashutosh Sharma  wrote:
> > I don't think that's the use case of this patch. Unless there is some
> > other valid reason, I would suggest you remove it.
>
> Removed the function pg_verify_raw_wal_record. Robert and Greg also
> voted for removal upthread.
>
> > > > Should we add a function that returns the pointer to the first and
> > > > probably the last WAL record in the WAL segment? This would help users
> > > > to inspect the wal records in the entire wal segment if they wish to
> > > > do so.
> > >
> > > Good point. One can do this already with pg_get_wal_records_info and
> > > pg_walfile_name_offset. Usually, the LSN format itself can give an
> > > idea about the WAL file it is in.
> > >
> > > postgres=# select lsn, pg_walfile_name_offset(lsn) from
> > > pg_get_wal_records_info('0/500', '0/5FF') order by lsn asc
> > > limit 1;
> > > lsn|pg_walfile_name_offset
> > > ---+---
> > >  0/538 | (00010005,56)
> > > (1 row)
> > >
> > > postgres=# select lsn, pg_walfile_name_offset(lsn) from
> > > pg_get_wal_records_info('0/500', '0/5FF') order by lsn desc
> > > limit 1;
> > > lsn|   pg_walfile_name_offset
> > > ---+

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-23 Thread Ashutosh Sharma

+   if ((ckpt_flags &
+(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
+   {

This code (present at multiple places) looks a little ugly to me, what
we can do instead is add a macro probably named IsShutdownCheckpoint()
which does the above check and use it in all the functions that have
this check. See below:

#define IsShutdownCheckpoint(flags) \
  (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY) != 0)

And then you may use this macro like:

if (IsBootstrapProcessingMode() || IsShutdownCheckpoint(flags))
return;

This change can be done in all these functions:

+void
+checkpoint_progress_start(int flags)

--

+ */
+void
+checkpoint_progress_update_param(int index, int64 val)

--

+ * Stop reporting progress of the checkpoint.
+ */
+void
+checkpoint_progress_end(void)

==

+
pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
InvalidOid);
+
+   val[0] = XLogCtl->InsertTimeLineID;
+   val[1] = flags;
+   val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
+   val[3] = CheckpointStats.ckpt_start_t;
+
+   pgstat_progress_update_multi_param(4, index, val);
+   }

Any specific reason for recording the timelineID in checkpoint stats
table? Will this ever change in our case?

--
With Regards,
Ashutosh Sharma.

On Wed, Feb 23, 2022 at 6:59 PM Nitin Jadhav
 wrote:
>
> > I will make use of pgstat_progress_update_multi_param() in the next
> > patch to replace multiple calls to checkpoint_progress_update_param().
>
> Fixed.
> ---
>
> > > The other progress tables use [type]_total as column names for counter
> > > targets (e.g. backup_total for backup_streamed, heap_blks_total for
> > > heap_blks_scanned, etc.). I think that `buffers_total` and
> > > `files_total` would be better column names.
> >
> > I agree and I will update this in the next patch.
>
> Fixed.
> ---
>
> > How about this "The checkpoint is started because max_wal_size is reached".
> >
> > "The checkpoint is started because checkpoint_timeout expired".
> >
> > "The checkpoint is started because some operation forced a checkpoint".
>
> I have used the above description. Kindly let me know if any changes
> are required.
> ---
>
> > > > +  checkpointing CommitTs pages
> > >
> > > CommitTs -> Commit time stamp
> >
> > I will handle this in the next patch.
>
> Fixed.
> ---
>
> > There are more scenarios where you can have a baackend requesting a 
> > checkpoint
> > and waiting for its completion, and there may be more than one backend
> > concerned, so I don't think that storing only one / the first backend pid is
> > ok.
>
> Thanks for this information. I am not considering backend_pid.
> ---
>
> > I think all the information should be exposed.  Only knowing why the current
> > checkpoint has been triggered without any further information seems a bit
> > useless.  Think for instance for cases like [1].
>
> I have supported all possible checkpoint kinds. Added
> pg_stat_get_progress_checkpoint_kind() to convert the flags (int) to a
> string representing a combination of flags and also checking for the
> flag update in ImmediateCheckpointRequested() which checks whether
> CHECKPOINT_IMMEDIATE flag is set or not. I did not find any other
> cases where the flags get changed (which changes the current
> checkpoint behaviour) during the checkpoint. Kindly let me know if I
> am missing something.
> ---
>
> > > I feel 'processes_wiating' aligns more with the naming conventions of
> > > the fields of the existing progres views.
> >
> > There's at least pg_stat_progress_vacuum.num_dead_tuples.  Anyway I don't 
> > have
> > a strong opinion on it, just make sure to correct the typo.
>
> More analysis is required to support this. I am planning to take care
> in the next patch.
> ---
>
> > If pg_is_in_recovery() is true, then it's a restartpoint, otherwise it's a
> > restartpoint if the checkpoint's timeline is different from the current
> > timeline?
>
> Fixed.
>
> Sharing the v2 patch. Kindly have a look and share your comments.
>
> Thanks & Regards,
> Nitin Jadhav
>
>
>
>
> On Tue, Feb 22, 2022 at 12:08 PM Nitin Jadhav
>  wrote:
> >
> > > > Thank you for sharing the information.  'triggering backend PID' (int)
> > > > - can be stored without any problem. 'checkpoint or restartpoint?'
> > > > (boolean) - can be stored as a integer value like
> > > > PROGRESS_CHECKPOINT_TYPE_CHECKPOINT(0) and
> > > > PROGRESS_CHECKPOINT_TYPE_RESTARTPOINT(1). 'elap

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-22 Thread Ashutosh Sharma

I'm not sure about the current status, but found it while playing
around with the latest changes a bit, so thought of sharing it here.

+  
+   strategy
+   
+
+ This is used for copying the database directory.  Currently, we have
+ two strategies the WAL_LOG_BLOCK and the

Isn't it wal_log instead of wal_log_block?

I think when users input wrong strategy with createdb command, we
should provide a hint message showing allowed values for strategy
types along with an error message. This will be helpful for the users.

On Tue, Feb 15, 2022 at 5:19 PM Dilip Kumar  wrote:
>
> On Tue, Feb 15, 2022 at 2:01 AM Maciek Sakrejda  wrote:
> >
> Here is the updated version of the patch, the changes are 1) Fixed
> review comments given by Robert and one open comment from Ashutosh.
> 2) Preserved the old create db method. 3) As agreed upthread for now
> we are using the new strategy only for createdb not for movedb so I
> have removed the changes in ForgetDatabaseSyncRequests() and
> DropDatabaseBuffers().  3) Provided a database creation strategy
> option as of now I have kept it as below.
>
> CREATE DATABASE ... WITH (STRATEGY = WAL_LOG);  -- default if
> option is omitted
> CREATE DATABASE ... WITH (STRATEGY = FILE_COPY);
>
> I have updated the document but I was not sure how much internal
> information to be exposed to the user so I will work on that based on
> feedback from others.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-22 Thread Ashutosh Sharma

+/* Kinds of checkpoint (as advertised via PROGRESS_CHECKPOINT_KIND) */
+#define PROGRESS_CHECKPOINT_KIND_WAL0
+#define PROGRESS_CHECKPOINT_KIND_TIME   1
+#define PROGRESS_CHECKPOINT_KIND_FORCE  2
+#define PROGRESS_CHECKPOINT_KIND_UNKNOWN3

On what basis have you classified the above into the various types of
checkpoints? AFAIK, the first two types are based on what triggered
the checkpoint (whether it was the checkpoint_timeout or maz_wal_size
settings) while the third type indicates the force checkpoint that can
happen when the checkpoint is triggered for various reasons e.g. .
during createb or dropdb etc. This is quite possible that both the
PROGRESS_CHECKPOINT_KIND_TIME and PROGRESS_CHECKPOINT_KIND_FORCE flags
are set for the checkpoint because multiple checkpoint requests are
processed at one go, so what type of checkpoint would that be?

+*/
+   if ((flags & (CHECKPOINT_IS_SHUTDOWN |
CHECKPOINT_END_OF_RECOVERY)) == 0)
+   {
+
pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
InvalidOid);
+   checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_PHASE,
+
  PROGRESS_CHECKPOINT_PHASE_INIT);
+   if (flags & CHECKPOINT_CAUSE_XLOG)
+   checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
  PROGRESS_CHECKPOINT_KIND_WAL);
+   else if (flags & CHECKPOINT_CAUSE_TIME)
+   checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
  PROGRESS_CHECKPOINT_KIND_TIME);
+   else if (flags & CHECKPOINT_FORCE)
+   checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
  PROGRESS_CHECKPOINT_KIND_FORCE);
+   else
+   checkpoint_progress_update_param(flags,
PROGRESS_CHECKPOINT_KIND,
+
  PROGRESS_CHECKPOINT_KIND_UNKNOWN);
+   }
+}

--
With Regards,
Ashutosh Sharma.

On Thu, Feb 10, 2022 at 12:23 PM Nitin Jadhav
 wrote:
>
> > > We need at least a trace of the number of buffers to sync (num_to_scan) 
> > > before the checkpoint start, instead of just emitting the stats at the 
> > > end.
> > >
> > > Bharat, it would be good to show the buffers synced counter and the total 
> > > buffers to sync, checkpointer pid, substep it is running, whether it is 
> > > on target for completion, checkpoint_Reason
> > > (manual/times/forced). BufferSync has several variables tracking the sync 
> > > progress locally, and we may need some refactoring here.
> >
> > I agree to provide above mentioned information as part of showing the
> > progress of current checkpoint operation. I am currently looking into
> > the code to know if any other information can be added.
>
> Here is the initial patch to show the progress of checkpoint through
> pg_stat_progress_checkpoint view. Please find the attachment.
>
> The information added to this view are pid - process ID of a
> CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> for checkpoint (values can be wal, time or force), phase - indicates
> the current phase of checkpoint operation, total_buffer_writes - total
> number of buffers to be written, buffers_processed - number of buffers
> processed, buffers_written - number of buffers written,
> total_file_syncs - total number of files to be synced, files_synced -
> number of files synced.
>
> There are many operations happen as part of checkpoint. For each of
> the operation I am updating the phase field of
> pg_stat_progress_checkpoint view. The values supported for this field
> are initializing, checkpointing replication slots, checkpointing
> snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> buffers, performing sync requests, performing two phase checkpoint,
> recycling old XLOG files and Finalizing. In case of checkpointing
> buffers phase, the fields total_buffer_writes, buffers_processed and
> buffers_written shows the detailed progress of writing buffers. In
> case of performing sync requests phase, the fields total_file_syncs
> and files_synced shows the detailed progress of syncing files. In
> other phases, only the phase field is getting updated and it is
> difficult to show the progress because we do not get the total number
> of files count without traversing the directory. It is not worth to
> calculate that as it affects the performance of the checkpoint. I also
> gave a thought to just mention the number of files processed, but this
> wont give a meaningful progress information (It can be treated as
> stati

Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-02-18 Thread Ashutosh Sharma

On Sat, Feb 19, 2022 at 2:24 AM Nathan Bossart  wrote:
>
> On Fri, Feb 18, 2022 at 10:48:10PM +0530, Ashutosh Sharma wrote:
> > Some review comments on the latest version:
>
> Thanks for the feedback!  Before I start spending more time on this one, I
> should probably ask if this has any chance of making it into v15.

I don't see any reason why it can't make it to v15. However, it is not
super urgent as the users facing this problem have a choice. They can
use non-exclusive mode.

--
With Regards,
Ashutosh Sharma.

Re: Make mesage at end-of-recovery less scary.

2022-02-18 Thread Ashutosh Sharma

On Thu, Feb 17, 2022 at 1:20 PM Kyotaro Horiguchi
 wrote:
>
> At Tue, 15 Feb 2022 20:17:20 +0530, Ashutosh Sharma  
> wrote in
> > OK. The v13 patch looks good. I have marked it as ready to commit.
> > Thank you for working on all my review comments.
>
> Thaks! But the recent xlog.c refactoring crashes into this patch.
> And I found a silly bug while rebasing.
>
> xlog.c:12463 / xlogrecovery.c:3168
> if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
> ..
> {
> +   Assert(!StandbyMode);
> ...
> +   xlogreader->EndOfWAL = true;
>
> Yeah, I forgot about promotion there..

Yes, we exit WaitForWALToBecomeAvailable() even in standby mode
provided the user has requested for the promotion. So checking for the
!StandbyMode condition alone was not enough.

So what I should have done is
> setting EndOfWAL according to StandbyMode.
>
> +   Assert(!StandbyMode || CheckForStandbyTrigger());
> ...
> +   /* promotion exit is not end-of-WAL */
> +   xlogreader->EndOfWAL = !StandbyMode;
>

The changes looks good. thanks.!

--
With Regards,
Ashutosh Sharma.

Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-02-18 Thread Ashutosh Sharma

Some review comments on the latest version:

+* runningBackups is a counter indicating the number of backups currently in
+* progress. forcePageWrites is set to true when either of these is
+* non-zero. lastBackupStart is the latest checkpoint redo location used as
+* a starting point for an online backup.
 */
-   ExclusiveBackupState exclusiveBackupState;
-   int nonExclusiveBackups;

What do you mean by "either of these is non-zero ''. Earlier we used
to set forcePageWrites in case of both exclusive and non-exclusive
backups, but we have just one type of backup now.

==

-* OK to update backup counters, forcePageWrites and session-level lock.
+* OK to update backup counters and forcePageWrites.
 *

We still update the status of session-level lock so I don't think we
should update the above comment. See below code:

if (XLogCtl->Insert.runningBackups == 0)
{
XLogCtl->Insert.forcePageWrites = false;
}

/*
 * Clean up session-level lock.
 *
 * You might think that WALInsertLockRelease() can be called before
 * cleaning up session-level lock because session-level lock doesn't need
 * to be protected with WAL insertion lock. But since
 * CHECK_FOR_INTERRUPTS() can occur in it, session-level lock must be
 * cleaned up before it.
 */
sessionBackupState = SESSION_BACKUP_NONE;

WALInsertLockRelease();

==

@@ -8993,18 +8686,16 @@ do_pg_abort_backup(int code, Datum arg)
boolemit_warning = DatumGetBool(arg);

/*
-* Quick exit if session is not keeping around a non-exclusive backup
-* already started.
+* Quick exit if session does not have a running backup.
 */
-   if (sessionBackupState != SESSION_BACKUP_NON_EXCLUSIVE)
+   if (sessionBackupState != SESSION_BACKUP_RUNNING)
return;

WALInsertLockAcquireExclusive();
-   Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
-   XLogCtl->Insert.nonExclusiveBackups--;
+   Assert(XLogCtl->Insert.runningBackups > 0);
+   XLogCtl->Insert.runningBackups--;

-   if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE &&
-   XLogCtl->Insert.nonExclusiveBackups == 0)
+   if (XLogCtl->Insert.runningBackups == 0)
{
XLogCtl->Insert.forcePageWrites = false;
}

I think we have a lot of common code in do_pg_abort_backup() and
pg_do_stop_backup(). So why not have a common function that can be
called from both these functions.

==

+# Now delete the bogus backup_label file since it will interfere with startup
+unlink("$pgdata/backup_label")
+  or BAIL_OUT("unable to unlink $pgdata/backup_label");
+

Why do we need this additional change? Earlier this was not required.

--
With Regards,
Ashutosh Sharma.

On Thu, Feb 17, 2022 at 6:41 AM Nathan Bossart  wrote:
>
> Here is a rebased patch.
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com

Re: Make mesage at end-of-recovery less scary.

2022-02-17 Thread Ashutosh Sharma

On Thu, Feb 17, 2022 at 1:20 PM Kyotaro Horiguchi
 wrote:
>
> At Tue, 15 Feb 2022 20:17:20 +0530, Ashutosh Sharma  
> wrote in
> > OK. The v13 patch looks good. I have marked it as ready to commit.
> > Thank you for working on all my review comments.
>
> Thaks! But the recent xlog.c refactoring crashes into this patch.
> And I found a silly bug while rebasing.
>

Thanks.! I'll take a look at the new changes.

--
With Regards,
Ashutosh Sharma.

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-02-15 Thread Ashutosh Sharma

On Wed, Feb 16, 2022 at 1:01 AM Bharath Rupireddy
 wrote:
>
> On Mon, Feb 14, 2022 at 8:32 PM Ashutosh Sharma  wrote:
> >
> > Here are few comments:
>
> Thanks for reviewing the patches.
>
> > +/*
> > + * Verify the authenticity of the given raw WAL record.
> > + */
> > +Datum
> > +pg_verify_raw_wal_record(PG_FUNCTION_ARGS)
> > +{
> >
> >
> > Do we really need this function? I see that whenever the record is
> > read, we verify it. So could there be a scenario where any of these
> > functions would return an invalid WAL record?
>
> Yes, this function can be useful. Imagine a case where raw WAL records
> are fetched from one server using pg_get_wal_record_info and sent over
> the network to another server (for fixing some of the corrupted data
> pages or for whatever reasons), using pg_verify_raw_wal_record one can
> verify authenticity.
>

I don't think that's the use case of this patch. Unless there is some
other valid reason, I would suggest you remove it.

> > Should we add a function that returns the pointer to the first and
> > probably the last WAL record in the WAL segment? This would help users
> > to inspect the wal records in the entire wal segment if they wish to
> > do so.
>
> Good point. One can do this already with pg_get_wal_records_info and
> pg_walfile_name_offset. Usually, the LSN format itself can give an
> idea about the WAL file it is in.
>
> postgres=# select lsn, pg_walfile_name_offset(lsn) from
> pg_get_wal_records_info('0/500', '0/5FF') order by lsn asc
> limit 1;
> lsn|pg_walfile_name_offset
> ---+---
>  0/538 | (00010005,56)
> (1 row)
>
> postgres=# select lsn, pg_walfile_name_offset(lsn) from
> pg_get_wal_records_info('0/500', '0/5FF') order by lsn desc
> limit 1;
> lsn|   pg_walfile_name_offset
> ---+-
>  0/5C0 | (00010005,16777152)
> (1 row)
>

The workaround you are suggesting is not very user friendly and FYKI
pg_wal_records_info simply hangs at times when we specify the higher
and lower limit of lsn in a wal file.

To make things easier for the end users I would suggest we add a
function that can return a valid first and last lsn in a walfile. The
output of this function can be used to inspect the wal records in the
entire wal file if they wish to do so and I am sure they will. So, it
should be something like this:

select first_valid_lsn, last_valid_lsn from
pg_get_first_last_valid_wal_record('wal-segment-name');

And above function can directly be used with pg_get_wal_records_info() like

select 
pg_get_wal_records_info(pg_get_first_last_valid_wal_record('wal-segment'));

I think this is a pretty basic ASK that we expect to be present in the
module like this.

> > +PG_FUNCTION_INFO_V1(pg_get_raw_wal_record);
> > +PG_FUNCTION_INFO_V1(pg_get_first_valid_wal_record_lsn);
> > +PG_FUNCTION_INFO_V1(pg_verify_raw_wal_record);
> > +PG_FUNCTION_INFO_V1(pg_get_wal_record_info);
> > +PG_FUNCTION_INFO_V1(pg_get_wal_records_info);
> >
> > I think we should allow all these functions to be executed in wait and
> > *nowait* mode. If a user specifies nowait mode, the function should
> > return if no WAL data is present, rather than waiting for new WAL data
> > to become available, default behaviour could be anything you like.
>
> Currently, pg_walinspect uses read_local_xlog_page which waits in the
> while(1) loop if a future LSN is specified. As read_local_xlog_page is
> an implementation of XLogPageReadCB, which doesn't have a wait/nowait
> parameter, if we really need a wait/nowait mode behaviour, we need to
> do extra things(either add a backend-level global wait variable, set
> before XLogReadRecord, if set, read_local_xlog_page can just exit
> without waiting and reset after the XLogReadRecord or add an extra
> bool wait variable to XLogReaderState and use it in
> read_local_xlog_page).
>

I am not asking to do any changes in the backend code. Please check -
how pg_waldump does this when a user requests to stop once the endptr
has reached. If not for all functions at least for a few functions we
can do this if it is doable.

>
> > +Datum
> > +pg_get_wal_records_info(PG_FUNCTION_ARGS)
> > +{
> > +#define PG_GET_WAL_RECORDS_INFO_COLS 10
> >
> >
> > We could probably have another variant of this function that would
> > work even if the end pointer is not specified, in which case the
> > default end pointer would be the last WAL record in the WAL segment.
> > Currently it mandates the use of an end pointer which slightly reduces
> > flexibility.
>
> Last WAL rec

Re: Identify missing publications from publisher while create/alter subscription.

2022-02-15 Thread Ashutosh Sharma

Thanks for working on the review comments. The changes in the new
patch look good to me.  I am marking it as ready to commit.

--
With Regards,
Ashutosh Sharma.

On Sun, Feb 13, 2022 at 7:34 PM vignesh C  wrote:
>
> On Fri, Feb 11, 2022 at 7:14 PM Ashutosh Sharma  wrote:
> >
> > I have spent little time looking at the latest patch. The patch looks
> > to be in good shape as it has already been reviewed by many people
> > here, although I did get some comments. Please take a look and let me
> > know your thoughts.
> >
> >
> > +   /* Try to connect to the publisher. */
> > +   wrconn = walrcv_connect(sub->conninfo, true, sub->name, );
> > +   if (!wrconn)
> > +   ereport(ERROR,
> > +   (errmsg("could not connect to the publisher: %s", err)));
> >
> > I think it would be good to also include the errcode
> > (ERRCODE_CONNECTION_FAILURE) here?
>
> Modified
>
> > --
> >
> > @@ -514,6 +671,8 @@ CreateSubscription(ParseState *pstate,
> > CreateSubscriptionStmt *stmt,
> >
> > PG_TRY();
> > {
> > +   check_publications(wrconn, publications, 
> > opts.validate_publication);
> > +
> >
> >
> > Instead of passing the opts.validate_publication argument to
> > check_publication function, why can't we first check if this option is
> > set or not and accordingly call check_publication function? For other
> > options I see that it has been done in the similar way for e.g. check
> > for opts.connect or opts.refresh or opts.enabled etc.
>
> Modified
>
> > --
> >
> > Above comment also applies for:
> >
> > @@ -968,6 +1130,8 @@ AlterSubscription(ParseState *pstate,
> > AlterSubscriptionStmt *stmt,
> > replaces[Anum_pg_subscription_subpublications - 1] = true;
> >
> > update_tuple = true;
> > +   connect_and_check_pubs(sub, stmt->publication,
> > +  opts.validate_publication);
> >
>
> Modified
>
> > --
> >
> > + 
> > +  When true, the command verifies if all the specified publications
> > +  that are being subscribed to are present in the publisher and 
> > throws
> > +  an error if any of the publication doesn't exist. The default is
> > +  false.
> >
> > publication -> publications (in the 4th line : throw an error if any
> > of the publication doesn't exist)
> >
> > This applies for both CREATE and ALTER subscription commands.
>
> Modified
>
> Thanks for the comments, the attached v14 patch has the changes for the same.
>
> Regard,s
> Vignesh

Re: Make mesage at end-of-recovery less scary.

2022-02-15 Thread Ashutosh Sharma

On Tue, Feb 15, 2022 at 7:52 AM Kyotaro Horiguchi
 wrote:
>
> At Mon, 14 Feb 2022 20:14:11 +0530, Ashutosh Sharma  
> wrote in
> > No, I haven't tried to compare archive recovery to PITR or vice versa,
> > instead I was trying to compare crash recovery with PITR. The message
> > you're emitting says just before entering into the archive recovery is
> > - "reached end-of-WAL on ... in pg_wal *during crash recovery*,
> > entering archive recovery".  This message is static and can be emitted
> > not only during crash recovery, but also during PITR. I think we can
>
> No. It is emitted *only* after crash recovery before starting archive
> recovery.  Another message this patch adds can be emitted after PITR
> or archive recovery.
>
> > not only during crash recovery, but also during PITR. I think we can
> > remove the "during crash recovery" part from this message, so "reached
> > the end of WAL at %X/%X on timeline %u in %s, entering archive
>
> What makes you think it can be emitted after other than crash recovery?
> (Please look at the code comment just above.)
>

Yep that's right. We won't be coming here in case of pitr.

> > recovery". Also I don't think we need format specifier %s here, it can
> > be hard-coded with pg_wal as in this case we can only enter archive
> > recovery after reading wal from pg_wal, so current WAL source has to
> > be pg_wal, isn't it?
>
> You're right that it can't be other than pg_wal.  It was changed just
> in accordance woth another message this patch adds and it would be a
> matter of taste.  I replaced to "pg_wal" in this version.
>

OK. I have verified the changes.

> > Thanks for the changes. Please note that I am not able to apply the
> > latest patch on HEAD. Could you please rebase it on HEAD and share the
> > new version. Thank you.
>
> A change on TAP script hit this.  The v13 attached is:
>

OK. The v13 patch looks good. I have marked it as ready to commit.
Thank you for working on all my review comments.

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-02-14 Thread Ashutosh Sharma

Hi Dilip,

On Sun, Feb 13, 2022 at 12:04 PM Dilip Kumar  wrote:
>
> On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar  wrote:
> >
>
> I have done performance testing with different template DB sizes and
> different amounts of dirty shared buffers and I think as expected the
> bigger the dirty shared buffer the checkpoint approach becomes costly
> and OTOH the larger the template DB size the WAL log approach takes
> more time.
>
> I think it is very common to have larger shared buffers and of course,
> if somebody has configured such a large shared buffer then a good % of
> it will be dirty most of the time.  So IMHO in the future, the WAL log
> approach is going to be more usable in general.  However, this is just
> my opinion, and others may have completely different thoughts and
> anyhow we are keeping options for both the approaches so no worry.
>
> Next, I am planning to do some more tests, where we are having pgbench
> running and concurrently we do CREATEDB maybe every 1 minute and see
> what is the CREATEDB time as well as what is the impact on pgbench
> performance.  Because currently I have only measured CREATEDB time but
> we must be knowing the impact of createdb on the other system as well.
>
> Test setup:
> max_wal_size=64GB
> checkpoint_timeout=15min
> - CREATE base TABLE of size of Shared Buffers
> - CREATE template database and table in it of varying sizes (as per test)
> - CHECKPOINT (write out dirty buffers)
> - UPDATE 70% of tuple in base table (dirty 70% of shared buffers)
> - CREATE database using template db. (Actual test target)
>
> test1:
> 1 GB shared buffers, template DB size = 6MB, dirty shared buffer=70%
> Head: 2341.665 ms
> Patch: 85.229 ms
>
> test2:
> 1 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
> Head: 4044 ms
> Patch: 8376 ms
>
> test3:
> 8 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
> Head: 21398 ms
> Patch: 9834 ms
>
> test4:
> 8 GB shared buffers, template DB size = 10GB, dirty shared buffer=95%
> Head: 38574 ms
> Patch: 77160 ms
>
> test4:
> 32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
> Head: 47656 ms
> Patch: 79767 ms
>

Is it possible to see the WAL size generated by these two statements:
UPDATE 70% of the tuple in the base table (dirty 70% of the shared
buffers) && CREATE database using template DB (Actual test target).
Just wanted to know if it can exceed the max_wal_size of 64GB. Also,
is it possible to try with minimal wal_level? Sorry for asking you
this, I could try it myself but I don't have any high level system to
try it.

--
With Regards,
Ashutosh Sharma.

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-02-14 Thread Ashutosh Sharma

Here are few comments:

+/*
+ * Verify the authenticity of the given raw WAL record.
+ */
+Datum
+pg_verify_raw_wal_record(PG_FUNCTION_ARGS)
+{

Do we really need this function? I see that whenever the record is
read, we verify it. So could there be a scenario where any of these
functions would return an invalid WAL record?

--

Should we add a function that returns the pointer to the first and
probably the last WAL record in the WAL segment? This would help users
to inspect the wal records in the entire wal segment if they wish to
do so.

--

+PG_FUNCTION_INFO_V1(pg_get_raw_wal_record);
+PG_FUNCTION_INFO_V1(pg_get_first_valid_wal_record_lsn);
+PG_FUNCTION_INFO_V1(pg_verify_raw_wal_record);
+PG_FUNCTION_INFO_V1(pg_get_wal_record_info);
+PG_FUNCTION_INFO_V1(pg_get_wal_records_info);

I think we should allow all these functions to be executed in wait and
*nowait* mode. If a user specifies nowait mode, the function should
return if no WAL data is present, rather than waiting for new WAL data
to become available, default behaviour could be anything you like.

--

+Datum
+pg_get_wal_records_info(PG_FUNCTION_ARGS)
+{
+#define PG_GET_WAL_RECORDS_INFO_COLS 10

We could probably have another variant of this function that would
work even if the end pointer is not specified, in which case the
default end pointer would be the last WAL record in the WAL segment.
Currently it mandates the use of an end pointer which slightly reduces
flexibility.

--

+
+/*
+ * Get the first valid raw WAL record lsn.
+ */
+Datum
+pg_get_first_valid_wal_record_lsn(PG_FUNCTION_ARGS)

I think this function should return a pointer to the nearest valid WAL
record which can be the previous WAL record to the LSN entered by the
user or the next WAL record. If a user unknowingly enters an lsn that
does not exist then in such cases we should probably return the lsn of
the previous WAL record instead of hanging or waiting for the new WAL
record to arrive.

--

Another important point I would like to mention here is - have we made
an attempt to ensure that we try to share as much of code with
pg_waldump as possible so that if any changes happens in the
pg_waldump in future it gets applied here as well and additionally it
will also reduce the code duplication.

I haven't yet looked into the code in detail. I will have a look at it
asap. thanks.

--
With Regards,
Ashutosh Sharma.

On Sat, Feb 12, 2022 at 5:03 PM Bharath Rupireddy
 wrote:
>
> On Thu, Feb 10, 2022 at 9:55 PM Peter Geoghegan  wrote:
> >
> > On Sun, Feb 6, 2022 at 7:45 AM Robert Haas  wrote:
> > > For what it's worth, I am generally in favor of having something like
> > > this in PostgreSQL. I think it's wrong of us to continue assuming that
> > > everyone has command-line access. Even when that's true, it's not
> > > necessarily convenient. If you choose to use a relational database,
> > > you may be the sort of person who likes SQL. And if you are, you may
> > > want to have the database tell you what's going on via SQL rather than
> > > command-line tools or operating system utilities. Imagine if we didn't
> > > have pg_stat_activity and you had to get that information by running a
> > > separate binary. Would anyone like that? Why is this case any
> > > different?
> >
> > +1. An SQL interface is significantly easier to work with. Especially
> > because it can use the built-in LSN type, pg_lsn.
> >
> > I don't find the slippery slope argument convincing. There aren't that
> > many other things that are like pg_waldump, but haven't already been
> > exposed via an SQL interface. Offhand, I can't think of any.
>
> On Sat, Feb 12, 2022 at 4:03 AM Andrew Dunstan  wrote:
> >
> > Almost completely off topic, but this reminded me of an incident about
> > 30 years ago at my first gig as an SA/DBA. There was an application
> > programmer who insisted on loading a set of values from a text file into
> > a temp table (it was Ingres, anyone remember that?). Why? Because he
> > knew how to write "Select * from mytable order by mycol" but didn't know
> > how to drive the Unix sort utility at the command line. When I was
> > unable to restrain myself from smiling at this he got very angry and
> > yelled at me loudly.
> >
> > So, yes, some people do like SQL and hate the command line.
>
> Thanks a lot  for the comments. I'm looking forward to the review of
> the latest v4 patches posted at [1].
>
> [1] 
> https://www.postgresql.org/message-id/CALj2ACUS9%2B54QGPtUjk76dcYW-AMKp3hPe-U%2BpQo2-GpE4kjtA%40mail.gmail.com
>
> Regards,
> Bharath Rupireddy.

Re: Identify missing publications from publisher while create/alter subscription.

2022-02-14 Thread Ashutosh Sharma

Thanks for working on my review comments. I'll take a look at the new
changes and let you know my comments, if any. I didn't get a chance to
check it out today as I was busy reviewing some other patches, but
I'll definitely take a look at the new patch in a day or so and let
you know my feedback.

--
With Regards,
Ashutosh Sharma.

On Sun, Feb 13, 2022 at 7:34 PM vignesh C  wrote:
>
> On Fri, Feb 11, 2022 at 7:14 PM Ashutosh Sharma  wrote:
> >
> > I have spent little time looking at the latest patch. The patch looks
> > to be in good shape as it has already been reviewed by many people
> > here, although I did get some comments. Please take a look and let me
> > know your thoughts.
> >
> >
> > +   /* Try to connect to the publisher. */
> > +   wrconn = walrcv_connect(sub->conninfo, true, sub->name, );
> > +   if (!wrconn)
> > +   ereport(ERROR,
> > +   (errmsg("could not connect to the publisher: %s", err)));
> >
> > I think it would be good to also include the errcode
> > (ERRCODE_CONNECTION_FAILURE) here?
>
> Modified
>
> > --
> >
> > @@ -514,6 +671,8 @@ CreateSubscription(ParseState *pstate,
> > CreateSubscriptionStmt *stmt,
> >
> > PG_TRY();
> > {
> > +   check_publications(wrconn, publications, 
> > opts.validate_publication);
> > +
> >
> >
> > Instead of passing the opts.validate_publication argument to
> > check_publication function, why can't we first check if this option is
> > set or not and accordingly call check_publication function? For other
> > options I see that it has been done in the similar way for e.g. check
> > for opts.connect or opts.refresh or opts.enabled etc.
>
> Modified
>
> > --
> >
> > Above comment also applies for:
> >
> > @@ -968,6 +1130,8 @@ AlterSubscription(ParseState *pstate,
> > AlterSubscriptionStmt *stmt,
> > replaces[Anum_pg_subscription_subpublications - 1] = true;
> >
> > update_tuple = true;
> > +   connect_and_check_pubs(sub, stmt->publication,
> > +  opts.validate_publication);
> >
>
> Modified
>
> > --
> >
> > + 
> > +  When true, the command verifies if all the specified publications
> > +  that are being subscribed to are present in the publisher and 
> > throws
> > +  an error if any of the publication doesn't exist. The default is
> > +  false.
> >
> > publication -> publications (in the 4th line : throw an error if any
> > of the publication doesn't exist)
> >
> > This applies for both CREATE and ALTER subscription commands.
>
> Modified
>
> Thanks for the comments, the attached v14 patch has the changes for the same.
>
> Regard,s
> Vignesh

Re: Identify missing publications from publisher while create/alter subscription.

2022-02-14 Thread Ashutosh Sharma

On Sun, Feb 13, 2022 at 7:32 PM vignesh C  wrote:
>
> On Thu, Feb 10, 2022 at 3:15 PM Ashutosh Sharma  wrote:
> >
> > On Wed, Feb 9, 2022 at 11:53 PM Euler Taveira  wrote:
> > >
> > > On Wed, Feb 9, 2022, at 12:06 PM, Ashutosh Sharma wrote:
> > >
> > > Just wondering if we should also be detecting the incorrect conninfo
> > > set with ALTER SUBSCRIPTION command as well. See below:
> > >
> > > -- try creating a subscription with incorrect conninfo. the command fails.
> > > postgres=# create subscription sub1 connection 'host=localhost
> > > port=5490 dbname=postgres' publication pub1;
> > > ERROR:  could not connect to the publisher: connection to server at
> > > "localhost" (::1), port 5490 failed: Connection refused
> > > Is the server running on that host and accepting TCP/IP connections?
> > > connection to server at "localhost" (127.0.0.1), port 5490 failed:
> > > Connection refused
> > > Is the server running on that host and accepting TCP/IP connections?
> > >
> > > That's because by default 'connect' parameter is true.
> > >
> >
> > So can we use this option with the ALTER SUBSCRIPTION command. I think
> > we can't, which means if the user sets wrong conninfo using ALTER
> > SUBSCRIPTION command then we don't have the option to detect it like
> > we have in case of CREATE SUBSCRIPTION command. Since this thread is
> > trying to add the ability to identify the wrong/missing publication
> > name specified with the ALTER SUBSCRIPTION command, can't we do the
> > same for the wrong conninfo?
>
> I felt this can be extended once this feature is committed. Thoughts?
>

I think that should be okay. I just wanted to share with you people to
know if it can be taken care of in this patch itself but it's ok if we
see it later.

--
With Regards,
Ashutosh Sharma.

Re: Make mesage at end-of-recovery less scary.

2022-02-14 Thread Ashutosh Sharma

Hi,

On Thu, Feb 10, 2022 at 11:47 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 9 Feb 2022 17:31:02 +0530, Ashutosh Sharma  
> wrote in
> > On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
> >  wrote:
> > > This means archive-recovery is requested but not started yet. That is,
> > > we've just finished crash recovery.  The existing comment cited
> > > together is mentioning that.
> > >
> > > At the end of PITR (or archive recovery), the other code works.
> > >
> >
> > This is quite understandable, the point here is that the message that
> > we are emitting says, we have just finished reading the wal files in
> > the pg_wal directory during crash recovery and are now entering
> > archive recovery when we are actually doing point-in-time recovery
> > which seems a bit misleading.
>
> Here is the messages.
>
> > 2022-02-08 18:00:44.367 IST [86185] LOG:  starting point-in-time
> > recovery to WAL location (LSN) "0/5227790"
> > 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not
> > properly shut down; automatic recovery in progress
> > 2022-02-08 18:00:44.369 IST [86185] LOG:  redo starts at 0/14DC8D8
> > 2022-02-08 18:00:44.978 IST [86185] DEBUG1:  reached end of WAL at
> > 0/3D0 on timeline 1 in pg_wal during crash recovery, entering
> > archive recovery
>
> In the first place the last DEBUG1 is not on my part, but one of the
> messages added by this patch says the same thing.  Is your point that
> archive recovery is different thing from PITR?  In regard to the
> difference, I think PITR is a form of archive recovery.
>

No, I haven't tried to compare archive recovery to PITR or vice versa,
instead I was trying to compare crash recovery with PITR. The message
you're emitting says just before entering into the archive recovery is
- "reached end-of-WAL on ... in pg_wal *during crash recovery*,
entering archive recovery".  This message is static and can be emitted
not only during crash recovery, but also during PITR. I think we can
remove the "during crash recovery" part from this message, so "reached
the end of WAL at %X/%X on timeline %u in %s, entering archive
recovery". Also I don't think we need format specifier %s here, it can
be hard-coded with pg_wal as in this case we can only enter archive
recovery after reading wal from pg_wal, so current WAL source has to
be pg_wal, isn't it?

> That being said, after some thoughts on this, I changed my mind that
> we don't need to say what operation was being performed at the
> end-of-WAL.  So in the attached the end-of-WAL message is not
> accompanied by the kind of recovery.
>
> > LOG:  reached end of WAL at 0/300 on timeline 1
>
> I removed the archive-source part along with the operation mode.
> Because it make the message untranslatable.  It is now very simple but
> seems enough.
>
> While working on this, I noticed that we need to set EndOfWAL when
> WaitForWALToBecomeAvailable returned with failure.  That means the
> file does not exist at all so it is a kind of end-of-WAL.  In that
> sense the following existing comment in ReadRecord is a bit wrong.
>
> >* We only end up here without a message when XLogPageRead()
> >* failed - in that case we already logged something. In
> >* StandbyMode that only happens if we have been triggered, so we
> >* shouldn't loop anymore in that case.
>
> Actually there's a case we get there without a message and without
> logged something when a segment file is not found unless we're in
> standby mode.
>
> > > Well. I guess that the "automatic recovery" is ambiguous.  Does it
> > > make sense if the second line were like the follows instead?
> > >
> > > + 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not 
> > > properly shut down; crash recovery in progress
> > >
> >
> > Well, according to me the current message looks fine.
>
> Good to hear. (In the previos version I modified the message by accident..)
>
> > $chkptfile is declared twice in the same scope. We can probably remove
> > the first one.
>
> Ugh.. Fixed.  (I wonder why Perl doesn't complain on this..)
>
>
> In this version 12 I made the following changes.
>
> - Rewrote (halfly reverted) a comment in ReadRecord
>
> - Simplified the "reached end of WAL" message by removing recovery
>   mode and WAL source in ReadRecord.
>
> - XLogPageRead sets EndOfWAL flag in the ENOENT case.
>
> - Removed redundant declaration of the same variable in TAP script.
>

Thanks for the changes. Please note that I am not able to apply the
latest patch on HEAD. Could you please rebase it on HEAD and share the
new version. Thank you.

--
With Regards,
Ashutosh Sharma.

Re: Identify missing publications from publisher while create/alter subscription.

2022-02-11 Thread Ashutosh Sharma

I have spent little time looking at the latest patch. The patch looks
to be in good shape as it has already been reviewed by many people
here, although I did get some comments. Please take a look and let me
know your thoughts.


+   /* Try to connect to the publisher. */
+   wrconn = walrcv_connect(sub->conninfo, true, sub->name, );
+   if (!wrconn)
+   ereport(ERROR,
+   (errmsg("could not connect to the publisher: %s", err)));

I think it would be good to also include the errcode
(ERRCODE_CONNECTION_FAILURE) here?

--

@@ -514,6 +671,8 @@ CreateSubscription(ParseState *pstate,
CreateSubscriptionStmt *stmt,

PG_TRY();
{
+   check_publications(wrconn, publications, opts.validate_publication);
+


Instead of passing the opts.validate_publication argument to
check_publication function, why can't we first check if this option is
set or not and accordingly call check_publication function? For other
options I see that it has been done in the similar way for e.g. check
for opts.connect or opts.refresh or opts.enabled etc.

--

Above comment also applies for:

@@ -968,6 +1130,8 @@ AlterSubscription(ParseState *pstate,
AlterSubscriptionStmt *stmt,
replaces[Anum_pg_subscription_subpublications - 1] = true;

update_tuple = true;
+   connect_and_check_pubs(sub, stmt->publication,
+  opts.validate_publication);


--

+ 
+  When true, the command verifies if all the specified publications
+  that are being subscribed to are present in the publisher and throws
+  an error if any of the publication doesn't exist. The default is
+  false.

publication -> publications (in the 4th line : throw an error if any
of the publication doesn't exist)

This applies for both CREATE and ALTER subscription commands.

--
With Regards,
Ashutosh Sharma.

On Sat, Nov 13, 2021 at 12:50 PM vignesh C  wrote:
>
> On Wed, Nov 10, 2021 at 11:16 AM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Nov 9, 2021 at 9:27 PM vignesh C  wrote:
> > > Attached v12 version is rebased on top of Head.
> >
> > Thanks for the patch. Here are some comments on v12:
> >
> > 1) I think ERRCODE_TOO_MANY_ARGUMENTS isn't the right error code, the
> > ERRCODE_UNDEFINED_OBJECT is more meaningful. Please change.
> > + ereport(ERROR,
> > + (errcode(ERRCODE_TOO_MANY_ARGUMENTS),
> > + errmsg_plural("publication %s does not exist in the publisher",
> > +"publications %s do not exist in the publisher",
> >
> > The existing code using
> > ereport(ERROR,
> > (errcode(ERRCODE_UNDEFINED_OBJECT),
> >  errmsg("subscription \"%s\" does not exist", subname)));
>
> Modified
>
> > 2) Typo: It is "One of the specified publications exists."
> > +# One of the specified publication exist.
>
> Modified
>
> > 3) I think we can remove the test case "+# Specified publication does
> > not exist." because the "+# One of the specified publication exist."
> > covers the code.
>
> Modified
>
> > 4) Do we need the below test case? Even with refresh = false, it does
> > call connect_and_check_pubs() right? Please remove it.
> > +# Specified publication does not exist with refresh = false.
> > +($ret, $stdout, $stderr) = $node_subscriber->psql('postgres',
> > +   "ALTER SUBSCRIPTION mysub1 SET PUBLICATION non_existent_pub
> > WITH (REFRESH = FALSE, VALIDATE_PUBLICATION = TRUE)"
> > +);
> > +ok( $stderr =~
> > + m/ERROR:  publication "non_existent_pub" does not exist in
> > the publisher/,
> > +   "Alter subscription for non existent publication fails");
> > +
>
> Modified
>
> > 5) Change the test case names to different ones instead of the same.
> > Have something like:
> > "Create subscription fails with single non-existent publication");
> > "Create subscription fails with multiple non-existent publications");
> > "Create subscription fails with mutually exclusive options");
> > "Alter subscription add publication fails with non-existent publication");
> > "Alter subscription set publication fails with non-existent publication");
> > "Alter subscription set publication fails with connection to a
> > non-existent database");
>
> Modified
>
> Thanks for the comments, the attached v13 patch has the fixes for the same.
>
> Regards,
> Vignesh

Re: Identify missing publications from publisher while create/alter subscription.

2022-02-10 Thread Ashutosh Sharma

On Wed, Feb 9, 2022 at 11:53 PM Euler Taveira  wrote:
>
> On Wed, Feb 9, 2022, at 12:06 PM, Ashutosh Sharma wrote:
>
> Just wondering if we should also be detecting the incorrect conninfo
> set with ALTER SUBSCRIPTION command as well. See below:
>
> -- try creating a subscription with incorrect conninfo. the command fails.
> postgres=# create subscription sub1 connection 'host=localhost
> port=5490 dbname=postgres' publication pub1;
> ERROR:  could not connect to the publisher: connection to server at
> "localhost" (::1), port 5490 failed: Connection refused
> Is the server running on that host and accepting TCP/IP connections?
> connection to server at "localhost" (127.0.0.1), port 5490 failed:
> Connection refused
> Is the server running on that host and accepting TCP/IP connections?
>
> That's because by default 'connect' parameter is true.
>

So can we use this option with the ALTER SUBSCRIPTION command. I think
we can't, which means if the user sets wrong conninfo using ALTER
SUBSCRIPTION command then we don't have the option to detect it like
we have in case of CREATE SUBSCRIPTION command. Since this thread is
trying to add the ability to identify the wrong/missing publication
name specified with the ALTER SUBSCRIPTION command, can't we do the
same for the wrong conninfo?

--
With Regards,
Ashutosh Sharma.

Re: Identify missing publications from publisher while create/alter subscription.

2022-02-09 Thread Ashutosh Sharma

Just wondering if we should also be detecting the incorrect conninfo
set with ALTER SUBSCRIPTION command as well. See below:

-- try creating a subscription with incorrect conninfo. the command fails.
postgres=# create subscription sub1 connection 'host=localhost
port=5490 dbname=postgres' publication pub1;
ERROR:  could not connect to the publisher: connection to server at
"localhost" (::1), port 5490 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 5490 failed:
Connection refused
Is the server running on that host and accepting TCP/IP connections?
postgres=#
postgres=#

-- this time the conninfo is correct and the command succeeded.
postgres=# create subscription sub1 connection 'host=localhost
port=5432 dbname=postgres' publication pub1;
NOTICE:  created replication slot "sub1" on publisher
CREATE SUBSCRIPTION
postgres=#
postgres=#

-- reset the connninfo in the subscription to some wrong value. the
command succeeds.
postgres=# alter subscription sub1 connection 'host=localhost
port=5490 dbname=postgres';
ALTER SUBSCRIPTION
postgres=#

postgres=# drop subscription sub1;
ERROR:  could not connect to publisher when attempting to drop
replication slot "sub1": connection to server at "localhost" (::1),
port 5490 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 5490 failed:
Connection refused
Is the server running on that host and accepting TCP/IP connections?
HINT:  Use ALTER SUBSCRIPTION ... SET (slot_name = NONE) to
disassociate the subscription from the slot.

==

When creating a subscription we do connect to the publisher node hence
the incorrect connection info gets detected. But that's not the case
with alter subscription.

--
With Regards,
Ashutosh Sharma.

On Sat, Nov 13, 2021 at 6:27 PM Bharath Rupireddy
 wrote:
>
> On Sat, Nov 13, 2021 at 12:50 PM vignesh C  wrote:
> >
> > Thanks for the comments, the attached v13 patch has the fixes for the same.
>
> Thanks for the updated v13 patch. I have no further comments, it looks
> good to me.
>
> Regards,
> Bharath Rupireddy.
>
>

Re: Make mesage at end-of-recovery less scary.

2022-02-09 Thread Ashutosh Sharma

On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
 wrote:
>
> Hi, Ashutosh.
>
> At Tue, 8 Feb 2022 18:35:34 +0530, Ashutosh Sharma  
> wrote in
> > Here are some of my review comments on the v11 patch:
>
> Thank you for taking a look on this.
>
> > -   (errmsg_internal("reached end of WAL in
> > pg_wal, entering archive recovery")));
> > +   (errmsg_internal("reached end of WAL at %X/%X
> > on timeline %u in %s during crash recovery, entering archive
> > recovery",
> > +LSN_FORMAT_ARGS(ErrRecPtr),
> > +replayTLI,
> > +xlogSourceNames[currentSource])));
> >
> > Why crash recovery? Won't this message get printed even during PITR?
>
> It is in the if-block with the following condition.
>
> >* If archive recovery was requested, but we were still doing
> >* crash recovery, switch to archive recovery and retry using the
> >* offline archive. We have now replayed all the valid WAL in
> >* pg_wal, so we are presumably now consistent.
> ...
> >if (!InArchiveRecovery && ArchiveRecoveryRequested)
>
> This means archive-recovery is requested but not started yet. That is,
> we've just finished crash recovery.  The existing comment cited
> together is mentioning that.
>
> At the end of PITR (or archive recovery), the other code works.
>

This is quite understandable, the point here is that the message that
we are emitting says, we have just finished reading the wal files in
the pg_wal directory during crash recovery and are now entering
archive recovery when we are actually doing point-in-time recovery
which seems a bit misleading.

> > /*
> >  * If we haven't emit an error message, we have safely reached the
> >  * end-of-WAL.
> >  */
> > if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
> > {
> >   char   *fmt;
> >
> >   if (StandbyMode)
> >   fmt = gettext_noop("reached end of WAL at %X/%X on timeline 
> > %u in %s during standby mode");
> >   else if (InArchiveRecovery)
> >   fmt = gettext_noop("reached end of WAL at %X/%X on timeline 
> > %u in %s during archive recovery");
> >   else
> >   fmt = gettext_noop("reached end of WAL at %X/%X on timeline 
> > %u in %s during crash recovery");
>
> The last among the above messages is choosed when archive-recovery is
> not requested at all.
>
> > I just did a PITR and could see these messages in the logfile.
>
> Yeah, the log lines are describing that the server starting with crash
> recovery to run PITR.
>
> > 2022-02-08 18:00:44.367 IST [86185] LOG:  starting point-in-time
> > recovery to WAL location (LSN) "0/5227790"
> > 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not
> > properly shut down; automatic recovery in progress
>
> Well. I guess that the "automatic recovery" is ambiguous.  Does it
> make sense if the second line were like the follows instead?
>
> + 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not properly 
> shut down; crash recovery in progress
>

Well, according to me the current message looks fine.

> > Lastly, are we also planning to backport this patch?
>
> This is apparent a behavioral change, not a bug fix, which I think we
> regard as not appropriate for back-patching.
>
>
> As the result, I made the following chages in the version 11.
>
> 1. Changed the condition for the "end-of-WAL" message from
>emode_for_corrupt_record to the EndOfWAL flag.
>
> 2. Corrected the wording of end-of-wal to end-of-WAL.
>
> 3. In the "reached end of WAL" message, report the LSN of the
>   beginning of failed record instead of the beginning of the
>   last-succeeded record.
>
> 4. In the changed message in walreceiver.c, I swapped LSN and timeline
>   so that they are in the same order with other similar messages.
>

Thanks for sharing this information.

==

Here is one more comment:

One more comment:

+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;

$chkptfile is declared twice in the same scope. We can probably remove
the first one.

--
With Regards,
Ashutosh Sharma.

Re: Synchronizing slots from primary to standby

2022-02-08 Thread Ashutosh Sharma

On Tue, Feb 8, 2022 at 2:02 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-02-07 13:38:38 +0530, Ashutosh Sharma wrote:
> > Are you talking about this scenario - what if the logical replication
> > slot on the publisher is dropped, but is being referenced by the
> > standby where the slot is synchronized?
>
> It's a bit hard to say, because neither in this thread nor in the patch I've
> found a clear description of what the syncing needs to & tries to
> guarantee. It might be that that was discussed in one of the precursor
> threads, but...
>
> Generally I don't think we can permit scenarios where a slot can be in a
> "corrupt" state, i.e. missing required catalog entries, after "normal"
> administrative commands (i.e. not mucking around in catalog entries / on-disk
> files). Even if the sequence of commands may be a bit weird. All such cases
> need to be either prevented or detected.
>
>
> As far as I can tell, the way this patch keeps slots on physical replicas
> "valid" is solely by reorderbuffer.c blocking during replay via
> wait_for_standby_confirmation().
>
> Which means that if e.g. the standby_slot_names GUC differs from
> synchronize_slot_names on the physical replica, the slots synchronized on the
> physical replica are not going to be valid.  Or if the primary drops its
> logical slots.
>
>
> > Should the redo function for the drop replication slot have the capability
> > to drop it on standby and its subscribers (if any) as well?
>
> Slots are not WAL logged (and shouldn't be).
>
> I think you pretty much need the recovery conflict handling infrastructure I
> referenced upthread, which recognized during replay if a record has a conflict
> with a slot on a standby.  And then ontop of that you can build something like
> this patch.
>

OK. Understood, thanks Andres.

--
With Regards,
Ashutosh Sharma.

Re: Make mesage at end-of-recovery less scary.

2022-02-08 Thread Ashutosh Sharma

Hi,

Here are some of my review comments on the v11 patch:

-   (errmsg_internal("reached end of WAL in
pg_wal, entering archive recovery")));
+   (errmsg_internal("reached end of WAL at %X/%X
on timeline %u in %s during crash recovery, entering archive
recovery",
+LSN_FORMAT_ARGS(ErrRecPtr),
+replayTLI,
+xlogSourceNames[currentSource])));

Why crash recovery? Won't this message get printed even during PITR?

I just did a PITR and could see these messages in the logfile.

2022-02-08 18:00:44.367 IST [86185] LOG:  starting point-in-time
recovery to WAL location (LSN) "0/5227790"
2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not
properly shut down; automatic recovery in progress
2022-02-08 18:00:44.369 IST [86185] LOG:  redo starts at 0/14DC8D8
2022-02-08 18:00:44.978 IST [86185] DEBUG1:  reached end of WAL at
0/3D0 on timeline 1 in pg_wal during crash recovery, entering
archive recovery

==

+   /*
+* If we haven't emit an error message, we have safely reached the
+* end-of-WAL.
+*/
+   if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+   {
+   char   *fmt;
+
+   if (StandbyMode)
+   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during standby mode");
+   else if (InArchiveRecovery)
+   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during archive recovery");
+   else
+   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during crash recovery");
+
+   ereport(LOG,
+   (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+   xlogSourceNames[currentSource]),
+(errormsg ? errdetail_internal("%s", errormsg) : 0)));
+   }

Doesn't it make sense to add an assert statement inside this if-block
that will check for xlogreader->EndOfWAL?

==

-* We only end up here without a message when XLogPageRead()
-* failed - in that case we already logged something. In
-* StandbyMode that only happens if we have been triggered, so we
-* shouldn't loop anymore in that case.
+* If we get here for other than end-of-wal, emit the error
+* message right now. Otherwise the message if any is shown as a
+* part of the end-of-WAL message below.
 */

For consistency, I think we can replace "end-of-wal" with
"end-of-WAL". Please note that everywhere else in the comments you
have used "end-of-WAL". So why not the same here?

==

ereport(LOG,
-   (errmsg("replication terminated by
primary server"),
-errdetail("End of WAL reached on
timeline %u at %X/%X.",
-  startpointTLI,
-
LSN_FORMAT_ARGS(LogstreamResult.Write;
+   (errmsg("replication terminated by
primary server on timeline %u at %X/%X.",
+   startpointTLI,
+
LSN_FORMAT_ARGS(LogstreamResult.Write;

Is this change really required? I don't see any issue with the
existing error message.

==

Lastly, are we also planning to backport this patch?

--
With Regards,
Ashutosh Sharma.

On Wed, Feb 2, 2022 at 11:05 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 1 Feb 2022 12:38:01 +0400, Pavel Borisov  
> wrote in
> > Maybe it can be written little bit shorter:
> > pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
> > as
> > pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
> > ?
>
> That difference would be a matter of taste, but I found it looks
> cleaner that definition and assignment is separated for both p and pe.
> Now it is like the following.
>
> >   char   *p;
> >   char   *pe;
> >
> >   /* scan from the beginning of the record to the end of block */
> >   p  = (char *) record;
> >   pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
>
>
> > The problem that pgindent sometimes reflow formatting of unrelated blocks
> > is indeed existing. But I think it's right to manually leave pgindent-ed
> > code only on what is related to the patch. The leftover is pgindent-ed in a
> > scheduled manner sometimes, so don't need to bother.
>
> Yeah, I meant that it is a bit annoying to unpginden-ting unrelated
> edits:p
>
> > I'd like to set v10 as RfC.
>
> Thanks!  The suggested change is done in the attached v11.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center

Re: Synchronizing slots from primary to standby

2022-02-07 Thread Ashutosh Sharma

Hi Andres,

Are you talking about this scenario - what if the logical replication
slot on the publisher is dropped, but is being referenced by the
standby where the slot is synchronized? Should the redo function for
the drop replication slot have the capability to drop it on standby
and its subscribers (if any) as well?

--
With Regards,
Ashutosh Sharma.

On Sun, Feb 6, 2022 at 1:29 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote:
> > From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001
> > From: Peter Eisentraut 
> > Date: Mon, 3 Jan 2022 14:43:36 +0100
> > Subject: [PATCH v3] Synchronize logical replication slots from primary to
> >  standby
>
> I've just skimmed the patch and the related threads. As far as I can tell this
> cannot be safely used without the conflict handling in [1], is that correct?
>
>
> Greetings,
>
> Andres Freund
>
> [1] 
> https://postgr.es/m/CA%2BTgmoZd-JqNL1-R3RJ0jQRD%2B-dc94X0nPJgh%2BdwdDF0rFuE3g%40mail.gmail.com
>
>

Re: Synchronizing slots from primary to standby

2022-02-04 Thread Ashutosh Sharma

On Sat, Jan 22, 2022 at 4:33 AM Hsu, John  wrote:
>
>> I might be missing something but isn’t it okay even if the new primary
>> server is behind the subscribers? IOW, even if two slot's LSNs (i.e.,
>> restart_lsn and confirm_flush_lsn) are behind the subscriber's remote
>> LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only
>> transactions that were committed after the remote_lsn. So the
>> subscriber can resume logical replication with the new primary without
>> any data loss.
>
> Maybe I'm misreading, but I thought the purpose of this to make
> sure that the logical subscriber does not have data that has not been
> replicated to the new primary. The use-case I can think of would be
> if synchronous_commit were enabled and fail-over occurs. If
> we didn't have this set, isn't it possible that this logical subscriber
> has extra commits that aren't present on the newly promoted primary?
>

This is very much possible if the new primary used to be asynchronous
standby. But, it seems like the current patch is trying to hold the
logical replication until the data has been replicated to the physical
standby when synchronous_slot_names is set. This will ensure that the
logical subscriber is never ahead of the new primary. However, AFAIU
that's not the primary use-case of this patch; instead this is to
ensure that the logical subscribers continue getting data from the new
primary when the failover occurs.

>
> If a logical slot was dropped on the writer, should the worker drop 
> logical
> slots that it was previously synchronizing but are no longer present? Or
> should we leave that to the user to manage? I'm trying to think why users
> would want to sync logical slots to a reader but not have that be dropped
> as well if it's no longer present.
>

AFAIU this should be taken care of by the background worker used to
synchronize the replication slot.

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-22 Thread Ashutosh Sharma

On Wed, Dec 22, 2021 at 5:06 PM Dilip Kumar  wrote:
>
> On Wed, Dec 22, 2021 at 4:26 PM Ashutosh Sharma  wrote:
>
> >> Basically, ALTER TABLE SET TABLESPACE, will register the
> >> SYNC_UNLINK_REQUEST for the table files w.r.t the old tablespace, but
> >> those will get unlinked during the next checkpoint.  Although the
> >> files must be truncated during commit itself but unlink might not have
> >> been processed until the next checkpoint.  This is the explanation for
> >> the behavior you found during your investigation, but I haven't looked
> >> into the issue so I will do it latest by tomorrow and send my
> >> analysis.
> >>
> >> Thanks for working on this.
> >
> >
> > Yeah the problem here is that the old rel file that needs to be unlinked 
> > still exists in the old tablespace. Earlier, without your changes we were 
> > doing force checkpoint before starting with the actual work for the alter 
> > database which unlinked/deleted the rel file from the old tablespace, but 
> > that is not the case here.  Now we have removed the force checkpoint from 
> > movedb() which means until the auto checkpoint happens the old rel file 
> > will remain in the old tablespace thereby creating this problem.
>
> One solution to this problem could be that, similar to mdpostckpt(),
> we invent one more function which takes dboid and dsttblspc oid as
> input and it will unlink all the requests which are w.r.t. the dboid
> and tablespaceoid, and before doing it we should also do
> ForgetDatabaseSyncRequests(), so that next checkpoint does not flush
> some old request.

I couldn't find the mdpostchkpt() function. Are you talking about
SyncPostCheckpoint() ? Anyway, as you have rightly said, we need to
unlink all the files available inside the dst_tablespaceoid/dst_dboid/
directory by scanning the pendingUnlinks list. And finally we don't
want the next checkpoint to unlink this file again and PANIC so for
that we have to update the entry for this unlinked rel file in the
hash table i.e. cancel the sync request for it.

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-22 Thread Ashutosh Sharma

On Wed, Dec 22, 2021 at 2:44 PM Dilip Kumar  wrote:

> On Tue, Dec 21, 2021 at 11:10 AM Ashutosh Sharma 
> wrote:
> >
> > I am getting the below error when running the same test-case that Neha
> shared in her previous email.
> >
> > ERROR:  55000: some relations of database "test1" are already in
> tablespace "tab1"
> > HINT:  You must move them back to the database's default tablespace
> before using this command.
> > LOCATION:  movedb, dbcommands.c:1555
> >
> > test-case:
> > 
> > create tablespace tab1 location '/home/ashu/test1';
> > create tablespace tab location '/home/ashu/test';
> >
> > create database test tablespace tab;
> > \c test
> >
> > create table t(a int primary key, b text);
> >
> > create or replace function large_val() returns text language sql as
> 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> >
> > insert into t values (generate_series(1,10), large_val());
> >
> > alter table t set tablespace tab1 ;
> >
> > \c postgres
> > create database test1 template test;
> >
> > \c test1
> > alter table t set tablespace tab;
> >
> > \c postgres
> > alter database test1 set tablespace tab1; -- this fails with  the given
> error.
> >
> > Observations:
> > ===
> > Please note that before running above alter database statement, the
> table 't'  is moved to tablespace 'tab' from 'tab1' so not sure why
> ReadDir() is returning true when searching for table 't' in tablespace
> 'tab1'. It should have returned NULL here:
> >
> >  while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL)
> > {
> > if (strcmp(xlde->d_name, ".") == 0 ||
> > strcmp(xlde->d_name, "..") == 0)
> > continue;
> >
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("some relations of database \"%s\" are
> already in tablespace \"%s\"",
> > dbname, tblspcname),
> >  errhint("You must move them back to the database's
> default tablespace before using this command.")));
> > }
> >
> > Also, if I run the checkpoint explicitly before executing the above
> alter database statement, this error doesn't appear which means it only
> happens with the new changes because earlier we were doing the force
> checkpoint at the end of createdb statement.
> >
>
> Basically, ALTER TABLE SET TABLESPACE, will register the
> SYNC_UNLINK_REQUEST for the table files w.r.t the old tablespace, but
> those will get unlinked during the next checkpoint.  Although the
> files must be truncated during commit itself but unlink might not have
> been processed until the next checkpoint.  This is the explanation for
> the behavior you found during your investigation, but I haven't looked
> into the issue so I will do it latest by tomorrow and send my
> analysis.
>
> Thanks for working on this.
>

Yeah the problem here is that the old rel file that needs to be unlinked
still exists in the old tablespace. Earlier, without your changes we were
doing force checkpoint before starting with the actual work for the alter
database which unlinked/deleted the rel file from the old tablespace, but
that is not the case here.  Now we have removed the force checkpoint from
movedb() which means until the auto checkpoint happens the old rel file
will remain in the old tablespace thereby creating this problem.

--
With Regards,
Ashutosh Sharma.

Re: Checkpointer crashes with "PANIC: could not fsync file "pg_tblspc/.."

2021-12-22 Thread Ashutosh Sharma

On Wed, Dec 22, 2021 at 7:20 AM Dilip Kumar  wrote:

> On Wed, 22 Dec 2021 at 12:28 AM, Ashutosh Sharma 
> wrote:
>
>>
>> Is it okay to share the same tablespace (infact relfile) between the
>> primary and standby server? Perhaps NO.
>>
>
>> Oops, yeah absolutely they can never share the tablespace path.  So we
> can ignore this issue.
>

That's right. I agree. thanks.!

--
With Regards,
Ashutosh sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-22 Thread Ashutosh Sharma

Hi Dilip,

On Tue, Dec 21, 2021 at 11:10 AM Ashutosh Sharma 
wrote:

> I am getting the below error when running the same test-case that Neha
> shared in her previous email.
>
> ERROR:  55000: some relations of database "test1" are already in
> tablespace "tab1"
> HINT:  You must move them back to the database's default tablespace before
> using this command.
> LOCATION:  movedb, dbcommands.c:1555
>
> test-case:
> 
> create tablespace tab1 location '/home/ashu/test1';
> create tablespace tab location '/home/ashu/test';
>
> create database test tablespace tab;
> \c test
>
> create table t(a int primary key, b text);
>
> create or replace function large_val() returns text language sql as
> 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
>
> insert into t values (generate_series(1,10), large_val());
>
> alter table t set tablespace tab1 ;
>
> \c postgres
> create database test1 template test;
>
> \c test1
> alter table t set tablespace tab;
>
> \c postgres
> alter database test1 set tablespace tab1; -- this fails with  the given
> error.
>
> Observations:
> ===
> Please note that before running above alter database statement, the table
> 't'  is moved to tablespace 'tab' from 'tab1' so not sure why ReadDir() is
> returning true when searching for table 't' in tablespace 'tab1'. It should
> have returned NULL here:
>
>  while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL)
> {
> if (strcmp(xlde->d_name, ".") == 0 ||
> strcmp(xlde->d_name, "..") == 0)
> continue;
>
> ereport(ERROR,
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>  errmsg("some relations of database \"%s\" are already
> in tablespace \"%s\"",
> dbname, tblspcname),
>  errhint("You must move them back to the database's
> default tablespace before using this command.")));
> }
>
> Also, if I run the checkpoint explicitly before executing the above alter
> database statement, this error doesn't appear which means it only happens
> with the new changes because earlier we were doing the force checkpoint at
> the end of createdb statement.
>

Is this expected? I think it is not.

--
With Regards,
Ashutosh Sharma.

Re: Checkpointer crashes with "PANIC: could not fsync file "pg_tblspc/.."

2021-12-21 Thread Ashutosh Sharma

This is happening because in the case of the primary server, we let the
checkpointer process to unlink the first segment of the rel file but
that's not the case with the standby server. In case of standby, the
startup/recovery process unlinks the first segment of the rel file
immediately during WAL replay. Now, in this case as the tablespace path is
shared between the primary and standby servers, when the startup process
unlinks the first segment of the rel file, it gets unlinked/deleted for the
primary server as well. So, when we run the checkpoint on the primary
server the subsequent fsync fails with the error "No such file.." which
causes the database system to PANIC.

Please have a look at the code below in mdunlinkfork().

if (isRedo || forkNum != MAIN_FORKNUM ||
RelFileNodeBackendIsTemp(rnode))
{
if (!RelFileNodeBackendIsTemp(rnode))
{
/* Prevent other backends' fds from holding on to the disk
space */
ret = do_truncate(path);

/* Forget any pending sync requests for the first segment */
register_forget_request(rnode, forkNum, 0 /* first seg */ );
}
else
ret = 0;

/* Next unlink the file, unless it was already found to be missing
*/
if (ret == 0 || errno != ENOENT)
{
ret = unlink(path);
if (ret < 0 && errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
 errmsg("could not remove file \"%s\": %m", path)));
}
}
else
{
/* Prevent other backends' fds from holding on to the disk space */
ret = do_truncate(path);

/* Register request to unlink first segment later */
register_unlink_segment(rnode, forkNum, 0 /* first seg */ );
}

==

Is it okay to share the same tablespace (infact relfile) between the
primary and standby server? Perhaps NO.

--
With Regards,
Ashutosh Sharma.

On Tue, Dec 21, 2021 at 4:47 PM Dilip Kumar  wrote:

> While testing the below case with the hot standby setup (with the
> latest code), I have noticed that the checkpointer process crashed
> with the $subject error. As per my observation, we have registered the
> SYNC_REQUEST when inserting some tuple into the table, and later on
> ALTER SET TABLESPACE we have registered the SYNC_UNLINK_REQUEST, which
> looks fine so far, then I have noticed that only when the standby is
> connected the underlying table file w.r.t the old tablespace is
> already deleted.  Now, in AbsorbFsyncRequests we don't do anything for
> the SYNC_REQUEST even though we have SYNC_UNLINK_REQUEST for the same
> file, but since the underlying file is already deleted the
> checkpointer cashed while processing the SYNC_REQUEST.
>
> I have spent some time on this but could not figure out how the
> relfilenodenode file w.r.t. to the old tablespace is getting deleted
> and if I disconnect the standby then it is not getting deleted, not
> sure how walsender is playing a role in deleting the file even before
> checkpointer process the unlink request.
>
> postgres[8905]=# create tablespace tab location
> '/home/dilipkumar/work/PG/install/bin/test';
> CREATE TABLESPACE
> postgres[8905]=# create tablespace tab1 location
> '/home/dilipkumar/work/PG/install/bin/test1';
> CREATE TABLESPACE
> postgres[8905]=# create database test tablespace tab;
> CREATE DATABASE
> postgres[8905]=# \c test
> You are now connected to database "test" as user "dilipkumar".
> test[8912]=# create table t( a int PRIMARY KEY,b text);
> CREATE TABLE
> test[8912]=# insert into t values (generate_series(1,10), 'aaa');
> INSERT 0 10
> test[8912]=# alter table t set tablespace tab1 ;
> ALTER TABLE
> test[8912]=# CHECKPOINT ;
> WARNING:  57P02: terminating connection because of crash of another
> server process
>
> log shows:
> PANIC:  could not fsync file
> "pg_tblspc/16384/PG_15_202112131/16386/16387": No such file or
> directory
>
> backtrace:
> #0  0x7f2f865ff387 in raise () from /lib64/libc.so.6
> #1  0x7f2f86600a78 in abort () from /lib64/libc.so.6
> #2  0x00b13da3 in errfinish (filename=0xcf283f "sync.c", ..
> #3  0x00978dc7 in ProcessSyncRequests () at sync.c:439
> #4  0x005949d2 in CheckPointGuts (checkPointRedo=67653624,
> flags=108) at xlog.c:9590
> #5  0x005942fe in CreateCheckPoint (flags=108) at xlog.c:9318
> #6  0x008a80b7 in CheckpointerMain () at checkpointer.c:444
>
> Note: This smaller test case is derived from one of the bigger
> scenarios raised by Neha Sharma [1]
>
> [1]
> https://www.postgresql.org/message-id/CANiYTQs0E8TcB11eU0C4eNN0tUd%3DSQqsqEtL1AVZP1%3DEnD-49A%40mail.gmail.com
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>
>
>

Re: Multi-Column List Partitioning

2021-12-20 Thread Ashutosh Sharma

On Mon, Dec 20, 2021 at 7:04 PM Amit Langote 
wrote:

> Hi,
>
> On Mon, Dec 13, 2021 at 11:37 PM Ashutosh Sharma 
> wrote:
> >
> > Hi,
> >
> > Is this okay?
> >
> > postgres=# CREATE TABLE t1 (a int, b int) PARTITION BY LIST ( a, a, a );
> > CREATE TABLE
> >
> > postgres=# CREATE TABLE t1_1 PARTITION OF t1 FOR VALUES IN ((1, 2, 3),
> (4, 5, 6));
> > CREATE TABLE
> >
> > postgres=# \d t1
> >Partitioned table "public.t1"
> >  Column |  Type   | Collation | Nullable | Default
> > +-+---+--+-
> >  a  | integer |   |  |
> >  b  | integer |   |  |
> > Partition key: LIST (a, a, a)
> > Number of partitions: 1 (Use \d+ to list them.)
>
> I'd say it's not okay for a user to expect this to work sensibly, and
> I don't think it would be worthwhile to write code to point that out
> to the user if that is what you were implying.
>

OK. As you wish.

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-20 Thread Ashutosh Sharma

I am getting the below error when running the same test-case that Neha
shared in her previous email.

ERROR:  55000: some relations of database "test1" are already in tablespace
"tab1"
HINT:  You must move them back to the database's default tablespace before
using this command.
LOCATION:  movedb, dbcommands.c:1555

test-case:

create tablespace tab1 location '/home/ashu/test1';
create tablespace tab location '/home/ashu/test';

create database test tablespace tab;
\c test

create table t(a int primary key, b text);

create or replace function large_val() returns text language sql as 'select
array_agg(md5(g::text))::text from generate_series(1, 256) g';

insert into t values (generate_series(1,10), large_val());

alter table t set tablespace tab1 ;

\c postgres
create database test1 template test;

\c test1
alter table t set tablespace tab;

\c postgres
alter database test1 set tablespace tab1; -- this fails with  the given
error.

Observations:
===
Please note that before running above alter database statement, the table
't'  is moved to tablespace 'tab' from 'tab1' so not sure why ReadDir() is
returning true when searching for table 't' in tablespace 'tab1'. It should
have returned NULL here:

 while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL)
{
if (strcmp(xlde->d_name, ".") == 0 ||
strcmp(xlde->d_name, "..") == 0)
continue;

ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("some relations of database \"%s\" are already
in tablespace \"%s\"",
dbname, tblspcname),
 errhint("You must move them back to the database's
default tablespace before using this command.")));
}

Also, if I run the checkpoint explicitly before executing the above alter
database statement, this error doesn't appear which means it only happens
with the new changes because earlier we were doing the force checkpoint at
the end of createdb statement.

--
With Regards,
Ashutosh Sharma.

On Thu, Dec 16, 2021 at 9:26 PM Neha Sharma 
wrote:

> Hi,
>
> While testing the v8 patches in a hot-standby setup, it was observed the
> master is crashing with the below error;
>
> 2021-12-16 19:32:47.757 +04 [101483] PANIC:  could not fsync file
> "pg_tblspc/16385/PG_15_202112111/16386/16391": No such file or directory
> 2021-12-16 19:32:48.917 +04 [101482] LOG:  checkpointer process (PID
> 101483) was terminated by signal 6: Aborted
>
> Parameters configured at master:
> wal_level = hot_standby
> max_wal_senders = 3
> hot_standby = on
> max_standby_streaming_delay= -1
> wal_consistency_checking='all'
> max_wal_size= 10GB
> checkpoint_timeout= 1d
> log_min_messages=debug1
>
> Test Case:
> create tablespace tab1 location
> '/home/edb/PGsources/postgresql/inst/bin/test1';
> create tablespace tab location
> '/home/edb/PGsources/postgresql/inst/bin/test';
> create database test tablespace tab;
> \c test
> create table t( a int PRIMARY KEY,b text);
> CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
> 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> insert into t values (generate_series(1,10), large_val());
> alter table t set tablespace tab1 ;
> \c postgres
> create database test1 template test;
> \c test1
> alter table t set tablespace tab;
> \c postgres
> alter database test1 set tablespace tab1;
>
> --cancel the below command
> alter database test1 set tablespace pg_default; --press ctrl+c
> \c test1
> alter table t set tablespace tab1;
>
>
> Log file attached for reference.
>
> Thanks.
> --
> Regards,
> Neha Sharma
>
>
> On Thu, Dec 16, 2021 at 4:17 PM Dilip Kumar  wrote:
>
>> On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian  wrote:
>> >
>> > On Thu, Dec  2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
>> > From the patch:
>> >
>> > > Currently, CREATE DATABASE forces a checkpoint, then copies all the
>> files,
>> > > then forces another checkpoint. The comments in the createdb()
>> function
>> > > explain the reasons for this. The attached patch fixes this problem
>> by making
>> > > create database completely WAL logged so that we can avoid the
>> checkpoints.
>> > >
>> > > This can also be useful for supporting the TDE. For example, if we
>> need different
>> > > encryption for the source and the target database then we can not
>> re-encrypt the
>> > > page data if we copy the whole directory.  But with this patch, we
>> are copying
>> > > page by page

Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-14 Thread Ashutosh Sharma

Hi,

I have looked into the v2 patch and here are my comments:

+   PG_RETURN_INT32(local_beentry->subxact_overflowed);
+}

Should this be PG_RETURN_BOOL instead of PG_RETURN_INT32??

--

+{ oid => '6107', descr => 'statistics: cached subtransaction count of
backend',
+  proname => 'pg_stat_get_backend_subxact_count', provolatile => 's',
proparallel => 'r',
+  prorettype => 'int4', proargtypes => 'int4',
+  prosrc => 'pg_stat_get_backend_subxact_count' },
+{ oid => '6108', descr => 'statistics: subtransaction cache of backend
overflowed',
+  proname => 'pg_stat_get_backend_subxact_overflow', provolatile => 's',
proparallel => 'r',
+  prorettype => 'bool', proargtypes => 'int4',
+  prosrc => 'pg_stat_get_backend_subxact_overflow' },

The description says that the two new functions show the statistics for
"cached subtransaction count of backend" and "subtransaction cache of
backend overflowed". But, when these functions are called it shows these
stats for the non-backend processes like checkpointer, walwriter etc as
well. Should that happen?

--

typedef struct LocalPgBackendStatus
 * not.
 */
TransactionId backend_xmin;
+
+   /*
+* Number of cached subtransactions in the current session.
+*/
+   int subxact_count;
+
+   /*
+* The number of subtransactions in the current session exceeded the
cached
+* subtransaction limit.
+*/
+   bool subxact_overflowed;

All the variables inside this LocalPgBackendStatus structure are prefixed
with "backend" word. Can we do the same for the newly added variables as
well?

--

+ * Get the xid and xmin, nsubxid and overflow status of the backend.
The

Should this be put as - "xid, xmin, nsubxid and overflow" instead of "xid
and xmin, nsubxid and overflow"?

--
With Regards,
Ashutosh Sharma.


On Mon, Dec 13, 2021 at 7:58 PM Dilip Kumar  wrote:

> On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby 
> wrote:
>
> >
> > You added this to pg_stat_activity, which already has a lot of fields.
> > We talked a few months ago about not adding more fields that weren't
> commonly
> > used.
> >
> https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e
> >
> > Since I think this field is usually not interesting to most users of
> > pg_stat_activity, maybe this should instead be implemented as a function
> like
> > pg_backend_get_subxact_status(pid).
> >
> > People who want to could use it like:
> > SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid)
> sub;
>
> I have provided two function, one for subtransaction counts and other
> whether subtransaction cache is overflowed or not, we can use like
> this,  if we think this is better way to do it then we can also add
> another function for the lastOverflowedXid
>
> postgres[43994]=# select id, pg_stat_get_backend_pid(id) as pid,
> pg_stat_get_backend_subxact_count(id) as nsubxact,
> pg_stat_get_backend_subxact_overflow(id) as overflowed from
> pg_stat_get_backend_idset() as id;
>  id |  pid  | nsubxact | overflowed
> +---+--+
>   1 | 43806 |0 | f
>   2 | 43983 |   64 | t
>   3 | 43994 |0 | f
>   4 | 44323 |   22 | f
>   5 | 43802 |0 | f
>   6 | 43801 |0 | f
>   7 | 43804 |0 | f
> (7 rows)
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Re: Multi-Column List Partitioning

2021-12-13 Thread Ashutosh Sharma

Hi,

Is this okay?

postgres=# CREATE TABLE t1 (a int, b int) PARTITION BY LIST ( a, a, a );
CREATE TABLE

postgres=# CREATE TABLE t1_1 PARTITION OF t1 FOR VALUES IN ((1, 2, 3), (4,
5, 6));
CREATE TABLE

postgres=# \d t1
   Partitioned table "public.t1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Partition key: LIST (a, a, a)
Number of partitions: 1 (Use \d+ to list them.)

--

Also, getting some compiler warnings when building the source. please check.

--
With Regards,
Ashutosh Sharma.

On Mon, Dec 6, 2021 at 7:27 PM Nitin Jadhav 
wrote:

> Thank you for reviewing the patch.
>
> > partbounds.c: In function ‘get_qual_for_list.isra.18’:
> > partbounds.c:4284:29: warning: ‘boundinfo’ may be used uninitialized
> > in this function [-Wmaybe-uninitialized]
> > datumCopy(bound_info->datums[i][j],
> >   ~~^~~~
> > partbounds.c:4335:21: note: ‘boundinfo’ was declared here
> >  PartitionBoundInfo boundinfo;
> > ^
> > partbounds.c: In function ‘partition_bounds_merge’:
> > partbounds.c:1305:12: warning: ‘inner_isnull’ may be used
> > uninitialized in this function [-Wmaybe-uninitialized]
> >   bool*inner_isnull;
> >^~~~
> > partbounds.c:1304:12: warning: ‘outer_isnull’ may be used
> > uninitialized in this function [-Wmaybe-uninitialized]
> >   bool*outer_isnull;
> >   ^~~~
>
> Fixed.
>
> > This function is unnecessarily complicated, I think you can avoid
> > inner for loops; simply replace for-loop-block with  "if
> > (equal(lfirst(cell), new_bound)) return true".
>
> Thank you for the suggestion. Fixed.
>
> > + char   **colname = (char **) palloc0(partnatts * sizeof(char *));
> > + Oid*coltype = palloc0(partnatts * sizeof(Oid));
> > + int32*coltypmod = palloc0(partnatts * sizeof(int));
> > + Oid*partcollation = palloc0(partnatts * sizeof(Oid));
> > +
> > This allocation seems to be worthless, read ahead.
> >
> > I think there is no need for this separate loop inside
> > transformPartitionListBounds, you can do that same in the next loop as
> > well. And instead of  get_partition_col_* calling and storing, simply
> > use that directly as an argument to transformPartitionBoundValue().
>
> Yes. The loop can be avoided and content of the above loop can be
> included in the next loop but the next loop iterates over a list of
> multi column datums. For each iteration, we need the information of
> all the columns. The above data (colname, coltype, coltypmod and
> partcollation) remains same for each iteration of the loop, If we
> modify as suggested, then the function to fetch these information has
> to be called every-time. To avoid this situation I have made a
> separate loop outside which only runs as many number of columns and
> stores in a variable which can be reused later. Please let me correct
> if I am wrong.
>
> > I think this should be inside the "else" block after "!IsA(rowexpr,
> > RowExpr)" error and you can avoid IsA() check too.
>
> This is required to handle the situation when one partition key is
> mentioned and multiple values are provided in the partition bound
> specification.
>
> > Looks difficult to understand at first glance, how about the following:
> >
> > if (b1->isnulls != b2->isnulls)
> >return false;
> >
> > if (b1->isnulls)
> > {
> >if (b1->isnulls[i][j] != b2->isnulls[i][j])
> >return false;
> >if (b1->isnulls[i][j])
> >continue;
> > }
> >
> > See how range partitioning infinite values are handled. Also, place
> > this before the comment block that was added for the "!datumIsEqual()"
> > case.
>
> Fixed. I feel the 'continue' block is not required and hence removed it.
>
> > Nothing wrong with this but if we could have checked "dest->isnulls"
> > instead of "src->isnulls" would be much better.
>
> Here we are copying the data from 'src' to 'dest'. If there is no data
> in 'src', it is unnecessary to copy. Hence checking 'src'.
>
> > Condition "key->strategy != PARTITION_STRATEGY_LIST" seems to be
> unnecessary.
>
> Fixed.
>
> > Can't be a single loop?
>
> Yes. Fixed.
>
>
>
> On Fri, Dec 3, 2021 at 7:26 PM Amul Sul  wrote:
> >
> > Hi,
> >
> > Few comments for v7 patch, note that I haven't been through the
> > previous

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-12 Thread Ashutosh Sharma

+   /*
+* If the relation is from the default tablespace then we need to
+* create it in the destinations db's default tablespace.
Otherwise,
+* we need to create in the same tablespace as it is in the source
+* database.
+*/

This comment looks a bit confusing to me especially because when we say
destination db's default tablespace people may think of pg_default
tablespace (at least I think so). Basically what you are trying to say here
- "If the relation exists in the same tablespace as the src database, then
in the destination db also it should be the same or something like that.. "
So, why not put it that way instead of referring to it as the default
tablespace. It's just my view. If you disagree you can ignore it.

--

+   else if (src_dboid == dst_dboid)
+   continue;
+   else
+   dstrnode.spcNode = srcrnode.spcNode;;

There is an extra semicolon here.

--
With Regards,
Ashutosh Sharma.


On Sun, Dec 12, 2021 at 1:39 PM Dilip Kumar  wrote:

> On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma 
> wrote:
> >>
> >> Logfile Snippet:
> >> 2021-12-09 17:49:18.110 +04 [18151] PANIC:  could not fsync file
> "base/116398/116400": No such file or directory
> >> 2021-12-09 17:49:19.105 +04 [18150] LOG:  checkpointer process (PID
> 18151) was terminated by signal 6: Aborted
> >> 2021-12-09 17:49:19.105 +04 [18150] LOG:  terminating any other active
> server processes
> >
> >
> > This is different from the issue you raised earlier. As Dilip said, we
> need to unregister sync requests for files that got successfully copied to
> the target database, but the overall alter database statement failed. We
> are doing this when the database is created successfully, but not when it
> fails.
> > Probably doing the same inside the cleanup function
> movedb_failure_callback() should fix the problem.
>
> Correct, I have done this cleanup, apart from this we have dropped the
> fsyc request in create database failure case as well and also need to
> drop buffer in error case of creatdb as well as movedb.  I have also
> fixed the other issue for which you gave the patch (a bit differently)
> basically, in case of movedb the source and destination dboid are same
> so we don't need an additional parameter and also readjusted the
> conditions to avoid nested if.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Re: WIP: WAL prefetch (another approach)

2021-12-10 Thread Ashutosh Sharma

Hi Thomas,

I am unable to apply these new set of patches on HEAD. Can you please share
the rebased patch or if you have any work branch can you please point it
out, I will refer to it for the changes.

--
With Regards,
Ashutosh sharma.

On Tue, Nov 23, 2021 at 3:44 PM Thomas Munro  wrote:

> On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson 
> wrote:
> > Could you post an updated version of the patch which is for review?
>
> Sorry for taking so long to come back; I learned some new things that
> made me want to restructure this code a bit (see below).  Here is an
> updated pair of patches that I'm currently testing.
>
> Old problems:
>
> 1.  Last time around, an infinite loop was reported in pg_waldump.  I
> believe Horiguchi-san has fixed that[1], but I'm no longer depending
> on that patch.  I thought his patch set was a good idea, but it's
> complicated and there's enough going on here already... let's consider
> that independently.
>
> This version goes back to what I had earlier, though (I hope) it is
> better about how "nonblocking" states are communicated.  In this
> version, XLogPageRead() has a way to give up part way through a record
> if it doesn't have enough data and there are queued up records that
> could be replayed right now.  In that case, we'll go back to the
> beginning of the record (and occasionally, back a WAL page) next time
> we try.  That's the cost of not maintaining intra-record decoding
> state.
>
> 2.  Last time around, we could try to allocate a crazy amount of
> memory when reading garbage past the end of the WAL.  Fixed, by
> validating first, like in master.
>
> New work:
>
> Since last time, I went away and worked on a "real" AIO version of
> this feature.  That's ongoing experimental work for a future proposal,
> but I have a working prototype and I aim to share that soon, when that
> branch is rebased to catch up with recent changes.  In that version,
> the prefetcher starts actual reads into the buffer pool, and recovery
> receives already pinned buffers attached to the stream of records it's
> replaying.
>
> That inspired a couple of refactoring changes to this non-AIO version,
> to minimise the difference and anticipate the future work better:
>
> 1.  The logic for deciding which block to start prefetching next is
> moved into a new callback function in a sort of standard form (this is
> approximately how all/most prefetching code looks in the AIO project,
> ie sequential scans, bitmap heap scan, etc).
>
> 2.  The logic for controlling how many IOs are running and deciding
> when to call the above is in a separate component.  In this non-AIO
> version, it works using a simple ring buffer of LSNs to estimate the
> number of in flight I/Os, just like before.  This part would be thrown
> away and replaced with the AIO branch's centralised "streaming read"
> mechanism which tracks I/O completions based on a stream of completion
> events from the kernel (or I/O worker processes).
>
> 3.  In this version, the prefetcher still doesn't pin buffers, for
> simplicity.  That work did force me to study places where WAL streams
> need prefetching "barriers", though, so in this patch you can
> see that it's now a little more careful than it probably needs to be.
> (It doesn't really matter much if you call posix_fadvise() on a
> non-existent file region, or the wrong file after OID wraparound and
> reuse, but it would matter if you actually read it into a buffer, and
> if an intervening record might be trying to drop something you have
> pinned).
>
> Some other changes:
>
> 1.  I dropped the GUC recovery_prefetch_fpw.  I think it was a
> possibly useful idea but it's a niche concern and not worth worrying
> about for now.
>
> 2.  I simplified the stats.  Coming up with a good running average
> system seemed like a problem for another day (the numbers before were
> hard to interpret).  The new stats are super simple counters and
> instantaneous values:
>
> postgres=# select * from pg_stat_prefetch_recovery ;
> -[ RECORD 1 ]--+--
> stats_reset| 2021-11-10 09:02:08.590217+13
> prefetch   | 13605674 <- times we called posix_fadvise()
> hit| 24185289 <- times we found pages already cached
> skip_init  | 217215   <- times we did nothing because init, not read
> skip_new   | 192347   <- times we skipped because relation too small
> skip_fpw   | 27429<- times we skipped because fpw, not read
> wal_distance   | 10648<- how far ahead in WAL bytes
> block_distance | 134  <- how far ahead in block references
> io_depth   | 50   <- fadvise() calls not yet followed by pread()
>
> I als

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-09 Thread Ashutosh Sharma

On Thu, Dec 9, 2021 at 7:23 PM Neha Sharma 
wrote:

> On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma 
> wrote:
>
>> Hi,
>>
>> The issue here is that we are trying to create a table that exists inside
>> a non-default tablespace when doing ALTER DATABASE. I think this should be
>> skipped otherwise we will come across the error like shown below:
>>
>> ashu@postgres=# alter database test set tablespace pg_default;
>> ERROR:  58P02: could not create file
>> "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists
>>
>
> Thanks Ashutosh for the patch, the mentioned issue has been resolved with
> the patch.
>
> But I am still able to reproduce the crash consistently on top of this
> patch + v7 patches,just the test case has been modified.
>
> create tablespace tab1 location '/test1';
> create tablespace tab location '/test';
> create database test tablespace tab;
> \c test
> create table t( a int PRIMARY KEY,b text);
> CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
> 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> insert into t values (generate_series(1,10), large_val());
> alter table t set tablespace tab1 ;
> \c postgres
> create database test1 template test;
> \c test1
> alter table t set tablespace tab;
> \c postgres
> alter database test1 set tablespace tab1;
>
> --Cancel the below command after few seconds
> alter database test1 set tablespace pg_default;
>
> \c test1
> alter table t set tablespace tab1;
>
>
> Logfile Snippet:
> 2021-12-09 17:49:18.110 +04 [18151] PANIC:  could not fsync file
> "base/116398/116400": No such file or directory
> 2021-12-09 17:49:19.105 +04 [18150] LOG:  checkpointer process (PID 18151)
> was terminated by signal 6: Aborted
> 2021-12-09 17:49:19.105 +04 [18150] LOG:  terminating any other active
> server processes
>

This is different from the issue you raised earlier. As Dilip said, we need
to unregister sync requests for files that got successfully copied to the
target database, but the overall alter database statement failed. We are
doing this when the database is created successfully, but not when it fails.
Probably doing the same inside the cleanup function
movedb_failure_callback() should fix the problem.

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-08 Thread Ashutosh Sharma

Hi,

The issue here is that we are trying to create a table that exists inside a
non-default tablespace when doing ALTER DATABASE. I think this should be
skipped otherwise we will come across the error like shown below:

ashu@postgres=# alter database test set tablespace pg_default;
ERROR:  58P02: could not create file
"pg_tblspc/16385/PG_15_202111301/16386/16390": File exists

I have taken the above from Neha's test-case.

--

Attached patch fixes this. I am passing a new boolean flag named *movedb*
to CopyDatabase() so that it could skip the creation of tables existing in
non-default tablespace when doing alter database. Alternatively, we can
also rename the boolean flag movedb to createdb and pass its value
accordingly from movedb() or createdb(). Either way looks fine to me.
Kindly check the attached patch for the changes.

Dilip, Could you please check the attached patch and let me know if it
looks fine or not?

Neha, can you please re-run the test-cases with the attached patch.

Thanks,

--
With Regards,
Ashutosh Sharma.

On Thu, Dec 9, 2021 at 8:43 AM Neha Sharma 
wrote:

>
>
>
> On Thu, Dec 9, 2021 at 4:26 AM Greg Nancarrow  wrote:
>
>> On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma 
>> wrote:
>> >
>> > While testing the v7 patches, I am observing a crash with the below
>> test case.
>> >
>> > Test case:
>> > create tablespace tab location '/test_dir';
>> > create tablespace tab1 location '/test_dir1';
>> > create database test tablespace tab;
>> > \c test
>> > create table t( a int PRIMARY KEY,b text);
>> > CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
>> 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
>> > insert into t values (generate_series(1,200), large_val());
>> > alter table t set tablespace tab1 ;
>> > \c postgres
>> > create database test1 template test;
>> > alter database test set tablespace pg_default;
>> > alter database test set tablespace tab;
>> > \c test1
>> > alter table t set tablespace tab;
>> >
>> >  Logfile says:
>> > 2021-12-08 23:31:58.855 +04 [134252] PANIC:  could not fsync file
>> "base/16386/4152": No such file or directory
>> > 2021-12-08 23:31:59.398 +04 [134251] LOG:  checkpointer process (PID
>> 134252) was terminated by signal 6: Aborted
>> >
>>
>> I tried to reproduce the issue using your test scenario, but I needed
>> to reduce the amount of inserted data (so reduced 200 to 2)
>> due to disk space.
>> I then consistently get an error like the following:
>>
>> postgres=# alter database test set tablespace pg_default;
>> ERROR:  could not create file
>> "pg_tblspc/16385/PG_15_202111301/16386/36395": File exists
>>
>> (this only happens when the patch is used)
>>
>>
> Yes, I was also getting this, and moving further we get a crash when we
> alter the table of database test1.
> Below is the output of the test at my end.
>
> postgres=# create tablespace tab1 location
> '/home/edb/PGsources/postgresql/inst/bin/rep_test1';
> CREATE TABLESPACE
> postgres=# create tablespace tab location
> '/home/edb/PGsources/postgresql/inst/bin/rep_test';
> CREATE TABLESPACE
> postgres=# create database test tablespace tab;
> CREATE DATABASE
> postgres=# \c test
> You are now connected to database "test" as user "edb".
> test=# create table t( a int PRIMARY KEY,b text);
> CREATE TABLE
> test=# CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
> 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> CREATE FUNCTION
> test=# insert into t values (generate_series(1,200), large_val());
> INSERT 0 200
> test=# alter table t set tablespace tab1 ;
> ALTER TABLE
> test=# \c postgres
> You are now connected to database "postgres" as user "edb".
> postgres=# create database test1 template test;
> CREATE DATABASE
> postgres=# alter database test set tablespace pg_default;
> ERROR:  could not create file
> "pg_tblspc/16384/PG_15_202111301/16386/2016395": File exists
> postgres=# alter database test set tablespace tab;
> ALTER DATABASE
> postgres=# \c test1
> You are now connected to database "test1" as user "edb".
> test1=# alter table t set tablespace tab;
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !?>
>
>>
>> Regards,
>> Greg Nancarrow
>> Fujitsu Australia
>>
>


skip-table-creation-for-alter-database.patch
Description: Binary data

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-06 Thread Ashutosh Sharma

Thanks Robert for sharing your thoughts.

On Mon, Dec 6, 2021 at 11:16 PM Robert Haas  wrote:

> On Mon, Dec 6, 2021 at 9:23 AM Ashutosh Sharma 
> wrote:
> > One last point - If we try to clone a huge database, as expected CREATE
> DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints
> which seems to be affecting the performance slightly.
>
> Yes, I think this needs to be characterized better. If you have a big
> shared buffers setting and a lot of those buffers are dirty and the
> template database is small, all of which is fairly normal, then this
> new approach should be much quicker. On the other hand, what if the
> situation is reversed? Perhaps you have a small shared buffers and not
> much of it is dirty and the template database is gigantic. Then maybe
> this new approach will be slower. But right now I think we don't know
> where the crossover point is, and I think we should try to figure that
> out.
>

Yes I think so too.


>
> So for example, imagine tests with 1GB of shard_buffers, 8GB, and
> 64GB. And template databases with sizes of whatever the default is,
> 1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then
> create a new database from one of the templates. And then just measure
> the performance. Maybe for large databases this approach is just
> really the pits -- and if your max_wal_size is too small, it
> definitely will be. But, I don't know, maybe with reasonable settings
> it's not that bad. Writing everything to disk twice - once to WAL and
> once to the target directory - has to be more expensive than doing it
> once. But on the other hand, it's all sequential I/O and the data
> pages don't need to be fsync'd, so perhaps the overhead is relatively
> mild. I don't know.
>

So far, I haven't found much performance overhead with a few gb of data in
the template database. It's just a bit with the default settings, perhaps
setting a higher value of max_wal_size would reduce this overhead.

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-06 Thread Ashutosh Sharma

Thank you, Dilip for the quick response. I am okay with the changes done in
the v7 patch.

One last point - If we try to clone a huge database, as expected CREATE
DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints
which seems to be affecting the performance slightly.

--
With Regards,
Ashutosh Sharma.

On Mon, Dec 6, 2021 at 9:59 AM Dilip Kumar  wrote:

> On Mon, Dec 6, 2021 at 9:17 AM Ashutosh Sharma 
> wrote:
> >
> > Here are few more review comments:
>
> Thanks for reviewing it.
>
> > 1) It seems that we are not freeing the memory allocated for buf.data in
> CreateDirAndVersionFile().
>
> Yeah this was a problem in v6 but I have fixed in v7, can you check that.
> >
> > + */
> > +static void
> > +CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
> > +{
> >
> > 2) Do we need to pass dbpath here? I mean why not reconstruct it from
> dbid and tsid.
>
> Yeah we can do that but I thought computing dbpath has some cost and
> since the caller already has it why not to pass it.
>
> >
> > 3) Not sure if this point has already been discussed, Will we be able to
> recover the data when wal_level is set to minimal because the following
> condition would be false with this wal level.
> >
> > +   use_wal = XLogIsNeeded() &&
> > +   (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
> >
>
> Since we are creating new relfilenode this is fine, refer "Skipping
> WAL for New RelFileNode" in src/backend/access/transam/README
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-05 Thread Ashutosh Sharma

Here are few more review comments:

1) It seems that we are not freeing the memory allocated for buf.data in
CreateDirAndVersionFile().

--

+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{

2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid
and tsid.

--

3) Not sure if this point has already been discussed, Will we be able to
recover the data when wal_level is set to minimal because the following
condition would be false with this wal level.

+   use_wal = XLogIsNeeded() &&
+   (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);

--
With Regards,
Ashutosh Sharma.

On Mon, Dec 6, 2021 at 9:12 AM Ashutosh Sharma 
wrote:

> On Fri, Dec 3, 2021 at 8:28 PM Dilip Kumar  wrote:
>
>> On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma 
>> wrote:
>> >
>> > I see that this patch is reducing the database creation time by almost
>> 3-4 times provided that the template database has some user data in it.
>> However, there are couple of points to be noted:
>>
>> Thanks a lot for looking into the patches.
>> >
>> > 1) It makes the crash recovery a bit slower than before if the crash
>> has occurred after the execution of a create database statement. Moreover,
>> if the template database size is big, it might even generate a lot of WAL
>> files which the user needs to be aware of.
>>
>> Yes it will but actually that is the only correct way to do it, in
>> current we are just logging the WAL as copying the source directory to
>> destination directory without really noting down exactly what we
>> wanted to copy, so we are force to do the checkpoint right after
>> create database because in crash recovery we can not actually replay
>> that WAL.  Because WAL just say copy the source to destination so it
>> is very much possible that at the DO time source directory had some
>> different content than the REDO time so this would have created the
>> inconsistencies in the crash recovery so to avoid this bug they force
>> the checkpoint so now also if you do force checkpoint then again crash
>> recovery will be equally fast.  So I would not say that we have made
>> crash recovery slow but we have removed some bugs and with that now we
>> don't need to force the checkpoint.  Also note that in current code
>> even with force checkpoint the bug is not completely avoided in all
>> the cases, check below comments from the code[1].
>>
>> > 2) This will put a lot of load on the first checkpoint that will occur
>> after creating the database statement. I will experiment around this to see
>> if this has any side effects.
>>
>> But now a checkpoint can happen at its own need and there is no need
>> to force a checkpoint like it was before patch.
>>
>> So the major goal of this patch is 1) Correctly WAL log the create
>> database which is hack in the current system,  2) Avoid force
>> checkpoints, 3) We copy page by page so it will support TDE because if
>> the source and destination database has different encryption then we
>> can reencrypt the page before copying to destination database, which
>> is not possible in current system as we are copying directory  4) Now
>> the new database pages will get the latest LSN which is the correct
>> things earlier new database pages were getting copied directly with
>> old LSN only.
>>
>
> OK. Understood, thanks.!
>
> --
> With Regards,
> Ashutosh Sharma.
>

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-05 Thread Ashutosh Sharma

On Fri, Dec 3, 2021 at 8:28 PM Dilip Kumar  wrote:

> On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma 
> wrote:
> >
> > I see that this patch is reducing the database creation time by almost
> 3-4 times provided that the template database has some user data in it.
> However, there are couple of points to be noted:
>
> Thanks a lot for looking into the patches.
> >
> > 1) It makes the crash recovery a bit slower than before if the crash has
> occurred after the execution of a create database statement. Moreover, if
> the template database size is big, it might even generate a lot of WAL
> files which the user needs to be aware of.
>
> Yes it will but actually that is the only correct way to do it, in
> current we are just logging the WAL as copying the source directory to
> destination directory without really noting down exactly what we
> wanted to copy, so we are force to do the checkpoint right after
> create database because in crash recovery we can not actually replay
> that WAL.  Because WAL just say copy the source to destination so it
> is very much possible that at the DO time source directory had some
> different content than the REDO time so this would have created the
> inconsistencies in the crash recovery so to avoid this bug they force
> the checkpoint so now also if you do force checkpoint then again crash
> recovery will be equally fast.  So I would not say that we have made
> crash recovery slow but we have removed some bugs and with that now we
> don't need to force the checkpoint.  Also note that in current code
> even with force checkpoint the bug is not completely avoided in all
> the cases, check below comments from the code[1].
>
> > 2) This will put a lot of load on the first checkpoint that will occur
> after creating the database statement. I will experiment around this to see
> if this has any side effects.
>
> But now a checkpoint can happen at its own need and there is no need
> to force a checkpoint like it was before patch.
>
> So the major goal of this patch is 1) Correctly WAL log the create
> database which is hack in the current system,  2) Avoid force
> checkpoints, 3) We copy page by page so it will support TDE because if
> the source and destination database has different encryption then we
> can reencrypt the page before copying to destination database, which
> is not possible in current system as we are copying directory  4) Now
> the new database pages will get the latest LSN which is the correct
> things earlier new database pages were getting copied directly with
> old LSN only.
>

OK. Understood, thanks.!

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-03 Thread Ashutosh Sharma

I see that this patch is reducing the database creation time by almost 3-4
times provided that the template database has some user data in it.
However, there are couple of points to be noted:

1) It makes the crash recovery a bit slower than before if the crash has
occurred after the execution of a create database statement. Moreover, if
the template database size is big, it might even generate a lot of WAL
files which the user needs to be aware of.

2) This will put a lot of load on the first checkpoint that will occur
after creating the database statement. I will experiment around this to see
if this has any side effects.

--

Further, the code changes in the patch looks good. I just have few comments:

+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+   LOCKTAG tag;
+   LOCALLOCK  *locallock;
+   LockAcquireResult res;
+
+   SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);

Should there be an assertion statement here to ensure that relid->dbid
and  relid->relid is valid?

--

if (info == XLOG_DBASE_CREATE)
{
xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *)
XLogRecGetData(record);
-   char   *src_path;
-   char   *dst_path;
-   struct stat st;
-
-   src_path = GetDatabasePath(xlrec->src_db_id,
xlrec->src_tablespace_id);
-   dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+   char   *dbpath;

-   /*
-* Our theory for replaying a CREATE is to forcibly drop the target
-* subdirectory if present, then re-copy the source data. This may
be
-* more work than needed, but it is simple to implement.
-*/
-   if (stat(dst_path, ) == 0 && S_ISDIR(st.st_mode))
-   {
-   if (!rmtree(dst_path, true))
-   /* If this failed, copydir() below is going to error. */
-   ereport(WARNING,
-   (errmsg("some useless files may be left behind in
old database directory \"%s\"",
-   dst_path)));
-   }

I think this is a significant change and probably needs some kind of
explanation/comments as-in why we are just creating a dir and copying the
version file when replaying create database operation. Earlier, this meant
replaying the complete create database operation, that doesn't seem to be
the case now.

--

Have you intentionally skipped pg_internal.init file from being copied to
the target database?

--
With Regards,
Ashutosh Sharma.

On Thu, Dec 2, 2021 at 7:20 PM Dilip Kumar  wrote:

> On Wed, Dec 1, 2021 at 6:04 PM Dilip Kumar  wrote:
>
> > Thanks a lot for testing this. From the error, it seems like some of
> > the old buffer w.r.t. the previous tablespace is not dropped after the
> > movedb.  Actually, we are calling DropDatabaseBuffers() after copying
> > to a new tablespace and dropping all the buffers of this database
> > w.r.t the old tablespace.  But seems something is missing, I will
> > reproduce this and try to fix it by tomorrow.  I will also fix the
> > other review comments raised by you in the previous mail.
>
> Okay, I got the issue, basically we are dropping the database buffers
> but not unregistering the existing sync request for database buffers
> w.r.t old tablespace. Attached patch fixes that.  I also had to extend
> ForgetDatabaseSyncRequests so that we can delete the sync request of
> the database for the particular tablespace so added another patch for
> the same (0006).
>
> I will test the performance scenario next week, which is suggested by John.
>
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-11-12 Thread Ashutosh Sharma

On Thu, Nov 11, 2021 at 6:07 PM Alvaro Herrera 
wrote:

> On 2021-Nov-10, Bossart, Nathan wrote:
>
> > On 11/10/21, 9:43 AM, "Bharath Rupireddy" <
> bharath.rupireddyforpostg...@gmail.com> wrote:
> > > As discussed in [1], isn't it a better idea to add some of activity
> > > messages [2] such as recovery, archive, backup, streaming etc. to
> > > server logs at LOG level?
>
> > I think this would make the logs far too noisy for many servers.  For
> > archiving alone, this could cause tens of thousands more log messages
> > per hour on a busy system.  I think you can already see such
> > information at a debug level, anyway.
>
> Yeah.  If we had some sort of ring buffer in which to store these
> messages, the user could examine them through a view; they would still
> be accessible in a running server, but they would not be written to the
> server log.
>

That's a good idea. How about also adding some GUC(s) to the log archive,
recovery related log messages just like we have for checkpoints, autovacuum
etc? Maybe something like log_archive, log_recovery etc.

--
With Regards,
Ashutosh Sharma.

1 2 3 >

1 - 100 of 255 matches

Mail list logo