subject:"Re\: shared\-memory based stats collector"

Re: shared-memory based stats collector - v70

2022-08-22 Thread Andres Freund

Hi,

On 2022-08-22 11:32:14 +0900, Kyotaro Horiguchi wrote:
> At Wed, 10 Aug 2022 14:02:34 +0200, "Drouvot, Bertrand"  
> wrote in 
> > what about?
> > 
> > +   /*
> > +    * Acquire the LWLock directly instead of using
> > pg_stat_lock_entry_shared()
> > +    * which requires a reference.
> > +    */
> > 
> > 
> > I think that's more consistent with other comments mentioning LWLock
> > acquisition.
> 
> Sure. Thaks!. I did that in the attached.

Pushed, thanks!

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-21 Thread Kyotaro Horiguchi

At Wed, 10 Aug 2022 14:02:34 +0200, "Drouvot, Bertrand"  
wrote in 
> what about?
> 
> +   /*
> +    * Acquire the LWLock directly instead of using
> pg_stat_lock_entry_shared()
> +    * which requires a reference.
> +    */
> 
> 
> I think that's more consistent with other comments mentioning LWLock
> acquisition.

Sure. Thaks!. I did that in the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 202ef49a8885f46e339a6d81c723ac3b0a7d55ca Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 Aug 2022 11:29:07 +0900
Subject: [PATCH v2] Acquire lock properly when building stats snapshot

It got lost somewhere while developing shared memory stats collector
but it is undoubtedly required.
---
 src/backend/utils/activity/pgstat.c   |  8 
 src/backend/utils/activity/pgstat_shmem.c | 12 
 src/include/utils/pgstat_internal.h   |  1 +
 3 files changed, 21 insertions(+)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..95f09cfcc7 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	else
 		stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		kind_info->shared_data_len);
+	pgstat_lock_entry_shared(entry_ref, false);
 	memcpy(stats_data,
 		   pgstat_get_entry_data(kind, entry_ref->shared_stats),
 		   kind_info->shared_data_len);
+	pgstat_unlock_entry(entry_ref);
 
 	if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE)
 	{
@@ -983,9 +985,15 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		 kind_info->shared_size);
+		/*
+		 * Acquire the LWLock directly instead of using
+		 * pg_stat_lock_entry_shared() which requires a reference.
+		 */
+		LWLockAcquire(_data->lock, LW_SHARED);
 		memcpy(entry->data,
 			   pgstat_get_entry_data(kind, stats_data),
 			   kind_info->shared_size);
+		LWLockRelease(_data->lock);
 	}
 	dshash_seq_term();
 
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 89060ef29a..0bfa460af1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -579,6 +579,18 @@ pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait)
 	return true;
 }
 
+bool
+pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait)
+{
+	LWLock	   *lock = _ref->shared_stats->lock;
+
+	if (nowait)
+		return LWLockConditionalAcquire(lock, LW_SHARED);
+
+	LWLockAcquire(lock, LW_SHARED);
+	return true;
+}
+
 void
 pgstat_unlock_entry(PgStat_EntryRef *entry_ref)
 {
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..901d2041d6 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -581,6 +581,7 @@ extern void pgstat_detach_shmem(void);
 extern PgStat_EntryRef *pgstat_get_entry_ref(PgStat_Kind kind, Oid dboid, Oid objoid,
 			 bool create, bool *found);
 extern bool pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait);
+extern bool pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_unlock_entry(PgStat_EntryRef *entry_ref);
 extern bool pgstat_drop_entry(PgStat_Kind kind, Oid dboid, Oid objoid);
 extern void pgstat_drop_all_entries(void);
-- 
2.31.1

Re: shared-memory based stats collector - v70

2022-08-19 Thread Drouvot, Bertrand


Hi,

On 8/18/22 9:51 PM, Andres Freund wrote:

Hi,

On 2022-08-18 15:26:31 -0400, Greg Stark wrote:

And indexes of course. It's a bit frustrating since without the
catalog you won't know what table the index actually is for... But
they're pretty important stats.

FWIW, I think we should split relation stats into table and index
stats. Historically it'd have added a lot of complexity to separate the two,
but I don't think that's the case anymore. And we waste space for index stats
by having lots of table specific fields.


It seems to me that we should work on that first then, what do you 
think? (If so I can try to have a look at it).


And once done then resume the work to provide the APIs to get all 
tables/indexes from all the databases.


That way we'll be able to provide one API for the tables and one for the 
indexes (instead of one API for both like my current POC is doing).



On that note though... What do you think about having the capability
to add other stats kinds to the stats infrastructure?


I think that's a good idea and that would be great to have.


Getting closer to that was one of my goals working on the shared memory stats
stuff.



It would make a lot of sense for pg_stat_statements to add its entries here
instead of having to reimplement a lot of the same magic.

Yes, we should move pg_stat_statements over.

It's pretty easy to get massive contention on stats entries with
pg_stat_statements, because it doesn't have support for "batching" updates to
shared stats. And reimplementing the same logic in pg_stat_statements.c
doesn't make sense.

And the set of normalized queries could probably stored in DSA as well - the
file based thing we have right now is problematic.



To do that I guess more of the code needs to be moved to be table
driven from the kind structs either with callbacks or with other meta
data.

Pretty much all of it already is. The only substantial missing bit is
reading/writing of stats files, but that should be pretty easy. And of course
making the callback array extensible.



So the kind record could contain tupledesc and the code to construct the
returned tuple so that these functions could return any custom entry as well
as the standard entries.

I don't see how this would work well - we don't have functions returning
variable kinds of tuples. And what would convert a struct to a tuple?

Nor do I think it's needed - if you have an extension providing a new stats
kind it can also provide accessors.


I think the same (the extension should be able to do that).

I really like the idea of being able to provide new stats kind.

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

Re: shared-memory based stats collector - v70

2022-08-18 Thread Andres Freund

Hi,

On 2022-08-18 15:26:31 -0400, Greg Stark wrote:
> And indexes of course. It's a bit frustrating since without the
> catalog you won't know what table the index actually is for... But
> they're pretty important stats.

FWIW, I think we should split relation stats into table and index
stats. Historically it'd have added a lot of complexity to separate the two,
but I don't think that's the case anymore. And we waste space for index stats
by having lots of table specific fields.

> On that note though... What do you think about having the capability
> to add other stats kinds to the stats infrastructure?

Getting closer to that was one of my goals working on the shared memory stats
stuff.

> It would make a lot of sense for pg_stat_statements to add its entries here
> instead of having to reimplement a lot of the same magic.

Yes, we should move pg_stat_statements over.

It's pretty easy to get massive contention on stats entries with
pg_stat_statements, because it doesn't have support for "batching" updates to
shared stats. And reimplementing the same logic in pg_stat_statements.c
doesn't make sense.

And the set of normalized queries could probably stored in DSA as well - the
file based thing we have right now is problematic.

> To do that I guess more of the code needs to be moved to be table
> driven from the kind structs either with callbacks or with other meta
> data.

Pretty much all of it already is. The only substantial missing bit is
reading/writing of stats files, but that should be pretty easy. And of course
making the callback array extensible.

> So the kind record could contain tupledesc and the code to construct the
> returned tuple so that these functions could return any custom entry as well
> as the standard entries.

I don't see how this would work well - we don't have functions returning
variable kinds of tuples. And what would convert a struct to a tuple?

Nor do I think it's needed - if you have an extension providing a new stats
kind it can also provide accessors.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-18 Thread Greg Stark

On Thu, 18 Aug 2022 at 02:27, Drouvot, Bertrand  wrote:
>
> What I had in mind is to provide an API to retrieve stats for those that
> would need to connect to each database individually otherwise.
>
> That's why I focused on PGSTAT_KIND_RELATION that has
> PgStat_KindInfo.accessed_across_databases set to false.
>
> I think that another candidate could also be PGSTAT_KIND_FUNCTION.

And indexes of course. It's a bit frustrating since without the
catalog you won't know what table the index actually is for... But
they're pretty important stats.

On that note though... What do you think about having the capability
to add other stats kinds to the stats infrastructure? It would make a
lot of sense for pg_stat_statements to add its entries here instead of
having to reimplement a lot of the same magic. And I have in mind an
extension that allows adding other stats and it would be nice to avoid
having to reimplement any of this.

To do that I guess more of the code needs to be moved to be table
driven from the kind structs either with callbacks or with other meta
data. So the kind record could contain tupledesc and the code to
construct the returned tuple so that these functions could return any
custom entry as well as the standard entries.

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-17 Thread Andres Freund

Hi,

On 2022-08-17 15:46:42 -0400, Greg Stark wrote:
> Isn't there also a local hash table used to find the entries to reduce
> traffic on the shared hash table? Even if you don't take a snapshot
> does it get entered there? There are definitely still parts of this
> I'm working on a pretty vague understanding of :/

Yes, there is. But it's more about code that generates stats, rather than
reporting functions. While there's backend local pending stats we need to have
a refcount on the shared stats item so that the stats item can't be dropped
and then revived when those local stats are flushed.

Relevant comments from pgstat.c:

 * To avoid contention on the shared hashtable, each backend has a
 * backend-local hashtable (pgStatEntryRefHash) in front of the shared
 * hashtable, containing references (PgStat_EntryRef) to shared hashtable
 * entries. The shared hashtable only needs to be accessed when no prior
 * reference is found in the local hashtable. Besides pointing to the
 * shared hashtable entry (PgStatShared_HashEntry) PgStat_EntryRef also
 * contains a pointer to the shared statistics data, as a process-local
 * address, to reduce access costs.
 *
 * The names for structs stored in shared memory are prefixed with
 * PgStatShared instead of PgStat. Each stats entry in shared memory is
 * protected by a dedicated lwlock.
 *
 * Most stats updates are first accumulated locally in each process as pending
 * entries, then later flushed to shared memory (just after commit, or by
 * idle-timeout). This practically eliminates contention on individual stats
 * entries. For most kinds of variable-numbered pending stats data is stored
 * in PgStat_EntryRef->pending. All entries with pending data are in the
 * pgStatPending list. Pending statistics updates are flushed out by
 * pgstat_report_stat().
 *

pgstat_internal.h has more details about the refcount aspect:

 * Per-object statistics are stored in the "shared stats" hashtable. That
 * table's entries (PgStatShared_HashEntry) contain a pointer to the actual 
stats
 * data for the object (the size of the stats data varies depending on the
 * kind of stats). The table is keyed by PgStat_HashKey.
 *
 * Once a backend has a reference to a shared stats entry, it increments the
 * entry's refcount. Even after stats data is dropped (e.g., due to a DROP
 * TABLE), the entry itself can only be deleted once all references have been
 * released.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-17 Thread Greg Stark

On Tue, 16 Aug 2022 at 08:49, Drouvot, Bertrand  wrote:
>
>
> +   if (p->key.kind != PGSTAT_KIND_RELATION)
> +   continue;

Hm. So presumably this needs to be extended. Either to let the caller
decide which types of stats to return or to somehow return all the
stats intermixed. In my monitoring code I did the latter because I
didn't think going through the hash table repeatedly would be very
efficient. But it's definitely a pretty awkward API since I need a
switch statement that explicitly lists each case and casts the result.

> > 2) When I did the function attached above I tried to avoid returning
> > the whole set and make it possible to process them as they arrive.
>
> Is it the way it has been done? (did not look at your function yet)

I did it with callbacks. It was quick and easy and convenient for my
use case. But in general I don't really like callbacks and would think
some kind of iterator style api would be nicer.

I am handling the stats entries as they turn up. I'm constructing the
text output for each in a callback and buffering up the whole http
response in a string buffer.

I think that's ok but if I wanted to avoid buffering it up and do
network i/o then I would think the thing to do would be to build the
list of entry keys and then loop over that list doing a hash lookup
for each one and generating the response for each out and writing it
to the network. That way there wouldn't be anything locked, not even
the hash table, while doing network i/o. It would mean a lot of
traffic on the hash table though.

> > -- on that note I wonder if I've done something
> > wrong because I noted a few records with InvalidOid where I didn't
> > expect it.
>
> It looks like that InvalidOid for the dbid means that the entry is for a
> shared relation.

Ah yes. I had actually found that but forgotten it.

There's also a database entry with dboid=InvalidOid which is
apparently where background workers with no database attached report
stats.

> I've in mind to add some filtering on the dbid (I think it could be
> useful for monitoring tool with a persistent connection to one database
> but that wants to pull the stats database per database).
>
> I don't think a look up through the local cache will work if the
> entry/key is related to another database the API is launched from.

Isn't there also a local hash table used to find the entries to reduce
traffic on the shared hash table? Even if you don't take a snapshot
does it get entered there? There are definitely still parts of this
I'm working on a pretty vague understanding of :/

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-15 Thread Greg Stark

On Thu, 11 Aug 2022 at 02:11, Drouvot, Bertrand  wrote:
>
> As Andres was not -1 about that idea (as it should be low cost to add
> and maintain) as long as somebody cares enough to write something: then
> I'll give it a try and submit a patch for it.

I agree it would be a useful feature. I think there may be things to
talk about here though.

1) Are you planning to go through the local hash table and
LocalSnapshot and obey the consistency mode? I was thinking a flag
passed to build_snapshot to request global mode might be sufficient
instead of a completely separate function.

2) When I did the function attached above I tried to avoid returning
the whole set and make it possible to process them as they arrive. I
actually was hoping to get to the point where I could start shipping
out network data as they arrive and not even buffer up the response,
but I think I need to be careful about hash table locking then.

3) They key difference here is that we're returning whatever stats are
in the hash table rather than using the catalog to drive a list of id
numbers to look up. I guess the API should make it clear this is what
is being returned -- on that note I wonder if I've done something
wrong because I noted a few records with InvalidOid where I didn't
expect it.

4) I'm currently looping over the hash table returning the records all
intermixed. Some users will probably want to do things like "return
all Relation records for all databases" or "return all Index records
for database id xxx". So some form of filtering may be best or perhaps
a way to retrieve just the keys so they can then be looked up one by
one (through the local cache?).

5) On that note I'm not clear how the local cache will interact with
these cross-database lookups. That should probably be documented...

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-11 Thread Drouvot, Bertrand


Hi,

On 8/10/22 11:25 PM, Greg Stark wrote:

On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand  wrote:

Hi,

On 8/9/22 6:00 PM, Greg Stark wrote:

On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand  wrote:

What do you think about adding a function in core PG to provide such
functionality? (means being able to retrieve all the stats (+ eventually
add some filtering) without the need to connect to each database).

I'm working on it myself too. I'll post a patch for discussion in a bit.

Great! Thank you!

So I was adding the code to pgstat.c because I had thought there were
some data types I needed and/or static functions I needed. However you
and Andres encouraged me to check again now. And indeed I was able,
after fixing a couple things, to make the code work entirely
externally.


Nice!

Though I still think to have an SQL API in core could be useful to.

As Andres was not -1 about that idea (as it should be low cost to add 
and maintain) as long as somebody cares enough to write something: then 
I'll give it a try and submit a patch for it.




This is definitely not polished and there's a couple obvious things
missing. But at the risk of embarrassment I've attached my WIP. Please
be gentle :) I'll post the github link in a bit when I've fixed up
some meta info.


Thanks! I will have a look at it on github (once you share the link).

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

Re: shared-memory based stats collector - v70

2022-08-10 Thread Andres Freund

Hi,

On 2022-08-10 15:48:15 -0400, Greg Stark wrote:
> One thing that's bugging me is that the names we use for these stats
> are *all* over the place.

Yes. I had a huge issue with this when polishing the patch. And Horiguchi-san
did as well.  I had to limit the amount of cleanup done to make it feasible to
get anything committed. I think it's a bit less bad than before, but by no
means good.

> * The pg_stat_bgwriter view returns data from two different fixed
> entries, the checkpointer and the bgwriter, is there a reason those
> are kept separately but then reported as if they're one thing?

Historical raisins. Checkpointer and bgwriter used to be one thing, but isn't
anymore.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-10 Thread Andres Freund

Hi,

On 2022-08-10 14:18:25 -0400, Greg Stark wrote:
> > I don't think that's a large enough issue to worry about unless you're
> > polling at a very high rate, which'd be a bad idea in itself. If a backend
> > can't get the lock for some stats change it'll defer flushing the stats a 
> > bit,
> > so it'll not cause a lot of other problems.
> 
> Hm. I wonder if we're on the same page about what constitutes a "high rate".
> 
> I've seen people try push prometheus or other similar systems to 5s
> poll intervals. That would be challenging for Postgres due to the
> volume of statistics. The default is 30s and people often struggle to
> even have that function for large fleets. But if you had a small
> fleet, perhaps an iot style system with a "one large table" type of
> schema you might well want stats every 5s or even every 1s.

That's probably fine. Although I think you might run into trouble not from the
stats subystem side, but from the "amount of data" side. On a system with a
lot of objects that can be a fair amount.  If you really want to do very low
latency stats reporting, I suspect you'd have to build an incremental system.


> > I'm *dead* set against including catalog names in shared memory stats. 
> > That'll
> > add a good amount of memory usage and complexity, without any sort of
> > comensurate gain.
> 
> Well it's pushing the complexity there from elsewhere. If the labels
> aren't in the stats structures then the exporter needs to connect to
> each database, gather all the names into some local cache and then it
> needs to worry about keeping it up to date. And if there are any
> database problems such as disk errors or catalog objects being locked
> then your monitoring breaks though perhaps it can be limited to just
> missing some object names or having out of date names.

Shrug. If the stats system state desynchronizes from an alter table rename
you'll also have a problem in monitoring.

And even if you can benefit from having all that information, it'd still be an
overhead born by everybody for a very small share of users.


> > > I also think it would be nice to have a change counter for every stat
> > > object, or perhaps a change time. Prometheus wouldn't be able to make
> > > use of it but other monitoring software might be able to receive only
> > > metrics that have changed since the last update which would really
> > > help on databases with large numbers of mostly static objects.
> >
> > I think you're proposing adding overhead that doesn't even have a real user.
> 
> I guess I'm just brainstorming here. I don't need to currently no. It
> doesn't seem like significant overhead though compared to the locking
> and copying though?

Yes, timestamps aren't cheap to determine (nor free too store, but that's a
lesser issue).

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-10 Thread Greg Stark

On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 8/9/22 6:00 PM, Greg Stark wrote:
> > On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand  wrote:
> >>
> >> What do you think about adding a function in core PG to provide such
> >> functionality? (means being able to retrieve all the stats (+ eventually
> >> add some filtering) without the need to connect to each database).
> > I'm working on it myself too. I'll post a patch for discussion in a bit.
>
> Great! Thank you!

So I was adding the code to pgstat.c because I had thought there were
some data types I needed and/or static functions I needed. However you
and Andres encouraged me to check again now. And indeed I was able,
after fixing a couple things, to make the code work entirely
externally.

This is definitely not polished and there's a couple obvious things
missing. But at the risk of embarrassment I've attached my WIP. Please
be gentle :) I'll post the github link in a bit when I've fixed up
some meta info.

I'm definitely not wedded to the idea of using callbacks, it was just
the most convenient way to get started, especially when I was putting
the main loop in pgstat.c.  Ideally I do want to keep open the
possibility of streaming the results out without buffering the whole
set in memory.

> Out of curiosity, would you be also interested by such a feature for
> previous versions (that will not get the patch in) ?

I always had trouble understanding the existing stats code so I was
hoping the new code would make it easier. It seems to have worked but
it's possible I'm wrong and it was always possible and the problem was
always just me :)


-- 
greg
/*-
 *
 * telemetry.c
 *
 * Most of this code was copied from pg_prewarm.c as a template. 
 *
 *
 *-
 */

#include "postgres.h"

#include 
#include 
#include 

#include "access/relation.h"
#include "access/xact.h"
#include "catalog/pg_class.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/bgworker.h"
#include "postmaster/interrupt.h"
#include "storage/buf_internals.h"
#include "storage/dsm.h"
#include "storage/ipc.h"
#include "storage/latch.h"
#include "storage/lwlock.h"
#include "storage/proc.h"
#include "storage/procsignal.h"
#include "storage/shmem.h"
#include "storage/smgr.h"
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/datetime.h"
#include "utils/guc.h"
#include "utils/memutils.h"
#include "utils/rel.h"
#include "utils/resowner.h"


#include "telemetry.h"
#include "telemetry_pgstat.h"

PG_MODULE_MAGIC;


/* We should already have included what we need to get uint64_t size_t
   fd_set socklen_t and struct sockaddr so don't let microhttpd
   include extra stuff for them */
#define MHD_PLATFORM_H
#include 

/* MHD_USE_EPOLL */
/* MHD_USE_AUTO */
/* MHD_OPTION_CONNECTION_LIMIT */
/* MHD_OPTION_CONNECTION_TIMEOUT */
/* MHD_OPTION_EXTERNAL_LOGGER */



/* Background worker harness */
void		_PG_init(void);

/* Actual internal background worker main entry point */
static void telemetry_start_worker(unsigned short port);

/* GUC variables. */
static int telemetry_port = 9187; /* TCP port to listen on for metrics */
static char *telemetry_listen_addresses; /* TCP listen addresses */
static bool telemetry = true; /* start worker automatically on default port */


static enum MHD_Result
telemetry_handler(void *cls,
		  struct MHD_Connection *connection,
		  const char *url,
		  const char *method,
		  const char *version,
		  const char *upload_data,
		  size_t *upload_data_size,
		  void **con_cls);

/*
 * Module load callback.
 */
void
_PG_init(void)
{
DefineCustomIntVariable("telemetry.port",
			"TCP Port to serve metrics on by default",
			NULL,
			_port,
			9187, 1, 65535,
			PGC_SIGHUP,
			0, /* flags */
			NULL, NULL, NULL /* hooks */
			);


DefineCustomStringVariable("telemetry.listen_addresses",
			   "TCP Listen Addresses to serve metrics on by default",
			   NULL,
			   _listen_addresses,
			   "*",
			   PGC_SIGHUP,
			   GUC_LIST_INPUT, /* flags */
			   NULL, NULL, NULL /* hooks */
			   );

if (!process_shared_preload_libraries_in_progress)
	return;

/* can't define PGC_POSTMASTER variable after startup */
DefineCustomBoolVariable("telemetry.start_server",
			 "Starts the telemetry worker on startup.",
			 NULL,
			 ,
			 true,
			 PGC_POSTMASTER,
			 0,
			 NULL,
			 NULL,
			 NULL);

EmitWarningsOnPlaceholders("telemetry");

/* Register telemetry worker, if enabled. */
if (telemetry)
	telemetry_start_worker(telemetry_port);
}


static void telemetry_logger(void *arg, const char *fmt, va_list ap);

struct MHD_OptionItem mhd_ops[] = {
{ MHD_OPTION_EXTERNAL_LOGGER, (intptr_t)telemetry_logger, NULL },
{

Re: shared-memory based stats collector - v70

2022-08-10 Thread Greg Stark

One thing that's bugging me is that the names we use for these stats
are *all* over the place.

The names go through three different stages

pgstat structs  ->  pgstatfunc tupledescs  ->  pg_stat_* views

(Followed by a fourth stage where pg_exporter or whatever names for
the monitoring software)

And for some reason both transitions (plus the exporter) felt the need
to fiddle with the names or values. And not in any sort of even
vaguely consistent way. So there are three (or four) different sets of
names for the same metrics :(

e.g.

* Some of the struct elements have abbreviated words which are
expanded in the tupledesc names or the view columns -- some have long
names which get abbreviated.

* Some struct members have n_ prefixes (presumably to avoid C keywords
or other namespace issues?) and then lose them at one of the other
stages. But then the relation stats do not have n_ prefixes and then
the pg_stat view *adds* n_ prefixes in the SQL view!

* Some columns are added together in the SQL view which seems like
gratuitously hiding information from the user. The pg_stat_*_tables
view actually looks up information from the indexes stats and combines
them to get idx_scan and idx_tup_fetch.

* The pg_stat_bgwriter view returns data from two different fixed
entries, the checkpointer and the bgwriter, is there a reason those
are kept separately but then reported as if they're one thing?


Some of the simpler renaming could be transparently fixed by making
the internal stats match the public facing names. But for many of them
I think the internal names are better. And the cases where the views
aggregate data in a way that loses information are not something I
want to reproduce.

I had intended to use the internal names directly, reasoning that
transparency and consistency are the direction to be headed. But in
some cases I think the current public names are the better choice -- I
certainly don't want to remove n_* prefixes from some names but then
add them to different names! And some of the cases where the data is
combined or modified do seem like they would be missed.

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-10 Thread Greg Stark

On Tue, 9 Aug 2022 at 12:48, Andres Freund  wrote:

> > The reason I want a C function is I'm trying to get as far as I can
> > without a connection to a database, without a transaction, without
> > accessing the catalog, and as much as possible without taking locks.
>
> I assume you don't include lwlocks under locks?

I guess it depends on which lwlock :) I would be leery of a monitoring
system taking an lwlock that could interfere with regular transactions
doing work. Or taking a lock that is itself the cause of the problem
elsewhere that you really need stats to debug would be a deal breaker.

> I don't think that's a large enough issue to worry about unless you're
> polling at a very high rate, which'd be a bad idea in itself. If a backend
> can't get the lock for some stats change it'll defer flushing the stats a bit,
> so it'll not cause a lot of other problems.

Hm. I wonder if we're on the same page about what constitutes a "high rate".

I've seen people try push prometheus or other similar systems to 5s
poll intervals. That would be challenging for Postgres due to the
volume of statistics. The default is 30s and people often struggle to
even have that function for large fleets. But if you had a small
fleet, perhaps an iot style system with a "one large table" type of
schema you might well want stats every 5s or even every 1s.

> I'm *dead* set against including catalog names in shared memory stats. That'll
> add a good amount of memory usage and complexity, without any sort of
> comensurate gain.

Well it's pushing the complexity there from elsewhere. If the labels
aren't in the stats structures then the exporter needs to connect to
each database, gather all the names into some local cache and then it
needs to worry about keeping it up to date. And if there are any
database problems such as disk errors or catalog objects being locked
then your monitoring breaks though perhaps it can be limited to just
missing some object names or having out of date names.

> > I also think it would be nice to have a change counter for every stat
> > object, or perhaps a change time. Prometheus wouldn't be able to make
> > use of it but other monitoring software might be able to receive only
> > metrics that have changed since the last update which would really
> > help on databases with large numbers of mostly static objects.
>
> I think you're proposing adding overhead that doesn't even have a real user.

I guess I'm just brainstorming here. I don't need to currently no. It
doesn't seem like significant overhead though compared to the locking
and copying though?

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-09 Thread Kyotaro Horiguchi

At Tue, 9 Aug 2022 09:53:19 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-08-09 17:24:35 +0900, Kyotaro Horiguchi wrote:
> > If I'm not missing something, it's strange that pgstat_lock_entry()
> > only takes LW_EXCLUSIVE.
> 
> I think it makes some sense, given that there's a larger number of callers for
> that in various stats-emitting code. Perhaps we could just add a separate
> function with a _shared() suffix?

Sure. That was an alternative I had in my mind.

> > The atached changes the interface of
> > pgstat_lock_entry() but there's only one user since another read of
> > shared stats entry is not using reference. Thus the interface change
> > might be too much. If I just add bare LWLockAcquire/Release() to
> > pgstat_fetch_entry,the amount of the patch will be reduced.
> 
> Could you try the pgstat_lock_entry_shared() approach?

Of course. Please find the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From f406fd384d7ca145e37810a2f6a7a410f74d5795 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 10 Aug 2022 10:53:19 +0900
Subject: [PATCH] Take lock while building stats snapshot

It got lost somewhere while developing shared memory stats collector
but it is undoubtedly required.
---
 src/backend/utils/activity/pgstat.c   |  8 
 src/backend/utils/activity/pgstat_shmem.c | 12 
 src/include/utils/pgstat_internal.h   |  1 +
 3 files changed, 21 insertions(+)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..c23142a670 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	else
 		stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		kind_info->shared_data_len);
+	pgstat_lock_entry_shared(entry_ref, false);
 	memcpy(stats_data,
 		   pgstat_get_entry_data(kind, entry_ref->shared_stats),
 		   kind_info->shared_data_len);
+	pgstat_unlock_entry(entry_ref);
 
 	if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE)
 	{
@@ -983,9 +985,15 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		 kind_info->shared_size);
+		/*
+		 * Take lwlock directly instead of using pg_stat_lock_entry_shared()
+		 * which requires a reference.
+		 */
+		LWLockAcquire(_data->lock, LW_SHARED);
 		memcpy(entry->data,
 			   pgstat_get_entry_data(kind, stats_data),
 			   kind_info->shared_size);
+		LWLockRelease(_data->lock);
 	}
 	dshash_seq_term();
 
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 89060ef29a..0bfa460af1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -579,6 +579,18 @@ pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait)
 	return true;
 }
 
+bool
+pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait)
+{
+	LWLock	   *lock = _ref->shared_stats->lock;
+
+	if (nowait)
+		return LWLockConditionalAcquire(lock, LW_SHARED);
+
+	LWLockAcquire(lock, LW_SHARED);
+	return true;
+}
+
 void
 pgstat_unlock_entry(PgStat_EntryRef *entry_ref)
 {
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..901d2041d6 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -581,6 +581,7 @@ extern void pgstat_detach_shmem(void);
 extern PgStat_EntryRef *pgstat_get_entry_ref(PgStat_Kind kind, Oid dboid, Oid objoid,
 			 bool create, bool *found);
 extern bool pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait);
+extern bool pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_unlock_entry(PgStat_EntryRef *entry_ref);
 extern bool pgstat_drop_entry(PgStat_Kind kind, Oid dboid, Oid objoid);
 extern void pgstat_drop_all_entries(void);
-- 
2.31.1

Re: shared-memory based stats collector - v70

2022-08-09 Thread Andres Freund

Hi,

On 2022-08-09 17:24:35 +0900, Kyotaro Horiguchi wrote:
> At Mon, 8 Aug 2022 11:53:19 -0400, Greg Stark  wrote in
> > I'm trying to wrap my head around the shared memory stats collector
> > infrastructure from
> > <20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in
> > 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.
> >
> > I have one question about locking -- afaics there's nothing protecting
> > reading the shared memory stats. There is an lwlock protecting
> > concurrent updates of the shared memory stats, but that lock isn't
> > taken when we read the stats. Are we ok relying on atomic 64-bit reads
> > or is there something else going on that I'm missing?

Yes, that's not right. Not sure how it ended up that way. There was a lot of
refactoring and pushing down the locking into different places, I guess it got
lost somewhere on the way :(. It's unlikely to be a large problem, but we
should fix it.

> If I'm not missing something, it's strange that pgstat_lock_entry()
> only takes LW_EXCLUSIVE.

I think it makes some sense, given that there's a larger number of callers for
that in various stats-emitting code. Perhaps we could just add a separate
function with a _shared() suffix?

> The atached changes the interface of
> pgstat_lock_entry() but there's only one user since another read of
> shared stats entry is not using reference. Thus the interface change
> might be too much. If I just add bare LWLockAcquire/Release() to
> pgstat_fetch_entry,the amount of the patch will be reduced.

Could you try the pgstat_lock_entry_shared() approach?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-09 Thread Andres Freund

Hi,

On 2022-08-09 12:00:46 -0400, Greg Stark wrote:
> I was more aiming at a C function that extensions could use directly
> rather than an SQL function -- though I suppose having the former it
> would be simple enough to implement the latter using it. (though it
> would have to be one for each stat type I guess)

I think such a C extension could exist today, without patching core code? It'd
be a bit ugly to include pgstat_internal.h, I guess, but other than that...

> The reason I want a C function is I'm trying to get as far as I can
> without a connection to a database, without a transaction, without
> accessing the catalog, and as much as possible without taking locks.

I assume you don't include lwlocks under locks?

> I think this is important for making monitoring highly reliable and low
> impact on production.

I'm doubtful about that, but whatever.

> The main problem with my current code is that I'm accessing the shared
> memory hash table directly. This means the I'm possibly introducing
> locking contention on the shared memory hash table.

I don't think that's a large enough issue to worry about unless you're
polling at a very high rate, which'd be a bad idea in itself. If a backend
can't get the lock for some stats change it'll defer flushing the stats a bit,
so it'll not cause a lot of other problems.

> I'm thinking of separating the shared memory hash scan from the metric scan
> so the list can be quickly built minimizing the time the lock is held.

I'd really really want to see some evidence that any sort of complexity here
is worth it.

> I have a few things I would like to suggest for future improvements to
> this infrastructure. I haven't polished the details of it yet but the
> main thing I think I'm missing is the catalog name for the object. I
> don't want to have to fetch it from the catalog and in any case I
> think it would generally be useful and might regularize the
> replication slot handling too.

I'm *dead* set against including catalog names in shared memory stats. That'll
add a good amount of memory usage and complexity, without any sort of
comensurate gain.

> I also think it would be nice to have a change counter for every stat
> object, or perhaps a change time. Prometheus wouldn't be able to make
> use of it but other monitoring software might be able to receive only
> metrics that have changed since the last update which would really
> help on databases with large numbers of mostly static objects.

I think you're proposing adding overhead that doesn't even have a real user.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-09 Thread Andres Freund

Hi,

On 2022-08-09 12:18:47 +0200, Drouvot, Bertrand wrote:
> What do you think about adding a function in core PG to provide such
> functionality? (means being able to retrieve all the stats (+ eventually add
> some filtering) without the need to connect to each database).

I'm not that convinced by the use case, but I think it's also low cost to add
and maintain, so if somebody cares enough to write something...

The only thing I would "request" is that such a function requires more
permissions than the default accessors do. I think it's a minor problem that
we allow so much access within a database right now, regardless of object
permissions, but it'd not be a great idea to expand that to other databases,
in bulk?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-09 Thread Greg Stark

On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand  wrote:
>
>
> What do you think about adding a function in core PG to provide such
> functionality? (means being able to retrieve all the stats (+ eventually
> add some filtering) without the need to connect to each database).

I'm working on it myself too. I'll post a patch for discussion in a bit.

I was more aiming at a C function that extensions could use directly
rather than an SQL function -- though I suppose having the former it
would be simple enough to implement the latter using it. (though it
would have to be one for each stat type I guess)

The reason I want a C function is I'm trying to get as far as I can
without a connection to a database, without a transaction, without
accessing the catalog, and as much as possible without taking locks. I
think this is important for making monitoring highly reliable and low
impact on production. It's also kind of fundamental to accessing stats
for objects from other databases since we won't have easy access to
the catalogs for the other databases.

The main problem with my current code is that I'm accessing the shared
memory hash table directly. This means the I'm possibly introducing
locking contention on the shared memory hash table. I'm thinking of
separating the shared memory hash scan from the metric scan so the
list can be quickly  built minimizing the time the lock is held. We
could possibly also only rebuild that list at a lower frequency than
the metrics gathering so new objects might not show up instantly.

I have a few things I would like to suggest for future improvements to
this infrastructure. I haven't polished the details of it yet but the
main thing I think I'm missing is the catalog name for the object. I
don't want to have to fetch it from the catalog and in any case I
think it would generally be useful and might regularize the
replication slot handling too.

I also think it would be nice to have a change counter for every stat
object, or perhaps a change time. Prometheus wouldn't be able to make
use of it but other monitoring software might be able to receive only
metrics that have changed since the last update which would really
help on databases with large numbers of mostly static objects. Even on
typical databases there are tons of builtin objects (especially
functions) that are probably never getting updates.

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-09 Thread Kyotaro Horiguchi

At Mon, 8 Aug 2022 11:53:19 -0400, Greg Stark  wrote in 
> I'm trying to wrap my head around the shared memory stats collector
> infrastructure from
> <20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in
> 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.
> 
> I have one question about locking -- afaics there's nothing protecting
> reading the shared memory stats. There is an lwlock protecting
> concurrent updates of the shared memory stats, but that lock isn't
> taken when we read the stats. Are we ok relying on atomic 64-bit reads
> or is there something else going on that I'm missing?
> 
> In particular I'm looking at pgstat.c:847 in pgstat_fetch_entry()
> which does this:
> 
> memcpy(stats_data,
>pgstat_get_entry_data(kind, entry_ref->shared_stats),
>kind_info->shared_data_len);
> 
> stats_data is the returned copy of the stats entry with all the
> statistics in it. But it's copied from the shared memory location
> directly using memcpy and there's no locking or change counter or
> anything protecting this memcpy that I can see.

We take LW_SHARED while creating a snapshot of fixed-numbered
stats. On the other hand we don't for variable-numbered stats.  I
agree to you, that we need that also for variable-numbered stats.

If I'm not missing something, it's strange that pgstat_lock_entry()
only takes LW_EXCLUSIVE. The atached changes the interface of
pgstat_lock_entry() but there's only one user since another read of
shared stats entry is not using reference. Thus the interface change
might be too much. If I just add bare LWLockAcquire/Release() to
pgstat_fetch_entry,the amount of the patch will be reduced.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..7c4e5f0238 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	else
 		stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		kind_info->shared_data_len);
+	pgstat_lock_entry(entry_ref, LW_SHARED, false);
 	memcpy(stats_data,
 		   pgstat_get_entry_data(kind, entry_ref->shared_stats),
 		   kind_info->shared_data_len);
+	pgstat_unlock_entry(entry_ref);
 
 	if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE)
 	{
@@ -983,9 +985,15 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		 kind_info->shared_size);
+		/*
+		 * We're directly accesing the shared stats entry not using
+		 * reference. pg_stat_lock_entry() cannot be used here.
+		 */
+		LWLockAcquire(_data->lock, LW_SHARED);
 		memcpy(entry->data,
 			   pgstat_get_entry_data(kind, stats_data),
 			   kind_info->shared_size);
+		LWLockRelease(_data->lock);
 	}
 	dshash_seq_term();
 
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index d9275611f0..fdf4d022c1 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -364,7 +364,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	pendingent = (PgStat_StatDBEntry *) entry_ref->pending;
 	sharedent = (PgStatShared_Database *) entry_ref->shared_stats;
 
-	if (!pgstat_lock_entry(entry_ref, nowait))
+	if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait))
 		return false;
 
 #define PGSTAT_ACCUM_DBCOUNT(item)		\
diff --git a/src/backend/utils/activity/pgstat_function.c b/src/backend/utils/activity/pgstat_function.c
index 427d8c47fc..318db0b3c8 100644
--- a/src/backend/utils/activity/pgstat_function.c
+++ b/src/backend/utils/activity/pgstat_function.c
@@ -200,7 +200,7 @@ pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 
 	/* localent always has non-zero content */
 
-	if (!pgstat_lock_entry(entry_ref, nowait))
+	if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait))
 		return false;
 
 	shfuncent->stats.f_numcalls += localent->f_counts.f_numcalls;
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..98dda726db 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -782,7 +782,7 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 		return true;
 	}
 
-	if (!pgstat_lock_entry(entry_ref, nowait))
+	if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait))
 		return false;
 
 	/* add the values to the shared entry. */
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 89060ef29a..fdd20d80a1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -568,14 +568,14 @@ pgstat_release_entry_ref(PgStat_HashKey key, PgStat_EntryRef *entry_ref,
 }
 
 bool
-pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait)

Re: shared-memory based stats collector - v70

2022-08-08 Thread Greg Stark

I'm trying to wrap my head around the shared memory stats collector
infrastructure from
<20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in
5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.

I have one question about locking -- afaics there's nothing protecting
reading the shared memory stats. There is an lwlock protecting
concurrent updates of the shared memory stats, but that lock isn't
taken when we read the stats. Are we ok relying on atomic 64-bit reads
or is there something else going on that I'm missing?

In particular I'm looking at pgstat.c:847 in pgstat_fetch_entry()
which does this:

memcpy(stats_data,
   pgstat_get_entry_data(kind, entry_ref->shared_stats),
   kind_info->shared_data_len);

stats_data is the returned copy of the stats entry with all the
statistics in it. But it's copied from the shared memory location
directly using memcpy and there's no locking or change counter or
anything protecting this memcpy that I can see.

-- 
greg

Re: shared-memory based stats collector - v70

2022-07-21 Thread Greg Stark

On Wed, 20 Jul 2022 at 15:09, Andres Freund  wrote:
>
> Each backend only had stats for things it touched. But the stats collector 
> read all files at startup into hash tables and absorbed all generated stats 
> into those as well.

Fascinating. I'm surprised this didn't raise issues previously for
people with millions of tables. I wonder if it wasn't causing issues
and we just didn't hear about them because there were other bigger
issues :)

> >All that said -- having all objects loaded in shared memory makes my
> >work way easier.
>
> What are your trying to do?

I'm trying to implement an exporter for prometheus/openmetrics/etc
that dumps directly from shared memory without going through the SQL
backend layer. I believe this will be much more reliable, lower
overhead, safer, and consistent than writing SQL queries.

Ideally I would want to dump out the stats without connecting to each
database. I suspect that would run into problems where the schema
really adds a lot of information (such as which table each index is on
or which table a toast relation is for. There are also some things
people think of as stats that are maintained in the catalog such as
reltuples and relpages. So I'm imagining this won't strictly stay true
in the end.

It seems like just having an interface to iterate over the shared hash
table and return entries one by one without filtering by database
would be fairly straightforward and I would be able to do most of what
I want just with that. There's actually enough meta information in the
stats entries to be able to handle them as they come instead of trying
to process look up specific stats one by one.

-- 
greg

Re: shared-memory based stats collector - v70

2022-07-20 Thread Andres Freund

Hi, 

On July 20, 2022 8:41:53 PM GMT+02:00, Greg Stark  wrote:
>On Wed, 20 Jul 2022 at 12:08, Tom Lane  wrote:
>>
>> AFAIR, the previous stats collector implementation had no such provision
>> either: it'd just keep adding hashtable entries as it received info about
>> new objects.  The only thing that's changed is that now those entries are
>> in shared memory instead of process-local memory.  We'd be well advised to
>> be sure that memory can be swapped out under pressure, but otherwise I'm
>> not seeing that things have gotten worse.
>
>Just to be clear I'm not looking for ways things have gotten worse.
>Just trying to understand what I'm reading and I guess I came in with
>assumptions that led me astray.
>
>But... adding entries as it received info about new objects isn't the
>same as having info on everything. I didn't really understand how the
>old system worked but if you had a very large schema but each session
>only worked with a small subset did the local stats data ever absorb
>info on the objects it never touched?

Each backend only had stats for things it touched. But the stats collector read 
all files at startup into hash tables and absorbed all generated stats into 
those as well.


>All that said -- having all objects loaded in shared memory makes my
>work way easier.

What are your trying to do? 

>It actually seems feasible to dump out all the
>objects from shared memory and including objects from other databases
>and if I don't need a consistent snapshot it even seems like it would
>be possible to do that without having a copy of more than one stats
>entry at a time in local memory. I hope that doesn't cause huge
>contention on the shared hash table to be doing that regularly.

The stats accessors now default to not creating a full snapshot of stats data 
at first access (but that's configurable). So yes, that behavior is possible. 
E.g. autovac now uses a single object access like you describe.

Andres


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: shared-memory based stats collector - v70

2022-07-20 Thread Greg Stark

On Wed, 20 Jul 2022 at 12:08, Tom Lane  wrote:
>
> AFAIR, the previous stats collector implementation had no such provision
> either: it'd just keep adding hashtable entries as it received info about
> new objects.  The only thing that's changed is that now those entries are
> in shared memory instead of process-local memory.  We'd be well advised to
> be sure that memory can be swapped out under pressure, but otherwise I'm
> not seeing that things have gotten worse.

Just to be clear I'm not looking for ways things have gotten worse.
Just trying to understand what I'm reading and I guess I came in with
assumptions that led me astray.

But... adding entries as it received info about new objects isn't the
same as having info on everything. I didn't really understand how the
old system worked but if you had a very large schema but each session
only worked with a small subset did the local stats data ever absorb
info on the objects it never touched?

All that said -- having all objects loaded in shared memory makes my
work way easier. It actually seems feasible to dump out all the
objects from shared memory and including objects from other databases
and if I don't need a consistent snapshot it even seems like it would
be possible to do that without having a copy of more than one stats
entry at a time in local memory. I hope that doesn't cause huge
contention on the shared hash table to be doing that regularly.

-- 
greg

Re: shared-memory based stats collector - v70

2022-07-20 Thread Andres Freund

Hi,

On 2022-07-20 12:08:35 -0400, Tom Lane wrote:
> AFAIR, the previous stats collector implementation had no such provision
> either: it'd just keep adding hashtable entries as it received info about
> new objects.

Yep.

> The only thing that's changed is that now those entries are in shared
> memory instead of process-local memory.  We'd be well advised to be
> sure that memory can be swapped out under pressure, but otherwise I'm
> not seeing that things have gotten worse.

FWIW, I ran a few memory usage benchmarks. Without stats accesses the
memory usage with shared memory stats was sometimes below, sometimes
above the "old" memory usage, depending on the number of objects. As
soon as there's stats access, it's well below (that includes things like
autovac workers).

I think there's quite a bit of memory usage reduction potential around
dsa.c - we occasionally end up with [nearly] unused dsm segments.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-07-20 Thread Tom Lane

Melanie Plageman  writes:
> On Wed, Jul 20, 2022 at 11:35 AM Greg Stark  wrote:
>> It seems like if we really think the total number of database objects
>> is reasonably limited to scales that fit in RAM there would be a much
>> simpler database design that would just store the catalog tables in
>> simple in-memory data structures and map them all on startup without
>> doing all the work Postgres does to make relational storage scale.

> I think efforts to do such a thing have gotten caught up in solving
> issues around visibility and managing the relationship between local and
> global caches [1]. It doesn't seem like the primary technical concern
> was memory usage.

AFAIR, the previous stats collector implementation had no such provision
either: it'd just keep adding hashtable entries as it received info about
new objects.  The only thing that's changed is that now those entries are
in shared memory instead of process-local memory.  We'd be well advised to
be sure that memory can be swapped out under pressure, but otherwise I'm
not seeing that things have gotten worse.

regards, tom lane

Re: shared-memory based stats collector - v70

2022-07-20 Thread Andres Freund

Hi,

On 2022-07-20 11:35:13 -0400, Greg Stark wrote:
> Is it true that the shared memory allocation contains the hash table
> entry and body of every object in every database?

Yes. However, note that that was already the case with the old stats
collector - it also kept everything in memory. In addition every read
access to stats loaded a copy of the stats (well of the global stats and
the relevant per-database stats).

It might be worth doing something fancier at some point - the shared
memory stats was already a huge effort, cramming yet another change in
there would pretty much have guaranteed that it'd fail.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-07-20 Thread Melanie Plageman

On Wed, Jul 20, 2022 at 11:35 AM Greg Stark  wrote:

> On the one hand the rest of Postgres seems to be designed on the
> assumption that the number of tables and database objects is limited
> only by disk space. The catalogs are stored in relational storage
> which is read through the buffer cache. On the other hand it's true
> that syscaches don't do expire entries (though I think the assumption
> is that no one backend touches very much).
>
> It seems like if we really think the total number of database objects
> is reasonably limited to scales that fit in RAM there would be a much
> simpler database design that would just store the catalog tables in
> simple in-memory data structures and map them all on startup without
> doing all the work Postgres does to make relational storage scale.
>

I think efforts to do such a thing have gotten caught up in solving
issues around visibility and managing the relationship between local and
global caches [1]. It doesn't seem like the primary technical concern
was memory usage.

[1]
https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04

Re: shared-memory based stats collector - v70

2022-07-20 Thread Greg Stark

So I'm finally wrapping my head around this new code. There is
something I'm surprised by that perhaps I'm misreading or perhaps I
shouldn't be surprised by, not sure.

Is it true that the shared memory allocation contains the hash table
entry and body of every object in every database? I guess I was
assuming I would find some kind of LRU cache which loaded data from
disk on demand. But afaict it loads everything on startup and then
never loads from disk later. The disk is purely for recovering state
after a restart.

On the one hand the rest of Postgres seems to be designed on the
assumption that the number of tables and database objects is limited
only by disk space. The catalogs are stored in relational storage
which is read through the buffer cache. On the other hand it's true
that syscaches don't do expire entries (though I think the assumption
is that no one backend touches very much).

It seems like if we really think the total number of database objects
is reasonably limited to scales that fit in RAM there would be a much
simpler database design that would just store the catalog tables in
simple in-memory data structures and map them all on startup without
doing all the work Postgres does to make relational storage scale.

Re: shared-memory based stats collector - v70

2022-04-14 Thread Andres Freund

Hi,

On 2022-04-13 17:55:18 -0700, Andres Freund wrote:
> On 2022-04-13 16:56:45 -0700, David G. Johnston wrote:
> > On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
> > david.g.johns...@gmail.com> wrote:
> > Sorry, apparently this "2000-01-01" behavior only manifests after crash
> > recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
> > same initdb timestamp.
> 
> > Feels like we should still report the "end of crash recovery timestamp" for
> > these instead of 2000-01-01 (which I guess is derived from 0) if we are not
> > willing to produce null (and it seems other parts of the system using these
> > stats assumes non-null).
> 
> Yes, that's definitely not correct. I see the bug (need to call
> pgstat_reset_after_failure(); in pgstat_discard_stats()). Stupid, but
> easy to fix - too fried to write a test tonight, but will commit the fix
> tomorrow.

Pushed the fix (including a test that previously failed). Thanks again!

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-13 Thread Michael Paquier

On Wed, Apr 13, 2022 at 04:56:45PM -0700, David G. Johnston wrote:
> Sorry, apparently this "2000-01-01" behavior only manifests after crash
> recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
> same initdb timestamp.
> 
> Feels like we should still report the "end of crash recovery timestamp" for
> these instead of 2000-01-01 (which I guess is derived from 0) if we are not
> willing to produce null (and it seems other parts of the system using these
> stats assumes non-null).

I can see this timestamp as well after crash recovery.  This seems
rather misleading to me.  I have added an open item.
--
Michael


signature.asc
Description: PGP signature

Re: shared-memory based stats collector - v70

2022-04-13 Thread Andres Freund

Hi,

On 2022-04-13 16:56:45 -0700, David G. Johnston wrote:
> On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
> david.g.johns...@gmail.com> wrote:
> Sorry, apparently this "2000-01-01" behavior only manifests after crash
> recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
> same initdb timestamp.

> Feels like we should still report the "end of crash recovery timestamp" for
> these instead of 2000-01-01 (which I guess is derived from 0) if we are not
> willing to produce null (and it seems other parts of the system using these
> stats assumes non-null).

Yes, that's definitely not correct. I see the bug (need to call
pgstat_reset_after_failure(); in pgstat_discard_stats()). Stupid, but
easy to fix - too fried to write a test tonight, but will commit the fix
tomorrow.

Thanks for catching!

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-13 Thread David G. Johnston

On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:
>
>> Here comes v70:
>>
>>
> One thing I just noticed while peeking at pg_stat_slru:
>
> The stats_reset column for my newly initdb'd cluster is showing me
> "2000-01-01 00:00:00" (v15).  I was expecting null, though a non-null value
> restriction does make sense.  Neither choice is documented though.
>
> Based upon my expectation I checked to see if v14 reported null, and thus
> this was a behavior change.  v14 reports the initdb timestamp (e.g.,
> 2022-04-13 23:26:48.349115+00)
>
> Can we document the non-null aspect of this value (pg_stat_database is
> happy being null, this seems to be a "fixed" type behavior) but have it
> continue to report initdb as its initial value?
>
>
Sorry, apparently this "2000-01-01" behavior only manifests after crash
recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
same initdb timestamp.

Feels like we should still report the "end of crash recovery timestamp" for
these instead of 2000-01-01 (which I guess is derived from 0) if we are not
willing to produce null (and it seems other parts of the system using these
stats assumes non-null).

David J.

David J.

Re: shared-memory based stats collector - v70

2022-04-13 Thread David G. Johnston

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:

> Here comes v70:
>
>
One thing I just noticed while peeking at pg_stat_slru:

The stats_reset column for my newly initdb'd cluster is showing me
"2000-01-01 00:00:00" (v15).  I was expecting null, though a non-null value
restriction does make sense.  Neither choice is documented though.

Based upon my expectation I checked to see if v14 reported null, and thus
this was a behavior change.  v14 reports the initdb timestamp (e.g.,
2022-04-13 23:26:48.349115+00)

Can we document the non-null aspect of this value (pg_stat_database is
happy being null, this seems to be a "fixed" type behavior) but have it
continue to report initdb as its initial value?

David J.

Re: shared-memory based stats collector - v70

2022-04-09 Thread David G. Johnston

On Sat, Apr 9, 2022 at 12:07 PM Andres Freund  wrote:

>
> > ... specific counters.  In particular, replay will not increment
> > pg_stat_database or pg_stat_all_tables columns, and the startup process
> > will not report reads and writes for the pg_statio views.
> >
> > It would helpful to give at least one specific example of what is being
> > recorded normally, especially since we give three of what is not.
>
> The second sentence is a set of examples - or do you mean examples for what
> actions by the startup process are counted?
>
>
Specific views that these statistics will be updating; like
pg_stat_database being the example of a view that is not updating.

David J.

Re: shared-memory based stats collector - v70

2022-04-09 Thread Andres Freund

Hi,

On 2022-04-07 21:39:55 -0700, David G. Johnston wrote:
> On Thu, Apr 7, 2022 at 8:59 PM Andres Freund  wrote:
> 
> >
> >   
> >Cumulative statistics are collected in shared memory. Every
> >PostgreSQL process collects statistics
> > locally
> >then updates the shared data at appropriate intervals.  When a server,
> >including a physical replica, shuts down cleanly, a permanent copy of
> > the
> >statistics data is stored in the pg_stat
> > subdirectory,
> >so that statistics can be retained across server restarts.  In contrast,
> >when starting from an unclean shutdown (e.g., after an immediate
> > shutdown,
> >a server crash, starting from a base backup, and point-in-time
> > recovery),
> >all statistics counters are reset.
> >   
> >
> 
> I like this.  My comment regarding using "i.e.," here stands though.

Argh. I'd used in e.g., but not i.e..


> >
> >
> > The cumulative statistics system is active during recovery. All scans,
> > reads, blocks, index usage, etc., will be recorded normally on the
> > standby. However, WAL replay will not increment relation and database
> > specific counters. I.e. replay will not increment pg_stat_all_tables
> > columns (like n_tup_ins), nor will reads or writes performed by the
> > startup process be tracked in the pg_statio views, nor will associated
> > pg_stat_database columns be incremented.
> >
> >
> >
> I like this too.  The second part with three nors is a bit rough.  Maybe:

Agreed. I tried to come up with a smoother formulation, but didn't (perhaps
because I was a tad tired).


> ... specific counters.  In particular, replay will not increment
> pg_stat_database or pg_stat_all_tables columns, and the startup process
> will not report reads and writes for the pg_statio views.
> 
> It would helpful to give at least one specific example of what is being
> recorded normally, especially since we give three of what is not.

The second sentence is a set of examples - or do you mean examples for what
actions by the startup process are counted?

Greetings,

Andres Freund

Re: shared-memory based stats collector

2022-04-08 Thread Andres Freund

Hi, 

On April 8, 2022 4:49:48 AM PDT, Ranier Vilela  wrote:
>Hi,
>
>Per Coverity.
>
>pgstat_reset_entry does not check if lock it was really blocked.
>I think if shared_stat_reset_contents is called without lock,
>is it an issue not?

I don't think so - the nowait parameter is say to false, so the lock 
acquisition is blocking.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: shared-memory based stats collector

2022-04-08 Thread Ranier Vilela

Hi,

Per Coverity.

pgstat_reset_entry does not check if lock it was really blocked.
I think if shared_stat_reset_contents is called without lock,
is it an issue not?

regards,

Ranier Vilela


0001-avoid-reset-stats-without-lock.patch
Description: Binary data

Re: shared-memory based stats collector - v70

2022-04-07 Thread Kyotaro Horiguchi

At Thu, 7 Apr 2022 20:59:21 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-04-08 11:10:14 +0900, Kyotaro Horiguchi wrote:
> > I can read it. But I'm not sure that the difference is obvious for
> > average users between "starting a standby from a basebackup" and
> > "starting a standby after a normal shutdown"..
> 
> Yea, that's what I was concerned about. How about:
> 
>   
>Cumulative statistics are collected in shared memory. Every
>PostgreSQL process collects statistics locally
>then updates the shared data at appropriate intervals.  When a server,
>including a physical replica, shuts down cleanly, a permanent copy of the
>statistics data is stored in the pg_stat subdirectory,
>so that statistics can be retained across server restarts.  In contrast,
>when starting from an unclean shutdown (e.g., after an immediate shutdown,
>a server crash, starting from a base backup, and point-in-time recovery),
>all statistics counters are reset.
>   

Looks perfect generally, and especially in regard to the concern.

> I think I like my version above a bit better?

Quite a bit.  It didn't answer for the concern.

> > > 2)
> > > The edit is not a problem, but it's hard to understand what the existing
> > > paragraph actually means?
> > > 
> > > diff --git a/doc/src/sgml/high-availability.sgml 
> > > b/doc/src/sgml/high-availability.sgml
> > > index 3247e056663..8bfb584b752 100644
> > > --- a/doc/src/sgml/high-availability.sgml
> > > +++ b/doc/src/sgml/high-availability.sgml
> > > @@ -,17 +,17 @@ HINT:  You can then restart the server after 
> > > making the necessary configuration
> > > ...
> > > 
> > > -The statistics collector is active during recovery. All scans, 
> > > reads, blocks,
> > > +The cumulative statistics system is active during recovery. All 
> > > scans, reads, blocks,
> > >  index usage, etc., will be recorded normally on the standby. Replayed
> > >  actions will not duplicate their effects on primary, so replaying an
> > >  insert will not increment the Inserts column of pg_stat_user_tables.
> > >  The stats file is deleted at the start of recovery, so stats from 
> > > primary
> > >  and standby will differ; this is considered a feature, not a bug.
> > > 
> > > 
> > > 
> > 
> > Agreed partially. It's too detailed.  It might not need to mention WAL
> > replay.
> 
> My concern is more that it seems halfway nonsensical. "Replayed actions will
> not duplicate their effects on primary" - I can guess what that means, but not
> more. There's no "Inserts" column of pg_stat_user_tables.
> 
> 
>
> The cumulative statistics system is active during recovery. All scans,
> reads, blocks, index usage, etc., will be recorded normally on the
> standby. However, WAL replay will not increment relation and database
> specific counters. I.e. replay will not increment pg_stat_all_tables
> columns (like n_tup_ins), nor will reads or writes performed by the
> startup process be tracked in the pg_statio views, nor will associated
> pg_stat_database columns be incremented.
>

Looks clearer since it mention user-facing interfaces with concrete
example columns.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-07 Thread David G. Johnston

On Thu, Apr 7, 2022 at 8:59 PM Andres Freund  wrote:

>
>   
>Cumulative statistics are collected in shared memory. Every
>PostgreSQL process collects statistics
> locally
>then updates the shared data at appropriate intervals.  When a server,
>including a physical replica, shuts down cleanly, a permanent copy of
> the
>statistics data is stored in the pg_stat
> subdirectory,
>so that statistics can be retained across server restarts.  In contrast,
>when starting from an unclean shutdown (e.g., after an immediate
> shutdown,
>a server crash, starting from a base backup, and point-in-time
> recovery),
>all statistics counters are reset.
>   
>

I like this.  My comment regarding using "i.e.," here stands though.


>
>
> The cumulative statistics system is active during recovery. All scans,
> reads, blocks, index usage, etc., will be recorded normally on the
> standby. However, WAL replay will not increment relation and database
> specific counters. I.e. replay will not increment pg_stat_all_tables
> columns (like n_tup_ins), nor will reads or writes performed by the
> startup process be tracked in the pg_statio views, nor will associated
> pg_stat_database columns be incremented.
>
>
>
I like this too.  The second part with three nors is a bit rough.  Maybe:

... specific counters.  In particular, replay will not increment
pg_stat_database or pg_stat_all_tables columns, and the startup process
will not report reads and writes for the pg_statio views.

It would helpful to give at least one specific example of what is being
recorded normally, especially since we give three of what is not.

David J.

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

On 2022-04-07 16:37:51 -0700, Andres Freund wrote:
> On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > I've gotten through the main commits (and then a fix for the apparently
> > inevitable bug that's immediately highlighted by the buildfarm), and the 
> > first
> > test. I'll call it a night now, and work on the other tests & docs tomorrow.
> 
> I've gotten through the tests now. There's one known, not yet addressed, issue
> with the stats isolation test, see [1].

That has since been fixed, in d6c0db14836cd843d589372d909c73aab68c7a24

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

Hi,

On 2022-04-07 20:59:21 -0700, Andres Freund wrote:
> 
>   
>Cumulative statistics are collected in shared memory. Every
>PostgreSQL process collects statistics locally
>then updates the shared data at appropriate intervals.  When a server,
>including a physical replica, shuts down cleanly, a permanent copy of the
>statistics data is stored in the pg_stat subdirectory,
>so that statistics can be retained across server restarts.  In contrast,
>when starting from an unclean shutdown (e.g., after an immediate shutdown,
>a server crash, starting from a base backup, and point-in-time recovery),
>all statistics counters are reset.
>   
> ...
>
> The cumulative statistics system is active during recovery. All scans,
> reads, blocks, index usage, etc., will be recorded normally on the
> standby. However, WAL replay will not increment relation and database
> specific counters. I.e. replay will not increment pg_stat_all_tables
> columns (like n_tup_ins), nor will reads or writes performed by the
> startup process be tracked in the pg_statio views, nor will associated
> pg_stat_database columns be incremented.
>

I went with these for now. My guess is that there's further improvements in
them, and in surrounding areas...


With that, I'll close this CF entry. It's been a while.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

Hi,

On 2022-04-07 20:51:10 -0700, David G. Johnston wrote:
> On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi 
> wrote:
> > I can read it. But I'm not sure that the difference is obvious for
> > average users between "starting a standby from a basebackup" and
> > "starting a standby after a normal shutdown"..
> >
> > Other than that, it might be easier to read if the additional part
> > were moved out to the end of the paragraph, prefixing with "Note:
> > ". For example,
> >
> > ...
> > statistics can be retained across server restarts.  When crash recovery is
> > performed at server start (e.g., after immediate shutdown, server crash,
> > and point-in-time recovery), all statistics counters are reset. Note that
> > crash recovery is not performed when starting a standby that was shut
> > down normally then all counters are retained.
> >
> >
> Maybe:
> When the server shuts down cleanly a permanent copy of the statistics data
> is stored in the pg_stat subdirectory so that
> statistics can be retained across server restarts.  However, if crash
> recovery is performed (i.e., after immediate shutdown, server crash, or
> point-in-time recovery), all statistics counters are reset.  For any
> standby server, the initial startup to get the cluster initialized is a
> point-in-time crash recovery startup. For all subsequent startups it
> behaves like any other server.  For a hot standby server, statistics are
> retained during a failover promotion.
> 
> I'm pretty sure i.e., is correct since those three situations are not
> examples but rather the complete set.

I don't think the "initial startup ..." bit is quite correct. A standby can be
created for a shut down server, and IIRC there's some differences in how PITR
is handled too.


> Is crash recovery ever performed other than at server start?  If so I
> choose to remove the redundancy.

No.


> I feel like some of this detail about standby servers is/should be covered
> elsewhere and we are at least missing a cross-reference chance even if we
> leave the material coverage as-is.

I didn't find anything good to reference...


> > 2)
> > > The edit is not a problem, but it's hard to understand what the existing
> > > paragraph actually means?
> > >
> > > diff --git a/doc/src/sgml/high-availability.sgml
> > b/doc/src/sgml/high-availability.sgml
> > > index 3247e056663..8bfb584b752 100644
> > > --- a/doc/src/sgml/high-availability.sgml
> > > +++ b/doc/src/sgml/high-availability.sgml
> > > @@ -,17 +,17 @@ HINT:  You can then restart the server after
> > making the necessary configuration
> > > ...
> > > 
> > > -The statistics collector is active during recovery. All scans,
> > reads, blocks,
> > > +The cumulative statistics system is active during recovery. All
> > scans, reads, blocks,
> > >  index usage, etc., will be recorded normally on the standby.
> > Replayed
> > >  actions will not duplicate their effects on primary, so replaying an
> > >  insert will not increment the Inserts column of pg_stat_user_tables.
> > >  The stats file is deleted at the start of recovery, so stats from
> > primary
> > >  and standby will differ; this is considered a feature, not a bug.
> > > 
> > >
> > > 
> >
> > Agreed partially. It's too detailed.  It might not need to mention WAL
> > replay.
> >
> >
> The insert example seems like a poor one...IIUC cumulative statistics are
> not WAL logged and while in recovery INSERT is prohibited, so how would
> replaying the insert in the WAL result in a duplicated effect on
> pg_stat_user_tables.inserts?

I agree, the sentence doesn't make much sense.

It doesn't really matter that stats aren't WAL logged, one could infer them
from the WAL at a decent level of accuracy. However, we can't actually
associate those actions to relations, we just know the relfilenode... And the
startup process can't read the catalog to figure the mapping out.


> I also have no idea what, in the fragment, "Replayed actions will not
> duplicate their effects on primary...", what "on primary" is supposed to
> mean.

I think it's trying to say "will not duplicate the effect on
pg_stat_user_tables they had on the primary".


> I would like to write the following but I don't believe it is sufficiently
> true:
> 
> "The cumulative statistics system records only the locally generated
> activity of the cluster, including while in recovery.  Activity happening
> only due to the replay of WAL is not considered local."
> 
> But to apply the WAL we have to fetch blocks from the local filesystem and
> write them back out.  That is local activity happening due to the replay of
> WAL which sounds like it is and should be reported ("All...blocks...", and
> the example given being logical DDL oriented).

That's not true today - the startup processes reads / writes aren't reflected
in pg_statio, pg_stat_database or whatnot.


> I cannot think of a better paragraph at the moment, the minimal change is
> good, and the detail it

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

Hi,

On 2022-04-08 11:10:14 +0900, Kyotaro Horiguchi wrote:
> At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund  wrote 
> in 
> > Hi,
> > 
> > On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > > I've gotten through the main commits (and then a fix for the apparently
> > > inevitable bug that's immediately highlighted by the buildfarm), and the 
> > > first
> > > test. I'll call it a night now, and work on the other tests & docs 
> > > tomorrow.
> > 
> > I've gotten through the tests now. There's one known, not yet addressed, 
> > issue
> > with the stats isolation test, see [1].
> > 
> > 
> > Working on the docs. Found a few things worth raising:
> > 
> > 1)
> > Existing text:
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so that
> >statistics can be retained across server restarts.  When recovery is
> >performed at server start (e.g., after immediate shutdown, server crash,
> >and point-in-time recovery), all statistics counters are reset.
> > 
> > The existing docs patch hadn't updated yet. My current edit is
> > 
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so that
> >statistics can be retained across server restarts.  When crash recovery 
> > is
> >performed at server start (e.g., after immediate shutdown, server crash,
> >and point-in-time recovery, but not when starting a standby that was shut
> >down normally), all statistics counters are reset.
> > 
> > but I'm not sure the parenthetical is easy enough to understand?
> 
> I can read it. But I'm not sure that the difference is obvious for
> average users between "starting a standby from a basebackup" and
> "starting a standby after a normal shutdown"..

Yea, that's what I was concerned about. How about:

  
   Cumulative statistics are collected in shared memory. Every
   PostgreSQL process collects statistics locally
   then updates the shared data at appropriate intervals.  When a server,
   including a physical replica, shuts down cleanly, a permanent copy of the
   statistics data is stored in the pg_stat subdirectory,
   so that statistics can be retained across server restarts.  In contrast,
   when starting from an unclean shutdown (e.g., after an immediate shutdown,
   a server crash, starting from a base backup, and point-in-time recovery),
   all statistics counters are reset.
  


> Other than that, it might be easier to read if the additional part
> were moved out to the end of the paragraph, prefixing with "Note:
> ". For example,
> 
> ...
> statistics can be retained across server restarts.  When crash recovery is
> performed at server start (e.g., after immediate shutdown, server crash,
> and point-in-time recovery), all statistics counters are reset. Note that
> crash recovery is not performed when starting a standby that was shut
> down normally then all counters are retained.

I think I like my version above a bit better?


> > 2)
> > The edit is not a problem, but it's hard to understand what the existing
> > paragraph actually means?
> > 
> > diff --git a/doc/src/sgml/high-availability.sgml 
> > b/doc/src/sgml/high-availability.sgml
> > index 3247e056663..8bfb584b752 100644
> > --- a/doc/src/sgml/high-availability.sgml
> > +++ b/doc/src/sgml/high-availability.sgml
> > @@ -,17 +,17 @@ HINT:  You can then restart the server after making 
> > the necessary configuration
> > ...
> > 
> > -The statistics collector is active during recovery. All scans, reads, 
> > blocks,
> > +The cumulative statistics system is active during recovery. All scans, 
> > reads, blocks,
> >  index usage, etc., will be recorded normally on the standby. Replayed
> >  actions will not duplicate their effects on primary, so replaying an
> >  insert will not increment the Inserts column of pg_stat_user_tables.
> >  The stats file is deleted at the start of recovery, so stats from 
> > primary
> >  and standby will differ; this is considered a feature, not a bug.
> > 
> > 
> > 
> 
> Agreed partially. It's too detailed.  It might not need to mention WAL
> replay.

My concern is more that it seems halfway nonsensical. "Replayed actions will
not duplicate their effects on primary" - I can guess what that means, but not
more. There's no "Inserts" column of pg_stat_user_tables.


   
The cumulative statistics system is active during recovery. All scans,
reads, blocks, index usage, etc., will be recorded normally on the
standby. However, WAL replay will not increment relation and database
specific counters. I.e. replay will not increment pg_stat_all_tables
columns (like n_tup_ins), nor will reads or writes performed by the
startup process be tracked in the pg_statio views, nor will associated
pg_stat_database columns be incremented.
   



Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-07 Thread David G. Johnston

On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi 
wrote:

> At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund 
> wrote in
> > Hi,
> >
> > On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > > I've gotten through the main commits (and then a fix for the apparently
> > > inevitable bug that's immediately highlighted by the buildfarm), and
> the first
> > > test. I'll call it a night now, and work on the other tests & docs
> tomorrow.
> >
> > I've gotten through the tests now. There's one known, not yet addressed,
> issue
> > with the stats isolation test, see [1].
> >
> >
> > Working on the docs. Found a few things worth raising:
> >
> > 1)
> > Existing text:
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so
> that
> >statistics can be retained across server restarts.  When recovery is
> >performed at server start (e.g., after immediate shutdown, server
> crash,
> >and point-in-time recovery), all statistics counters are reset.
> >
> > The existing docs patch hadn't updated yet. My current edit is
> >
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so
> that
> >statistics can be retained across server restarts.  When crash
> recovery is
> >performed at server start (e.g., after immediate shutdown, server
> crash,
> >and point-in-time recovery, but not when starting a standby that was
> shut
> >down normally), all statistics counters are reset.
> >
> > but I'm not sure the parenthetical is easy enough to understand?
>
> I can read it. But I'm not sure that the difference is obvious for
> average users between "starting a standby from a basebackup" and
> "starting a standby after a normal shutdown"..
>
> Other than that, it might be easier to read if the additional part
> were moved out to the end of the paragraph, prefixing with "Note:
> ". For example,
>
> ...
> statistics can be retained across server restarts.  When crash recovery is
> performed at server start (e.g., after immediate shutdown, server crash,
> and point-in-time recovery), all statistics counters are reset. Note that
> crash recovery is not performed when starting a standby that was shut
> down normally then all counters are retained.
>
>
Maybe:
When the server shuts down cleanly a permanent copy of the statistics data
is stored in the pg_stat subdirectory so that
statistics can be retained across server restarts.  However, if crash
recovery is performed (i.e., after immediate shutdown, server crash, or
point-in-time recovery), all statistics counters are reset.  For any
standby server, the initial startup to get the cluster initialized is a
point-in-time crash recovery startup. For all subsequent startups it
behaves like any other server.  For a hot standby server, statistics are
retained during a failover promotion.

I'm pretty sure i.e., is correct since those three situations are not
examples but rather the complete set.

Is crash recovery ever performed other than at server start?  If so I
choose to remove the redundancy.

I feel like some of this detail about standby servers is/should be covered
elsewhere and we are at least missing a cross-reference chance even if we
leave the material coverage as-is.

> 2)
> > The edit is not a problem, but it's hard to understand what the existing
> > paragraph actually means?
> >
> > diff --git a/doc/src/sgml/high-availability.sgml
> b/doc/src/sgml/high-availability.sgml
> > index 3247e056663..8bfb584b752 100644
> > --- a/doc/src/sgml/high-availability.sgml
> > +++ b/doc/src/sgml/high-availability.sgml
> > @@ -,17 +,17 @@ HINT:  You can then restart the server after
> making the necessary configuration
> > ...
> > 
> > -The statistics collector is active during recovery. All scans,
> reads, blocks,
> > +The cumulative statistics system is active during recovery. All
> scans, reads, blocks,
> >  index usage, etc., will be recorded normally on the standby.
> Replayed
> >  actions will not duplicate their effects on primary, so replaying an
> >  insert will not increment the Inserts column of pg_stat_user_tables.
> >  The stats file is deleted at the start of recovery, so stats from
> primary
> >  and standby will differ; this is considered a feature, not a bug.
> > 
> >
> > 
>
> Agreed partially. It's too detailed.  It might not need to mention WAL
> replay.
>
>
The insert example seems like a poor one...IIUC cumulative statistics are
not WAL logged and while in recovery INSERT is prohibited, so how would
replaying the insert in the WAL result in a duplicated effect on
pg_stat_user_tables.inserts?

I also have no idea what, in the fragment, "Replayed actions will not
duplicate their effects on primary...", what "on primary" is supposed to
mean.

I would like to write the following but I don't believe it is sufficiently
true:

"The cumulative statistics system

Re: shared-memory based stats collector - v70

2022-04-07 Thread Kyotaro Horiguchi

At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > I've gotten through the main commits (and then a fix for the apparently
> > inevitable bug that's immediately highlighted by the buildfarm), and the 
> > first
> > test. I'll call it a night now, and work on the other tests & docs tomorrow.
> 
> I've gotten through the tests now. There's one known, not yet addressed, issue
> with the stats isolation test, see [1].
> 
> 
> Working on the docs. Found a few things worth raising:
> 
> 1)
> Existing text:
>When the server shuts down cleanly, a permanent copy of the statistics
>data is stored in the pg_stat subdirectory, so that
>statistics can be retained across server restarts.  When recovery is
>performed at server start (e.g., after immediate shutdown, server crash,
>and point-in-time recovery), all statistics counters are reset.
> 
> The existing docs patch hadn't updated yet. My current edit is
> 
>When the server shuts down cleanly, a permanent copy of the statistics
>data is stored in the pg_stat subdirectory, so that
>statistics can be retained across server restarts.  When crash recovery is
>performed at server start (e.g., after immediate shutdown, server crash,
>and point-in-time recovery, but not when starting a standby that was shut
>down normally), all statistics counters are reset.
> 
> but I'm not sure the parenthetical is easy enough to understand?

I can read it. But I'm not sure that the difference is obvious for
average users between "starting a standby from a basebackup" and
"starting a standby after a normal shutdown"..

Other than that, it might be easier to read if the additional part
were moved out to the end of the paragraph, prefixing with "Note:
". For example,

...
statistics can be retained across server restarts.  When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset. Note that
crash recovery is not performed when starting a standby that was shut
down normally then all counters are retained.

> 2)
> The edit is not a problem, but it's hard to understand what the existing
> paragraph actually means?
> 
> diff --git a/doc/src/sgml/high-availability.sgml 
> b/doc/src/sgml/high-availability.sgml
> index 3247e056663..8bfb584b752 100644
> --- a/doc/src/sgml/high-availability.sgml
> +++ b/doc/src/sgml/high-availability.sgml
> @@ -,17 +,17 @@ HINT:  You can then restart the server after making 
> the necessary configuration
> ...
> 
> -The statistics collector is active during recovery. All scans, reads, 
> blocks,
> +The cumulative statistics system is active during recovery. All scans, 
> reads, blocks,
>  index usage, etc., will be recorded normally on the standby. Replayed
>  actions will not duplicate their effects on primary, so replaying an
>  insert will not increment the Inserts column of pg_stat_user_tables.
>  The stats file is deleted at the start of recovery, so stats from primary
>  and standby will differ; this is considered a feature, not a bug.
> 
> 
> 

Agreed partially. It's too detailed.  It might not need to mention WAL
replay.

> I'll just commit the necessary bit, but we really ought to rephrase this.
> 
> 
> 
> 
> Greetings,
> 
> Andres Freund
> 
> [1] 
> https://www.postgresql.org/message-id/20220407165709.jgdkrzqlkcwue6ko%40alap3.anarazel.de

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

Hi,

On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> I've gotten through the main commits (and then a fix for the apparently
> inevitable bug that's immediately highlighted by the buildfarm), and the first
> test. I'll call it a night now, and work on the other tests & docs tomorrow.

I've gotten through the tests now. There's one known, not yet addressed, issue
with the stats isolation test, see [1].

Working on the docs. Found a few things worth raising:

1)
Existing text:
   When the server shuts down cleanly, a permanent copy of the statistics
   data is stored in the pg_stat subdirectory, so that
   statistics can be retained across server restarts.  When recovery is
   performed at server start (e.g., after immediate shutdown, server crash,
   and point-in-time recovery), all statistics counters are reset.

The existing docs patch hadn't updated yet. My current edit is

   When the server shuts down cleanly, a permanent copy of the statistics
   data is stored in the pg_stat subdirectory, so that
   statistics can be retained across server restarts.  When crash recovery is
   performed at server start (e.g., after immediate shutdown, server crash,
   and point-in-time recovery, but not when starting a standby that was shut
   down normally), all statistics counters are reset.

but I'm not sure the parenthetical is easy enough to understand?

2)
The edit is not a problem, but it's hard to understand what the existing
paragraph actually means?

diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index 3247e056663..8bfb584b752 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -,17 +,17 @@ HINT:  You can then restart the server after making the 
necessary configuration
...

-The statistics collector is active during recovery. All scans, reads, 
blocks,
+The cumulative statistics system is active during recovery. All scans, 
reads, blocks,
 index usage, etc., will be recorded normally on the standby. Replayed
 actions will not duplicate their effects on primary, so replaying an
 insert will not increment the Inserts column of pg_stat_user_tables.
 The stats file is deleted at the start of recovery, so stats from primary
 and standby will differ; this is considered a feature, not a bug.

I'll just commit the necessary bit, but we really ought to rephrase this.

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20220407165709.jgdkrzqlkcwue6ko%40alap3.anarazel.de

Re: shared-memory based stats collector - v70

2022-04-07 Thread Kyotaro Horiguchi

At Thu, 7 Apr 2022 00:28:45 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-04-05 20:00:08 -0700, Andres Freund wrote:
> > It'll be a few hours to get to the main commit - but except for 0001 it
> > doesn't make sense to push without intending to push later changes too. I
> > might squash a few commits togther.
> 
> I've gotten through the main commits (and then a fix for the apparently
> inevitable bug that's immediately highlighted by the buildfarm), and the first
> test. I'll call it a night now, and work on the other tests & docs tomorrow.

Thank you very much for the great effort on this to make it get in!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-07 Thread Kyotaro Horiguchi

At Wed, 6 Apr 2022 18:58:52 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-04-07 10:36:30 +0900, Kyotaro Horiguchi wrote:
> > At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund  wrote 
> > in 
> > > I think there's no switches left now, so it's not actually providing too 
> > > much.
> > 
> > (Ouch!)
> 
> I think it's great that there's no switches left - means we're pretty close to
> pgstat being runtime extensible...

Yes. I agree.

> It's now pgstat_get_kind_from_str().

I'm fine with it.

> It was harder to see earlier (I certainly didn't really see it) - because
> there were so many "violations" - but most of pgstat is
> pgstat__() or just _. I'd already moved most of
> the patch series over to that (maybe in v68 or so). Now I also did that with
> the internal functions.
> 
> There's a few functions breaking that pattern, partially because I added them
> :(, but since they're not touched in these patches I've not renamed them. But
> it's probably worth doing so tomorrow.

Thab being said, it gets far cleaner.   Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

Hi,

On 2022-04-05 20:00:08 -0700, Andres Freund wrote:
> It'll be a few hours to get to the main commit - but except for 0001 it
> doesn't make sense to push without intending to push later changes too. I
> might squash a few commits togther.

I've gotten through the main commits (and then a fix for the apparently
inevitable bug that's immediately highlighted by the buildfarm), and the first
test. I'll call it a night now, and work on the other tests & docs tomorrow.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-07 10:36:30 +0900, Kyotaro Horiguchi wrote:
> At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund  wrote 
> in 
> > > +  * Don't define an INVALID value so switch() statements can warn if some
> > > +  * cases aren't covered. But define the first member to 1 so that
> > > +  * uninitialized values can be detected more easily.
> > > 
> > > FWIW, I like this.
> > 
> > I think there's no switches left now, so it's not actually providing too 
> > much.
> 
> (Ouch!)

I think it's great that there's no switches left - means we're pretty close to
pgstat being runtime extensible...

> > > 0010:
> > > (I didn't look this closer. The comments arised while looking other
> > > patches.)
> > > 
> > > +pgstat_kind_from_str(char *kind_str)
> > > 
> > > I don't think I like "str" so much.  Don't we spell it as
> > > "pgstat_kind_from_name"?
> >
> > name makes me think of NameData.  What do you dislike about str? We seem to 
> > use
> > str in plenty places?
> 
> For clarity, I don't dislike it so much.  So, I'm fine with the
> current name.
> 
> I found that you meant a type by the "str".  I thought it as an
> instance (I'm not sure I can express my feeling correctly here..) and
> the following functions were in my mind.
> 
> char *get_namespace/rel/collation/func_name(Oid someoid)
> char *pgstat_slru_name(int slru_idx)
> 
> Another instance of the same direction is
> 
> ForkNumber forkname_to_number(const char *forkName)

It's now pgstat_get_kind_from_str().

It was harder to see earlier (I certainly didn't really see it) - because
there were so many "violations" - but most of pgstat is
pgstat__() or just _. I'd already moved most of
the patch series over to that (maybe in v68 or so). Now I also did that with
the internal functions.

There's a few functions breaking that pattern, partially because I added them
:(, but since they're not touched in these patches I've not renamed them. But
it's probably worth doing so tomorrow.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Kyotaro Horiguchi

At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund  wrote in 
> > +* Don't define an INVALID value so switch() statements can warn if some
> > +* cases aren't covered. But define the first member to 1 so that
> > +* uninitialized values can be detected more easily.
> > 
> > FWIW, I like this.
> 
> I think there's no switches left now, so it's not actually providing too much.

(Ouch!)

> > 0008:
> > 
> > +   xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
> > +   xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
> > +   xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);
> > 
> > I'm not sure I like this, but I don't object to this..
> 
> The string prefixes? Or the entire patch?

The string prefixes, since they are a limited set of fixed
strings. That being said, I don't think it's better to use an enum
instead, too.  So I don't object to pass the strings here.

> > 0010:
> > (I didn't look this closer. The comments arised while looking other
> > patches.)
> > 
> > +pgstat_kind_from_str(char *kind_str)
> > 
> > I don't think I like "str" so much.  Don't we spell it as
> > "pgstat_kind_from_name"?
>
> name makes me think of NameData.  What do you dislike about str? We seem to 
> use
> str in plenty places?

For clarity, I don't dislike it so much.  So, I'm fine with the
current name.

I found that you meant a type by the "str".  I thought it as an
instance (I'm not sure I can express my feeling correctly here..) and
the following functions were in my mind.

char *get_namespace/rel/collation/func_name(Oid someoid)
char *pgstat_slru_name(int slru_idx)

Another instance of the same direction is

ForkNumber forkname_to_number(const char *forkName)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 17:01:17 -0700, David G. Johnston wrote:
> On Wed, Apr 6, 2022 at 4:12 PM Andres Freund  wrote:
>
> The fact there is just the one outlier here suggests that this is indeed the
> better option.

FWIW, the outlier also uses pgstat_reset(), just with a small wrapper doing
the translation from slot name to slot index.

> > What does "private" mean for you? They're exposed via pgstat.h not
> > pgstat_internal.h. But not to SQL.
> I was thinking specifically of the freedom to rename and not break
> extensions.  Namely, are these truly implementation details or something
> that, while unlikely to be used by extensions, still constitute an exposed
> API?  It was mainly a passing thought, I'm not looking for a crash-course
> in how all that works right now.

I doubt there are extension using these functions - and they'd have been
broken the way things were in v70, because the signature already had changed.

Generally, between major releases, we don't worry too much about changing C
APIs. Of course we try to avoid unnecessarily breaking things, particularly
when it's going to cause widespread breakage.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread David G. Johnston

On Wed, Apr 6, 2022 at 4:12 PM Andres Freund  wrote:

>
> On 2022-04-06 15:32:39 -0700, David G. Johnston wrote:
> > On Wednesday, April 6, 2022, Andres Freund  wrote:
> >
> >
> > I like having the SQL function paired with a matching implementation in
> > this scheme.
>
> It would have gotten things closer than it was before imo. We can't just
> rename the SQL functions, they're obviously exposed API.
>

Right, I meant the naming scheme proposed was acceptable.  Not that I
wanted to change the SQL functions too.

I've hacked up the above, but after doing so I think I found a cleaner
>
approach:
>
> I've introduced:
>
> pgstat_reset_of_kind(PgStat_Kind kind) which works for both fixed and
> variable
> numbered stats. That allows to remove pgstat_reset_subscription_counters(),
> pgstat_reset_replslot_counters(), pgstat_reset_shared_counters().
>
> pgstat_reset(PgStat_Kind kind, Oid dboid, Oid objoid), which removes the
> need
> for pgstat_reset_subscription_counter(),
> pgstat_reset_single_counter().



> pgstat_reset_replslot() is still needed, to do
> the name -> index lookup.
>
> That imo makes a lot more sense than requiring each variable-amount kind to
> have wrapper functions.
>
>
I can see benefits of both, or even possibly combining them.  Absent being
able to point to some other part of the system and saying "it is done this
way there, let's do the same here" I think the details will inform the
decision.  The fact there is just the one outlier here suggests that this
is indeed the better option.



> What does "private" mean for you? They're exposed via pgstat.h not
> pgstat_internal.h. But not to SQL.
>
>
I was thinking specifically of the freedom to rename and not break
extensions.  Namely, are these truly implementation details or something
that, while unlikely to be used by extensions, still constitute an exposed
API?  It was mainly a passing thought, I'm not looking for a crash-course
in how all that works right now.

David J.

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 15:32:39 -0700, David G. Johnston wrote:
> On Wednesday, April 6, 2022, Andres Freund  wrote:
> 
> >
> >
> > I'd go for
> > pgstat_reset_slru_counter() -> pgstat_reset_slru()
> > pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
> > pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
> > pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
> > pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()
> 
> 
> I like having the SQL function paired with a matching implementation in
> this scheme.

It would have gotten things closer than it was before imo. We can't just
rename the SQL functions, they're obviously exposed API.

I'd like to remove the NULL -> all behaviour, but that should be discussed
separately.

I've hacked up the above, but after doing so I think I found a cleaner
approach:

I've introduced:

pgstat_reset_of_kind(PgStat_Kind kind) which works for both fixed and variable
numbered stats. That allows to remove pgstat_reset_subscription_counters(),
pgstat_reset_replslot_counters(), pgstat_reset_shared_counters().

pgstat_reset(PgStat_Kind kind, Oid dboid, Oid objoid), which removes the need
for pgstat_reset_subscription_counter(),
pgstat_reset_single_counter(). pgstat_reset_replslot() is still needed, to do
the name -> index lookup.

That imo makes a lot more sense than requiring each variable-amount kind to
have wrapper functions.

> > Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
> > for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?
> >
> 
> Why not add both pgstat_resert_function() and pgstat_reset_table() (to keep
> the pairing) and they can call the renamed pgstat_reset_function_or_table()
> internally (since the function indeed handle both paths and we’ve yet to
> come up with a label to use instead of “function and table stats”)?
> 
> These are private functions right?

What does "private" mean for you? They're exposed via pgstat.h not
pgstat_internal.h. But not to SQL.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread David G. Johnston

On Wednesday, April 6, 2022, Andres Freund  wrote:

>
>
> I'd go for
> pgstat_reset_slru_counter() -> pgstat_reset_slru()
> pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
> pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
> pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
> pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()

I like having the SQL function paired with a matching implementation in
this scheme.

>
> We could leave out the _all_ and just use plural too, but I think it's a
> bit
> nicer with _all_ in there.

+1 to _all_

>
> Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
> for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?
>

Why not add both pgstat_resert_function() and pgstat_reset_table() (to keep
the pairing) and they can call the renamed pgstat_reset_function_or_table()
internally (since the function indeed handle both paths and we’ve yet to
come up with a label to use instead of “function and table stats”)?

These are private functions right?

David J.

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-05 20:00:08 -0700, Andres Freund wrote:
> It'll be a few hours to get to the main commit - but except for 0001 it
> doesn't make sense to push without intending to push later changes too. I
> might squash a few commits togther.

I just noticed an existing incoherency that I'm wondering about fixing as part
of 0007 "pgstat: prepare APIs used by pgstatfuncs for shared memory stats."

The SQL functions to reset function and relation stats
pg_stat_reset_single_table_counters() and
pg_stat_reset_single_function_counters() respectively both make use of
pgstat_reset_single_counter().

Note that the SQL function uses plural "counters" (which makes sense, it
resets all counters for that object), whereas the C function they call to
perform the the reset uses singular.

Similarly pg_stat_reset_slru(), pg_stat_reset_replication_slot(),
pg_stat_reset_subscription_stats() SQL function use
pgstat_reset_subscription_counter(), pgstat_reset_replslot_counter() and
pgstat_reset_subscription_counter() to reset either the stats for one or all
SLRUs/slots.

This is relevant for the commit mentioned above because it separates the
functions to reset the stats for one slru / slot / sub from the function to
reset all slrus / slots / subs. Going with the existing naming I'd just named
them pgstat_reset_*_counters(). But that doesn't really make sense.

If it were just existing code I'd just not touch this for now. But because the
patch introduces further functions, I'd rather not introducing more weird
function names.

I'd go for
pgstat_reset_slru_counter() -> pgstat_reset_slru()
pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()

We could leave out the _all_ and just use plural too, but I think it's a bit
nicer with _all_ in there.

Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Lukas Fittl

On Wed, Apr 6, 2022 at 12:34 PM Justin Pryzby  wrote:

> On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote:
> > > > +   next use of statistical information will cause a new snapshot to
> be built
> > > > +   or accessed statistics to be cached.
> > >
> > > I believe this should be an "and", not an "or". (next access builds
> both a
> > > new snapshot and caches accessed statistics)
> >
> > I *think* or is correct? The new snapshot is when
> stats_fetch_consistency =
> > snapshot, the cached is when stats_fetch_consistency = cache. Not sure
> how to
> > make that clearer without making it a lot longer. Suggestions?
>
> I think it's correct.  Maybe it's clearer to say:
>
> +   next use of statistical information will (when in snapshot mode) cause
> a new snapshot to be built
> +   or (when in cache mode) accessed statistics to be cached.
>

Ah, yes, that does clarify what was meant.

+1 to Justin's edit, or something like it.

Thanks,
Lukas

-- 
Lukas Fittl

Re: shared-memory based stats collector - v70

2022-04-06 Thread Justin Pryzby

On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote:
> > > +   next use of statistical information will cause a new snapshot to be 
> > > built
> > > +   or accessed statistics to be cached.
> > 
> > I believe this should be an "and", not an "or". (next access builds both a
> > new snapshot and caches accessed statistics)
> 
> I *think* or is correct? The new snapshot is when stats_fetch_consistency =
> snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to
> make that clearer without making it a lot longer. Suggestions?

I think it's correct.  Maybe it's clearer to say:

+   next use of statistical information will (when in snapshot mode) cause a 
new snapshot to be built
+   or (when in cache mode) accessed statistics to be cached.

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 12:14:35 -0700, Lukas Fittl wrote:
> On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:
> 
> > Here comes v70:
> >
> 
> Some small nitpicks on the docs:

Thanks!

> > From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001
> > From: Andres Freund 
> > Date: Sat, 2 Apr 2022 19:38:01 -0700
> > Subject: [PATCH v70 14/27] pgstat: update docs.
> > ...
> > diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> > -  These parameters control server-wide statistics collection
> features.
> > -  When statistics collection is enabled, the data that is produced
> can be
> > +  These parameters control server-wide cumulative statistics system.
> > +  When enabled, the data that is collected can be
> 
> Missing "the" ("These parameters control the server-wide cumulative
> statistics system").

> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > +   any of the accumulated statistics, acessed values are cached until
> the end
> 
> "acessed" => "accessed"

> > +   stats_fetch_consistency can be set
> > +   snapshot, at the price of increased memory usage
> for
> 
> Missing "to" ("can be set to snapshot")

Fixed.

> > +   caching not-needed statistics data.  Conversely, if it's known that
> statistics
> 
> Double space between "data." and "Conversely" (not sure if that matters)
> > +   current transaction's statistics snapshot or cached values (if any).
> The
> 
> Double space between "(if any)." and "The" (not sure if that matters)

That's done pretty widely in the docs and comments.


> > +   next use of statistical information will cause a new snapshot to be
> built
> > +   or accessed statistics to be cached.
> 
> I believe this should be an "and", not an "or". (next access builds both a
> new snapshot and caches accessed statistics)

I *think* or is correct? The new snapshot is when stats_fetch_consistency =
snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to
make that clearer without making it a lot longer. Suggestions?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Lukas Fittl

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:

> Here comes v70:
>

Some small nitpicks on the docs:

> From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001
> From: Andres Freund 
> Date: Sat, 2 Apr 2022 19:38:01 -0700
> Subject: [PATCH v70 14/27] pgstat: update docs.
> ...
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> -  These parameters control server-wide statistics collection
features.
> -  When statistics collection is enabled, the data that is produced
can be
> +  These parameters control server-wide cumulative statistics system.
> +  When enabled, the data that is collected can be

Missing "the" ("These parameters control the server-wide cumulative
statistics system").

> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> +   any of the accumulated statistics, acessed values are cached until
the end

"acessed" => "accessed"

> +   stats_fetch_consistency can be set
> +   snapshot, at the price of increased memory usage
for

Missing "to" ("can be set to snapshot")

> +   caching not-needed statistics data.  Conversely, if it's known that
statistics

Double space between "data." and "Conversely" (not sure if that matters)

> +   current transaction's statistics snapshot or cached values (if any).
The

Double space between "(if any)." and "The" (not sure if that matters)

> +   next use of statistical information will cause a new snapshot to be
built
> +   or accessed statistics to be cached.

I believe this should be an "and", not an "or". (next access builds both a
new snapshot and caches accessed statistics)

Thanks,
Lukas

-- 
Lukas Fittl

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 16:24:28 +0700, John Naylor wrote:
> On Wed, Apr 6, 2022 at 10:00 AM Andres Freund  wrote:
> > - while working on the above point, I noticed that hash_bytes() showed up
> >   noticeably in profiles, so I replaced it with a fixed-width function
> 
> I'm curious about this -- could you direct me to which patch introduces this?

Commit 0010, search for pgstat_hash_key_hash. For simplicity I'm including it
here inline:

/* helpers for dshash / simplehash hashtables */
static inline int
pgstat_hash_key_cmp(const void *a, const void *b, size_t size, void *arg)
{
AssertArg(size == sizeof(PgStat_HashKey) && arg == NULL);
return memcmp(a, b, sizeof(PgStat_HashKey));
}

static inline uint32
pgstat_hash_key_hash(const void *d, size_t size, void *arg)
{
const PgStat_HashKey *key = (PgStat_HashKey *)d;
uint32 hash;

AssertArg(size == sizeof(PgStat_HashKey) && arg == NULL);

hash = murmurhash32(key->kind);
hash = hash_combine(hash, murmurhash32(key->dboid));
hash = hash_combine(hash, murmurhash32(key->objoid));

return hash;
}

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 13:31:31 +0200, Alvaro Herrera wrote:
> Just skimming a bit here ...

Thanks!

> On 2022-Apr-05, Andres Freund wrote:
> 
> > From 0532b869033595202d5797b148f22c61e4eb4969 Mon Sep 17 00:00:00 2001
> > From: Andres Freund 
> > Date: Mon, 4 Apr 2022 16:53:16 -0700
> > Subject: [PATCH v70 10/27] pgstat: store statistics in shared memory.
> 
> > +  PgStatsData
> > +  Waiting fo shared memory stats data access
> > + 
> 
> Typo "fo" -> "for"

Oh, oops. I had fixed that in the wrong patch.

> > @@ -5302,7 +5317,9 @@ StartupXLOG(void)
> > performedWalRecovery = true;
> > }
> > else
> > +   {
> > performedWalRecovery = false;
> > +   }
> 
> Why? :-)

Damage from merging two commits yesterday. I'd left open where exactly we'd
reset stats, with the "main commit" implementing the current behaviour more
closely, and then a followup commit implementing something a bit
better. Nobody seemed to argue for keeping the behaviour 1:1, so I merged
them. Without removing the parens again :)

> Why give pgstat_get_entry_ref the responsibility of initializing
> created_entry to false?  The vast majority of callers don't care about
> that flag; it seems easier/cleaner to set it to false in
> pgstat_init_function_usage (the only caller that cares that I could
> find) before calling pgstat_prep_pending_entry.

It's annoying to have to initialize it, I agree. But I think it's bugprone for
the caller to know that it has to be pre-initialized to false.

> (I suggest pgstat_prep_pending_entry should have a comment line stating
> "*created_entry, if not NULL, is set true if the entry required to be
> created.", same as pgstat_get_entry_ref.)

Added something along those lines.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 18:11:04 +0900, Kyotaro Horiguchi wrote:
> 0004:
> 
> I can see padding_pgstat_send and fun:pgstat_send in valgrind.supp

Those shouldn't be affected by the patch, I think? But I did indeed forget to
remove those in 0010.

> 0006:
> 
> I'm fine with the categorize for now.
> 
> +#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
> +#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)
> 
> The number of kinds is 10. And PGSTAT_NUM_KINDS is 11?

Yea, it's not great. I think I'll introduce INVALID and rename
PGSTAT_KIND_FIRST to FIRST_VALID.


> +  * Don't define an INVALID value so switch() statements can warn if some
> +  * cases aren't covered. But define the first member to 1 so that
> +  * uninitialized values can be detected more easily.
> 
> FWIW, I like this.

I think there's no switches left now, so it's not actually providing too much.


> 0008:
> 
> + xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
> + xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
> + xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);
> 
> I'm not sure I like this, but I don't object to this..

The string prefixes? Or the entire patch?


> 0010:
> (I didn't look this closer. The comments arised while looking other
> patches.)
> 
> +pgstat_kind_from_str(char *kind_str)
> 
> I don't think I like "str" so much.  Don't we spell it as
> "pgstat_kind_from_name"?

name makes me think of NameData. What do you dislike about str? We seem to use
str in plenty places?


> 0019:
> 
> +my $og_stats = $datadir . '/' . "pg_stat" . '/' . "pgstat.stat";
> 
> It can be "$datadir/pg_stat/pgstat.stat" or $datadir . '/pgstat/pgstat.stat'.
> Isn't it simpler?

Yes, will change.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Alvaro Herrera

Just skimming a bit here ...

On 2022-Apr-05, Andres Freund wrote:

> From 0532b869033595202d5797b148f22c61e4eb4969 Mon Sep 17 00:00:00 2001
> From: Andres Freund 
> Date: Mon, 4 Apr 2022 16:53:16 -0700
> Subject: [PATCH v70 10/27] pgstat: store statistics in shared memory.

> +  PgStatsData
> +  Waiting fo shared memory stats data access
> + 

Typo "fo" -> "for"

> @@ -5302,7 +5317,9 @@ StartupXLOG(void)
>   performedWalRecovery = true;
>   }
>   else
> + {
>   performedWalRecovery = false;
> + }

Why? :-)

Why give pgstat_get_entry_ref the responsibility of initializing
created_entry to false?  The vast majority of callers don't care about
that flag; it seems easier/cleaner to set it to false in
pgstat_init_function_usage (the only caller that cares that I could
find) before calling pgstat_prep_pending_entry.

(I suggest pgstat_prep_pending_entry should have a comment line stating
"*created_entry, if not NULL, is set true if the entry required to be
created.", same as pgstat_get_entry_ref.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."(Robert Davidson)
   http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php

Re: shared-memory based stats collector - v70

2022-04-06 Thread John Naylor

On Wed, Apr 6, 2022 at 10:00 AM Andres Freund  wrote:
> - while working on the above point, I noticed that hash_bytes() showed up
>   noticeably in profiles, so I replaced it with a fixed-width function

I'm curious about this -- could you direct me to which patch introduces this?

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: shared-memory based stats collector - v70

2022-04-06 Thread Kyotaro Horiguchi

At Tue, 5 Apr 2022 20:00:08 -0700, Andres Freund  wrote in 
> Hi,
> 
> Here comes v70:
> - extended / polished the architecture comment based on feedback from Melanie
>   and David
> - other polishing as suggested by David
> - addressed the open issue around pgstat_report_stat(), as described in
>   
> https://www.postgresql.org/message-id/20220405204019.6yj7ocmpw352c2u5%40alap3.anarazel.de
> - while working on the above point, I noticed that hash_bytes() showed up
>   noticeably in profiles, so I replaced it with a fixed-width function
> - found a few potential regression test instabilities by either *always*
>   flushing in pgstat_report_stat(), or only flushing when force = true.
> - random minor improvements
> - reordered commits some
> 
> I still haven't renamed pg_stat_exists_stat() yet - I'm leaning towards
> pg_stat_have_stats() or pg_stat_exists() right now. But it's an SQL function
> for testing, so it doesn't really matter.
> 
> I think this is basically ready, minus a a few comment adjustments here and
> there. Unless somebody protests I'm planning to start pushing things tomorrow
> morning.
> 
> It'll be a few hours to get to the main commit - but except for 0001 it
> doesn't make sense to push without intending to push later changes too. I
> might squash a few commits togther.
> 
> There's lots that can be done once all this is in place, both simplifying
> pre-existing code and easy new features, but that's for a later release.

I'm not sure it's in time but..

(Sorry in advance for possible duplicate or pointless comments.)

0001: Looks fine.

0002:

All references to "stats collector" or alike looks like eliminated
after all of the 24 patches are applied.  So this seems fine.

0003:

This is just moving around functions and variables. Looks fine.

0004:

I can see padding_pgstat_send and fun:pgstat_send in valgrind.supp

0005:

The function is changed later patch, and it looks fine.

0006:

I'm fine with the categorize for now.

+#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
+#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)

The number of kinds is 10. And PGSTAT_NUM_KINDS is 11?

+* Don't define an INVALID value so switch() statements can warn if some
+* cases aren't covered. But define the first member to 1 so that
+* uninitialized values can be detected more easily.

FWIW, I like this.


0007:

(mmm no comments)

0008:

+   xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
+   xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
+   xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);

I'm not sure I like this, but I don't object to this..

0009: (skipped)

0010:
(I didn't look this closer. The comments arised while looking other
patches.)

+pgstat_kind_from_str(char *kind_str)

I don't think I like "str" so much.  Don't we spell it as
"pgstat_kind_from_name"?


0011:
  Looks fine.

0012:
  Looks like covering all related parts.


0013:
  Just fine.

0014:
  I bilieve it:p

0015:
  Function attributes seems correct. Looks fine.

0016:
(skipped, but looks fine by a quick look.)

0017:

 I don't find a problem with this.

0018: (skipped)

0019:

+my $og_stats = $datadir . '/' . "pg_stat" . '/' . "pgstat.stat";

It can be "$datadir/pg_stat/pgstat.stat" or $datadir . '/pgstat/pgstat.stat'.
Isn't it simpler?


0020-24:(I believe them:p)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v69

2022-04-05 Thread David G. Johnston

On Tue, Apr 5, 2022 at 8:14 PM Andres Freund  wrote:

>
> On 2022-04-05 20:00:50 -0700, David G. Johnston wrote:
>
>  * Statistics are loaded from the filesystem during startup (by the startup
>  * process), unless preceded by a crash, in which case all stats are
>  * discarded. They are written out by the checkpointer process just before
>  * shutting down, except when shutting down in immediate mode.
>
>
Cool.  I was on the fence about the level of detail here, but mostly
excluded mentioning the checkpointer 'cause I didn't want to research the
correct answer tonight.

>
> >  * Each cumulative statistic producing system must construct a
> PgStat_Kind
> >  * datum in this file. The details are described elsewhere, but of
> >  * particular importance is that each kind is classified as having
> either a
> >  * fixed number of objects that it tracks, or a variable number.
> >  *
> >  * During normal operations, the different consumers of the API will have
> > their
> >  * accessed managed by the API, the protocol used is determined based
> upon
> > whether
> >  * the statistical kind is fixed-numbered or variable-numbered.
> >  * Readers of variable-numbered statistics will have the option to
> locally
> >  * cache the data, while writers may have their updates locally queued
> >  * and applied in a batch. Thus favoring speed over freshness.
> >  * The fixed-numbered statistics are faster to process and thus forgo
> >  * these mechanisms in favor of a light-weight lock.
>
> This feels a bit jumbled.


I had that inkling as well.  First draft and I needed to stop at some
point.  It didn't seem bad or wrong at least.

Of course something using an API will be managed by
> the API. I don't know what protocol reallly means?
>
>
Procedure, process, algorithm are synonyms.  Procedure probably makes more
sense here since it is a procedural language we are using.  I thought of
algorithm while writing this but it carried too much technical baggage for
me (compression, encryption, etc..) that this didn't seem to fit in with.

>
> > Additionally, both due to unclean shutdown or user
> > request,
> >  * statistics can be reset - meaning that their stored numeric values are
> > returned
> >  * to zero, and any non-numeric data that may be tracked (say a
> timestamp)
> > is cleared.
>
> I think this is basically covered in the above?
>
>
Yes and no.  The first paragraph says they are forced to reset due to
system error.  This paragraph basically says that resetting this kind of
statistic is an acceptable, and even expected, thing to do.  And in fact
can also be done intentionally and not only due to system error.  I am
pondering whether to mention this dynamic first and/or better blend it in -
but the minor repetition in the different contexts seems ok.

David J.

Re: shared-memory based stats collector - v70

2022-04-05 Thread Andres Freund

Hi,

On 2022-04-05 23:11:07 -0400, Greg Stark wrote:
> I've never tried to review a 24-patch series before. It's kind of
> intimidating Is there a good place to start to get a good idea of
> the most important changes?

It was more at some point :). And believe me, I find this whole project
intimidating and exhausting. The stats collector is entangled in a lot of
places, and there was a lot of preparatory work to get this point.

Most of the commits aren't really interesting, I broke them out to make the
"main commit" a bit smaller, because it's exhausting to look at a *huge*
single commit. I wish I could have broken it down more, but I didn't find a
good way.

The interesting commit is
v70-0010-pgstat-store-statistics-in-shared-memory.patch
which actually replaces the stats collector by storing stats in shared
memory. It contains a, now hopefully decent, overview of how things work at
the top of pgstat.c.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-05 Thread David G. Johnston

On Tue, Apr 5, 2022 at 8:11 PM Greg Stark  wrote:

> I've never tried to review a 24-patch series before. It's kind of
> intimidating Is there a good place to start to get a good idea of
> the most important changes?
>

It isn't as bad as the number makes it sound - I just used "git am" to
apply the patches to a branch and skimmed each commit separately.  Most of
them are tests or other minor pieces.  The remaining few cover different
aspects of the major commit and you can choose them based upon your
experience and time.

David J.

Re: shared-memory based stats collector - v69

2022-04-05 Thread Andres Freund

Hi,

On 2022-04-05 20:00:50 -0700, David G. Johnston wrote:
> On Tue, Apr 5, 2022 at 4:16 PM Andres Freund  wrote:
> > On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:
> > > On Tue, Apr 5, 2022 at 2:23 PM Andres Freund  wrote:
> > > > I guess I should add a paragraph about snapshots / fetch consistency.
> > > >
> > >
> > > I apparently confused/combined the two concepts just now so that would
> > help.
> >
> > Will add.

I at least tried...


> On a slightly different track, I took the time to write-up a "Purpose"
> section for pgstat.c :
> 
> It may possibly be duplicating some things written elsewhere as I didn't go
> looking for similar prior art yet, I just wanted to get thoughts down.

There's very very little prior documentation in this area.


> This is the kind of preliminary framing I've been constructing in my own
> mind as I try to absorb this patch.  I haven't formed an opinion whether
> the actual user-facing documentation should cover some or all of this
> instead of the preamble to pgstat.c (which could just point to the docs for
> prerequisite reading).

>  * The PgStat namespace defines an API that facilitates concurrent access
>  * to a shared memory region where cumulative statistical data is saved.
>  * At shutdown, one of the running system workers will initiate the writing
>  * of the data to file. Then, during startup (following a clean shutdown)
> the
>  * Postmaster process will early on ensure that the file is loaded into
> memory.

I added something roughly along those lines in the version I just sent, based
on a suggestion by Melanie over IM:

 * Statistics are loaded from the filesystem during startup (by the startup
 * process), unless preceded by a crash, in which case all stats are
 * discarded. They are written out by the checkpointer process just before
 * shutting down, except when shutting down in immediate mode.



>  * Each cumulative statistic producing system must construct a PgStat_Kind
>  * datum in this file. The details are described elsewhere, but of
>  * particular importance is that each kind is classified as having either a
>  * fixed number of objects that it tracks, or a variable number.
>  *
>  * During normal operations, the different consumers of the API will have
> their
>  * accessed managed by the API, the protocol used is determined based upon
> whether
>  * the statistical kind is fixed-numbered or variable-numbered.
>  * Readers of variable-numbered statistics will have the option to locally
>  * cache the data, while writers may have their updates locally queued
>  * and applied in a batch. Thus favoring speed over freshness.
>  * The fixed-numbered statistics are faster to process and thus forgo
>  * these mechanisms in favor of a light-weight lock.

This feels a bit jumbled. Of course something using an API will be managed by
the API. I don't know what protocol reallly means?


> Additionally, both due to unclean shutdown or user
> request,
>  * statistics can be reset - meaning that their stored numeric values are
> returned
>  * to zero, and any non-numeric data that may be tracked (say a timestamp)
> is cleared.

I think this is basically covered in the above?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-05 Thread Greg Stark

I've never tried to review a 24-patch series before. It's kind of
intimidating Is there a good place to start to get a good idea of
the most important changes?

Re: shared-memory based stats collector - v70

2022-04-05 Thread David G. Johnston

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:

>
> Here comes v70:
>
> I think this is basically ready, minus a a few comment adjustments here and
> there. Unless somebody protests I'm planning to start pushing things
> tomorrow
> morning.
>
>
Nothing I've come across, given my area of experience, gives me pause.  I'm
mostly going to focus on docs and comments at this point - to try and help
the next person in my position (and end-users) have an easier go at
on-boarding.  Toward that end, I did just add a "Purpose" section writeup
to the v69 thread.

David J.

Re: shared-memory based stats collector - v69

2022-04-05 Thread David G. Johnston

On Tue, Apr 5, 2022 at 4:16 PM Andres Freund  wrote:

> On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:
> > On Tue, Apr 5, 2022 at 2:23 PM Andres Freund  wrote:
>
> >
> > > I guess I should add a paragraph about snapshots / fetch consistency.
> > >
> >
> > I apparently confused/combined the two concepts just now so that would
> help.
>
> Will add.
>
>
Thank you.

On a slightly different track, I took the time to write-up a "Purpose"
section for pgstat.c :

It may possibly be duplicating some things written elsewhere as I didn't go
looking for similar prior art yet, I just wanted to get thoughts down.
This is the kind of preliminary framing I've been constructing in my own
mind as I try to absorb this patch.  I haven't formed an opinion whether
the actual user-facing documentation should cover some or all of this
instead of the preamble to pgstat.c (which could just point to the docs for
prerequisite reading).

David J.

 * Purpose:

 * The PgStat namespace defines an API that facilitates concurrent access
 * to a shared memory region where cumulative statistical data is saved.
 * At shutdown, one of the running system workers will initiate the writing
 * of the data to file. Then, during startup (following a clean shutdown)
the
 * Postmaster process will early on ensure that the file is loaded into
memory.
 *
 * Each cumulative statistic producing system must construct a PgStat_Kind
 * datum in this file. The details are described elsewhere, but of
 * particular importance is that each kind is classified as having either a
 * fixed number of objects that it tracks, or a variable number.
 *
 * During normal operations, the different consumers of the API will have
their
 * accessed managed by the API, the protocol used is determined based upon
whether
 * the statistical kind is fixed-numbered or variable-numbered.
 * Readers of variable-numbered statistics will have the option to locally
 * cache the data, while writers may have their updates locally queued
 * and applied in a batch. Thus favoring speed over freshness.
 * The fixed-numbered statistics are faster to process and thus forgo
 * these mechanisms in favor of a light-weight lock.
 *
 * Cumulative in this context means that processes must, for numeric data,
send
 * a delta (or change) value via the API which will then be added to the
 * stored value in memory. The system does not track individual changes,
only
 * their net effect. Additionally, both due to unclean shutdown or user
request,
 * statistics can be reset - meaning that their stored numeric values are
returned
 * to zero, and any non-numeric data that may be tracked (say a timestamp)
is cleared.

Re: shared-memory based stats collector - v69

2022-04-05 Thread Andres Freund

On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:
> On Tue, Apr 5, 2022 at 2:23 PM Andres Freund  wrote:
> 
> >
> > On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:
> >
> > >, but rather add to the shared queue
> >
> > Queue? Maybe you mean the hashtable?
> >
> 
> Queue implemented by a list...?  Anyway, I think I mean this:

> /*
>  * List of PgStat_EntryRefs with unflushed pending stats.
>  *
>  * Newly pending entries should only ever be added to the end of the list,
>  * otherwise pgstat_flush_pending_entries() might not see them immediately.
>  */
> static dlist_head pgStatPending = DLIST_STATIC_INIT(pgStatPending);

That's not in shared memory, but backend local...


> > >, and so the cache is effectively read-only.  It is also
> > transaction-scoped
> > >based upon the GUC and the nature of stats vis-a-vis transactions.
> >
> > No, that's not right. I think you might be thinking of
> > pgStatLocal.snapshot.stats?
> >
> >
> Probably...
> 
> 
> > I guess I should add a paragraph about snapshots / fetch consistency.
> >
> 
> I apparently confused/combined the two concepts just now so that would help.

Will add.


> >
> > > Even before I added the read-only and transaction-scoped I got a bit hung
> > > up on reading:
> > > "The shared hashtable only needs to be accessed when no prior reference
> > to
> > > the shared hashtable exists."
> >
> > > Thinking in terms of key seems to make more sense than value in this
> > > sentence - even if there is a one-to-one correspondence.
> >
> > Maybe "prior reference to the shared hashtable exists for the key"?
> >
> 
> I specifically dislike having two mentions of the "shared hashtable" in the
> same sentence, so I tried to phrase the second half in terms of the local
> hashtable.

You left two mentions of "shared hashtable" in the sentence prior though
:). I'll try to rephrase. But it's not the end if this isn't the most elegant
prose...


> > Was that comma after e.g. intentional?
> >
> 
> It is.  That is the style I was taught, and that we seem to adhere to in
> user-facing documentation.  Source code is a mixed bag with no enforcement,
> but while we are here...

Looks a bit odd to me. But I guess I'll add it then...


> > >   *
> > > @@ -19,19 +19,21 @@
> > >   *
> > >   * All variable-numbered stats are addressed by PgStat_HashKey while
> > running.
> > >   * It is not possible to have statistics for an object that cannot be
> > > - * addressed that way at runtime. A wider identifier can be used when
> > > + * addressed that way at runtime. A alternate identifier can be used
> > when
> > >   * serializing to disk (used for replication slot stats).
> >
> > Not sure this improves things.
> >
> >
> It just seems odd that width is being mentioned when the actual struct is a
> combination of three subcomponents.  I do feel I'd need to understand
> exactly what replication slot stats are doing uniquely here, though, to
> make any point beyond that.

There's no real numeric identifier for replication slot stats. So I'm using
the "index" used in slot.c while running. But that can change during
start/stop.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v69

2022-04-05 Thread David G. Johnston

On Tue, Apr 5, 2022 at 2:23 PM Andres Freund  wrote:

>
> On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:
>
> >, but rather add to the shared queue
>
> Queue? Maybe you mean the hashtable?
>

Queue implemented by a list...?  Anyway, I think I mean this:

/*
 * List of PgStat_EntryRefs with unflushed pending stats.
 *
 * Newly pending entries should only ever be added to the end of the list,
 * otherwise pgstat_flush_pending_entries() might not see them immediately.
 */
static dlist_head pgStatPending = DLIST_STATIC_INIT(pgStatPending);


>
>
> >, and so the cache is effectively read-only.  It is also
> transaction-scoped
> >based upon the GUC and the nature of stats vis-a-vis transactions.
>
> No, that's not right. I think you might be thinking of
> pgStatLocal.snapshot.stats?
>
>
Probably...


> I guess I should add a paragraph about snapshots / fetch consistency.
>

I apparently confused/combined the two concepts just now so that would help.

>
> > Even before I added the read-only and transaction-scoped I got a bit hung
> > up on reading:
> > "The shared hashtable only needs to be accessed when no prior reference
> to
> > the shared hashtable exists."
>
> > Thinking in terms of key seems to make more sense than value in this
> > sentence - even if there is a one-to-one correspondence.
>
> Maybe "prior reference to the shared hashtable exists for the key"?
>

I specifically dislike having two mentions of the "shared hashtable" in the
same sentence, so I tried to phrase the second half in terms of the local
hashtable.


> > I am wondering why there are no mentions to the header files in this
> > architecture, only the .c files.
>
> Hm, I guess, but I'm not sure it'd add a lot? It's really just intended to
> give a starting point (and it can't be worse than explanation of the
> current
> system).
>

No need to try to come up with something.  More curious if there was a
general reason to avoid it before I looked to see if I felt anything in
them seemed worth including from my perspective.

>
>
> > diff --git a/src/backend/utils/activity/pgstat.c
> b/src/backend/utils/activity/pgstat.c
> > index bfbfe53deb..504f952c0e 100644
> > --- a/src/backend/utils/activity/pgstat.c
> > +++ b/src/backend/utils/activity/pgstat.c
> > @@ -4,9 +4,9 @@
> >   *
> >   *
> >   * PgStat_KindInfo describes the different types of statistics handled.
> Some
> > - * kinds of statistics are collected for fixed number of objects
> > - * (e.g. checkpointer statistics). Other kinds are statistics are
> collected
> > - * for variable-numbered objects (e.g. relations).
> > + * kinds of statistics are collected for a fixed number of objects
> > + * (e.g., checkpointer statistics). Other kinds of statistics are
> collected
>
> Was that comma after e.g. intentional?
>

It is.  That is the style I was taught, and that we seem to adhere to in
user-facing documentation.  Source code is a mixed bag with no enforcement,
but while we are here...


> > + * for a varying number of objects (e.g., relations).
> >   * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
>

status-quo works for me too, and matches up with the desired labelling we
are using here.



> >   *
> > @@ -19,19 +19,21 @@
> >   *
> >   * All variable-numbered stats are addressed by PgStat_HashKey while
> running.
> >   * It is not possible to have statistics for an object that cannot be
> > - * addressed that way at runtime. A wider identifier can be used when
> > + * addressed that way at runtime. A alternate identifier can be used
> when
> >   * serializing to disk (used for replication slot stats).
>
> Not sure this improves things.
>
>
It just seems odd that width is being mentioned when the actual struct is a
combination of three subcomponents.  I do feel I'd need to understand
exactly what replication slot stats are doing uniquely here, though, to
make any point beyond that.


>
> > - * Each statistics kind is handled in a dedicated file:
> > + * Each statistics kind is handled in a dedicated file, though their
> structs
> > + * are defined here for lack of better ideas.
>
> -0.5
>
>
Status-quo works for me.  Food for thought for other reviewers though.

David J.

Re: shared-memory based stats collector - v69

2022-04-05 Thread Andres Freund

Hi,

On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:
> On Mon, Apr 4, 2022 at 8:05 PM Andres Freund  wrote:
> 
> > - added an architecture overview comment to the top of pgstat.c - not sure
> > if
> >   it makes sense to anybody but me (and perhaps Horiguchi-san)?
> >
> >
> I took a look at this, diff attached.

Thanks!


> Some typos and minor style stuff,
> plus trying to bring a bit more detail to the caching mechanism.  I may
> have gotten it wrong in adding more detail though.
> 
> + * read-only, backend-local, transaction-scoped, hashtable
> (pgStatEntryRefHash)
> + * in front of the shared hashtable, containing references
> (PgStat_EntryRef)
> + * to shared hashtable entries. The shared hashtable thus only needs to be
> + * accessed when the PgStat_HashKey is not present in the backend-local
> hashtable,
> + * or if stats_fetch_consistency = 'none'.
> 
> I'm under the impression, but didn't try to confirm, that the pending
> updates don't use the caching mechanism

They do.


>, but rather add to the shared queue

Queue? Maybe you mean the hashtable?


>, and so the cache is effectively read-only.  It is also transaction-scoped
>based upon the GUC and the nature of stats vis-a-vis transactions.

No, that's not right. I think you might be thinking of
pgStatLocal.snapshot.stats?

I guess I should add a paragraph about snapshots / fetch consistency.


> Even before I added the read-only and transaction-scoped I got a bit hung
> up on reading:
> "The shared hashtable only needs to be accessed when no prior reference to
> the shared hashtable exists."

> Thinking in terms of key seems to make more sense than value in this
> sentence - even if there is a one-to-one correspondence.

Maybe "prior reference to the shared hashtable exists for the key"?


> I am wondering why there are no mentions to the header files in this
> architecture, only the .c files.

Hm, I guess, but I'm not sure it'd add a lot? It's really just intended to
give a starting point (and it can't be worse than explanation of the current
system).


> diff --git a/src/backend/utils/activity/pgstat.c 
> b/src/backend/utils/activity/pgstat.c
> index bfbfe53deb..504f952c0e 100644
> --- a/src/backend/utils/activity/pgstat.c
> +++ b/src/backend/utils/activity/pgstat.c
> @@ -4,9 +4,9 @@
>   *
>   *
>   * PgStat_KindInfo describes the different types of statistics handled. Some
> - * kinds of statistics are collected for fixed number of objects
> - * (e.g. checkpointer statistics). Other kinds are statistics are collected
> - * for variable-numbered objects (e.g. relations).
> + * kinds of statistics are collected for a fixed number of objects
> + * (e.g., checkpointer statistics). Other kinds of statistics are collected

Was that comma after e.g. intentional?

Applied the rest.


> + * for a varying number of objects (e.g., relations).
>   * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
>   *
> @@ -19,19 +19,21 @@
>   *
>   * All variable-numbered stats are addressed by PgStat_HashKey while running.
>   * It is not possible to have statistics for an object that cannot be
> - * addressed that way at runtime. A wider identifier can be used when
> + * addressed that way at runtime. A alternate identifier can be used when
>   * serializing to disk (used for replication slot stats).

Not sure this improves things.



>   * The names for structs stored in shared memory are prefixed with
>   * PgStatShared instead of PgStat.
> @@ -53,15 +55,16 @@
>   * entry in pgstat_kind_infos, see PgStat_KindInfo for details.
>   *
>   *
> - * To keep things manageable stats handling is split across several
> + * To keep things manageable, stats handling is split across several

Done.


>   * files. Infrastructure pieces are in:
> - * - pgstat.c - this file, to tie it all together
> + * - pgstat.c - this file, which ties everything together

I liked that :)


> - * Each statistics kind is handled in a dedicated file:
> + * Each statistics kind is handled in a dedicated file, though their structs
> + * are defined here for lack of better ideas.

-0.5

Greetings,

Andres Freund

Re: shared-memory based stats collector - v69

2022-04-05 Thread David G. Johnston

On Mon, Apr 4, 2022 at 8:05 PM Andres Freund  wrote:

> - added an architecture overview comment to the top of pgstat.c - not sure
> if
>   it makes sense to anybody but me (and perhaps Horiguchi-san)?
>
>
I took a look at this, diff attached.  Some typos and minor style stuff,
plus trying to bring a bit more detail to the caching mechanism.  I may
have gotten it wrong in adding more detail though.

+ * read-only, backend-local, transaction-scoped, hashtable
(pgStatEntryRefHash)
+ * in front of the shared hashtable, containing references
(PgStat_EntryRef)
+ * to shared hashtable entries. The shared hashtable thus only needs to be
+ * accessed when the PgStat_HashKey is not present in the backend-local
hashtable,
+ * or if stats_fetch_consistency = 'none'.

I'm under the impression, but didn't try to confirm, that the pending
updates don't use the caching mechanism, but rather add to the shared
queue, and so the cache is effectively read-only.  It is also
transaction-scoped based upon the GUC and the nature of stats vis-a-vis
transactions.

Even before I added the read-only and transaction-scoped I got a bit hung
up on reading:
"The shared hashtable only needs to be accessed when no prior reference to
the shared hashtable exists."

Thinking in terms of key seems to make more sense than value in this
sentence - even if there is a one-to-one correspondence.

The code comment about having per-kind definitions in pgstat.c being
annoying is probably sufficient but it does seem like a valid comment to
leave in the architecture as well.  Having them in both places seems OK.

I am wondering why there are no mentions to the header files in this
architecture, only the .c files.

David J.

pgstat-architecture.diff
Description: Binary data

Re: shared-memory based stats collector - v66

2022-04-05 Thread Andres Freund

Hi,

On 2022-04-02 01:16:48 -0700, Andres Freund wrote:
> I just noticed that the code doesn't appear to actually work like that right
> now. Whenever the timeout is reached, pgstat_report_stat() is called with
> force = true.
> 
> And even if the backend is busy running queries, once there's contention, the
> next invocation of pgstat_report_stat() will return the timeout relative to
> pendin_since, which then will trigger a force report via a very short timeout
> soon.
> 
> It might actually make sense to only ever return PGSTAT_RETRY_MIN_INTERVAL
> (with a slightly different name) from pgstat_report_stat() when blocked
> (limiting the max reporting delay for an idle connection) and to continue
> calling pgstat_report_stat(force = true).  But to only trigger force
> "internally" in pgstat_report_stat() when PGSTAT_MAX_INTERVAL is reached.
> 
> I think that'd mean we'd report after max PGSTAT_RETRY_MIN_INTERVAL in an idle
> connection, and try reporting every PGSTAT_RETRY_MIN_INTERVAL (increasing up
> to PGSTAT_MAX_INTERVAL when blocked) on busy connections.
> 
> Makes sense?

I tried to come up with a workload producing a *lot* of stats (multiple
function calls within a transaction, multiple transactions pipelined) and ran
it with 1000 clients (on a machine with 2 x (10 cores / 20 threads)). To
reduce overhead I set
  default_transaction_isolation=repeatable read
  track_activities=false
MVCC Snapshot acquisition is the clear bottleneck otherwise, followed by
pgstat_report_activity() (which, as confusing as it may sound, is independent
of this patch).

I do see a *small* amount of contention if I lower PGSTAT_MIN_INTERVAL to
1ms. Too small to ever be captured in pg_stat_activity.wait_event, but just
about visible in a profiler.

Which leads me to conclude we can simplify the logic significantly. Here's my
current comment explaining the logic:

 * Unless called with 'force', pending stats updates are flushed happen once
 * per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
 * block on lock acquisition, except if stats updates have been pending for
 * longer than PGSTAT_MAX_INTERVAL (6ms).
 *
 * Whenever pending stats updates remain at the end of pgstat_report_stat() a
 * suggested idle timeout is returned. Currently this is always
 * PGSTAT_IDLE_INTERVAL (1ms). Callers can use the returned time to set up
 * a timeout after which to call pgstat_report_stat(true), but are not
 * required to to do so.

Comments?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v68

2022-04-05 Thread David G. Johnston

On Tuesday, April 5, 2022, Andres Freund  wrote:

> Hi,
>
> On 2022-04-05 08:49:36 -0700, David G. Johnston wrote:
> > On Mon, Apr 4, 2022 at 7:36 PM Andres Freund  wrote:
> >
> > >
> > > I think all this is going to achieve is to making code more
> complicated.
> > > There
> > > is a *single* non-assert use of accessed_across_databases and now a
> single
> > > assertion involving it.
> > >
> > > What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?
> > >
> >
> > So, I decided to see what this would look like; the results are attached,
> > portions of it also inlined below.
>
> > I'll admit this does introduce a terminology problem - but IMO these
> words
> > are much more meaningful to the reader and code than the existing
> > booleans.  I'm hopeful we can bikeshed something agreeable as I'm
> strongly
> > in favor of making this change.
>
> Sorry, I just don't agree. I'm happy to try to make it look better, but
> this
> isn't it.
>
> Do you think it should be your way strongly enough that you'd not want to
> get
> it in the current way?
>
>
Not that strongly; I’m good with the code as-is.  Its not pervasive enough
to be hard to understand (I may ponder some code comments though) and the
system it is modeling has some legacy aspects that are more the root
problem and I don’t want to touch those here for sure.


>
> Oring PGSTAT_CLUSTER = 2 with PGSTAT_OBJECT = 3 yields 3 again. To do this
> kind of thing the different values need to have power-of-two values, and
> then
> the tests need to be done with &.


Thanks.


>
> Nicely demonstrated by the fact that with the patch applied initdb doesn't
> pass...



Yeah, I compiled but tried to run the tests and learned I still need to
figure out my setup for make check; then I forgot to make install…

It served its purpose at least.



>
> > @@ -1047,8 +1042,8 @@ pgstat_delete_pending_entry(PgStat_EntryRef
> *entry_ref)
> >   void   *pending_data = entry_ref->pending;
> >
> >   Assert(pending_data != NULL);
> > - /* !fixed_amount stats should be handled explicitly */
> > - Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> > + /* global stats should be handled explicitly : why?*/
> > + Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_DYNAMIC);
>
> The pending data infrastructure doesn't provide a way of dealing with fixed
> amount stats, and there's no PgStat_EntryRef for them (since they're not in
> the hashtable).
>
>
Thanks.

David J.

Re: shared-memory based stats collector - v68

2022-04-05 Thread Andres Freund

Hi,

On 2022-04-05 08:49:36 -0700, David G. Johnston wrote:
> On Mon, Apr 4, 2022 at 7:36 PM Andres Freund  wrote:
> 
> >
> > I think all this is going to achieve is to making code more complicated.
> > There
> > is a *single* non-assert use of accessed_across_databases and now a single
> > assertion involving it.
> >
> > What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?
> >
> 
> So, I decided to see what this would look like; the results are attached,
> portions of it also inlined below.

> I'll admit this does introduce a terminology problem - but IMO these words
> are much more meaningful to the reader and code than the existing
> booleans.  I'm hopeful we can bikeshed something agreeable as I'm strongly
> in favor of making this change.

Sorry, I just don't agree. I'm happy to try to make it look better, but this
isn't it.

Do you think it should be your way strongly enough that you'd not want to get
it in the current way?



> The ability to create defines for subsets nicely resolves the problem that
> CLUSTER and DATABASE (now OBJECT to avoid DATABASE conflict in PgStat_Kind)
> are generally related together - they are now grouped under the DYNAMIC
> label (variable, if you want) while all of the fixed entries get associated
> with GLOBAL.  Thus the majority of usages, since accessed_across_databases
> is rare, end up being either DYNAMIC or GLOBAL.

FWIW, as-is DYNAMIC isn't correct:

> +typedef enum PgStat_KindGroup
> +{
> + PGSTAT_GLOBAL = 1,
> + PGSTAT_CLUSTER,
> + PGSTAT_OBJECT
> +} PgStat_KindGroup;
> +
> +#define PGSTAT_DYNAMIC (PGSTAT_CLUSTER | PGSTAT_OBJECT)

Oring PGSTAT_CLUSTER = 2 with PGSTAT_OBJECT = 3 yields 3 again. To do this
kind of thing the different values need to have power-of-two values, and then
the tests need to be done with &.

Nicely demonstrated by the fact that with the patch applied initdb doesn't
pass...


> @@ -909,7 +904,7 @@ pgstat_build_snapshot(void)
>*/
>   if (p->key.dboid != MyDatabaseId &&
>   p->key.dboid != InvalidOid &&
> - !kind_info->accessed_across_databases)
> + kind_info->kind_group == PGSTAT_OBJECT)
>   continue;
>  
>   if (p->dropped)

Imo this is far harder to interpret - !kind_info->accessed_across_databases
tells you why we're skipping in clear code. Your alternative doesn't.


> @@ -938,7 +933,7 @@ pgstat_build_snapshot(void)
>   {
>   const PgStat_KindInfo *kind_info = pgstat_kind_info_for(kind);
>  
> - if (!kind_info->fixed_amount)
> + if (kind_info->kind_group == PGSTAT_DYNAMIC)

These all would have to be kind_info->kind_group & PGSTAT_DYNAMIC, or even
(kind_group & PGSTAT_DYNAMIC) != 0, depending on the case.


> @@ -1047,8 +1042,8 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
>   void   *pending_data = entry_ref->pending;
>  
>   Assert(pending_data != NULL);
> - /* !fixed_amount stats should be handled explicitly */
> - Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> + /* global stats should be handled explicitly : why?*/
> + Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_DYNAMIC);

The pending data infrastructure doesn't provide a way of dealing with fixed
amount stats, and there's no PgStat_EntryRef for them (since they're not in
the hashtable).


Greetings,

Andres Freund

Re: shared-memory based stats collector - v68

2022-04-05 Thread David G. Johnston

On Mon, Apr 4, 2022 at 7:36 PM Andres Freund  wrote:

>
> I think all this is going to achieve is to making code more complicated.
> There
> is a *single* non-assert use of accessed_across_databases and now a single
> assertion involving it.
>
> What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?
>

So, I decided to see what this would look like; the results are attached,
portions of it also inlined below.

I'll admit this does introduce a terminology problem - but IMO these words
are much more meaningful to the reader and code than the existing
booleans.  I'm hopeful we can bikeshed something agreeable as I'm strongly
in favor of making this change.

The ability to create defines for subsets nicely resolves the problem that
CLUSTER and DATABASE (now OBJECT to avoid DATABASE conflict in PgStat_Kind)
are generally related together - they are now grouped under the DYNAMIC
label (variable, if you want) while all of the fixed entries get associated
with GLOBAL.  Thus the majority of usages, since accessed_across_databases
is rare, end up being either DYNAMIC or GLOBAL.  The presence of any other
category should give one pause.  We could add an ALL define if we ever
decide to consolidate the API - but for now it's largely used to ensure
that stats of one type don't get processed by the other.  The boolean fixed
does that well enough but this just seems much cleaner and more
understandable to me.  Though having made up the terms and model myself,
that isn't too surprising.

The only existing usage of accessed_across_databases is in the negative
form, which translates to excluding objects, but only those from other
actual databases.

@@ -909,7 +904,7 @@ pgstat_build_snapshot(void)
  */
  if (p->key.dboid != MyDatabaseId &&
  p->key.dboid != InvalidOid &&
- !kind_info->accessed_across_databases)
+ kind_info->kind_group == PGSTAT_OBJECT)
  continue;

The only other usage of something other than GLOBAL or DYNAMIC is the
restriction on the behavior of reset_single_counter, which also has to be
an object in the current database (the later condition being enforced by
the presence of a valid object oid I presume).  The replacement for this
below is not behavior-preserving, the proposed behavior I believe we agree
is correct though.

@@ -652,7 +647,7 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
objoid)

- Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+ Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_OBJECT);

Everything else is a straight conversion of fixed_amount to CLUSTER+OBJECT

@@ -728,7 +723,7 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid
objoid)

- AssertArg(!kind_info->fixed_amount);
+ AssertArg(kind_info->kind_group == PGSTAT_DYNAMIC);

and !fixed_amount to GLOBAL

@@ -825,7 +820,7 @@ pgstat_get_stat_snapshot_timestamp(bool *have_snapshot)
 bool
 pgstat_exists_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 {
- if (pgstat_kind_info_for(kind)->fixed_amount)
+ if (pgstat_kind_info_for(kind)->kind_group == PGSTAT_GLOBAL)
  return true;

  return pgstat_get_entry_ref(kind, dboid, objoid, false, NULL) != NULL;

David J.

rework-using-enums.diff
Description: Binary data

Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund

Hi,

On 2022-04-04 19:03:13 -0700, David G. Johnston wrote:
> > > (if this is true...but given this is an optimization category I'm
> > thinking
> > > maybe it doesn't actually matter...)
> >
> > It is true. Not sure what you mean with "optimization category"?
> >
> >
> I mean that distinguishing between stats that are fixed and those that are
> variable implies that fixed kinds have a better performance (speed, memory)
> characteristic than variable kinds (at least in part due to the presence of
> changecount).  If fixed kinds did not have a performance benefit then
> having the variable kind implementation simply handle fixed kinds as well
> (using the common struct header and storage in a hash table) would make the
> implementation simpler since all statistics would report through the same
> API.

Yes, fixed-numbered stats are faster.



> Coming back to this:
> """
> + /* cluster-scoped object stats having a variable number of entries */
> + PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
> + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
> + PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd
> spot to be closer to the database-scoped options)
> +
> + /* database-scoped object stats having a variable number of entries */
> + PGSTAT_KIND_RELATION, /* per-table statistics */
> + PGSTAT_KIND_FUNCTION, /* per-function statistics */
> +
> + /* cluster-scoped stats having a fixed number of entries */ (maybe these
> should go first, the variable following?)
> + PGSTAT_KIND_ARCHIVER,
> + PGSTAT_KIND_BGWRITER,
> + PGSTAT_KIND_CHECKPOINTER,
> + PGSTAT_KIND_SLRU,
> + PGSTAT_KIND_WAL,
> """
> 
> I see three "KIND_GROUP" categories here:
> PGSTAT_KIND_CLUSTER (open to a different word here though...)
> PGSTAT_KIND_DATABASE (we seem to agree on this above)
> PGSTAT_KIND_GLOBAL (already used in the code)
> 
> This single enum can replace the two booleans that, in combination, would
> define 4 unique groups (of which only three are interesting -
> database+fixed doesn't seem interesting and so is not given a name/value
> here).

The more I think about it, the less I think a split like that makes sense. The
difference between PGSTAT_KIND_CLUSTER / PGSTAT_KIND_DATABASE is tiny. Nearly
all code just deals with both together.

I think all this is going to achieve is to making code more complicated. There
is a *single* non-assert use of accessed_across_databases and now a single
assertion involving it.

What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v68

2022-04-04 Thread David G. Johnston

On Mon, Apr 4, 2022 at 3:44 PM Andres Freund  wrote:

> Hi,
>
> On 2022-04-04 15:24:24 -0700, David G. Johnston wrote:
> > Replacing the existing assert(!kind->fixed_amount) with
> > assert(!kind->accessed_across_databases) produces the same result as the
> > later presently implies the former.
>
> I wasn't proposing to replace, but to add...
>

Right, but it seems redundant to have both when one implies the other.  But
I'm not hard set against it either, though my idea below make them both
obsolete.

>
> > Now I start to dislike the behavioral aspect of the attribute and would
> > rather just name it: kind->is_cluster_scoped (or something else that is
> > descriptive of the stat category itself, not how it is used)
>
> I'm not in love with the name either. But cluster is just a badly
> overloaded
> word :(.
>
> system_wide? Or invert it and say: database_scoped? I think I like the
> latter.
>
>
I like database_scoped as well...but see my idea below that makes this
obsolete.

>
> > Then reorganize the Kind documentation to note and emphasize these two
> > primary descriptors:
> > variable, which can be cluster or database scoped
> > fixed, which are cluster scoped by definition
>
> Hm. There's not actually that much difference between cluster/non-cluster
> wide
> scope for most of the system. I'm not strongly against, but I'm also not
> really seeing the benefit.
>

Not married to it myself, something to come back to when the dust settles.


> > (if this is true...but given this is an optimization category I'm
> thinking
> > maybe it doesn't actually matter...)
>
> It is true. Not sure what you mean with "optimization category"?
>
>
I mean that distinguishing between stats that are fixed and those that are
variable implies that fixed kinds have a better performance (speed, memory)
characteristic than variable kinds (at least in part due to the presence of
changecount).  If fixed kinds did not have a performance benefit then
having the variable kind implementation simply handle fixed kinds as well
(using the common struct header and storage in a hash table) would make the
implementation simpler since all statistics would report through the same
API.  In that world, variability is simply a possibility that not every
actual reporter has to use.  That improved performance characteristic is
what I meant by "optimization category".  I question whether we should be
publishing "fixed" and "variable" as concrete properties.  I'm not
presently against the current choice to do so, but as you say above, I'm
also not really seeing the benefit.

(goes and looks at all the places that use the fixed_amount
field...sparking an idea)

Coming back to this:
"""
+ /* cluster-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
+ PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd
spot to be closer to the database-scoped options)
+
+ /* database-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+
+ /* cluster-scoped stats having a fixed number of entries */ (maybe these
should go first, the variable following?)
+ PGSTAT_KIND_ARCHIVER,
+ PGSTAT_KIND_BGWRITER,
+ PGSTAT_KIND_CHECKPOINTER,
+ PGSTAT_KIND_SLRU,
+ PGSTAT_KIND_WAL,
"""

I see three "KIND_GROUP" categories here:
PGSTAT_KIND_CLUSTER (open to a different word here though...)
PGSTAT_KIND_DATABASE (we seem to agree on this above)
PGSTAT_KIND_GLOBAL (already used in the code)

This single enum can replace the two booleans that, in combination, would
define 4 unique groups (of which only three are interesting -
database+fixed doesn't seem interesting and so is not given a name/value
here).

While the succinctness of the booleans has appeal the need for half of the
booleans to end up being negated quickly tarnishes it.  With the three
groups, every assertion is positive in nature indicating which of the three
groups are handled by the function.  While that is probably a few more
characters it seems like an easier read and is less complicated as it has
fewer independent parts.  At most you OR two kinds together which is
succinct enough I would think.  There are no gaps relative to the existing
implementation that defines fixed_amount and accessed_across_databases -
every call site using either of them can be transformed mechanically.

David J.

Re: shared-memory based stats collector - v67

2022-04-04 Thread Andres Freund

Hi,

On 2022-03-23 10:42:03 -0700, Andres Freund wrote:
> On 2022-03-23 17:27:50 +0900, Kyotaro Horiguchi wrote:
> > +   /*
> > +* When starting with crash recovery, reset pgstat data - it might not 
> > be
> > +* valid. Otherwise restore pgstat data. It's safe to do this here,
> > +* because postmaster will not yet have started any other processes
> > +*
> > +* TODO: With a bit of extra work we could just start with a pgstat file
> > +* associated with the checkpoint redo location we're starting from.
> > +*/
> > +   if (ControlFile->state == DB_SHUTDOWNED ||
> > +   ControlFile->state == DB_SHUTDOWNED_IN_RECOVERY)
> > +   pgstat_restore_stats();
> > +   else
> > +   pgstat_discard_stats();
> > +
> >
> > Before there, InitWalRecovery changes the state to
> > DB_IN_ARCHIVE_RECOVERY if it was either DB_SHUTDOWNED or
> > DB_IN_PRODUCTION. So the stat seems like always discarded on standby.
>
> Hm. I though it worked at some point. I guess there's a reason this commit is
> a separate commit marked WIP ;)

FWIW, it had gotten broken by

commit be1c00ab13a7c2c9299d60cb5a9d285c40e2506c
Author: Heikki Linnakangas 
Date:   2022-02-16 09:22:44 +0200

Move code around in StartupXLOG().

because that moved the spot where
ControlFile->state = DB_IN_CRASH_RECOVERY
is set to an earlier location.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund

Hi,

On 2022-04-04 15:24:24 -0700, David G. Johnston wrote:
> Replacing the existing assert(!kind->fixed_amount) with
> assert(!kind->accessed_across_databases) produces the same result as the
> later presently implies the former.

I wasn't proposing to replace, but to add...

> Now I start to dislike the behavioral aspect of the attribute and would
> rather just name it: kind->is_cluster_scoped (or something else that is
> descriptive of the stat category itself, not how it is used)

I'm not in love with the name either. But cluster is just a badly overloaded
word :(.

system_wide? Or invert it and say: database_scoped? I think I like the latter.

> Then reorganize the Kind documentation to note and emphasize these two
> primary descriptors:
> variable, which can be cluster or database scoped
> fixed, which are cluster scoped by definition

Hm. There's not actually that much difference between cluster/non-cluster wide
scope for most of the system. I'm not strongly against, but I'm also not
really seeing the benefit.

> (if this is true...but given this is an optimization category I'm thinking
> maybe it doesn't actually matter...)

It is true. Not sure what you mean with "optimization category"?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v68

2022-04-04 Thread David G. Johnston

On Mon, Apr 4, 2022 at 2:54 PM Andres Freund  wrote:

>
> > As the existing function only handles functions and relations why not
> just
> > perform a specific Kind check for them?  Generalizing to assert on
> whether
> > or not the function works on fixed or variable Kinds seems beyond its
> > present state.  Or could it be used, as-is, for databases, replication
> > slots, and subscriptions today, and we just haven't migrated those areas
> to
> > use the now generalized function?
>
> It couldn't quite be used for those, because it really only makes sense for
> objects "within a database", because it wants to reset the timestamp of the
> pg_stat_database row too (I don't like that behaviour as-is, but that's the
> topic of another thread as you know...).
>
> It will work for other per-database stats though, once we have them.
>
>
> > Even then, unless we do expand the
> > definition of the this publicly facing function is seems better to
> > precisely define what it requires as an input Kind by checking for
> RELATION
> > or FUNCTION specifically.
>
> I don't see a benefit in adding a restriction on it that we'd just have to
> lift again?
>
> How about adding a
> Assert(!pgstat_kind_info_for(kind)->accessed_across_databases)
>
> and extending the function comment to say that it's used for per-database
> stats and that it resets both the passed-in stats object as well as
> pg_stat_database?
>
>
I could live with adding that, but...

Replacing the existing assert(!kind->fixed_amount) with
assert(!kind->accessed_across_databases) produces the same result as the
later presently implies the former.

Now I start to dislike the behavioral aspect of the attribute and would
rather just name it: kind->is_cluster_scoped (or something else that is
descriptive of the stat category itself, not how it is used)

Then reorganize the Kind documentation to note and emphasize these two
primary descriptors:
variable, which can be cluster or database scoped
fixed, which are cluster scoped by definition (if this is true...but given
this is an optimization category I'm thinking maybe it doesn't actually
matter...)

+ /* cluster-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
+ PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd
spot to be closer to the database-scoped options)
+
+ /* database-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+
+ /* cluster-scoped stats having a fixed number of entries */ (maybe these
should go first, the variable following?)
+ PGSTAT_KIND_ARCHIVER,
+ PGSTAT_KIND_BGWRITER,
+ PGSTAT_KIND_CHECKPOINTER,
+ PGSTAT_KIND_SLRU,
+ PGSTAT_KIND_WAL,

David J.

Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund

Hi,

On 2022-04-04 14:25:57 -0700, David G. Johnston wrote:
> > You mentioned this as a restriction above - I'm not seeing it as such?  I'd
> > like to write out stats more often in the future (e.g. in the
> > checkpointer),
> > but then it'd not be written out with this function...
> >
> >
> Yeah, the idea only really works if you can implement "last one out, shut
> off the lights".  I think I was subconsciously wanting this to work that
> way, but the existing process is good.

Preserving stats more than we do today (the patch doesn't really affect that)
will require a good chunk more work. My idea for it is that we'd write the
file out as part of a checkpoint / restartpoint, with a name including the
redo-lsn. Then when recovery starts, it can use the stats file associated with
that to start from.  Then we'd loose at most 1 checkpoint's worth of stats
during a crash, not more.

There's a few non-trivial corner cases to solve, around stats objects getting
dropped concurrently with creating that serialized snapshot. Solvable, but not
trivial.

> > > + * I also am unsure, off the top of my head, whether both replication
> > > slots and subscriptions,
> > > + * which are fixed, can be reset singly (today, and/or whether this
> > patch
> > > enables that capability)
> > > + */
> >
> > FWIW, neither are implemented as fixed amount stats.
> 
> 
> That was a typo, I meant to write variable.  My point was that of these 5
> kinds that will pass the assertion test only 2 of them are actually handled
> by the function today.
> 
> + PGSTAT_KIND_DATABASE = 1, /* database-wide statistics */
> + PGSTAT_KIND_RELATION, /* per-table statistics */
> + PGSTAT_KIND_FUNCTION, /* per-function statistics */
> + PGSTAT_KIND_REPLSLOT, /* per-slot statistics */
> + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */

> As the existing function only handles functions and relations why not just
> perform a specific Kind check for them?  Generalizing to assert on whether
> or not the function works on fixed or variable Kinds seems beyond its
> present state.  Or could it be used, as-is, for databases, replication
> slots, and subscriptions today, and we just haven't migrated those areas to
> use the now generalized function?

It couldn't quite be used for those, because it really only makes sense for
objects "within a database", because it wants to reset the timestamp of the
pg_stat_database row too (I don't like that behaviour as-is, but that's the
topic of another thread as you know...).

It will work for other per-database stats though, once we have them.

> Even then, unless we do expand the
> definition of the this publicly facing function is seems better to
> precisely define what it requires as an input Kind by checking for RELATION
> or FUNCTION specifically.

I don't see a benefit in adding a restriction on it that we'd just have to
lift again?

How about adding a
Assert(!pgstat_kind_info_for(kind)->accessed_across_databases)

and extending the function comment to say that it's used for per-database
stats and that it resets both the passed-in stats object as well as
pg_stat_database?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v68

2022-04-04 Thread David G. Johnston

On Mon, Apr 4, 2022 at 2:06 PM Andres Freund  wrote:

>
> > My first encounter with pg_stat_exists_stat() didn't draw my attention as
> > being problematic so I'd say we just stick with it.  As a SQL user
> reading:
> > WHERE exists (...) is somewhat natural; using "have" or back-to-back
> > stat_stat is less appealing.
>
> There are a number of other *_exists functions, albeit not within
> pg_stat_*. Like jsonb_exists. Perhaps just pg_stat_exists()?
>
>
Works for me.


> A way to get back to the old behaviour seems good,
> and the function idea doesn't provide that.
>

Makes sense.

> (merged the typos that I hadn't already fixed based on Justin / Thomas'
> feedback)
>
>
> > @@ -371,7 +371,9 @@ pgstat_discard_stats(void)
> >  /*
> >   * pgstat_before_server_shutdown() needs to be called by exactly one
> > process
> >   * during regular server shutdowns. Otherwise all stats will be lost.
> > - *
> > + * XXX: What bad things happen if this is invoked by more than one
> process?
> > + *   I'd presume stats are not actually lost in that case.  Can we just
> > 'no-op'
> > + *   subsequent calls and say "at least once at shutdown, as late as
> > possible"
>
> What's the reason behind this question? There really shouldn't be a second
> call (and there's only a single callsite). As is you'd get an assertion
> failure about things already having been shutdown.
>

Mostly OCD  I guess, "exactly one" has two failure modes - zero, and > 1;
and the "Otherwise" only covers the zero mode.


> I don't think we want to relax that, because in all the realistic scenarios
> that I can think of that'd open us up to loosing stats that were generated
> after the first writeout of the stats data.


> You mentioned this as a restriction above - I'm not seeing it as such?  I'd
> like to write out stats more often in the future (e.g. in the
> checkpointer),
> but then it'd not be written out with this function...
>
>
Yeah, the idea only really works if you can implement "last one out, shut
off the lights".  I think I was subconsciously wanting this to work that
way, but the existing process is good.


>
> > @@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
> > objoid)
> >
> >   Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> >
> > + /*
> > + * More of a conceptual observation here - the fact that something is
> > fixed does not imply
> > + * that it is not fixed at a value greater than zero and thus could have
> > single subentries
> > + * that could be addressed.
>
> pgstat_reset_single_counter() is a pre-existing function (with a
> pre-existing
> name, but adapted signature in the patch), it's currently only used for
> functions and relation stats.
>
>
> > + * I also am unsure, off the top of my head, whether both replication
> > slots and subscriptions,
> > + * which are fixed, can be reset singly (today, and/or whether this
> patch
> > enables that capability)
> > + */
>
> FWIW, neither are implemented as fixed amount stats.


That was a typo, I meant to write variable.  My point was that of these 5
kinds that will pass the assertion test only 2 of them are actually handled
by the function today.

+ PGSTAT_KIND_DATABASE = 1, /* database-wide statistics */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+ PGSTAT_KIND_REPLSLOT, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */


> There's one example of fixed amount stats that can be reset more
> granularly,
> namely slru. That can be done via pg_stat_reset_slru().
>
>
Right, hence the conceptual disconnect.  It doesn't affect the
implementation, everything is working just fine, but is something to ponder
for future maintainers getting up to speed here.

As the existing function only handles functions and relations why not just
perform a specific Kind check for them?  Generalizing to assert on whether
or not the function works on fixed or variable Kinds seems beyond its
present state.  Or could it be used, as-is, for databases, replication
slots, and subscriptions today, and we just haven't migrated those areas to
use the now generalized function?  Even then, unless we do expand the
definition of the this publicly facing function is seems better to
precisely define what it requires as an input Kind by checking for RELATION
or FUNCTION specifically.

David J.

Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund

Hi,

On 2022-04-04 13:45:40 -0700, David G. Johnston wrote:
> I didn't take the time to fixup all the various odd typos in the general
> code comments; none of them reduced comprehension appreciably.  I may do so
> when/if I do another pass.

Cool.

> My first encounter with pg_stat_exists_stat() didn't draw my attention as
> being problematic so I'd say we just stick with it.  As a SQL user reading:
> WHERE exists (...) is somewhat natural; using "have" or back-to-back
> stat_stat is less appealing.

There are a number of other *_exists functions, albeit not within
pg_stat_*. Like jsonb_exists. Perhaps just pg_stat_exists()?

> I would suggest we do away with stats_fetch_consistency "snapshot" mode and
> instead add a function that can be called that would accomplish the same
> thing but in "cache" mode.  Future iterations of that function could accept
> patterns, allowing for something between "one" and "everything".

I don't want to do that. We had a lot of discussion around what consistency
model we want, and Tom was adamant that there needs to be a mode that behaves
like the current consistency model (which is what snapshot behaves like, with
very minor differences).  A way to get back to the old behaviour seems good,
and the function idea doesn't provide that.

(merged the typos that I hadn't already fixed based on Justin / Thomas'
feedback)

> @@ -371,7 +371,9 @@ pgstat_discard_stats(void)
>  /*
>   * pgstat_before_server_shutdown() needs to be called by exactly one
> process
>   * during regular server shutdowns. Otherwise all stats will be lost.
> - *
> + * XXX: What bad things happen if this is invoked by more than one process?
> + *   I'd presume stats are not actually lost in that case.  Can we just
> 'no-op'
> + *   subsequent calls and say "at least once at shutdown, as late as
> possible"

What's the reason behind this question? There really shouldn't be a second
call (and there's only a single callsite). As is you'd get an assertion
failure about things already having been shutdown.

I don't think we want to relax that, because in all the realistic scenarios
that I can think of that'd open us up to loosing stats that were generated
after the first writeout of the stats data.

You mentioned this as a restriction above - I'm not seeing it as such?  I'd
like to write out stats more often in the future (e.g. in the checkpointer),
but then it'd not be written out with this function...

> @@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
> objoid)
> 
>   Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> 
> + /*
> + * More of a conceptual observation here - the fact that something is
> fixed does not imply
> + * that it is not fixed at a value greater than zero and thus could have
> single subentries
> + * that could be addressed.

pgstat_reset_single_counter() is a pre-existing function (with a pre-existing
name, but adapted signature in the patch), it's currently only used for
functions and relation stats.

> + * I also am unsure, off the top of my head, whether both replication
> slots and subscriptions,
> + * which are fixed, can be reset singly (today, and/or whether this patch
> enables that capability)
> + */

FWIW, neither are implemented as fixed amount stats. There's afaics no limit
at all for the number of existing subscriptions (although some would either
need to be disabled or you'd get errors). While there is a limit on the number
of slots, that's a configurable limit. So replication slot stats are also
implemented as variable amount stats (that used to be different, wasn't nice).

There's one example of fixed amount stats that can be reset more granularly,
namely slru. That can be done via pg_stat_reset_slru().

Thanks,

Andres

Re: shared-memory based stats collector - v68

2022-04-04 Thread David G. Johnston

On Sun, Apr 3, 2022 at 9:16 PM Andres Freund  wrote:

>
> Please take a look!
>
>
I didn't take the time to fixup all the various odd typos in the general
code comments; none of them reduced comprehension appreciably.  I may do so
when/if I do another pass.

I did skim over the entire patch set and, FWIW, found it to be quite
understandable.  I don't have the experience to comment on the lower-level
details like locking and such but the medium picture stuff makes sense to
me both as a user and a developer.  I did leave a couple of comments about
parts that at least piqued my interest (reset single stats) or seemed like
an undesirable restriction that was under addressed (before server shutdown
called exactly once).

I agree with Thomas's observation regarding PGSTAT_KIND_LAST.  I also think
that leaving it starting at 1 makes sense - maybe just fix the name and
comment to better reflect its actual usage in core.

I concur also with changing usages of " / " to ", or"

My first encounter with pg_stat_exists_stat() didn't draw my attention as
being problematic so I'd say we just stick with it.  As a SQL user reading:
WHERE exists (...) is somewhat natural; using "have" or back-to-back
stat_stat is less appealing.

I would suggest we do away with stats_fetch_consistency "snapshot" mode and
instead add a function that can be called that would accomplish the same
thing but in "cache" mode.  Future iterations of that function could accept
patterns, allowing for something between "one" and "everything".

I'm also not an immediate fan of "fetch_consistency"; with the function
suggestion it is basically "cache" and "no-cache" so maybe:
stats_use_transaction_cache ? (haven't thought hard or long on this one...)

 diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 22d0a1e491..e889c11d9e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2123,7 +2123,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss
11:34   0:00 postgres: ser

   PgStatsData
-  Waiting fo shared memory stats data access
+  Waiting for shared memory stats data access

   SerializableXactHash
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2689d0962c..bc7bdf8064 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4469,7 +4469,7 @@ PostgresMain(const char *dbname, const char *username)

  /*
  * (4) turn off the idle-in-transaction, idle-session and
- * idle-state-update timeouts if active.  We do this before step (5) so
+ * idle-stats-update timeouts if active.  We do this before step (5) so
  * that any last-moment timeout is certain to be detected in step (5).
  *
  * At most one of these timeouts will be active, so there's no need to
diff --git a/src/backend/utils/activity/pgstat.c
b/src/backend/utils/activity/pgstat.c
index dbd55a065d..370638b33b 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -5,7 +5,7 @@
  * Provides the infrastructure to collect and access cumulative statistics,
  * e.g. per-table access statistics, of all backends in shared memory.
  *
- * Most statistics updates are first first accumulated locally in each
process
+ * Most statistics updates are first accumulated locally in each process
  * as pending entries, then later flushed to shared memory (just after
commit,
  * or by idle-timeout).
  *
@@ -371,7 +371,9 @@ pgstat_discard_stats(void)
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one
process
  * during regular server shutdowns. Otherwise all stats will be lost.
- *
+ * XXX: What bad things happen if this is invoked by more than one process?
+ *   I'd presume stats are not actually lost in that case.  Can we just
'no-op'
+ *   subsequent calls and say "at least once at shutdown, as late as
possible"
  * We currently only write out stats for proc_exit(0). We might want to
change
  * that at some point... But right now pgstat_discard_stats() would be
called
  * during the start after a disorderly shutdown, anyway.
@@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
objoid)

  Assert(!pgstat_kind_info_for(kind)->fixed_amount);

+ /*
+ * More of a conceptual observation here - the fact that something is
fixed does not imply
+ * that it is not fixed at a value greater than zero and thus could have
single subentries
+ * that could be addressed.
+ * I also am unsure, off the top of my head, whether both replication
slots and subscriptions,
+ * which are fixed, can be reset singly (today, and/or whether this patch
enables that capability)
+ */
+
  /* Set the reset timestamp for the whole database */
  pgstat_reset_database_timestamp(MyDatabaseId, ts);

David J.

Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund

Hi,

On 2022-04-03 21:15:16 -0700, Andres Freund wrote:
> - collect who reviewed earlier revisions

I found reviews by
- Tomas Vondra 
- Arthur Zakirov 
- Antonin Houska 

There's also reviews by Fujii and Alvaro, but afaics just for parts that were
separately committed.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund

Hi,

On 2022-04-05 01:16:04 +1200, Thomas Munro wrote:
> On Mon, Apr 4, 2022 at 4:16 PM Andres Freund  wrote:
> > Please take a look!
> 
> A few superficial comments:
> 
> > [PATCH v68 01/31] pgstat: consistent function header formatting.
> > [PATCH v68 02/31] pgstat: remove some superflous comments from pgstat.h.
> 
> +1

Planning to commit these after making another coffee and proof reading them
some more.


> > [PATCH v68 03/31] dshash: revise sequential scan support.
> 
> Logic looks good.  That is,
> lock-0-and-ensure_valid_bucket_pointers()-only-once makes sense.  Just
> some comment trivia:
> 
> + * dshash_seq_term needs to be called when a scan finished.  The caller may
> + * delete returned elements midst of a scan by using dshash_delete_current()
> + * if exclusive = true.
> 
> s/scan finished/scan is finished/
> s/midst of/during/ (or /in the middle of/, ...)
> 
> > [PATCH v68 04/31] dsm: allow use in single user mode.
> 
> LGTM.


> +   Assert(IsUnderPostmaster || !IsPostmasterEnvironment);

> (Not this patch's fault, but I wish we had a more explicit way to say "am
> single user".)

Agreed.


> > [PATCH v68 05/31] pgstat: stats collector references in comments
> 
> LGTM.  I could think of some alternative suggested names for this subsystem,
> but don't think it would be helpful at this juncture so I will refrain :-)

Heh. I did start a thread about it a while ago :)


> > [PATCH v68 08/31] pgstat: introduce PgStat_Kind enum.
> 
> +#define PGSTAT_KIND_FIRST PGSTAT_KIND_DATABASE
> +#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
> +#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)
> 
> It's a little confusing that PGSTAT_NUM_KINDS isn't really the number of 
> kinds,
> because there is no kind 0.  For the two users of it... maybe just use
> pgstat_kind_infos[] = {...}, and
> global_valid[PGSTAT_KIND_LAST + 1]?

Maybe the whole justification for not defining an invalid kind is moot
now. There's not a single switch covering all kinds of stats left, and I hope
that we don't introduce one again...


> > [PATCH v68 10/31] pgstat: scaffolding for transactional stats creation / 
> > drop.
> 
> +   /*
> +* Dropping the statistics for objects that dropped transactionally itself
> +* needs to be transactional. ...
> 
> Hard to parse.  How about:  "Objects are dropped transactionally, so
> related statistics need to be dropped transactionally too."

Not all objects are dropped transactionally. But I agree it reads awkwardly. I
now, incorporating feedback from Justin as well, rephrased it to:

/*
 * Statistics for transactionally dropped objects need to be
 * transactionally dropped as well. Collect the stats dropped in the
 * current (sub-)transaction and only execute the stats drop when we 
know
 * if the transaction commits/aborts. To handle replicas and crashes,
 * stats drops are included in commit / abort records.
 */

A few too many "drop"s in there, but maybe that's unavoidable.



> +/*
> + * Whenever the for a dropped stats entry could not be freed (because
> + * backends still have references), this is incremented, causing backends
> + * to run pgstat_gc_entry_refs(), allowing that memory to be reclaimed.
> + */
> +pg_atomic_uint64 gc_count;
> 
> Whenever the ...?

 * Whenever statistics for dropped objects could not be freed - because
 * backends still have references - the dropping backend calls
 * pgstat_request_entry_refs_gc() incrementing this counter. Eventually
 * that causes backends to run pgstat_gc_entry_refs(), allowing memory 
to
 * be reclaimed.


> Would it be better to call this variable gc_request_count?

Agreed.


> + * Initialize refcount to 1, marking it as valid / not tdroped. The entry
> 
> s/tdroped/dropped/
> 
> + * further if a longer lived references is needed.
> 
> s/references/reference/

Fixed.


> +/*
> + * There are legitimate cases where the old stats entry might not
> + * yet have been dropped by the time it's reused. The easiest 
> case
> + * are replication slot stats. But oid wraparound can lead to
> + * other cases as well. We just reset the stats to their plain
> + * state.
> + */
> +shheader = pgstat_reinit_entry(kind, shhashent);
> 
> This whole comment is repeated in pgstat_reinit_entry and its caller.

I guess I felt as indecisive about where to place it between the two locations
when I wrote it as I do now. Left it at the callsite for now.


> +/*
> + * XXX: Might be worth adding some frobbing of the allocation before
> + * freeing, to make it easier to detect use-after-free style bugs.
> + */
> +dsa_free(pgStatLocal.dsa, pdsa);
> 
> FWIW dsa_free() clobbers memory in assert builds, just like pfree().

Oh. I could swear I saw that not being the case a while ago. But clearly it is
the case. Removed.


> +static Size

Re: shared-memory based stats collector - v68

2022-04-04 Thread Thomas Munro

On Mon, Apr 4, 2022 at 4:16 PM Andres Freund  wrote:
> Please take a look!

A few superficial comments:

> [PATCH v68 01/31] pgstat: consistent function header formatting.
> [PATCH v68 02/31] pgstat: remove some superflous comments from pgstat.h.

+1

> [PATCH v68 03/31] dshash: revise sequential scan support.

Logic looks good.  That is,
lock-0-and-ensure_valid_bucket_pointers()-only-once makes sense.  Just
some comment trivia:

+ * dshash_seq_term needs to be called when a scan finished.  The caller may
+ * delete returned elements midst of a scan by using dshash_delete_current()
+ * if exclusive = true.

s/scan finished/scan is finished/
s/midst of/during/ (or /in the middle of/, ...)

> [PATCH v68 04/31] dsm: allow use in single user mode.

LGTM.

+   Assert(IsUnderPostmaster || !IsPostmasterEnvironment);

(Not this patch's fault, but I wish we had a more explicit way to say "am
single user".)

> [PATCH v68 05/31] pgstat: stats collector references in comments

LGTM.  I could think of some alternative suggested names for this subsystem,
but don't think it would be helpful at this juncture so I will refrain :-)

> [PATCH v68 06/31] pgstat: add pgstat_copy_relation_stats().
> [PATCH v68 07/31] pgstat: move transactional code into pgstat_xact.c.

LGTM.

> [PATCH v68 08/31] pgstat: introduce PgStat_Kind enum.

+#define PGSTAT_KIND_FIRST PGSTAT_KIND_DATABASE
+#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
+#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)

It's a little confusing that PGSTAT_NUM_KINDS isn't really the number of kinds,
because there is no kind 0.  For the two users of it... maybe just use
pgstat_kind_infos[] = {...}, and
global_valid[PGSTAT_KIND_LAST + 1]?

> [PATCH v68 10/31] pgstat: scaffolding for transactional stats creation / drop.

+   /*
+* Dropping the statistics for objects that dropped transactionally itself
+* needs to be transactional. ...

Hard to parse.  How about:  "Objects are dropped transactionally, so
related statistics need to be dropped transactionally too."

> [PATCH v68 13/31] pgstat: store statistics in shared memory.

+ * Single-writer stats use the changecount mechanism to achieve low-overhead
+ * writes - they're obviously performance critical than reads. Check the
+ * definition of struct PgBackendStatus for some explanation of the
+ * changecount mechanism.

Missing word "more" after obviously?

+/*
+ * Whenever the for a dropped stats entry could not be freed (because
+ * backends still have references), this is incremented, causing backends
+ * to run pgstat_gc_entry_refs(), allowing that memory to be reclaimed.
+ */
+pg_atomic_uint64 gc_count;

Whenever the ...?

Would it be better to call this variable gc_request_count?

+ * Initialize refcount to 1, marking it as valid / not tdroped. The entry

s/tdroped/dropped/

+ * further if a longer lived references is needed.

s/references/reference/

+/*
+ * There are legitimate cases where the old stats entry might not
+ * yet have been dropped by the time it's reused. The easiest case
+ * are replication slot stats. But oid wraparound can lead to
+ * other cases as well. We just reset the stats to their plain
+ * state.
+ */
+shheader = pgstat_reinit_entry(kind, shhashent);

This whole comment is repeated in pgstat_reinit_entry and its caller.

+/*
+ * XXX: Might be worth adding some frobbing of the allocation before
+ * freeing, to make it easier to detect use-after-free style bugs.
+ */
+dsa_free(pgStatLocal.dsa, pdsa);

FWIW dsa_free() clobbers memory in assert builds, just like pfree().

+static Size
+pgstat_dsa_init_size(void)
+{
+Sizesz;
+
+/*
+ * The dshash header / initial buckets array needs to fit into "plain"
+ * shared memory, but it's beneficial to not need dsm segments
+ * immediately. A size of 256kB seems works well and is not
+ * disproportional compared to other constant sized shared memory
+ * allocations. NB: To avoid DSMs further, the user can configure
+ * min_dynamic_shared_memory.
+ */
+sz = 256 * 1024;

It kinda bothers me that the memory reserved by
min_dynamic_shared_memory might eventually fill up with stats, and not
be available for temporary use by parallel queries (which can benefit
more from fast acquire/release on DSMs, and probably also huge pages,
or maybe not...), and that's hard to diagnose.

+ * (4) turn off the idle-in-transaction, idle-session and
+ * idle-state-update timeouts if active.  We do this before step (5) so

s/idle-state-/idle-stats-/

+/*
+ * Some of the pending stats may have not been flushed due to lock
+ * contention.  If we have such pending stats here, let the caller know
+ * the retry interval.
+ */
+if (partial_flush)
+{

I think it's better for a comment that is outside the block to say "If
some of the pending...".  Or

Re: shared-memory based stats collector - v66

2022-04-02 Thread Andres Freund

Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> > * AFIXME: Should all the stats drop code be moved into pgstat_drop.c?
> 
> Or pgstat_xact.c?

I wasn't initially happy with that suggestion, but after running with it, it
looks pretty good.

I also moved a fair bit of code into pgstat_shm.c, which to me improved code
navigation a lot. I'm wondering about splitting it further even, into
pgstat_shm.c and pgstat_entry.c.

What do you think?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v66

2022-04-02 Thread Andres Freund

Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> > AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> > for increasing it?
> 
> It is 1000ms in the comment just above but actually 1ms. The
> number came from a discussion that if we have 1000 clients and each
> backend writes stats once per 0.5 seconds, totally we flush pending
> data to shared area at 2000 times per second which is too frequent. I
> raised it to 5000ms, then 1ms.  So the expected maximum flush
> frequency is reduces to 100 times per second.  Of course it is
> assuming the worst case and the 1ms is apparently too long for the
> average cases.
> 
> The current implement of pgstat postpones flushing if lock collision
> happens then postpone by at most 60s.  This is a kind of
> auto-averaging mechanishm.  It might be enough and we can reduce the
> PGSTAT_MIN_INTERVAL to 500ms or so.

I just noticed that the code doesn't appear to actually work like that right
now. Whenever the timeout is reached, pgstat_report_stat() is called with
force = true.

And even if the backend is busy running queries, once there's contention, the
next invocation of pgstat_report_stat() will return the timeout relative to
pendin_since, which then will trigger a force report via a very short timeout
soon.

It might actually make sense to only ever return PGSTAT_RETRY_MIN_INTERVAL
(with a slightly different name) from pgstat_report_stat() when blocked
(limiting the max reporting delay for an idle connection) and to continue
calling pgstat_report_stat(force = true).  But to only trigger force
"internally" in pgstat_report_stat() when PGSTAT_MAX_INTERVAL is reached.

I think that'd mean we'd report after max PGSTAT_RETRY_MIN_INTERVAL in an idle
connection, and try reporting every PGSTAT_RETRY_MIN_INTERVAL (increasing up
to PGSTAT_MAX_INTERVAL when blocked) on busy connections.

Makes sense?

I think we need to do something with the pgstat_report_stat() calls outside of
postgres.c. Otherwise there's nothing limiting their reporting delay, because
they don't have the timeout logic postgres.c has.  None of them is ever hot
enough to be problematic, so I think we should just make them pass force=true?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v67

2022-03-29 Thread Andres Freund

Hi,

On 2022-03-23 16:38:33 +0900, Kyotaro Horiguchi wrote:
> At Tue, 22 Mar 2022 11:56:40 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > Docs..  Yeah I'll try it.
> 
> This is the first cut, based on the earlier patchset.

Thanks!

> I didn't mention pgstat_force_next_flush() since I think it is a
> developer-only feature.

Yes, that makes sense.

Sorry for not yet getting back to looking at this.

One thing we definitely need to add documentation for is the
stats_fetch_consistency GUC. I think we should change its default to 'cache',
because that still gives the ability to "self-join", without the cost of the
current method.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v66

2022-03-29 Thread Andres Freund

Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> I'd like to dump out my humble thoughts about other AFIXMEs..

Thanks!  Please have another look at the code in
https://github.com/anarazel/postgres/tree/shmstat I just pushed a revised
version with a lot of [a]fixmes removed.

Most importantly I did move replication slot stats into the hash table, and
just generally revised the replication slot stats code substantially. I
think it does look better now.

But also there's a new commit allowing dsm use in single user mode. To be able
to rely on stats drops we need to perform them even in single user mode. The
only reason this didn't previously fail was that we allocated enough "static"
shared memory for single user mode to never need DSMs.

Thanks to Melanie's tests, and a few more additions by myself, the code is now
reasonably well covered. The big exception to that is recovery conflict stats,
and as Melanie noticed, that was broken (somehow pgstat_database_flush_cb()
didn't sum them up)). I think she has some WIP tests...

Re the added tests: I did fix a few timing issues there. There's probably a
few more hiding somewhere.

I also found that unfortunately dshash_seq_next() as is isn't correct. I
included a workaround commit, but it's not correct. What we need to do is to
just always lock partition 0 in the initialization branch. Before we call
ensure_valid_bucket_pointers() status->hash_table->size_log2 isn't valid. And
ensure_valid_bucket_pointers can only be called with a lock...

Horiguchi-san, if you have time to look at the "XXX: The following could now be
generalized" in pgstat_read_statsfile(), pgstat_write_statsfile()... I think
that'd be nice to clean up.

> > AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> > for increasing it?
> 
> It is 1000ms in the comment just above but actually 1ms. The
> number came from a discussion that if we have 1000 clients and each
> backend writes stats once per 0.5 seconds, totally we flush pending
> data to shared area at 2000 times per second which is too frequent.

Have you measured this (recently)? I tried to cause contention with a workload
targeted towards that, but couldn't see a problem with 1000ms. Of course
there's a problem with 1ms...

I think it's confusing to not report stats for 10s without a need.

> The current implement of pgstat postpones flushing if lock collision
> happens then postpone by at most 60s.  This is a kind of
> auto-averaging mechanishm.  It might be enough and we can reduce the
> PGSTAT_MIN_INTERVAL to 500ms or so.

Yea, I think the 60s part under contention is fine. I'd expect that to be
rarely reached.

> 
> > AFIXME: architecture explanation.
> 
> Mmm. next, please:p

Working on it. There's one more AFIXME that I want to resolve before, so I
don't end up with old type names strewn around (the one in pgstat_internal.h).

> 
> ( [PGSTAT_KIND_REPLSLOT] = {)
> > * AFIXME: With a bit of extra work this could now be a !fixed_amount
> > * stats kind.
> 
> Yeah.  The most bothersome point is the slot index is not persistent
> at all and the relationship between the index and name (or identity)
> is not stable even within a process life.  It can be resolved by
> allocating an object id to every replication slot.  I faintly remember
> of a discussion like that but I don't have a clear memory of the
> discussion.

I think it's resolved now. pgstat_report_replslot* all get the ReplicationSlot
as a parameter. They use the new ReplicationSlotIndex() to get an index from
that. pgstat_report_replslot_(create|acquire) ensure that the relevant index
doesn't somehow contain old stats.

To deal with indexes changing / slots getting removed during restart, there's
now a new callback made during pgstat_read_statsfile() to build the key from
the serialized NameStr. That can return false if a slot of that name is not
know, or use ReplicationSlotIndex() to get the index to store in-memory stats.

> > static Size
> > pgstat_dsa_init_size(void)
> > {
> > /*
> >  * AFIXME: What should we choose as an initial size? Should we make this
> >  * configurable? Maybe tune based on NBuffers?
> 
> > StatsShmemInit(void)
> >  * AFIXME: we need to guarantee this can be allocated in plain 
> > shared
> >  * memory, rather than allocating dsm segments.
> 
> I'm not sure that NBuffers is the ideal base for deciding the required
> size since it doesn't seem to be generally in proportion with the
> number of database objects.  If we made it manually-tunable, we will
> be able to emit a log when DSM segment allocation happens for this use
> as as the tuning aid..
> 
>WARNING: dsa allocation happened for activity statistics
>HINT: You might want to increase stat_dsa_initial_size if you see slow
>  down blah..

FWIW, I couldn't find any performance impact from using DSM. Because of the
"PgStatSharedRef" layer, there's not actually that much interaction with the
dsm

Re: shared-memory based stats collector - v66

2022-03-25 Thread Kyotaro Horiguchi

At Fri, 25 Mar 2022 14:22:56 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 24 Mar 2022 13:21:33 -0400, Melanie Plageman 
>  wrote in 
> > On Thu, Mar 17, 2022 at 3:36 AM Andres Freund  wrote:
> > >
> > > The biggest todos are:
> > > - Address all the remaining AFIXMEs and XXXs
> > 
> > Attached is a patch that addresses three of the existing AFIXMEs.

I'd like to dump out my humble thoughts about other AFIXMEs..

> AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> for increasing it?

It is 1000ms in the comment just above but actually 1ms. The
number came from a discussion that if we have 1000 clients and each
backend writes stats once per 0.5 seconds, totally we flush pending
data to shared area at 2000 times per second which is too frequent. I
raised it to 5000ms, then 1ms.  So the expected maximum flush
frequency is reduces to 100 times per second.  Of course it is
assuming the worst case and the 1ms is apparently too long for the
average cases.

The current implement of pgstat postpones flushing if lock collision
happens then postpone by at most 60s.  This is a kind of
auto-averaging mechanishm.  It might be enough and we can reduce the
PGSTAT_MIN_INTERVAL to 500ms or so.

> AFIXME: architecture explanation.

Mmm. next, please:p

(   [PGSTAT_KIND_REPLSLOT] = {)
> * AFIXME: With a bit of extra work this could now be a !fixed_amount
> * stats kind.

Yeah.  The most bothersome point is the slot index is not persistent
at all and the relationship between the index and name (or identity)
is not stable even within a process life.  It can be resolved by
allocating an object id to every replication slot.  I faintly remember
of a discussion like that but I don't have a clear memory of the
discussion.

> static Size
> pgstat_dsa_init_size(void)
> {
>   /*
>* AFIXME: What should we choose as an initial size? Should we make this
>* configurable? Maybe tune based on NBuffers?

> StatsShmemInit(void)
>* AFIXME: we need to guarantee this can be allocated in plain 
> shared
>* memory, rather than allocating dsm segments.

I'm not sure that NBuffers is the ideal base for deciding the required
size since it doesn't seem to be generally in proportion with the
number of database objects.  If we made it manually-tunable, we will
be able to emit a log when DSM segment allocation happens for this use
as as the tuning aid..

   WARNING: dsa allocation happened for activity statistics
   HINT: You might want to increase stat_dsa_initial_size if you see slow
 down blah..

> * AFIXME: Should all the stats drop code be moved into pgstat_drop.c?

Or pgstat_xact.c?

>  * AFIXME: comment
>  * AFIXME: see notes about race conditions for functions in
>  * pgstat_drop_function().
>  */
> void
> pgstat_schedule_stat_drop(PgStatKind kind, Oid dboid, Oid objoid)

pgstat_drop_function() doesn't seem to have such a note.

I suppose the "race condition" means the case a stats entry for an
object is created just after the same object is dropped on another
backend.  It seems to me such a race condition is eliminated by the
transactional drop mechanism.  Are you intending to write an
explanation of that?

>   /*
>* pgStatSharedRefAge increments quite slowly than the time the 
> following
>* loop takes so this is expected to iterate no more than twice.
>*
>* AFIXME: Why is this a good place to do this?
>*/
>   while (pgstat_shared_refs_need_gc())
>   pgstat_shared_refs_gc();

Is the reason for the AFIXME is you think that GC-check happens too
frequently?

> pgstat_shared_ref_release(PgStatHashKey key, PgStatSharedRef *shared_ref)
> {
...
> * AFIXME: this probably is racy. Another backend could look up the
> * stat, bump the refcount, as we free it.
>if (pg_atomic_fetch_sub_u32(_ref->shared_entry->refcount, 1) == 
> 1)
>{
...
>/* only dropped entries can reach a 0 refcount */
>Assert(shared_ref->shared_entry->dropped);

I didn't deeply examined, but is that race condition avoidable by
prevent pgstat_shared_ref_get from incrementing the refcount of
dropped entries?

>  * AFIXME: This needs to be deduplicated with pgstat_shared_ref_release(). But
>  * it's not entirely trivial, because we can't use plain dshash_delete_entry()
>  * (but have to use dshash_delete_current()).
>  */
> static bool
> pgstat_drop_stats_entry(dshash_seq_status *hstat)
...
>* AFIXME: don't do this while holding the dshash lock.

Is the AFIXMEs mean that we should move the call to
pgstat_shared_ref_release() out of the dshash-loop (in
pgstat_drop_database_and_contents) that calls this function?  Is it
sensible if we store the (key, ref) pairs for to-be released
shared_refs then clean up them after exiting the loop?

>* Database stats contain other stats. Drop those as well when
>* dropping the

Re: shared-memory based stats collector - v66

2022-03-24 Thread Kyotaro Horiguchi

At Thu, 24 Mar 2022 13:21:33 -0400, Melanie Plageman 
 wrote in 
> On Thu, Mar 17, 2022 at 3:36 AM Andres Freund  wrote:
> >
> > The biggest todos are:
> > - Address all the remaining AFIXMEs and XXXs
> 
> Attached is a patch that addresses three of the existing AFIXMEs.

Thanks!

+   .reset_timestamp_cb = pgstat_shared_reset_timestamp_noop,

(I once misunderstood that the "shared" means shared memory area..)

The reset function is type-specific and it must be set.  So don't we
provide all to-be-required reset functions?


+   if (pgstat_shared_ref_get(kind, dboid, objoid, false, NULL))
+   {
+   Oid msg_oid = (kind == PGSTAT_KIND_DB) ? dboid : objoid;

Explicitly using PGSTAT_KIND_DB here is a kind of annoyance.  Since we
always give InvalidOid correctly as the parameters, and objoid alone
is not specific enough, do we warn using both dboid and objoid without
a special treat?

Concretely, I propose to do the following instead.

+   if (pgstat_shared_ref_get(kind, dboid, objoid, false, NULL))
+   {
+   ereport(WARNING,
+   errmsg("resetting existing stats for type %s, 
db=%d, oid=%d",
+  pgstat_kind_info_for(kind)->name, dboid, objoid);




+pgstat_pending_delete(PgStatSharedRef *shared_ref)
+{
+   void   *pending_data = shared_ref->pending;
+   PgStatKind kind = shared_ref->shared_entry->key.kind;
+
+   Assert(pending_data != NULL);
+   Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+
+   /* PGSTAT_KIND_TABLE has its own callback */
+   Assert(kind != PGSTAT_KIND_TABLE);
+

"kind" is used only in assertion, which requires PG_USED_FOR_ASSERTS_ONLY.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

1 2 3 >

1 - 100 of 262 matches

Mail list logo