Re: Streaming I/O, vectored I/O (WIP)

2024-05-24 Thread Thomas Munro
On Wed, Nov 29, 2023 at 1:17 AM Thomas Munro  wrote:
> Done.  I like it, I just feel a bit bad about moving the p*v()
> replacement functions around a couple of times already!  I figured it
> might as well be static inline even if we use the fallback (= Solaris
> and Windows).

Just for the record, since I'd said things like the above a few times
while writing about this stuff: Solaris 11.4.69 has gained preadv()
and pwritev().  That's interesting because it means that there will
soon be no liive Unixoid operating systems left without them, and the
fallback code in src/include/port/pg_iovec.h will, in practice, be
only for Windows.  I wondered if that might have implications for how
we code or comment stuff like that, but it still seems to make sense
as we have it.

(I don't think Windows can have a real synchronous implementation; the
kernel knows how to do scatter/gather, a feature implemented
specifically for databases, but only in asynchronous ("overlapped") +
direct I/O mode, a difference I don't know how to hide at this level.
In later AIO work we should be able to use it as intended, but not by
pretending to be Unix like this.)




Re: Streaming I/O, vectored I/O (WIP)

2024-04-30 Thread Thomas Munro
On Wed, May 1, 2024 at 2:51 PM David Rowley  wrote:
> On Wed, 24 Apr 2024 at 14:32, David Rowley  wrote:
> > I've attached a patch with a few typo fixes and what looks like an
> > incorrect type for max_ios. It's an int16 and I think it needs to be
> > an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
> > anything when max_ios is int16.
>
> No feedback, so I'll just push this in a few hours unless anyone has anything.

Patch looks correct, thanks.  Please do.  (Sorry, running a bit behind
on email ATM...  I also have a few more typos around here from an
off-list email from Mr Lakhin, will get to that soon...)




Re: Streaming I/O, vectored I/O (WIP)

2024-04-30 Thread David Rowley
On Wed, 24 Apr 2024 at 14:32, David Rowley  wrote:
> I've attached a patch with a few typo fixes and what looks like an
> incorrect type for max_ios. It's an int16 and I think it needs to be
> an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
> anything when max_ios is int16.

No feedback, so I'll just push this in a few hours unless anyone has anything.

David




Re: Streaming I/O, vectored I/O (WIP)

2024-04-23 Thread David Rowley
I've attached a patch with a few typo fixes and what looks like an
incorrect type for max_ios. It's an int16 and I think it needs to be
an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
anything when max_ios is int16.

David
diff --git a/src/backend/storage/aio/read_stream.c 
b/src/backend/storage/aio/read_stream.c
index 634cf4f0d1..74b9bae631 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -26,12 +26,12 @@
  *
  * B) I/O is necessary, but fadvise is undesirable because the access is
  * sequential, or impossible because direct I/O is enabled or the system
- * doesn't support advice.  There is no benefit in looking ahead more than
- * io_combine_limit, because in this case only goal is larger read system
+ * doesn't support fadvise.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case the only goal is larger read system
  * calls.  Looking further ahead would pin many buffers and perform
  * speculative work looking ahead for no benefit.
  *
- * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * C) I/O is necessary, it appears random, and this system supports fadvise.
  * We'll look further ahead in order to reach the configured level of I/O
  * concurrency.
  *
@@ -418,7 +418,7 @@ read_stream_begin_relation(int flags,
ReadStream *stream;
size_t  size;
int16   queue_size;
-   int16   max_ios;
+   int max_ios;
int strategy_pin_limit;
uint32  max_pinned_buffers;
Oid tablespace_id;
@@ -447,6 +447,8 @@ read_stream_begin_relation(int flags,
max_ios = 
get_tablespace_maintenance_io_concurrency(tablespace_id);
else
max_ios = get_tablespace_io_concurrency(tablespace_id);
+
+   /* Cap to INT16_MAX to avoid overflowing below */
max_ios = Min(max_ios, PG_INT16_MAX);
 
/*


Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz
Hi,

On Mon, 8 Apr 2024 at 00:01, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz  wrote:
> >
> > Hi,
> >
> > On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
> > >
> > > I had been planning to commit v14 this morning but got cold feet with
> > > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > > neither did I.  I have now removed it, and it seems much better.  No
> > > other significant changes, just parameter types and inlining details.
> > > For example:
> > >
> > >  * read_stream_begin_relation() now takes a Relation, likes its name says
> > >  * StartReadBuffers()'s operation takes smgr and optional rel
> > >  * ReadBuffer_common() takes smgr and optional rel
> >
> > Read stream objects can be created only using Relations now. There
> > could be read stream users which do not have a Relation but
> > SMgrRelations. So, I created another constructor for the read streams
> > which use SMgrRelations instead of Relations. Related patch is
> > attached.
>
> After sending this, I realized that I forgot to add persistence value
> to the new constructor. While working on it I also realized that
> current code sets persistence in PinBufferForBlock() function and this
> function is called for each block, which can be costly. So, I moved
> setting persistence to the out of PinBufferForBlock() function.
>
> Setting persistence outside of the PinBufferForBlock() function (0001)
> and creating the new constructor that uses SMgrRelations (0002) are
> attached.

Melanie noticed there was a 'sgmr -> smgr' typo in 0002. Fixed in attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 04fd860ce8c4c57830930cb362799fd155c92613 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH 1/2] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..b4fcefed78a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc *bufHdr;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
 	Assert(blockNum != P_NEW);
 
-	/*
-	 * If there is no Relation it usually implies recovery and thus permanent,
-	 * but we take an argmument because CreateAndCopyRelationData can reach us
-	 * with only an SMgrRelation for an unlogged relation that we don't want
-	 * to flag with BM_PERMANENT.
-	 */
-	if (rel)
-		persistence = rel->rd_rel->relpersistence;
-	else if (smgr_persistence == 0)
-		persistence = RELPERSISTENCE_PERMANENT;
-	else
-		persistence = smgr_persistence;
-
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
@@ -1101,7 +1087,7 @@ PinBufferForBlock(Relation rel,
 	   smgr->smgr_rlocator.locator.relNumber,
 	   smgr->smgr_rlocator.backend);
 
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
@@ -1109,7 +1095,7 @@ PinBufferForBlock(Relation rel,
 	}
 	else
 	{
-		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+		bufHdr = BufferAlloc(smgr, smgr_persistence, forkNum, blockNum,
 			 strategy, foundPtr, io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
@@ -1157,6 +1143,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	ReadBuffersOperation operation;
 	Buffer		buffer;
 	int			flags;
+	char		persistence;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1195,7 +1182,15 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		flags = 0;
 	operation.smgr = smgr;
 	operation.rel = rel;
-	operation.smgr_persistence = smgr_persistence;
+
+	if (rel)
+		persistence = 

Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz
Hi,

On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
> >
> > I had been planning to commit v14 this morning but got cold feet with
> > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > neither did I.  I have now removed it, and it seems much better.  No
> > other significant changes, just parameter types and inlining details.
> > For example:
> >
> >  * read_stream_begin_relation() now takes a Relation, likes its name says
> >  * StartReadBuffers()'s operation takes smgr and optional rel
> >  * ReadBuffer_common() takes smgr and optional rel
>
> Read stream objects can be created only using Relations now. There
> could be read stream users which do not have a Relation but
> SMgrRelations. So, I created another constructor for the read streams
> which use SMgrRelations instead of Relations. Related patch is
> attached.

After sending this, I realized that I forgot to add persistence value
to the new constructor. While working on it I also realized that
current code sets persistence in PinBufferForBlock() function and this
function is called for each block, which can be costly. So, I moved
setting persistence to the out of PinBufferForBlock() function.

Setting persistence outside of the PinBufferForBlock() function (0001)
and creating the new constructor that uses SMgrRelations (0002) are
attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 04fd860ce8c4c57830930cb362799fd155c92613 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH 1/2] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..b4fcefed78a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc *bufHdr;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
 	Assert(blockNum != P_NEW);
 
-	/*
-	 * If there is no Relation it usually implies recovery and thus permanent,
-	 * but we take an argmument because CreateAndCopyRelationData can reach us
-	 * with only an SMgrRelation for an unlogged relation that we don't want
-	 * to flag with BM_PERMANENT.
-	 */
-	if (rel)
-		persistence = rel->rd_rel->relpersistence;
-	else if (smgr_persistence == 0)
-		persistence = RELPERSISTENCE_PERMANENT;
-	else
-		persistence = smgr_persistence;
-
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
@@ -1101,7 +1087,7 @@ PinBufferForBlock(Relation rel,
 	   smgr->smgr_rlocator.locator.relNumber,
 	   smgr->smgr_rlocator.backend);
 
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
@@ -1109,7 +1095,7 @@ PinBufferForBlock(Relation rel,
 	}
 	else
 	{
-		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+		bufHdr = BufferAlloc(smgr, smgr_persistence, forkNum, blockNum,
 			 strategy, foundPtr, io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
@@ -1157,6 +1143,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	ReadBuffersOperation operation;
 	Buffer		buffer;
 	int			flags;
+	char		persistence;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1195,7 +1182,15 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		flags = 0;
 	operation.smgr = smgr;
 	operation.rel = rel;
-	operation.smgr_persistence = smgr_persistence;
+
+	if (rel)
+		persistence = rel->rd_rel->relpersistence;
+	else if (smgr_persistence == 0)
+		persistence = RELPERSISTENCE_PERMANENT;
+	else
+		persistence = smgr_persistence;
+	operation.smgr_persistence = persistence;
+
 	operation.forknum = 

Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Melanie Plageman
On Sun, Apr 7, 2024 at 1:33 PM Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
> >
> > I had been planning to commit v14 this morning but got cold feet with
> > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > neither did I.  I have now removed it, and it seems much better.  No
> > other significant changes, just parameter types and inlining details.
> > For example:
> >
> >  * read_stream_begin_relation() now takes a Relation, likes its name says
> >  * StartReadBuffers()'s operation takes smgr and optional rel
> >  * ReadBuffer_common() takes smgr and optional rel
>
> Read stream objects can be created only using Relations now. There
> could be read stream users which do not have a Relation but
> SMgrRelations. So, I created another constructor for the read streams
> which use SMgrRelations instead of Relations. Related patch is
> attached.

This patch LGTM

- Melanie




Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz
Hi,

On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
>
> I had been planning to commit v14 this morning but got cold feet with
> the BMR-based interface.  Heikki didn't like it much, and in the end,
> neither did I.  I have now removed it, and it seems much better.  No
> other significant changes, just parameter types and inlining details.
> For example:
>
>  * read_stream_begin_relation() now takes a Relation, likes its name says
>  * StartReadBuffers()'s operation takes smgr and optional rel
>  * ReadBuffer_common() takes smgr and optional rel

Read stream objects can be created only using Relations now. There
could be read stream users which do not have a Relation but
SMgrRelations. So, I created another constructor for the read streams
which use SMgrRelations instead of Relations. Related patch is
attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 38b57ec7e54a54a0c7b0117a0ecaaf68c643e1b0 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 20:16:00 +0300
Subject: [PATCH] Add a way to create read stream object by using SMgrRelation

Currently read stream object can be created only by using Relation,
there is no way to create it by using SMgrRelation.

To achieve that, read_stream_begin_impl() function is created and
contents of the read_stream_begin_relation() is moved into this
function.

Both read_stream_begin_relation() and read_stream_begin_sgmr_relation()
calls read_stream_begin_impl() by Relation and SMgrRelation
respectively.
---
 src/include/storage/read_stream.h |  8 +++
 src/backend/storage/aio/read_stream.c | 74 ++-
 2 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index fae09d2b4cc..601a7fcf92b 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -15,6 +15,7 @@
 #define READ_STREAM_H
 
 #include "storage/bufmgr.h"
+#include "storage/smgr.h"
 
 /* Default tuning, reasonable for many users. */
 #define READ_STREAM_DEFAULT 0x00
@@ -56,6 +57,13 @@ extern ReadStream *read_stream_begin_relation(int flags,
 			  ReadStreamBlockNumberCB callback,
 			  void *callback_private_data,
 			  size_t per_buffer_data_size);
+extern ReadStream *read_stream_begin_sgmr_relation(int flags,
+   BufferAccessStrategy strategy,
+   SMgrRelation smgr,
+   ForkNumber forknum,
+   ReadStreamBlockNumberCB callback,
+   void *callback_private_data,
+   size_t per_buffer_data_size);
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..a9a3b0de6c9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -406,14 +406,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
  * write extra data for each block into the space provided to it.  It will
  * also receive callback_private_data for its own purposes.
  */
-ReadStream *
-read_stream_begin_relation(int flags,
-		   BufferAccessStrategy strategy,
-		   Relation rel,
-		   ForkNumber forknum,
-		   ReadStreamBlockNumberCB callback,
-		   void *callback_private_data,
-		   size_t per_buffer_data_size)
+static ReadStream *
+read_stream_begin_impl(int flags,
+	   BufferAccessStrategy strategy,
+	   Relation rel,
+	   SMgrRelation smgr,
+	   ForkNumber forknum,
+	   ReadStreamBlockNumberCB callback,
+	   void *callback_private_data,
+	   size_t per_buffer_data_size)
 {
 	ReadStream *stream;
 	size_t		size;
@@ -422,9 +423,6 @@ read_stream_begin_relation(int flags,
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
-	SMgrRelation smgr;
-
-	smgr = RelationGetSmgr(rel);
 
 	/*
 	 * Decide how many I/Os we will allow to run at the same time.  That
@@ -434,7 +432,7 @@ read_stream_begin_relation(int flags,
 	 */
 	tablespace_id = smgr->smgr_rlocator.locator.spcOid;
 	if (!OidIsValid(MyDatabaseId) ||
-		IsCatalogRelation(rel) ||
+		(rel && IsCatalogRelation(rel)) ||
 		IsCatalogRelationOid(smgr->smgr_rlocator.locator.relNumber))
 	{
 		/*
@@ -548,7 +546,7 @@ read_stream_begin_relation(int flags,
 	for (int i = 0; i < max_ios; ++i)
 	{
 		stream->ios[i].op.rel = rel;
-		stream->ios[i].op.smgr = RelationGetSmgr(rel);
+		stream->ios[i].op.smgr = smgr;
 		stream->ios[i].op.smgr_persistence = 0;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
@@ -557,6 +555,56 @@ read_stream_begin_relation(int flags,
 	return stream;
 }
 
+/*
+ * Create a new read stream for reading a relation.
+ * See read_stream_begin_impl() for the detailed explanation.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+		   

Re: Streaming I/O, vectored I/O (WIP)

2024-04-03 Thread Melanie Plageman
On Tue, Apr 2, 2024 at 8:32 PM Thomas Munro  wrote:
>
> Here are the remaining patches discussed in this thread.  They give
> tablespace-specific io_combine_limit, effective_io_readahead_window
> (is this useful?), and up-to-1MB io_combine_limit (is this useful?).
> I think the first two would probably require teaching reloption.c how
> to use guc.c's parse_int() and unit flags, but I won't have time to
> look at that for this release so I'll just leave these here.
>
> On the subject of guc.c, this is a terrible error message... did I do
> something wrong?
>
> postgres=# set io_combine_limit = '42MB';
> ERROR:  5376 8kB is outside the valid range for parameter
> "io_combine_limit" (1 .. 32)

Well, GUC_UNIT_BLOCKS interpolates the block limit into the error
message string (get_config_unit_name()). But, I can't imagine this
error message is clear for any of the GUCs using GUC_UNIT_BLOCKS. I
would think some combination of the two would be helpful, like "43008
kB (5376 blocks) is outside of the valid range for parameter". The
user can check what their block size is. I don't think we need to
interpolate and print the block size in the error message.

On another note, since io_combine_limit, when specified in size,
rounds up to the nearest multiple of blocksize, it might be worth
mentioning this in the io_combine_limit docs at some point. I checked
docs for another GUC_UNIT_BLOCKS guc, backend_flush_after, and it
alludes to this.

- Melanie




Re: Streaming I/O, vectored I/O (WIP)

2024-04-02 Thread Thomas Munro
On Tue, Apr 2, 2024 at 9:39 PM Thomas Munro  wrote:
> So this is the version I'm going to commit shortly, barring objections.

And done, after fixing a small snafu with smgr-only reads coming from
CreateAndCopyRelationData() (BM_PERMANENT would be
incorrectly/unnecessarily set for unlogged tables).

Here are the remaining patches discussed in this thread.  They give
tablespace-specific io_combine_limit, effective_io_readahead_window
(is this useful?), and up-to-1MB io_combine_limit (is this useful?).
I think the first two would probably require teaching reloption.c how
to use guc.c's parse_int() and unit flags, but I won't have time to
look at that for this release so I'll just leave these here.

On the subject of guc.c, this is a terrible error message... did I do
something wrong?

postgres=# set io_combine_limit = '42MB';
ERROR:  5376 8kB is outside the valid range for parameter
"io_combine_limit" (1 .. 32)
From 84b8280481312cdd1efcb7efa1182d4647cbe00a Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 30 Mar 2024 19:09:44 +1300
Subject: [PATCH v16 1/4] ALTER TABLESPACE ... SET (io_combine_limit = ...).

This is the per-tablespace version of the GUC of the same name.

XXX reloptions.c lacks the ability to accept units eg '64kB'!  Which is
why I haven't included it with the main feature commit.

Suggested-by: Tomas Vondra https://postgr.es/m/f603ac51-a7ff-496a-99c1-76673635692e%40enterprisedb.com
---
 doc/src/sgml/ref/alter_tablespace.sgml |  9 +++---
 src/backend/access/common/reloptions.c | 12 +++-
 src/backend/storage/aio/read_stream.c  | 39 --
 src/backend/utils/cache/spccache.c | 14 +
 src/bin/psql/tab-complete.c|  3 +-
 src/include/commands/tablespace.h  |  1 +
 src/include/utils/spccache.h   |  1 +
 7 files changed, 58 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..faf0c6e7fbc 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -84,16 +84,17 @@ ALTER TABLESPACE name RESET ( ,
   ,
   ,
-  ).  This may be useful if
+  ),
+  ).  This may be useful if
   one tablespace is located on a disk which is faster or slower than the
   remainder of the I/O subsystem.
  
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..1e1c611fab2 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -371,6 +371,15 @@ static relopt_int intRelOpts[] =
 		0, 0, 0
 #endif
 	},
+	{
+		{
+			"io_combine_limit",
+			"Limit on the size of data reads and writes.",
+			RELOPT_KIND_TABLESPACE,
+			ShareUpdateExclusiveLock
+		},
+		-1, 1, MAX_IO_COMBINE_LIMIT
+	},
 	{
 		{
 			"parallel_workers",
@@ -2089,7 +2098,8 @@ tablespace_reloptions(Datum reloptions, bool validate)
 		{"random_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, random_page_cost)},
 		{"seq_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, seq_page_cost)},
 		{"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)},
-		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)}
+		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)},
+		{"io_combine_limit", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, io_combine_limit)},
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5e..907c80e6bf9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -114,6 +114,7 @@ struct ReadStream
 	int16		max_pinned_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		io_combine_limit;
 	bool		advice_enabled;
 
 	/*
@@ -241,7 +242,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
-	Assert(stream->pending_read_nblocks <= io_combine_limit);
+	Assert(stream->pending_read_nblocks <= stream->io_combine_limit);
 
 	/* We had better not exceed the pin limit by starting this read. */
 	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
@@ -329,7 +330,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == io_combine_limit)
+		if (stream->pending_read_nblocks == stream->io_combine_limit)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
@@ -389,7 +390,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
-		

Re: Streaming I/O, vectored I/O (WIP)

2024-04-02 Thread Thomas Munro
I had been planning to commit v14 this morning but got cold feet with
the BMR-based interface.  Heikki didn't like it much, and in the end,
neither did I.  I have now removed it, and it seems much better.  No
other significant changes, just parameter types and inlining details.
For example:

 * read_stream_begin_relation() now takes a Relation, likes its name says
 * StartReadBuffers()'s operation takes smgr and optional rel
 * ReadBuffer_common() takes smgr and optional rel

ReadBuffer() (which calls ReadBuffer_common() which calls
StartReadBuffer() as before) now shows no regression in a tight loop
over ~1 million already-in-cache pages (something Heikki had observed
before and could only completely fix with a change that affected all
callers).  The same test using read_stream.c is still slightly slower,
~1 million pages -in-cache pages 301ms -> 308ms, which seems
acceptable to me and could perhaps be chased down with more study of
inlining/specialisation.  As mentioned before, it doesn't seem to be
measurable once you actually do something with the pages.

In some ways BMR was better than the "fake RelationData" concept
(another attempt at wrestling with the relation vs storage duality,
that is, the online vs recovery duality).  But in other ways it was
worse: a weird inconsistent mixture of pass-by-pointer and
pass-by-value interfaces that required several code paths to handle it
being only partially initialised, which turned out to be wasted cycles
implicated in regressions, despite which it is not even very nice to
use anyway.  I'm sure it could be made to work better, but I'm not yet
sure it's really needed.  In later work for recovery I will need to
add a separate constructor read_stream_begin_smgr_something() anyway
for other reasons (multi-relation streaming, different callback) and
perhaps also a separate StartReadBuffersSmgr() if it saves measurable
cycles to strip out branches.  Maybe it was all just premature
pessimisation.

So this is the version I'm going to commit shortly, barring objections.
From 4c3ec42aabaaf0b54f8c4393bef3411fed3a054f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 2 Apr 2024 14:40:40 +1300
Subject: [PATCH v15 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two main advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the
kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.

A new GUC io_combine_limit is defined, and the functions for limiting
per-backend pin counts are now made into public APIs.  Those are
provided for the benefit of callers of StartReadBuffers(), who should
respect them when deciding how many buffers to read at once.  A later
commit will add a higher level mechanism for doing that automatically
with a more practical interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of just issuing
advice and leaving WaitReadBuffers() to do the work synchronously.

Author: Thomas Munro 
Author: Andres Freund  (some optimization tweaks)
Reviewed-by: Melanie Plageman 
Reviewed-by: Heikki Linnakangas 
Reviewed-by: Nazir Bilal Yavuz 
Reviewed-by: Dilip Kumar 
Reviewed-by: Andres Freund 
Tested-by: Tomas Vondra 
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  14 +
 src/backend/storage/buffer/bufmgr.c   | 720 --
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/backend/utils/misc/guc_tables.c   |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h  |  50 ++
 src/tools/pgindent/typedefs.list  |   2 +
 7 files changed, 592 insertions(+), 223 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0e9617bcff4..624518e0b01 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2708,6 +2708,20 @@ include_dir 'conf.d'

   
 
+  
+   io_combine_limit (integer)
+   
+io_combine_limit configuration parameter
+   
+   
+   
+
+ Controls the largest I/O size in operations that combine I/O.
+ The default is 128kB.
+
+   
+  
+
   
max_worker_processes (integer)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..944ee271ba4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,10 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffer() -- as above, with separate wait step
+ * StartReadBuffers() -- multiple block version
+ * WaitReadBuffers() 

Re: Streaming I/O, vectored I/O (WIP)

2024-03-29 Thread Heikki Linnakangas

On 29/03/2024 09:01, Thomas Munro wrote:

On Fri, Mar 29, 2024 at 9:45 AM Heikki Linnakangas  wrote:

master (213c959a29):8.0 s
streaming-api v13:  9.5 s


Hmm, that's not great, and I think I know one factor that has
confounded my investigation and the conflicting reports I have
received from a couple of people: some are using meson, which is
defaulting to -O3 by default, and others are using make which gives
you -O2 by default, but at -O2, GCC doesn't inline that
StartReadBuffer specialisation that is used in the "fast path", and
possibly more.  Some of that gap is closed by using
pg_attribute_inline_always.  Clang fails to inline at any level.  So I
should probably use the "always" macro there because that is the
intention.  Still processing the rest of your email...


Ah yeah, I also noticed that the inlining didn't happen with some 
compilers and flags. I use a mix of gcc and clang and meson and autoconf 
in my local environment.


The above micro-benchmarks were with meson and gcc -O3. GCC version:

$ gcc --version
gcc (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Streaming I/O, vectored I/O (WIP)

2024-03-29 Thread Thomas Munro
On Fri, Mar 29, 2024 at 9:45 AM Heikki Linnakangas  wrote:
> master (213c959a29):8.0 s
> streaming-api v13:  9.5 s

Hmm, that's not great, and I think I know one factor that has
confounded my investigation and the conflicting reports I have
received from a couple of people: some are using meson, which is
defaulting to -O3 by default, and others are using make which gives
you -O2 by default, but at -O2, GCC doesn't inline that
StartReadBuffer specialisation that is used in the "fast path", and
possibly more.  Some of that gap is closed by using
pg_attribute_inline_always.  Clang fails to inline at any level.  So I
should probably use the "always" macro there because that is the
intention.  Still processing the rest of your email...




Re: Streaming I/O, vectored I/O (WIP)

2024-03-28 Thread Thomas Munro
On Fri, Mar 29, 2024 at 12:06 AM Thomas Munro  wrote:
> Small bug fix: the condition in the final test at the end of
> read_stream_look_ahead() wasn't quite right.  In general when looking
> ahead, we don't need to start a read just because the pending read
> would bring us up to stream->distance if submitted now (we'd prefer to
> build it all the way up to size io_combine_limit if we can), but if
> that condition is met AND we have nothing pinned yet, then there is no
> chance for the read to grow bigger by a pinned buffer being consumed.
> Fixed, comment updated.

Oops, I sent the wrong/unfixed version.  This version has the fix
described above.


v13-0001-Provide-vectored-variant-of-ReadBuffer.patch
Description: Binary data


v13-0002-Provide-API-for-streaming-relation-data.patch
Description: Binary data


v13-0003-Use-streaming-I-O-in-pg_prewarm.patch
Description: Binary data


Re: Streaming I/O, vectored I/O (WIP)

2024-03-28 Thread Thomas Munro
Small bug fix: the condition in the final test at the end of
read_stream_look_ahead() wasn't quite right.  In general when looking
ahead, we don't need to start a read just because the pending read
would bring us up to stream->distance if submitted now (we'd prefer to
build it all the way up to size io_combine_limit if we can), but if
that condition is met AND we have nothing pinned yet, then there is no
chance for the read to grow bigger by a pinned buffer being consumed.
Fixed, comment updated.


v12-0001-Provide-vectored-variant-of-ReadBuffer.patch
Description: Binary data


v12-0002-Provide-API-for-streaming-relation-data.patch
Description: Binary data


v12-0003-Use-streaming-I-O-in-pg_prewarm.patch
Description: Binary data


Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro
New version with some cosmetic/comment changes, and Melanie's
read_stream_reset() function merged, as required by her sequential
scan user patch.  I tweaked it slightly: it might as well share code
with read_stream_end().  I think setting distance = 1 is fine for now,
and we might later want to adjust that as we learn more about more
interesting users of _reset().
From 6b66a6412c90c8f696a8b5890596ba1ab7477191 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v11 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Author: Thomas Munro 
Author: Andres Freund  (optimization tweaks)
Reviewed-by: Melanie Plageman 
Reviewed-by: Heikki Linnakangas 
Reviewed-by: Nazir Bilal Yavuz 
Reviewed-by: Dilip Kumar 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  14 +
 src/backend/storage/buffer/bufmgr.c   | 701 --
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/backend/utils/misc/guc_tables.c   |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h  |  41 +-
 src/tools/pgindent/typedefs.list  |   1 +
 7 files changed, 564 insertions(+), 222 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5468637e2ef..f3736000ad2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'

   
 
+  
+   io_combine_limit (integer)
+   
+io_combine_limit configuration parameter
+   
+   
+   
+
+ Controls the largest I/O size in operations that combine I/O.
+ The default is 128kB.
+
+   
+  
+
   
max_worker_processes (integer)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..7123cbbaa2a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,9 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/* Limit on how many blocks should be handled in single I/O operations. */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +479,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 ForkNumber forkNum, BlockNumber blockNum,
-ReadBufferMode mode, BufferAccessStrategy strategy,
-bool *hit);
+ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 		   ForkNumber fork,
 		   BufferAccessStrategy strategy,
@@ -500,7 +507,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 		  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 			  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +788,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
    ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +800,9 @@ 

Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro
On Thu, Mar 28, 2024 at 2:02 PM Thomas Munro  wrote:
> ... In practice on a non-toy system, that's always going to be
> io_combine_limit.  ...

And to be more explicit about that: you're right that we initialise
max_pinned_buffers such that it's usually at least io_combine_limit,
but then if you have a very small buffer pool it gets clobbered back
down again by LimitAdditionalBins() and may finish up as low as 1.
You're not allowed to pin more than 1/Nth of the whole buffer pool,
where N is approximately max connections (well it's not exactly that
but that's the general idea).  So it's a degenerate case, but it can
happen that max_pinned_buffers is lower than io_combine_limit and then
it's important not to set distance higher or you'd exceed the allowed
limits (or more likely the circular data structure would implode).




Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro
On Mon, Mar 25, 2024 at 2:02 AM Thomas Munro  wrote:
> On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas  wrote:
> > >   /*
> > >* Skip the initial ramp-up phase if the caller says we're going to 
> > > be
> > >* reading the whole relation.  This way we start out doing 
> > > full-sized
> > >* reads.
> > >*/
> > >   if (flags & PGSR_FLAG_FULL)
> > >   pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, 
> > > pgsr->max_pinned_buffers);
> > >   else
> > >   pgsr->distance = 1;
> >
> > Should this be "Max(MAX_BUFFERS_PER_TRANSFER,
> > pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than
> > MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So
> > perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?
>
> Right, done.

BTW I forgot to mention that in v10 I changed my mind and debugged my
way back to the original coding, which now looks like this:

/*
 * Skip the initial ramp-up phase if the caller says we're going to be
 * reading the whole relation.  This way we start out assuming we'll be
 * doing full io_combine_limit sized reads (behavior B).
 */
if (flags & READ_STREAM_FULL)
stream->distance = Min(max_pinned_buffers, io_combine_limit);
else
stream->distance = 1;

It's not OK for distance to exceed max_pinned_buffers.  But if
max_pinned_buffers is huge, remember that the goal here is to access
'behavior B' meaning wide read calls but no unnecessary extra
look-ahead beyond what is needed for that, so we also don't want to
exceed io_combine_limit.  Therefore we want the minimum of those two
numbers.  In practice on a non-toy system, that's always going to be
io_combine_limit.  But I'm not sure how many users of READ_STREAM_FULL
there will be, and I am starting to wonder if it's a good name for the
flag, or even generally very useful.  It's sort of saying "I expect to
do I/O, and it'll be sequential, and I won't give up until the end".
But how many users can really make those claims?  pg_prewarm is unsual
in that it contains an explicit assumption that the cache is cold and
we want to warm it up.  But maybe we should just let the adaptive
algorithm do its thing.  It only takes a few reads to go from 1 ->
io_combine_limit.

Thinking harder, if we're going to keep this and not just be fully
adaptive, perhaps there should be a flag READ_STREAM_COLD, where you
hint that the data is not expected to be cached, and you'd combine
that with the _SEQUENTIAL hint.  pg_prewarm hints _COLD | _SEQUENTIAL.
Then the initial distance would be something uses the flag
combinations to select initial behavior A, B, C (and we'll quickly
adjust if you're wrong):

if (!(flags & READ_STREAM_COLD))
stream->distance = 1;
else if (flags & READ_STREAM_SEQUENTIAL)
stream->distance = Min(max_pinned_buffers, io_combine_limit);
else
stream->distance = max_pinned_buffers;

But probably almost all users especially in the executor haven't
really got much of a clue what they're going to do so they'd use the
initial starting position of 1 (A) and we'd soo figure it out.  Maybe
overengineering for pg_prewarm is a waste of time and we should just
delete the flag instead and hard code 1.




Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro
On Thu, Mar 28, 2024 at 10:52 AM Thomas Munro  wrote:
> I think 1 is good, as a rescan is even more likely to find the pages
> in cache, and if that turns out to be wrong it'll very soon adjust.

Hmm, no I take that back, it probably won't be due to the
strategy/ring...  I see your point now... when I had a separate flag,
the old distance was remembered across but now I'm zapping it.  I was
trying to minimise the number of variables that have to be tested in
the fast path by consolidating.  Hmm, it is signed -- would it be too
weird if we used a negative number for "finished", so we can just flip
it on reset?




Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro
On Thu, Mar 28, 2024 at 9:43 AM Melanie Plageman
 wrote:
> For sequential scan, I added a little reset function to the streaming
> read API (read_stream_reset()) that just releases all the buffers.
> Previously, it set finished to true before releasing the buffers (to
> indicate it was done) and then set it back to false after. Now, I'll
> set distance to 0 before releasing the buffers and !0 after. I could
> just restore whatever value distance had before I set it to 0. Or I
> could set it to 1. But, thinking about it, are we sure we want to ramp
> up in the same way on rescans? Maybe we want to use some information
> from the previous scan to determine what to set distance to? Maybe I'm
> overcomplicating it...

I think 1 is good, as a rescan is even more likely to find the pages
in cache, and if that turns out to be wrong it'll very soon adjust.




Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Melanie Plageman
On Wed, Mar 27, 2024 at 10:11 AM Thomas Munro  wrote:
>
> I got rid of "finished" (now represented by distance == 0, I was
> removing branches and variables).  I got rid of "started", which can
> now be deduced (used for suppressing advice when you're calling
> _next() because you need a block and we need to read it immediately),
> see the function argument suppress_advice.

I started rebasing the sequential scan streaming read user over this
new version, and this change (finished now represented with distance
== 0) made me realize that I'm not sure what to set distance to on
rescan.

For sequential scan, I added a little reset function to the streaming
read API (read_stream_reset()) that just releases all the buffers.
Previously, it set finished to true before releasing the buffers (to
indicate it was done) and then set it back to false after. Now, I'll
set distance to 0 before releasing the buffers and !0 after. I could
just restore whatever value distance had before I set it to 0. Or I
could set it to 1. But, thinking about it, are we sure we want to ramp
up in the same way on rescans? Maybe we want to use some information
from the previous scan to determine what to set distance to? Maybe I'm
overcomplicating it...

> Here is a new proposal for the names, updated in v10:
>
> read_stream_begin_relation()
> read_stream_next_buffer()
> void read_stream_end()

Personally, I'm happy with these.

- Melanie




Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro
On Wed, Mar 27, 2024 at 1:40 AM Heikki Linnakangas  wrote:
> Is int16 enough though? It seems so, because:
>
>  max_pinned_buffers = Max(max_ios * 4, buffer_io_size);
>
> and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and
> buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32.
>
> If someone changes those constants though, int16 might overflow and fail
> in weird ways. I'd suggest being more careful here and explicitly clamp
> max_pinned_buffers at PG_INT16_MAX or have a static assertion or
> something. (I think it needs to be somewhat less than PG_INT16_MAX,
> because of the extra "overflow buffers" stuff and some other places
> where you do arithmetic.)

Clamp added.

> >   /*
> >* We gave a contiguous range of buffer space to StartReadBuffers(), 
> > but
> >* we want it to wrap around at max_pinned_buffers.  Move values that
> >* overflowed into the extra space.  At the same time, put -1 in the 
> > I/O
> >* slots for the rest of the buffers to indicate no I/O.  They are 
> > covered
> >* by the head buffer's I/O, if there is one.  We avoid a % operator.
> >*/
> >   overflow = (stream->next_buffer_index + nblocks) - 
> > stream->max_pinned_buffers;
> >   if (overflow > 0)
> >   {
> >   memmove(>buffers[0],
> >   >buffers[stream->max_pinned_buffers],
> >   sizeof(stream->buffers[0]) * overflow);
> >   for (int i = 0; i < overflow; ++i)
> >   stream->buffer_io_indexes[i] = -1;
> >   for (int i = 1; i < nblocks - overflow; ++i)
> >   stream->buffer_io_indexes[stream->next_buffer_index + 
> > i] = -1;
> >   }
> >   else
> >   {
> >   for (int i = 1; i < nblocks; ++i)
> >   stream->buffer_io_indexes[stream->next_buffer_index + 
> > i] = -1;
> >   }
>
> Instead of clearing buffer_io_indexes here, it might be cheaper/simpler
> to initialize the array to -1 in streaming_read_buffer_begin(), and
> reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(),
> after the WaitReadBuffers() call. In other words, except when an I/O is
> in progress, keep all the elements at -1, even the elements that are not
> currently in use.

Yeah that wasn't nice and I had already got as far as doing exactly
that ↑ on my own, but your second idea ↓ is better!

> Alternatively, you could remember the first buffer that the I/O applies
> to in the 'ios' array. In other words, instead of pointing from buffer
> to the I/O that it depends on, point from the I/O to the buffer that
> depends on it. The last attached patch implements that approach. I'm not
> wedded to it, but it feels a little simpler.

Yeah, nice improvement.

> >   if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
> >   {
> >   /* Distance ramps up fast (behavior C). */
> >   ...
> >   }
> >   else
> >   {
> >   /* No advice; move towards full I/O size (behavior 
> > B). */
> >   ...
> >   }
>
> The comment on ReadBuffersOperation says "Declared in public header only
> to allow inclusion in other structs, but contents should not be
> accessed", but here you access the 'flags' field.
>
> You also mentioned that the StartReadBuffers() argument list is too
> long. Perhaps the solution is to redefine ReadBuffersOperation so that
> it consists of two parts: 1st part is filled in by the caller, and
> contains the arguments, and 2nd part is private to bufmgr.c. The
> signature for StartReadBuffers() would then be just:
>
> bool StartReadBuffers(ReadBuffersOperation *operation);

Yeah.  I had already got as far as doing this on the regression
hunting expedition, but I kept some arguments for frequently changing
things, eg blocknum.  It means that the stuff that never changes is in
there, and the stuff that changes each time doesn't have to be written
to memory at all.

> That would make it OK to read the 'flags' field. It would also allow
> reusing the same ReadBuffersOperation struct for multiple I/Os for the
> same relation; you only need to change the changing parts of the struct
> on each operation.

Right.  Done.

> In the attached patch set, the first three patches are your v9 with no
> changes. The last patch refactors away 'buffer_io_indexes' like I
> mentioned above. The others are fixes for some other trivial things that
> caught my eye.

Thanks, all squashed into the patch.

In an offline chat with Robert and Andres, we searched for a better
name for the GUC.  We came up with "io_combine_limit".  It's easier to
document a general purpose limit than to explain what "buffer_io_size"
does (particularly since the set of affected features will grow over
time but starts so small).  I'll feel better about using it to 

Re: Streaming I/O, vectored I/O (WIP)

2024-03-26 Thread Heikki Linnakangas

On 24/03/2024 15:02, Thomas Munro wrote:

On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas  wrote:

Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
for a shorter name.


Hmm.  The idea of 'buffer' appearing in a couple of names is that
there are conceptually other kinds of I/O that we might want to
stream, like raw files or buffers other than the buffer pool, maybe
even sockets, so this would be part of a family of similar interfaces.
I think it needs to be clear that this variant gives you buffers.  I'm
OK with removing "get" but I guess it would be better to keep the
words in the same order across the three functions?  What about these?

streaming_read_buffer_begin();
streaming_read_buffer_next();
streaming_read_buffer_end();

Tried like that in this version.  Other ideas would be to make
"stream" the main noun, buffered_read_stream_begin() or something.
Ideas welcome.


Works for me, although "streaming_read_buffer" is a pretty long prefix. 
The flags like "STREAMING_READ_MAINTENANCE" probably ought to be 
"STREAMING_READ_BUFFER_MAINTENANCE" as well.


Maybe "buffer_stream_next()"?


Here are some other changes:

* I'm fairly happy with the ABC adaptive distance algorithm so far, I
think, but I spent more time tidying up the way it is implemented.  I
didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER],
so I created a new dense array stream->buffers that behaved as a
second circular queue.

* The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to
become the GUC that it always wanted to be: buffer_io_size defaulting
to 128kB.  Seems like a reasonable thing to have?  Could also
influence things like bulk write?  (The main problem I have with the
GUC currently is choosing a category, async resources is wrong)

* By analogy, it started to look a bit funny that each range had room
for a ReadBuffersOperation, and we had enough ranges for
max_pinned_buffers * 1 block range.  So I booted that out to another
dense array, of size max_ios.

* At the same time, Bilal and Andres had been complaining privately
about 'range' management overheads showing up in perf and creating a
regression against master on fully cached scans that do nothing else
(eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O
and no CPU work with the page, a somewhat extreme case but a
reasonable way to isolate the management costs); having made the above
change, it suddenly seemed obvious that I should make the buffers
array the 'main' circular queue, pointing off to another place for
information required for dealing with misses.  In this version, there
are no more range objects.  This feels better and occupies and touches
less memory.  See pictures below.


+1 for all that. Much better!


* Various indexes and sizes that couldn't quite fit in uint8_t but
couldn't possibly exceed a few thousand because they are bounded by
numbers deriving from range-limited GUCs are now int16_t (while I was
looking for low hanging opportunities to reduce memory usage...)


Is int16 enough though? It seems so, because:

max_pinned_buffers = Max(max_ios * 4, buffer_io_size);

and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and 
buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32.


If someone changes those constants though, int16 might overflow and fail 
in weird ways. I'd suggest being more careful here and explicitly clamp 
max_pinned_buffers at PG_INT16_MAX or have a static assertion or 
something. (I think it needs to be somewhat less than PG_INT16_MAX, 
because of the extra "overflow buffers" stuff and some other places 
where you do arithmetic.)



/*
 * We gave a contiguous range of buffer space to StartReadBuffers(), but
 * we want it to wrap around at max_pinned_buffers.  Move values that
 * overflowed into the extra space.  At the same time, put -1 in the I/O
 * slots for the rest of the buffers to indicate no I/O.  They are 
covered
 * by the head buffer's I/O, if there is one.  We avoid a % operator.
 */
overflow = (stream->next_buffer_index + nblocks) - 
stream->max_pinned_buffers;
if (overflow > 0)
{
memmove(>buffers[0],
>buffers[stream->max_pinned_buffers],
sizeof(stream->buffers[0]) * overflow);
for (int i = 0; i < overflow; ++i)
stream->buffer_io_indexes[i] = -1;
for (int i = 1; i < nblocks - overflow; ++i)
stream->buffer_io_indexes[stream->next_buffer_index + 
i] = -1;
}
else
{
for (int i = 1; i < nblocks; ++i)
stream->buffer_io_indexes[stream->next_buffer_index + 
i] = -1;
}


Instead of clearing buffer_io_indexes here, it might be cheaper/simpler 
to initialize the array to -1 in streaming_read_buffer_begin(), and 
reset 

Re: Streaming I/O, vectored I/O (WIP)

2024-03-24 Thread Thomas Munro
On Mon, Mar 25, 2024 at 6:30 AM Melanie Plageman
 wrote:
> I haven't reviewed the whole patch, but as I was rebasing
> bitmapheapscan streaming read user, I found callback_private confusing
> because it seems like it is a private callback, not private data
> belonging to the callback. Perhaps call it callback_private_data? Also

WFM.

> maybe mention what it is for in the comment above
> streaming_read_buffer_begin() and in the StreamingRead structure
> itself.

Yeah.  I've tried to improve the comments on all three public
functions.  I also moved the three public functions _begin(), _next(),
_end() to be next to each other after the static helper functions.

Working on perf regression/tuning reports today, more soon...
From edd3d078cf8d4b0c2f08df82295825f7107ec62b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v9 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC buffer_io_size, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Reviewed-by: Melanie Plageman 
Reviewed-by: Heikki Linnakangas 
Reviewed-by: Nazir Bilal Yavuz 
Reviewed-by: Dilip Kumar 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  14 +
 src/backend/storage/buffer/bufmgr.c   | 658 --
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/backend/utils/misc/guc_tables.c   |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h  |  45 +-
 src/tools/pgindent/typedefs.list  |   1 +
 7 files changed, 535 insertions(+), 212 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c4086..3af86c59384 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'

   
 
+  
+   buffer_io_size (integer)
+   
+buffer_io_size configuration parameter
+   
+   
+   
+
+ Controls the target I/O size in operations that coalesce buffer I/O.
+ The default is 128kB.
+
+   
+  
+
   
max_worker_processes (integer)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..b5347678726 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,12 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/*
+ * How many buffers should be coalesced into single I/O operations where
+ * possible.
+ */
+int			buffer_io_size = DEFAULT_BUFFER_IO_SIZE;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +482,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 ForkNumber forkNum, BlockNumber blockNum,
-ReadBufferMode mode, BufferAccessStrategy strategy,
-bool *hit);
+ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 		   ForkNumber fork,
 		   BufferAccessStrategy strategy,
@@ -500,7 +510,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 		  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 			  

Re: Streaming I/O, vectored I/O (WIP)

2024-03-24 Thread Melanie Plageman
On Sun, Mar 24, 2024 at 9:02 AM Thomas Munro  wrote:
>
> On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas  wrote:
> > On 12/03/2024 15:02, Thomas Munro wrote:
> > > src/backend/storage/aio/streaming_read.c
> > > src/include/storage/streaming_read.h
> >
> > Standard file header comments missing.
>
> Fixed.
>
> > It would be nice to have a comment at the top of streaming_read.c,
> > explaining at a high level how the circular buffer, lookahead and all
> > that works. Maybe even some diagrams.
>
> Done.
>
> > For example, what is head and what is tail? Before reading the code, I
> > assumed that 'head' was the next block range to return in
> > pg_streaming_read_buffer_get_next(). But I think it's actually the other
> > way round?
>
> Yeah.  People seem to have different natural intuitions about head vs
> tail in this sort of thing, so I've switched to descriptive names:
> stream->{oldest,next}_buffer_index (see also below).
>
> > > /*
> > >  * Create a new streaming read object that can be used to perform the
> > >  * equivalent of a series of ReadBuffer() calls for one fork of one 
> > > relation.
> > >  * Internally, it generates larger vectored reads where possible by 
> > > looking
> > >  * ahead.
> > >  */
> > > PgStreamingRead *
> > > pg_streaming_read_buffer_alloc(int flags,
> > >  void 
> > > *pgsr_private,
> > >  size_t 
> > > per_buffer_data_size,
> > >  
> > > BufferAccessStrategy strategy,
> > >  
> > > BufferManagerRelation bmr,
> > >  ForkNumber 
> > > forknum,
> > >  
> > > PgStreamingReadBufferCB next_block_cb)
> >
> > I'm not a fan of the name, especially the 'alloc' part. Yeah, most of
> > the work it does is memory allocation. But I'd suggest something like
> > 'pg_streaming_read_begin' instead.
>
> I like it.  Done.
>
> > Do we really need the pg_ prefix in these?
>
> Good question.  My understanding of our convention is that pg_ is
> needed for local replacements/variants/extensions of things with well
> known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in
> a few places where the word is very common/short and we want to avoid
> collisions and make sure it's obviously ours (pg_popcount?), and I
> guess places that reflect the name of a SQL identifier with a prefix,
> but this doesn't seem to qualify for any of those things.  It's a new
> thing, our own thing entirely, and sufficiently distinctive and
> unconfusable with standard stuff.  So, prefix removed.
>
> Lots of other patches on top of this one are using "pgsr" as a
> variable name, ie containing that prefix; perhaps they would use  "sr"
> or "streaming_read" or "stream".  I used "stream" in a few places in
> this version.
>
> Other names improved in this version IMHO: pgsr_private ->
> callback_private.  I find it clearer, as a way to indicate that the
> provider of the callback "owns" it.  I also reordered the arguments:
> now it's streaming_read_buffer_begin(..., callback, callback_private,
> per_buffer_data_size), to keep those three things together.

I haven't reviewed the whole patch, but as I was rebasing
bitmapheapscan streaming read user, I found callback_private confusing
because it seems like it is a private callback, not private data
belonging to the callback. Perhaps call it callback_private_data? Also
maybe mention what it is for in the comment above
streaming_read_buffer_begin() and in the StreamingRead structure
itself.

- Melanie




Re: Streaming I/O, vectored I/O (WIP)

2024-03-24 Thread Thomas Munro
On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas  wrote:
> On 12/03/2024 15:02, Thomas Munro wrote:
> > src/backend/storage/aio/streaming_read.c
> > src/include/storage/streaming_read.h
>
> Standard file header comments missing.

Fixed.

> It would be nice to have a comment at the top of streaming_read.c,
> explaining at a high level how the circular buffer, lookahead and all
> that works. Maybe even some diagrams.

Done.

> For example, what is head and what is tail? Before reading the code, I
> assumed that 'head' was the next block range to return in
> pg_streaming_read_buffer_get_next(). But I think it's actually the other
> way round?

Yeah.  People seem to have different natural intuitions about head vs
tail in this sort of thing, so I've switched to descriptive names:
stream->{oldest,next}_buffer_index (see also below).

> > /*
> >  * Create a new streaming read object that can be used to perform the
> >  * equivalent of a series of ReadBuffer() calls for one fork of one 
> > relation.
> >  * Internally, it generates larger vectored reads where possible by looking
> >  * ahead.
> >  */
> > PgStreamingRead *
> > pg_streaming_read_buffer_alloc(int flags,
> >  void *pgsr_private,
> >  size_t 
> > per_buffer_data_size,
> >  
> > BufferAccessStrategy strategy,
> >  
> > BufferManagerRelation bmr,
> >  ForkNumber forknum,
> >  
> > PgStreamingReadBufferCB next_block_cb)
>
> I'm not a fan of the name, especially the 'alloc' part. Yeah, most of
> the work it does is memory allocation. But I'd suggest something like
> 'pg_streaming_read_begin' instead.

I like it.  Done.

> Do we really need the pg_ prefix in these?

Good question.  My understanding of our convention is that pg_ is
needed for local replacements/variants/extensions of things with well
known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in
a few places where the word is very common/short and we want to avoid
collisions and make sure it's obviously ours (pg_popcount?), and I
guess places that reflect the name of a SQL identifier with a prefix,
but this doesn't seem to qualify for any of those things.  It's a new
thing, our own thing entirely, and sufficiently distinctive and
unconfusable with standard stuff.  So, prefix removed.

Lots of other patches on top of this one are using "pgsr" as a
variable name, ie containing that prefix; perhaps they would use  "sr"
or "streaming_read" or "stream".  I used "stream" in a few places in
this version.

Other names improved in this version IMHO: pgsr_private ->
callback_private.  I find it clearer, as a way to indicate that the
provider of the callback "owns" it.  I also reordered the arguments:
now it's streaming_read_buffer_begin(..., callback, callback_private,
per_buffer_data_size), to keep those three things together.

> > Buffer
> > pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void 
> > **per_buffer_data)
>
> Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
> for a shorter name.

Hmm.  The idea of 'buffer' appearing in a couple of names is that
there are conceptually other kinds of I/O that we might want to
stream, like raw files or buffers other than the buffer pool, maybe
even sockets, so this would be part of a family of similar interfaces.
I think it needs to be clear that this variant gives you buffers.  I'm
OK with removing "get" but I guess it would be better to keep the
words in the same order across the three functions?  What about these?

streaming_read_buffer_begin();
streaming_read_buffer_next();
streaming_read_buffer_end();

Tried like that in this version.  Other ideas would be to make
"stream" the main noun, buffered_read_stream_begin() or something.
Ideas welcome.

It's also a bit grammatically weird to say StartReadBuffers() and
WaitReadBuffers() in the bufmgr API...  Hmm.  Perhaps we should just
call it ReadBuffers() and WaitForBufferIO()?  Maybe surprising because
the former isn't just like ReadBuffer() ... but on the other hand no
one said it has to be, and sometimes it even is (when it gets a hit).
I suppose there could even be a flag READ_BUFFERS_WAIT or the opposite
to make the asynchrony optional or explicit if someone has a problem
with that.

(Hmm, that'd be a bit like the Windows native file API, where
ReadFile() is synchronous or asynchronous depending on flags.)

> >
> >   /*
> >* pgsr->ranges is a circular buffer.  When it is empty, head == tail.
> >* When it is full, there is an empty element between head and tail.  
> > Head
> >* can also be empty (nblocks == 0), therefore we need two extra 
> > elements
> >* for non-occupied ranges, on top of 

Re: Streaming I/O, vectored I/O (WIP)

2024-03-19 Thread Heikki Linnakangas

Some quick comments:

On 12/03/2024 15:02, Thomas Munro wrote:

src/backend/storage/aio/streaming_read.c
src/include/storage/streaming_read.h


Standard file header comments missing.

It would be nice to have a comment at the top of streaming_read.c, 
explaining at a high level how the circular buffer, lookahead and all 
that works. Maybe even some diagrams.


For example, what is head and what is tail? Before reading the code, I 
assumed that 'head' was the next block range to return in 
pg_streaming_read_buffer_get_next(). But I think it's actually the other 
way round?



/*
 * Create a new streaming read object that can be used to perform the
 * equivalent of a series of ReadBuffer() calls for one fork of one relation.
 * Internally, it generates larger vectored reads where possible by looking
 * ahead.
 */
PgStreamingRead *
pg_streaming_read_buffer_alloc(int flags,
   void *pgsr_private,
   size_t 
per_buffer_data_size,
   BufferAccessStrategy 
strategy,
   
BufferManagerRelation bmr,
   ForkNumber forknum,
   
PgStreamingReadBufferCB next_block_cb)


I'm not a fan of the name, especially the 'alloc' part. Yeah, most of 
the work it does is memory allocation. But I'd suggest something like 
'pg_streaming_read_begin' instead.


Do we really need the pg_ prefix in these?


Buffer
pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)


Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next', 
for a shorter name.





/*
 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
 * When it is full, there is an empty element between head and tail.  
Head
 * can also be empty (nblocks == 0), therefore we need two extra 
elements
 * for non-occupied ranges, on top of max_pinned_buffers to allow for 
the
 * maxmimum possible number of occupied ranges of the smallest possible
 * size of one.
 */
size = max_pinned_buffers + 2;


I didn't understand this explanation for why it's + 2.


/*
 * Skip the initial ramp-up phase if the caller says we're going to be
 * reading the whole relation.  This way we start out doing full-sized
 * reads.
 */
if (flags & PGSR_FLAG_FULL)
pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, 
pgsr->max_pinned_buffers);
else
pgsr->distance = 1;


Should this be "Max(MAX_BUFFERS_PER_TRANSFER, 
pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than 
MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So 
perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?


--
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

2024-03-18 Thread Nazir Bilal Yavuz
Hi,

On Sat, 16 Mar 2024 at 02:53, Thomas Munro  wrote:
>
> I am planning to push the bufmgr.c patch soon.  At that point the new
> API won't have any direct callers yet, but the traditional
> ReadBuffer() family of functions will internally reach
> StartReadBuffers(nblocks=1) followed by WaitReadBuffers(),
> ZeroBuffer() or nothing as appropriate.  Any more thoughts or
> objections?  Naming, semantics, correctness of buffer protocol,
> sufficiency of comments, something else?

+if (StartReadBuffers(bmr,
+ ,
+ forkNum,
+ blockNum,
+ ,
+ strategy,
+ flags,
+ ))
+WaitReadBuffers();

I think we need to call WaitReadBuffers when 'mode !=
RBM_ZERO_AND_CLEANUP_LOCK && mode != RBM_ZERO_AND_LOCK' or am I
missing something?

Couple of nitpicks:

It would be nice to explain what the PrepareReadBuffer function does
with a comment.

+if (nblocks == 0)
+return;/* nothing to do */
It is guaranteed that nblocks will be bigger than 0. Can't we just use
Assert(operation->io_buffers_len > 0);?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Streaming I/O, vectored I/O (WIP)

2024-03-15 Thread Thomas Munro
I am planning to push the bufmgr.c patch soon.  At that point the new
API won't have any direct callers yet, but the traditional
ReadBuffer() family of functions will internally reach
StartReadBuffers(nblocks=1) followed by WaitReadBuffers(),
ZeroBuffer() or nothing as appropriate.  Any more thoughts or
objections?  Naming, semantics, correctness of buffer protocol,
sufficiency of comments, something else?




Re: Streaming I/O, vectored I/O (WIP)

2024-03-12 Thread Dilip Kumar
On Tue, Mar 12, 2024 at 12:10 PM Thomas Munro  wrote:
>
> I think you'd be right if StartReadBuffers() were capable of
> processing a sequence consisting of a hit followed by misses, but
> currently it always gives up after the first hit.  That is, it always
> processes some number of misses (0-16) and then at most one hit.  So
> for now the variable would always turn out to be the same as blockNum.
>
Okay, then shouldn't this "if (found)" block immediately break the
loop so that when we hit the block we just return that block?  So it
makes sense what you explained but with the current code if there are
the first few hits followed by misses then we will issue the
smgrprefetch() for the initial hit blocks as well.

+ if (found)
+ {
+ /*
+ * Terminate the read as soon as we get a hit.  It could be a
+ * single buffer hit, or it could be a hit that follows a readable
+ * range.  We don't want to create more than one readable range,
+ * so we stop here.
+ */
+ actual_nblocks = operation->nblocks = *nblocks = i + 1;(Dilip: I
think we should break after this?)
+ }
+ else
+ {
+ /* Extend the readable range to cover this block. */
+ operation->io_buffers_len++;
+ }
+ }

> The reason is that I wanted to allows "full sized" read system calls
> to form.  If you said "hey please read these 16 blocks" (I'm calling
> that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits,
> then it could only form a read of 14 blocks, but there might be more
> blocks that could be read after those.  We would have some arbitrary
> shorter read system calls, when we wanted to make them all as big as
> possible.  So in the current patch you say "hey please read these 16
> blocks" and it returns saying "only read 1", you call again with 15
> and it says "only read 1", and you call again and says "read 16!"
> (assuming 2 more were readable after the original range we started
> with).  Then physical reads are maximised.  Maybe there is some nice
> way to solve that, but I thought this way was the simplest (and if
> there is some instruction-cache-locality/tight-loop/perf reason why we
> should work harder to find ranges of hits, it could be for later).
> Does that make sense?

Understood, I think this makes sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Streaming I/O, vectored I/O (WIP)

2024-03-12 Thread Thomas Munro
On Tue, Mar 12, 2024 at 7:40 PM Thomas Munro  wrote:
> possible.  So in the current patch you say "hey please read these 16
> blocks" and it returns saying "only read 1", you call again with 15

Oops, typo worth correcting: s/15/16/.  Point being that the caller is
interested in more blocks after the original 16, so it uses 16 again
when it calls back (because that's the size of the Buffer array it
provides).




Re: Streaming I/O, vectored I/O (WIP)

2024-03-12 Thread Thomas Munro
On Tue, Mar 12, 2024 at 7:15 PM Dilip Kumar  wrote:
> I am planning to review this patch set, so started going through 0001,
> I have a question related to how we are issuing smgrprefetch in
> StartReadBuffers()

Thanks!

> + /*
> + * In theory we should only do this if PrepareReadBuffers() had to
> + * allocate new buffers above.  That way, if two calls to
> + * StartReadBuffers() were made for the same blocks before
> + * WaitReadBuffers(), only the first would issue the advice.
> + * That'd be a better simulation of true asynchronous I/O, which
> + * would only start the I/O once, but isn't done here for
> + * simplicity.  Note also that the following call might actually
> + * issue two advice calls if we cross a segment boundary; in a
> + * true asynchronous version we might choose to process only one
> + * real I/O at a time in that case.
> + */
> + smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
>   }
>
>  This is always issuing smgrprefetch starting with the input blockNum,
> shouldn't we pass the first blockNum which we did not find in the
>  Buffer pool?  So basically in the loop above this call where we are
> doing PrepareReadBuffer() we should track the first blockNum for which
>  the found is not true and pass that blockNum into the smgrprefetch()
> as a first block right?

I think you'd be right if StartReadBuffers() were capable of
processing a sequence consisting of a hit followed by misses, but
currently it always gives up after the first hit.  That is, it always
processes some number of misses (0-16) and then at most one hit.  So
for now the variable would always turn out to be the same as blockNum.

The reason is that I wanted to allows "full sized" read system calls
to form.  If you said "hey please read these 16 blocks" (I'm calling
that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits,
then it could only form a read of 14 blocks, but there might be more
blocks that could be read after those.  We would have some arbitrary
shorter read system calls, when we wanted to make them all as big as
possible.  So in the current patch you say "hey please read these 16
blocks" and it returns saying "only read 1", you call again with 15
and it says "only read 1", and you call again and says "read 16!"
(assuming 2 more were readable after the original range we started
with).  Then physical reads are maximised.  Maybe there is some nice
way to solve that, but I thought this way was the simplest (and if
there is some instruction-cache-locality/tight-loop/perf reason why we
should work harder to find ranges of hits, it could be for later).
Does that make sense?




Re: Streaming I/O, vectored I/O (WIP)

2024-03-12 Thread Dilip Kumar
On Sat, Mar 9, 2024 at 3:55 AM Thomas Munro  wrote:
>
Hi Thomas,

I am planning to review this patch set, so started going through 0001,
I have a question related to how we are issuing smgrprefetch in
StartReadBuffers()

+ if (operation->io_buffers_len > 0)
+ {
+ if (flags & READ_BUFFERS_ISSUE_ADVICE)
  {
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum;
- MemSet((char *) bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum;
+ /*
+ * In theory we should only do this if PrepareReadBuffers() had to
+ * allocate new buffers above.  That way, if two calls to
+ * StartReadBuffers() were made for the same blocks before
+ * WaitReadBuffers(), only the first would issue the advice.
+ * That'd be a better simulation of true asynchronous I/O, which
+ * would only start the I/O once, but isn't done here for
+ * simplicity.  Note also that the following call might actually
+ * issue two advice calls if we cross a segment boundary; in a
+ * true asynchronous version we might choose to process only one
+ * real I/O at a time in that case.
+ */
+ smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
  }

 This is always issuing smgrprefetch starting with the input blockNum,
shouldn't we pass the first blockNum which we did not find in the
 Buffer pool?  So basically in the loop above this call where we are
doing PrepareReadBuffer() we should track the first blockNum for which
 the found is not true and pass that blockNum into the smgrprefetch()
as a first block right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Streaming I/O, vectored I/O (WIP)

2024-02-26 Thread Robert Haas
On Tue, Feb 27, 2024 at 9:25 AM Thomas Munro  wrote:
> Here's the 2 step version.  The streaming_read.c API is unchanged, but
> the bugmfr.c API now has only the following extra functions:
>
>   bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op)
>   WaitReadBuffers(ReadBuffersOperation *op)

I wonder if there are semi-standard names that people use for this
kind of API. Somehow I like "start" and "finish" or "start" and
"complete" better than "start" and "wait". But I don't really know
what's best. If there's a usual practice, it'd be good to adhere to
it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Streaming I/O, vectored I/O (WIP)

2024-02-26 Thread Thomas Munro
On Wed, Feb 7, 2024 at 11:54 PM Nazir Bilal Yavuz  wrote:
> 0001-Provide-vectored-variant-of-ReadBuffer:
>
> - Do we need to pass the hit variable to ReadBuffer_common()? I think
> it can be just declared in the ReadBuffer_common() now.

Right, thanks!  Done, in the version I'll post shortly.

> 0003-Provide-API-for-streaming-reads-of-relations:
>
> - Do we need to re-think about getting a victim buffer logic?
> StrategyGetBuffer() function errors if it can not find any unpinned
> buffers, this can be more common in the async world since we pin
> buffers before completing the read (while looking ahead).

Hmm, well that is what the pin limit machinery is supposed to be the
solution to.  It has always been possible to see that error if
shared_buffers is too small, so that your backends can't pin what they
need to make progress because there are too many other backends, the
whole buffer pool is pinned and there is nothing available to
steal/evict.  Here, sure, we pin more stuff per backend, but not more
than a "fair share", that is, Buffers / max backends, so it's not any
worse, is it?  Well maybe it's marginally worse in some case, for
example if a query that uses many streams has one pinned buffer per
stream (which we always allow) where before we'd have acquired and
released pins in a slightly different sequence or whatever, but there
is always going to be a minimum shared_buffers that will work at all
for a given workload and we aren't changing it by much if at all here.
If you're anywhere near that limit, your performance must be so bad
that it'd only be a toy setting anyway.  Does that sound reasonable?

Note that this isn't the first to use multi-pin logic or that limit
mechanism: that was the new extension code that shipped in 16.  This
will do that more often, though.

> - If the returned block from the callback is an invalid block,
> pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there
> be cases like the returned block being an invalid block but we should
> continue to read after this invalid block?

Yeah, I think there will be, and I think we should do it with some
kind of reset/restart function.  I don't think we need it for the
current users so I haven't included it yet (there is a maybe-related
discussion about reset for another reasons, I think Melanie has an
idea about that), but I think something like that will useful for
future stuff like streaming recovery, where you can run out of WAL to
read but more will come via the network soon.

> - max_pinned_buffers and pinned_buffers_trigger variables are set in
> the initialization part (in the
> pg_streaming_read_buffer_alloc_internal() function) then they do not
> change. In some cases there could be no acquirable buffers to pin
> while initializing the pgsr (LimitAdditionalPins() set
> max_pinned_buffers to 1) but while the read is continuing there could
> be chances to create larger reads (other consecutive reads are
> finished while this read is continuing). Do you think that trying to
> reset max_pinned_buffers and pinned_buffers_trigger to have higher
> values after the initialization to have larger reads make sense?

That sounds hard!  You're right that in the execution of a query there
might well be cases like that (inner and outer scan of a hash join
don't actually run at the same time, likewise for various other plan
shapes), and something that would magically and dynamically balance
resource usage might be ideal, but I don't know where to begin.
Concretely, as long as your buffer pool is measured in gigabytes and
your max backends is measured in hundreds, the per backend pin limit
should actually be fairly hard to hit anyway, as it would be in the
thousands.  So I don't think it is as important as other resource
usage balance problems that we also don't attempt (memory, CPU, I/O
bandwidth).

> +/* Is there a head range that we can't extend? */
> +head_range = >ranges[pgsr->head];
> +if (head_range->nblocks > 0 &&
> +(!need_complete ||
> + !head_range->need_complete ||
> + head_range->blocknum + head_range->nblocks != blocknum))
> +{
> +/* Yes, time to start building a new one. */
> +head_range = pg_streaming_read_new_range(pgsr);
>
> - I think if both need_complete and head_range->need_complete are
> false, we can extend the head range regardless of the consecutiveness
> of the blocks.

Yeah, I think we can experiment with ideas like that.  Not done yet
but I'm thinking about it -- more shortly.

> 0006-Allow-streaming-reads-to-ramp-up-in-size:
>
> - ramp_up_pin_limit variable is declared as an int but we do not check
> the overflow while doubling it. This could be a problem in longer
> reads.

But it can't get above very high, because eventually it exceeds
max_pinned_buffers, which is anchored to the ground by various small
limits.




Re: Streaming I/O, vectored I/O (WIP)

2024-02-07 Thread Nazir Bilal Yavuz
Hi,

Thanks for working on this!

On Wed, 10 Jan 2024 at 07:14, Thomas Munro  wrote:

> Thanks!  I committed the patches up as far as smgr.c before the
> holidays.  The next thing to commit would be the bufmgr.c one for
> vectored ReadBuffer(), v5-0001.  Here's my response to your review of
> that, which triggered quite a few changes.
>
> See also new version of the streaming_read.c patch, with change list
> at end.  (I'll talk about v5-0002, the SMgrRelation lifetime one, over
> on the separate thread about that where Heikki posted a better
> version.)

I have a couple of comments / questions.

0001-Provide-vectored-variant-of-ReadBuffer:

- Do we need to pass the hit variable to ReadBuffer_common()? I think
it can be just declared in the ReadBuffer_common() now.


0003-Provide-API-for-streaming-reads-of-relations:

- Do we need to re-think about getting a victim buffer logic?
StrategyGetBuffer() function errors if it can not find any unpinned
buffers, this can be more common in the async world since we pin
buffers before completing the read (while looking ahead).

- If the returned block from the callback is an invalid block,
pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there
be cases like the returned block being an invalid block but we should
continue to read after this invalid block?

- max_pinned_buffers and pinned_buffers_trigger variables are set in
the initialization part (in the
pg_streaming_read_buffer_alloc_internal() function) then they do not
change. In some cases there could be no acquirable buffers to pin
while initializing the pgsr (LimitAdditionalPins() set
max_pinned_buffers to 1) but while the read is continuing there could
be chances to create larger reads (other consecutive reads are
finished while this read is continuing). Do you think that trying to
reset max_pinned_buffers and pinned_buffers_trigger to have higher
values after the initialization to have larger reads make sense?

+/* Is there a head range that we can't extend? */
+head_range = >ranges[pgsr->head];
+if (head_range->nblocks > 0 &&
+(!need_complete ||
+ !head_range->need_complete ||
+ head_range->blocknum + head_range->nblocks != blocknum))
+{
+/* Yes, time to start building a new one. */
+head_range = pg_streaming_read_new_range(pgsr);

- I think if both need_complete and head_range->need_complete are
false, we can extend the head range regardless of the consecutiveness
of the blocks.


0006-Allow-streaming-reads-to-ramp-up-in-size:

- ramp_up_pin_limit variable is declared as an int but we do not check
the overflow while doubling it. This could be a problem in longer
reads.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Streaming I/O, vectored I/O (WIP)

2024-01-11 Thread Thomas Munro
On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas  wrote:
> Ok. It feels surprising to have three steps. I understand that you need
> two steps, one to start the I/O and another to wait for them to finish,
> but why do you need separate Prepare and Start steps? What can you do in
> between them? (You explained that. I'm just saying that that's my
> initial reaction when seeing that API. It is surprising.)

Actually I don't think I explained that very well.  First, some more
detail about how a two-step version would work:

* we only want to start one I/O in an StartReadBuffers() call, because
otherwise it is hard/impossible for the caller to cap concurrency I/O
* therefore StartReadBuffers() can handle sequences matching /^H*I*H*/
("H" = hit, "I" = miss, I/O) in one call
* in the asynchronous version, "I" in that pattern means we got
BM_IO_IN_PROGRESS
* in the synchronous version, "I" means that it's not valid, not
BM_IN_IN_PROGRESS, but we won't actually try to get BM_IO_IN_PROGRESS
until the later Complete/Wait call (and then the answer might chagne,
but we'll just deal with that by looping in the synchronous version)
* streaming_read.c has to deal with buffering up work that
StartReadBuffers() didn't accept
* that's actually quite easy, you just use the rest to create a new
range in the next slot

Previously I thought the requirement to deal with buffering future
stuff that StartReadBuffers() couldn't accept yet was a pain, and life
became so much simpler once I deleted all that and exposed
PrepareReadBuffer() to the calling code.  Perhaps I just hadn't done a
good enough job of that.

The other disadvantage you reminded me of was the duplicate buffer
lookup in certain unlucky patterns, which I had forgotten about in my
previous email.  But I guess it's not fatal to the idea and there is a
potential partial mitigation.  (See below).

A third thing was the requirement for communication between
StartReadBuffers() and CompleteReadBuffers() which I originally had an
"opaque" object that the caller has to keep around that held private
state.  It seemed nice to go back to talking just about buffer
numbers, but that's not really an argument for anything...

OK, I'm going to try the two-step version (again) with interfaces
along the lines you sketched out...  more soon.

> I see. When you're about to zero the page, there's not much point in
> splitting the operation into Prepare/Start/Complete stages anyway.
> You're not actually doing any I/O. Perhaps it's best to have a separate
> "Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)"
> function that does the same as
> ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.

That makes sense, but... hmm, sometimes just allocating a page
generates I/O if it has to evict a dirty buffer.  Nothing in this code
does anything fancy about that, but imagine some hypothetical future
thing that manages to do that asynchronously -- then we might want to
take advantage of the ability to stream even a zeroed page, ie doing
something ahead of time?  Just a thought for another day, and perhaps
that is just an argument for including it in the streaming read API,
but it doesn't mean that the bufmgr.c API can't be as you say.

> One weakness is that if StartReadBufferRange() finds that the range is
> "chopped up", it needs to return and throw away the work it had to do to
> look up the next buffer. So in the extreme case that every other block
> in the range is in the buffer cache, each call would look up two buffers
> in the buffer cache, startBlk and startBlk + 1, but only return one
> buffer to the caller.

Yeah, right.  This was one of the observations that influenced my
PrepareReadBuffer() three-step thing that I'd forgotten.  To spell
that out with an example, suppose the buffer pool contains every odd
numbered block.  Successive StartReadBuffers() calls would process
"HMHm", "MHm", "MHm"... where "m" represents a miss that we can't do
anything with for a block we'll look up in the buffer pool again in
the next call.  With the PrepareReadBuffer() design, that miss just
starts a new range and we don't have to look it up again.  Hmm, I
suppose that could be mitigated somewhat with ReadRecentBuffer() if we
can find somewhere decent to store it.

BTW it was while thinking about and testing cases like that that I
found Palak Chaturvedi's https://commitfest.postgresql.org/46/4426/
extremely useful.  It can kick out every second page or any other
range-chopping scenario you can express in a WHERE clause.  I would
quite like to get that tool into the tree...




Re: Streaming I/O, vectored I/O (WIP)

2024-01-11 Thread Heikki Linnakangas

On 11/01/2024 05:19, Thomas Munro wrote:

On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas  wrote:

On 10/01/2024 06:13, Thomas Munro wrote:

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.


How will these functions work in the brave new async I/O world? I assume
PrepareReadBuffer() will initiate the async I/O, and
CompleteReadBuffers() will wait for it to complete. How about
StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
WaitBufferRead()?


What we have imagined so far is that the asynchronous version would
probably have three steps, like this:

  * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing 
needed
  * StartReadBuffers() -> starts the I/O for n contiguous !found buffers
  * CompleteReadBuffers() -> waits, completes as necessary


Ok. It feels surprising to have three steps. I understand that you need 
two steps, one to start the I/O and another to wait for them to finish, 
but why do you need separate Prepare and Start steps? What can you do in 
between them? (You explained that. I'm just saying that that's my 
initial reaction when seeing that API. It is surprising.)



If StartReadBuffer() starts the async I/O, the idea that you can call
ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think
StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for
you in RBM_ZERO_* modes.


Yeah, good thoughts, and topics that have occupied me for some time
now.  I also thought that StartReadBuffer() should take
ReadBufferMode, but I came to the idea that it probably shouldn't like
this:

I started working on all this by trying to implement the most
complicated case I could imagine, streaming recovery, and then working
back to the easy cases that just do scans with RBM_NORMAL.  In
recovery, we can predict that a block will be zeroed using WAL flags,
and pre-existing cross-checks at redo time that enforce that the flags
and redo code definitely agree on that, but we can't predict which
exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or
RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and
get_cleanup_lock=true, as the comment warns them not to, but I
digress).

That's OK, because we can't take locks while looking ahead in recovery
anyway (the redo routine carefully controls lock order/protocol), so
the code to actually do the locking needs to be somewhere near the
output end of the stream when the redo code calls
XLogReadBufferForRedoExtended().  But if you try to use RBM_XXX in
these interfaces, it begins to look pretty funny: the streaming
callback needs to be able to say which ReadBufferMode, but anywhere
near Prepare*(), Start*() or even Complete*() is too soon, so maybe we
need to invent a new value RBM_WILL_ZERO that doesn't yet say which of
the zero modes to use, and then the redo routine needs to pass in the
RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to
XLogReadBufferForRedoExtended() and it does it in a separate step
anyway, so we are ignoring ReadBufferMode.  But that feels just wrong
-- we'd be using RBM but implementing them only partially.


I see. When you're about to zero the page, there's not much point in 
splitting the operation into Prepare/Start/Complete stages anyway. 
You're not actually doing any I/O. Perhaps it's best to have a separate 
"Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)" 
function that does the same as 
ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.



The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the
right time and makes sense to process as a batch, so it becomes a
flag.


+1


Putting all that together, I propose:

/*
   * Initiate reading a block from disk to the buffer cache.
   *
   * XXX: Until we have async I/O, this just allocates the buffer in the
buffer
   * cache. The actual I/O happens in WaitReadBuffer().
   */
Buffer
StartReadBuffer(BufferManagerRelation bmr,
 ForkNumber forkNum,
 BlockNumber blockNum,
 BufferAccessStrategy strategy,
 ReadBufferMode mode,
 bool *foundPtr);

/*
   * Wait for a read that was started earlier with StartReadBuffer() to
finish.
   *
   * XXX: Until we have async I/O, this is the function that actually
performs
   * the I/O. If multiple I/Os have been started with StartReadBuffer(), this
   * will try to perform all of them in one syscall. Subsequent calls to
   * WaitReadBuffer(), for those other buffers, will finish quickly.
   */
void
WaitReadBuffer(Buffer buf);


I'm confused about where the extra state lives that would allow the
communication required to build a larger I/O.  In the AIO branch, it
does look a little more like that, but there is more magic state and
machinery hiding behind the curtain: the backend's pending I/O list
builds up a chain of I/Os, 

Re: Streaming I/O, vectored I/O (WIP)

2024-01-10 Thread Thomas Munro
On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas  wrote:
> On 10/01/2024 06:13, Thomas Munro wrote:
> > Bikeshedding call: I am open to better suggestions for the names
> > PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
> > grammatically clumsy.
>
> How will these functions work in the brave new async I/O world? I assume
> PrepareReadBuffer() will initiate the async I/O, and
> CompleteReadBuffers() will wait for it to complete. How about
> StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
> WaitBufferRead()?

What we have imagined so far is that the asynchronous version would
probably have three steps, like this:

 * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed
 * StartReadBuffers() -> starts the I/O for n contiguous !found buffers
 * CompleteReadBuffers() -> waits, completes as necessary

In the proposed synchronous version, the middle step is missing, but
streaming_read.c directly calls smgrprefetch() instead.  I thought
about shoving that inside a prosthetic StartReadBuffers() function,
but I backed out of simulating asynchronous I/O too fancifully.  The
streaming read API is where we really want to stabilise a nice API, so
we can moves things around behind it if required.

A bit of analysis of the one block -> nblocks change and the
synchronous -> asynchronous change:

Two things are new in a world with nblocks > 1.  (1) We needed to be
able to set BM_IO_IN_PROGRESS on more than one block at a time, but
commit 12f3867f already provided that, and (2) someone else might come
along and read a block in the middle of our range, effectively
chopping our range into subranges.  That's true also in master but
when nblocks === 1 that's all-or-nothing, and now we have partial
cases.  In the proposed synchronous code, CompleteReadBuffers() claims
as many contiguous BM_IO_IN_PROGRESS flags as it in the range, and
then loops process the rest, skipping over any blocks that are already
done.  Further down in md.c, you might also cross a segment boundary.
So that's two different reasons while a single call to
CompleteReadBuffers() might finish up generating zero or more than one
I/O system call, though very often it's one.

Hmm, while spelling that out, I noticed an obvious problem and
improvement to make to that part of v5.  If backend #1 is trying to
read blocks 101..103 and acquires BM_IO_IN_PROGRESS for 101, but
backend #2 comes along and starts reading block 102 first, backend
#1's StartBufferIO() call would wait for 102's I/O CV while it still
holds BM_IO_IN_PROGRESS for block 101, potentially blocking a third
backend #3 that wants to read block 101 even though no I/O is in
progress for that block yet!  At least that's deadlock free (because
always in block order), but it seems like undesirable lock chaining.
Here is my proposed improvement: StartBufferIO() gains a nowait flag.
For the head block we wait, but while trying to build a larger range
we don't.  We'll try 102 again in the next loop, with a wait.  Here is
a small fixup for that.

In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would
take place in StartReadBuffers() instead, which would be responsible
for kicking off asynchronous I/Os (= asking a background worker to
call pread(), or equivalent fancy kernel async APIs).  One question is
what it does if it finds a block in the middle that chops the read up,
or for that matter a segment boundary.  I don't think we have to
decide now, but the two options seem to be that it starts one single
I/O and reports its size, making it the client's problem to call again
with the rest, or that it starts more than one I/O and they are
somehow chained together so that the caller doesn't know about that
and can later wait for all of them to finish using just one .

(The reason why I'm not 100% certain how it will look is that the
real, working code in the aio branch right now doesn't actually expose
a vector/nblocks bufmgr interface at all, yet.  Andres's original
prototype had a single-block Start*(), Complete*() design, but a lower
level of the AIO system notices if pending read operations are
adjacent and could be merged.  While discussing all this we decided it
was a bit strange to have lower code deal with allocating, chaining
and processing lots of separate I/O objects in shared memory, when
higher level code could often work in bigger ranges up front, and then
interact with the AIO subsystem with many fewer objects and steps.
Also, the present simple and lightweight synchronous proposal that
lacks the whole subsystem that could do that by magic.)

> About the signature of those functions: Does it make sense for
> CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector
> of buffers? If StartReadBuffer() initiates the async I/O immediately, is
> there any benefit to batching the waiting?
>
> If StartReadBuffer() starts the async I/O, the idea that you can call
> ZeroBuffer() instead of WaitReadBuffer() doesn't work. I 

Re: Streaming I/O, vectored I/O (WIP)

2024-01-10 Thread Heikki Linnakangas

On 10/01/2024 06:13, Thomas Munro wrote:

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.


How will these functions work in the brave new async I/O world? I assume 
PrepareReadBuffer() will initiate the async I/O, and 
CompleteReadBuffers() will wait for it to complete. How about 
StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and 
WaitBufferRead()?


About the signature of those functions: Does it make sense for 
CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector 
of buffers? If StartReadBuffer() initiates the async I/O immediately, is 
there any benefit to batching the waiting?


If StartReadBuffer() starts the async I/O, the idea that you can call 
ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think 
StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for 
you in RBM_ZERO_* modes.



Putting all that together, I propose:

/*
 * Initiate reading a block from disk to the buffer cache.
 *
 * XXX: Until we have async I/O, this just allocates the buffer in the 
buffer

 * cache. The actual I/O happens in WaitReadBuffer().
 */
Buffer
StartReadBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
ReadBufferMode mode,
bool *foundPtr);

/*
 * Wait for a read that was started earlier with StartReadBuffer() to 
finish.

 *
 * XXX: Until we have async I/O, this is the function that actually 
performs

 * the I/O. If multiple I/Os have been started with StartReadBuffer(), this
 * will try to perform all of them in one syscall. Subsequent calls to
 * WaitReadBuffer(), for those other buffers, will finish quickly.
 */
void
WaitReadBuffer(Buffer buf);


I'm not sure how well this fits with the streaming read API. The 
streaming read code performs grouping of adjacent blocks to one 
CompleteReadBuffers() call. If WaitReadBuffer() does the batching, 
that's not really required. But does that make sense with async I/O? 
With async I/O, will you need a vectorized version of StartReadBuffer() too?


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Streaming I/O, vectored I/O (WIP)

2023-12-31 Thread Melanie Plageman
I've written a new version of the vacuum streaming read user on top of
the rebased patch set [1]. It differs substantially from Andres' and
includes several refactoring patches that can apply on top of master.
As such, I've proposed those in a separate thread [2]. I noticed mac
and windows fail to build on CI for my branch with the streaming read
code. I haven't had a chance to investigate  -- but I must have done
something wrong on rebase.

- Melanie

[1] https://github.com/melanieplageman/postgres/tree/stepwise_vac_streaming_read
[2] 
https://www.postgresql.org/message-id/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA%40mail.gmail.com




Re: Streaming I/O, vectored I/O (WIP)

2023-12-30 Thread Cédric Villemain

Le 11/12/2023 à 10:12, Thomas Munro a écrit :

3.  smgrreadv/smgrwritev patch:

  * improved ENOSPC handling
  * improve description of EOF and ENOSPC handling
  * fixed the sizes reported in dtrace static probes
  * fixed some words in the docs about that
  * changed error messages to refer to "blocks %u..%u"

4.  smgrprefetch-with-nblocks patch has no change, hasn't drawn any
comments hopefully because it is uncontroversial.

I'm planning to commit these fairly soon.



Thanks, very useful additions.
Not sure what you have already done to come next...

I have 2 smalls patches here:
* to use range prefetch in pg_prewarm (smgrprefetch only at the moment, 
using smgrreadv to come next).
* to support nblocks=0 in smgrprefetch (posix_fadvise supports a len=0 
to apply flag from offset to end of file).


Should I add to commitfest ?
---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R




Re: Streaming I/O, vectored I/O (WIP)

2023-12-11 Thread Thomas Munro
On Mon, Dec 11, 2023 at 10:28 PM Heikki Linnakangas  wrote:
> On 11/12/2023 11:12, Thomas Munro wrote:
> > 1.  I eventually figured out how to generalise
> > compute_remaining_iovec() (as I now call it) so that the existing
> > pg_pwritev_with_retry() in file_utils.c could also use it, so that's
> > now done in a patch of its own.
>
> In compute_remaining_iovec():
> > 'source' and 'destination' may point to the same array, in which
> > case it is adjusted in-place; otherwise 'destination' must have enough
> > space for 'iovcnt' elements.
> Is there any use case for not adjusting it in place?
> pg_pwritev_with_retry() takes a const iovec array, but maybe just remove
> the 'const' and document that it scribbles on it?

I guess I just wanted to preserve pg_pwritev_with_retry()'s existing
prototype, primarily because it matches standard pwritev()/writev().




Re: Streaming I/O, vectored I/O (WIP)

2023-12-11 Thread Heikki Linnakangas

On 11/12/2023 11:12, Thomas Munro wrote:

1.  I eventually figured out how to generalise
compute_remaining_iovec() (as I now call it) so that the existing
pg_pwritev_with_retry() in file_utils.c could also use it, so that's
now done in a patch of its own.


In compute_remaining_iovec():

'source' and 'destination' may point to the same array, in which
case it is adjusted in-place; otherwise 'destination' must have enough
space for 'iovcnt' elements.
Is there any use case for not adjusting it in place? 
pg_pwritev_with_retry() takes a const iovec array, but maybe just remove 
the 'const' and document that it scribbles on it?



I'm planning to commit these fairly soon.


+1

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Streaming I/O, vectored I/O (WIP)

2023-12-11 Thread Thomas Munro
On Sat, Dec 9, 2023 at 10:23 PM Heikki Linnakangas  wrote:
> Ok, works for me.

I finished up making a few more improvements:

1.  I eventually figured out how to generalise
compute_remaining_iovec() (as I now call it) so that the existing
pg_pwritev_with_retry() in file_utils.c could also use it, so that's
now done in a patch of its own.

2.  FileReadV/FileWriteV patch:

 * further simplification of the traditional ENOSPC 'guess'
 * unconstify() changed to raw cast (pending [1])
 * fixed the DO_DB()-wrapped debugging code

3.  smgrreadv/smgrwritev patch:

 * improved ENOSPC handling
 * improve description of EOF and ENOSPC handling
 * fixed the sizes reported in dtrace static probes
 * fixed some words in the docs about that
 * changed error messages to refer to "blocks %u..%u"

4.  smgrprefetch-with-nblocks patch has no change, hasn't drawn any
comments hopefully because it is uncontroversial.

I'm planning to commit these fairly soon.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGK3OXFjkOyZiw-DgL2bUqk9by1uGuCnViJX786W%2BfyDSw%40mail.gmail.com


v4-0001-Provide-helper-routine-for-partial-vector-I-O-ret.patch
Description: Binary data


v4-0002-Provide-vectored-variants-of-FileRead-and-FileWri.patch
Description: Binary data


v4-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch
Description: Binary data


v4-0004-Provide-multi-block-smgrprefetch.patch
Description: Binary data


Re: Streaming I/O, vectored I/O (WIP)

2023-12-09 Thread Heikki Linnakangas

On 09/12/2023 02:41, Thomas Munro wrote:

On Sat, Dec 9, 2023 at 7:25 AM Andres Freund  wrote:

On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:

On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas  wrote:

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?
If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?



There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC".  What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.


I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?


Heikki, what do you think about this:  we could go with the v3 fd.c
and md.c patches, but move adjust_iovec_for_partial_transfer() into
src/common/file_utils.c, so that at least that slightly annoying part
of the job is available for re-use by future code that faces the same
problem?


Ok, works for me.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Streaming I/O, vectored I/O (WIP)

2023-12-08 Thread Thomas Munro
On Sat, Dec 9, 2023 at 7:25 AM Andres Freund  wrote:
> On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:
> > On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas  wrote:
> > > Maybe we should bite the bullet and always retry short writes in
> > > FileWriteV(). Is that what you meant by "handling them"?
> > > If the total size is expensive to calculate, how about passing it as an
> > > extra argument? Presumably it is cheap for the callers to calculate at
> > > the same time that they build the iovec array?

> > There is another problem with pushing it down to fd.c, though.
> > Suppose you try to write 8192 bytes, and the kernel says "you wrote
> > 4096 bytes" so your loop goes around again with the second half the
> > data and now the kernel says "-1, ENOSPC".  What are you going to do?
> > fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
> > so you'd either have to return -1, ENOSPC (converting short writes
> > into actual errors, a lie because you did write some data), or return
> > 4096 (and possibly also set errno = ENOSPC as we have always done).
> > So you can't really handle this problem at this level, can you?
> > Unless you decide that fd.c should get into the business of raising
> > errors for I/O failures, which would be a bit of a departure.
> >
> > That's why I did the retry higher up in md.c.
>
> I think that's the right call. I think for AIO we can't do retry handling
> purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
> buy us that much in md.c anyway, we still need to handle the cross segment
> case and such, from what I can tell?

Heikki, what do you think about this:  we could go with the v3 fd.c
and md.c patches, but move adjust_iovec_for_partial_transfer() into
src/common/file_utils.c, so that at least that slightly annoying part
of the job is available for re-use by future code that faces the same
problem?

Note that in file_utils.c we already have pg_pwritev_with_retry(),
which is clearly related to all this: that is a function that
guarantees to either complete the full pwritev() or throw an ERROR,
but leaves it undefined whether any data has been written on ERROR.
It has to add up the size too, and it adjusts the iovec array at the
same time, so it wouldn't use adjust_iovec_for_partial_transfer().
This is essentially the type of interface that I declined to put into
fd.c's FileWrite() and FileRead() because I feel like it doesn't fit
with the existing functions' primary business of adding vfd support to
well known basic I/O functions that return bytes transferred and set
errno.  Perhaps someone might later want to introduce File*WithRetry()
wrappers or something if that proves useful?  I wouldn't want them for
md.c though because I already know the size.




Re: Streaming I/O, vectored I/O (WIP)

2023-12-08 Thread Andres Freund
Hi,

On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:
> On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas  wrote:
> > On 29/11/2023 21:39, Thomas Munro wrote:
> > > One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
> > > callers believe that short writes set errno: they error out with a
> > > message including %m.  We have historically set errno = ENOSPC inside
> > > FileWrite() if the write size was unexpectedly small AND the kernel
> > > didn't set errno to a non-zero value (having set it to zero ourselves
> > > earlier).  In FileWriteV(), I didn't want to do that because it is
> > > expensive to compute the total write size from the vector array and we
> > > managed to measure an effect due to that in some workloads.
> > >
> > > Note that the smgr patch actually handles short writes by continuing,
> > > instead of raising an error.  Short writes do already occur in the
> > > wild on various systems for various rare technical reasons other than
> > > ENOSPC I have heard (imagine transient failure to acquire some
> > > temporary memory that the kernel chooses not to wait for, stuff like
> > > that, though certainly many people and programs believe they should
> > > not happen[1]), and it seems like a good idea to actually handle them
> > > as our write sizes increase and the probability of short writes might
> > > presumably increase.
> >
> > Maybe we should bite the bullet and always retry short writes in
> > FileWriteV(). Is that what you meant by "handling them"?
> > If the total size is expensive to calculate, how about passing it as an
> > extra argument? Presumably it is cheap for the callers to calculate at
> > the same time that they build the iovec array?
> 
> It's cheap for md.c, because it already has nblocks_this_segment.
> That's one reason I chose to put the retry there.  If we push it down
> to fd.c in order to be able to help other callers, you're right that
> we could pass in the total size (and I guess assert that it's
> correct), but that is sort of annoyingly redundant and further from
> the interface we're wrapping.

> There is another problem with pushing it down to fd.c, though.
> Suppose you try to write 8192 bytes, and the kernel says "you wrote
> 4096 bytes" so your loop goes around again with the second half the
> data and now the kernel says "-1, ENOSPC".  What are you going to do?
> fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
> so you'd either have to return -1, ENOSPC (converting short writes
> into actual errors, a lie because you did write some data), or return
> 4096 (and possibly also set errno = ENOSPC as we have always done).
> So you can't really handle this problem at this level, can you?
> Unless you decide that fd.c should get into the business of raising
> errors for I/O failures, which would be a bit of a departure.
> 
> That's why I did the retry higher up in md.c.

I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?

Greetings,

Andres Freund




Re: Streaming I/O, vectored I/O (WIP)

2023-11-29 Thread Thomas Munro
On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas  wrote:
> On 29/11/2023 21:39, Thomas Munro wrote:
> > One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
> > callers believe that short writes set errno: they error out with a
> > message including %m.  We have historically set errno = ENOSPC inside
> > FileWrite() if the write size was unexpectedly small AND the kernel
> > didn't set errno to a non-zero value (having set it to zero ourselves
> > earlier).  In FileWriteV(), I didn't want to do that because it is
> > expensive to compute the total write size from the vector array and we
> > managed to measure an effect due to that in some workloads.
> >
> > Note that the smgr patch actually handles short writes by continuing,
> > instead of raising an error.  Short writes do already occur in the
> > wild on various systems for various rare technical reasons other than
> > ENOSPC I have heard (imagine transient failure to acquire some
> > temporary memory that the kernel chooses not to wait for, stuff like
> > that, though certainly many people and programs believe they should
> > not happen[1]), and it seems like a good idea to actually handle them
> > as our write sizes increase and the probability of short writes might
> > presumably increase.
>
> Maybe we should bite the bullet and always retry short writes in
> FileWriteV(). Is that what you meant by "handling them"?
> If the total size is expensive to calculate, how about passing it as an
> extra argument? Presumably it is cheap for the callers to calculate at
> the same time that they build the iovec array?

It's cheap for md.c, because it already has nblocks_this_segment.
That's one reason I chose to put the retry there.  If we push it down
to fd.c in order to be able to help other callers, you're right that
we could pass in the total size (and I guess assert that it's
correct), but that is sort of annoyingly redundant and further from
the interface we're wrapping.

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC".  What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

> > With the previous version of the patch, we'd have to change a couple
> > of other callers not to believe that short writes are errors and set
> > errno (callers are inconsistent on this point).  I don't really love
> > that we have "fake" system errors but I also want to stay focused
> > here, so in this new version V3 I tried a new approach: I realised I
> > can just always set errno without needing the total size, so that
> > (undocumented) aspect of the interface doesn't change.  The point
> > being that it doesn't matter if you clobber errno with a bogus value
> > when the write was non-short.  Thoughts?
>
> Feels pretty ugly, but I don't see anything outright wrong with that.

Cool.  I would consider cleaning up all the callers and get rid of
this ENOSPC stuff in independent work, but I didn't want discussion of
that (eg what external/extension code knows about this API?) to derail
THIS project, hence desire to preserve existing behaviour.




Re: Streaming I/O, vectored I/O (WIP)

2023-11-29 Thread Heikki Linnakangas

On 29/11/2023 21:39, Thomas Munro wrote:

One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
callers believe that short writes set errno: they error out with a
message including %m.  We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier).  In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error.  Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.


Maybe we should bite the bullet and always retry short writes in 
FileWriteV(). Is that what you meant by "handling them"?


If the total size is expensive to calculate, how about passing it as an 
extra argument? Presumably it is cheap for the callers to calculate at 
the same time that they build the iovec array?



With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point).  I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change.  The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short.  Thoughts?


Feels pretty ugly, but I don't see anything outright wrong with that.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Streaming I/O, vectored I/O (WIP)

2023-11-29 Thread Thomas Munro
On Wed, Nov 29, 2023 at 1:44 AM Heikki Linnakangas  wrote:
> LGTM. I think this 0001 patch is ready for commit, independently of the
> rest of the patches.

Done.

> In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:
>
> > +/* Filename components */
> > +#define PG_TEMP_FILES_DIR "pgsql_tmp"
> > +#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
> > +
>
> These seem out of place, we already have them in common/file_utils.h.

Yeah, they moved from there in f39b2658 and I messed up the rebase.  Fixed.

> Other than that,
> v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and
> v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look
> good to me.

One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
callers believe that short writes set errno: they error out with a
message including %m.  We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier).  In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error.  Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point).  I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change.  The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short.  Thoughts?

[1] https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShortOften
From 7808d7241088f8dc02d7d828dc4ee0227e8e2a01 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 19 Jul 2023 12:32:51 +1200
Subject: [PATCH v3 1/7] Provide vectored variants of FileRead() and
 FileWrite().

FileReadV() and FileWriteV() adapt preadv() and pwritev() for
PostgreSQL's virtual file descriptors.  The traditional FileRead() and
FileWrite() functions are implemented in terms of the new functions.

Reviewed-by: Heikki Linnakangas 
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com
---
 src/backend/storage/file/fd.c | 33 +++--
 src/include/storage/fd.h  | 30 --
 2 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f691ba0932..aa89c755b3 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2110,8 +2110,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 int
-FileRead(File file, void *buffer, size_t amount, off_t offset,
-		 uint32 wait_event_info)
+FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		  uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2131,7 +2131,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,
 
 retry:
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pread(vfdP->fd, buffer, amount, offset);
+	returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
 	if (returnCode < 0)
@@ -2166,8 +2166,8 @@ retry:
 }
 
 int
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
-		  uint32 wait_event_info)
+FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		   uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2195,7 +2195,14 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset + amount;
+		size_t		size = 0;
+		off_t		past_write;
+
+		/* Compute the total transfer size. */
+		for (int i = 0; i < iovcnt; ++i)
+			size += iov[i].iov_len;
+
+		past_write = offset + size;
 
 		if (past_write > vfdP->fileSize)
 		{
@@ -2213,11 +2220,17 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 retry:
 	errno = 0;
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset);
+	

Re: Streaming I/O, vectored I/O (WIP)

2023-11-28 Thread Melanie Plageman
On Wed, Nov 29, 2023 at 01:17:19AM +1300, Thomas Munro wrote:

Thanks for posting a new version. I've included a review of 0004.

> I've included just the pg_prewarm example user for now while we
> discuss the basic infrastructure.  The rest are rebased and in my
> public Github branch streaming-read (repo macdice/postgres) if anyone
> is interested (don't mind the red CI failures, they're just saying I
> ran out of monthly CI credits on the 29th, so close...)

I agree it makes sense to commit the interface with just prewarm as a
user. Then we can start new threads for the various streaming read users
(e.g. vacuum, sequential scan, bitmapheapscan).

> From db5de8ab5a1a804f41006239302fdce954cab331 Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Sat, 22 Jul 2023 17:31:54 +1200
> Subject: [PATCH v2 4/8] Provide vectored variant of ReadBuffer().
> 
> diff --git a/src/backend/storage/buffer/bufmgr.c 
> b/src/backend/storage/buffer/bufmgr.c
> index f7c67d504c..8ae3a72053 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1046,175 +1048,326 @@ ReadBuffer_common(SMgrRelation smgr, char 
> relpersistence, ForkNumber forkNum,
>   if (mode == RBM_ZERO_AND_LOCK || mode == 
> RBM_ZERO_AND_CLEANUP_LOCK)
>   flags |= EB_LOCK_FIRST;
>  
> - return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
> -  forkNum, 
> strategy, flags);
> + *hit = false;
> +
> + return ExtendBufferedRel(bmr, forkNum, strategy, flags);
>   }
>  
> - TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
> -
> smgr->smgr_rlocator.locator.spcOid,
> -
> smgr->smgr_rlocator.locator.dbOid,
> -
> smgr->smgr_rlocator.locator.relNumber,
> -
> smgr->smgr_rlocator.backend);
> + buffer = PrepareReadBuffer(bmr,
> +forkNum,
> +blockNum,
> +strategy,
> +hit,
> +);
> +
> + /* At this point we do NOT hold any locks. */
> +
> + if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
> + {
> + /* if we just want zeroes and a lock, we're done */
> + ZeroBuffer(buffer, mode);
> + }
> + else if (!*hit)
> + {
> + /* we might need to perform I/O */
> + CompleteReadBuffers(bmr,
> + ,
> + forkNum,
> + blockNum,
> + 1,
> + mode == 
> RBM_ZERO_ON_ERROR,
> + strategy);
> + }
> +
> + return buffer;
> +}
> +
> +/*
> + * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
> + * the returned buffer can be used immediately.  Otherwise, a physical read
> + * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
> + * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the

ot -> to

> + * caller has the opportunity to coalesce reads of neighboring blocks into 
> one
> + * CompleteReadBuffers() call.
> + *
> + * *found is set to true for a hit, and false for a miss.
> + *
> + * *allocated is set to true for a miss that allocates a buffer for the first
> + * time.  If there are multiple calls to PrepareReadBuffer() for the same
> + * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
> + * read, then only the first such call will receive *allocated == true, which
> + * the caller might use to issue just one prefetch hint.
> + */
> +Buffer
> +PrepareReadBuffer(BufferManagerRelation bmr,
> +   ForkNumber forkNum,
> +   BlockNumber blockNum,
> +   BufferAccessStrategy strategy,
> +   bool *found,
> +   bool *allocated)
> +{
> + BufferDesc *bufHdr;
> + boolisLocalBuf;
> + IOContext   io_context;
> + IOObjectio_object;
>  
> + Assert(blockNum != P_NEW);
> +
> + if (bmr.rel)
> + {
> + bmr.smgr = RelationGetSmgr(bmr.rel);
> + bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
> + }
> +
> + isLocalBuf = SmgrIsTemp(bmr.smgr);
>   if 

Re: Streaming I/O, vectored I/O (WIP)

2023-11-28 Thread Heikki Linnakangas

On 28/11/2023 14:17, Thomas Munro wrote:

On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas  wrote:

in streaming_read.h:


typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
 uintptr_t pgsr_private,
 void *per_io_private,
 BufferManagerRelation *bmr,
 ForkNumber *forkNum,
 BlockNumber *blockNum,
 ReadBufferMode *mode);


I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.


In this version I have introduced an alternative simple callback.
It's approximately what we had already tried out in an earlier version
before I started streamifying recovery, but in this version you can
choose, so recovery can opt for the wider callback.


Ok. Two APIs is a bit redundant, but because most callers would prefer 
the simpler API, that's probably a good tradeoff.



I've added some ramp-up logic.  The idea is that after we streamify
everything in sight, we don't want to penalise users that don't really
need more than one or two blocks, but don't know that yet.  Here is
how the system calls look when you do pg_prewarm():

pread64(32, ..., 8192, 0) = 8192 <--- start with just one block
pread64(32, ..., 16384, 8192) = 16384
pread64(32, ..., 32768, 24576) = 32768
pread64(32, ..., 65536, 57344) = 65536
pread64(32, ..., 131072, 122880) = 131072<--- soon reading 16
blocks at a time
pread64(32, ..., 131072, 253952) = 131072
pread64(32, ..., 131072, 385024) = 131072



I guess it could be done in quite a few different ways and I'm open to
better ideas.  This way inserts prefetching stalls but ramps up
quickly and is soon out of the way.  I wonder if we would want to make
that a policy that a caller can disable, if you want to skip the
ramp-up and go straight for the largest possible I/O size?  Then I
think we'd need a 'flags' argument to the streaming read constructor
functions.


I think a 'flags' argument and a way to opt-out of the slow start would 
make sense. pg_prewarm in particular knows that it will read the whole 
relation.



A small detour:  While contemplating how this interacts with parallel
sequential scan, which also has a notion of ramping up, I noticed
another problem.  One parallel seq scan process does this:

fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 35127296) = 131072
preadv(32, [...], 2, 35258368) = 131072
fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 36175872) = 131072
preadv(32, [...], 2, 36306944) = 131072
...

We don't really want those fadvise() calls.  We don't get them with
parallelism disabled, because streaming_read.c is careful not to
generate advice for sequential workloads based on ancient wisdom from
this mailing list, re-confirmed on recent Linux: WILLNEED hints
actually get in the way of Linux's own prefetching and slow you down,
so we only want them for truly random access.  But the logic can't see
that another process is making holes in this process's sequence.


Hmm, aside from making the sequential pattern invisible to this process, 
are we defeating Linux's logic too, just by performing the reads from 
multiple processes? The processes might issue the reads to the kernel 
out-of-order.


How bad is the slowdown when you issue WILLNEED hints on sequential access?


The two obvious solutions are (1) pass in a flag at the start saying
"I promise this is sequential even if it doesn't look like it, no
hints please" and (2) invent "shared" (cross-process) streaming
reads, and teach all the parallel seq scan processes to get their
buffers from there.

Idea (2) is interesting to think about but even if it is a useful idea
(not sure) it is certainly overkill just to solve this little problem
for now.  So perhaps I should implement (1), which would be another
reason to add a flags argument.  It's not a perfect solution though
because some more 'data driven' parallel scans (indexes, bitmaps, ...)
have a similar problem that is less amenable to top-down kludgery.


(1) seems fine to me.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Streaming I/O, vectored I/O (WIP)

2023-11-28 Thread Heikki Linnakangas

On 28/11/2023 14:17, Thomas Munro wrote:

On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas  wrote:

+ /* Avoid a slightly more expensive kernel call if there is no benefit. */
+ if (iovcnt == 1)
+ returnCode = pg_pread(vfdP->fd,
+   iov[0].iov_base,
+   iov[0].iov_len,
+   offset);
+ else
+ returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);


How about pushing down this optimization to pg_preadv() itself?
pg_readv() is currently just a macro if the system provides preadv(),
but it could be a "static inline" that does the above dance. I think
that optimization is platform-dependent anyway, pread() might not be any
faster on some OSs. In particular, if the system doesn't provide
preadv() and we use the implementation in src/port/preadv.c, it's the
same kernel call anyway.


Done.  I like it, I just feel a bit bad about moving the p*v()
replacement functions around a couple of times already!  I figured it
might as well be static inline even if we use the fallback (= Solaris
and Windows).


LGTM. I think this 0001 patch is ready for commit, independently of the 
rest of the patches.


In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:


+/* Filename components */
+#define PG_TEMP_FILES_DIR "pgsql_tmp"
+#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
+


These seem out of place, we already have them in common/file_utils.h. 
Other than that, 
v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and 
v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look 
good to me.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Streaming I/O, vectored I/O (WIP)

2023-09-27 Thread Heikki Linnakangas

On 31/08/2023 07:00, Thomas Munro wrote:

Currently PostgreSQL reads (and writes) data files 8KB at a time.
That's because we call ReadBuffer() one block at a time, with no
opportunity for lower layers to do better than that.  This thread is
about a model where you say which block you'll want next with a
callback, and then you pull the buffers out of a "stream".


I love this idea! Makes it a lot easier to perform prefetch, as 
evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch:


 13 files changed, 289 insertions(+), 637 deletions(-)

I'm a bit disappointed and surprised by 
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:


 4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so I 
hoped that this would simplify it. I didn't look closely at that patch, 
so maybe it's simpler even though it's more code.



There are more kinds of streaming I/O that would be useful, such as
raw unbuffered files, and of course writes, and I've attached some
early incomplete demo code for writes (just for fun), but the main
idea I want to share in this thread is the idea of replacing lots of
ReadBuffer() calls with the streaming model.


All this makes sense. Some random comments on the patches:


+   /* Avoid a slightly more expensive kernel call if there is no benefit. 
*/
+   if (iovcnt == 1)
+   returnCode = pg_pread(vfdP->fd,
+ iov[0].iov_base,
+ iov[0].iov_len,
+ offset);
+   else
+   returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);


How about pushing down this optimization to pg_preadv() itself? 
pg_readv() is currently just a macro if the system provides preadv(), 
but it could be a "static inline" that does the above dance. I think 
that optimization is platform-dependent anyway, pread() might not be any 
faster on some OSs. In particular, if the system doesn't provide 
preadv() and we use the implementation in src/port/preadv.c, it's the 
same kernel call anyway.



v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch


No smgrextendv()? I guess none of the patches here needed it.


/*
 * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
 * the returned buffer can be used immediately.  Otherwise, a physical read
 * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
 * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
 * caller has the opportunity to coalesce reads of neighboring blocks into one
 * CompleteReadBuffers() call.
 *
 * *found is set to true for a hit, and false for a miss.
 *
 * *allocated is set to true for a miss that allocates a buffer for the first
 * time.  If there are multiple calls to PrepareReadBuffer() for the same
 * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
 * read, then only the first such call will receive *allocated == true, which
 * the caller might use to issue just one prefetch hint.
 */
Buffer
PrepareReadBuffer(BufferManagerRelation bmr,
  ForkNumber forkNum,
  BlockNumber blockNum,
  BufferAccessStrategy strategy,
  bool *found,
  bool *allocated)



If you decide you don't want to perform the read, after all, is there a 
way to abort it without calling CompleteReadBuffers()? Looking at the 
later patch that introduces the streaming read API, seems that it 
finishes all the reads, so I suppose we don't need an abort function. 
Does it all get cleaned up correctly on error?



/*
 * Convert an array of buffer address into an array of iovec objects, and
 * return the number that were required.  'iov' must have enough space for up
 * to PG_IOV_MAX elements.
 */
static int
buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
 The comment is a bit inaccurate. There's an assertion that If nblocks 
<= PG_IOV_MAX, so while it's true that 'iov' must have enough space for 
up to PG_IOV_MAX elements, that's only because we also assume that 
nblocks <= PG_IOV_MAX.


I don't see anything in the callers (mdreadv() and mdwritev()) to 
prevent them from passing nblocks > PG_IOV_MAX.


in streaming_read.h:


typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);


I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on 
each read. I see that you used that in the WAL replay prefetching, so I 
guess that makes sense.



extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
extern Buffer 

Re: Streaming I/O, vectored I/O (WIP)

2023-09-27 Thread Thomas Munro
On Thu, Sep 28, 2023 at 9:13 AM Andres Freund  wrote:
> On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote:
> > Looking at the later patch that introduces the streaming read API, seems
> > that it finishes all the reads, so I suppose we don't need an abort
> > function. Does it all get cleaned up correctly on error?
>
> I think it should.  The buffer error handling is one of the areas where I
> really would like to have some way of testing the various cases, it's easy to
> get things wrong, and basically impossible to write reliable tests for with
> our current infrastructure.

One thing to highlight is that this patch doesn't create a new state
in that case.  In master, we already have the concept of a buffer with
BM_TAG_VALID but not BM_VALID and not BM_IO_IN_PROGRESS, reachable if
there is an I/O error.  Eventually another reader will try the I/O
again, or the buffer will fall out of the pool.  With this patch it's
the same, it's just a wider window: more kinds of errors might be
thrown in code between Prepare() and Complete() before we even have
BM_IO_IN_PROGRESS.  So there is nothing extra to clean up.  Right?

Yeah, it would be nice to test buffer pool logic directly.  Perhaps
with a C unit test framework[1] and pluggable smgr[2] we could mock up
cases like I/O errors...

> > > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead 
> > > *pgsr,
> > > uintptr_t pgsr_private,
> > > void *per_io_private,
> > > BufferManagerRelation *bmr,
> > > ForkNumber *forkNum,
> > > BlockNumber *blockNum,
> > > ReadBufferMode *mode);
> >
> > I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
> > each read. I see that you used that in the WAL replay prefetching, so I
> > guess that makes sense.
>
> Yea, that's the origin - I don't like it, but I don't really have a better
> idea.

Another idea I considered was that streams could be associated with a
single relation, but recovery could somehow manage a set of them.
>From a certain point of view, that makes sense (we could be redoing
work that was created by multiple concurrent streams at 'do' time, and
with the approach shown here some clustering opportunities available
at do time are lost at redo time), but it's not at all clear that it's
worth the overheads or complexity, and I couldn't immediately figure
out how to do it.  But I doubt there would ever be any other users of
a single stream with multiple relations, and I agree that this is
somehow not quite satisfying...  Perhaps we should think about that
some more...

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=pcnwe...@mail.gmail.com




Re: Streaming I/O, vectored I/O (WIP)

2023-09-27 Thread Andres Freund
Hi,

On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote:
> I'm a bit disappointed and surprised by
> v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:
> 
>  4 files changed, 244 insertions(+), 78 deletions(-)
> 
> The current prefetching logic in vacuumlazy.c is pretty hairy, so I hoped
> that this would simplify it. I didn't look closely at that patch, so maybe
> it's simpler even though it's more code.

A good chunk of the changes is pretty boring stuff. A good chunk of the
remainder could be simplified a lot - it's partially there because vacuumlazy
changed a lot over the last couple years and because a bit more refactoring is
needed.  I do think it's actually simpler in some ways - besides being more
efficient...


> > v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch
> 
> No smgrextendv()? I guess none of the patches here needed it.

I can't really imagine needing it anytime soon - due to the desire to avoid
ENOSPC for pages in the buffer pool the common pattern is to extend relations
with zeroes on disk, then populate those buffers in memory. It's possible that
you could use something like smgrextendv() when operating directly on the smgr
level - but then I suspect you're going to be better off to extend the
relation to the right size in one operation and then just use smgrwritev() to
write out the contents.


> > /*
> >  * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
> >  * the returned buffer can be used immediately.  Otherwise, a physical read
> >  * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
> >  * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
> >  * caller has the opportunity to coalesce reads of neighboring blocks into 
> > one
> >  * CompleteReadBuffers() call.
> >  *
> >  * *found is set to true for a hit, and false for a miss.
> >  *
> >  * *allocated is set to true for a miss that allocates a buffer for the 
> > first
> >  * time.  If there are multiple calls to PrepareReadBuffer() for the same
> >  * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
> >  * read, then only the first such call will receive *allocated == true, 
> > which
> >  * the caller might use to issue just one prefetch hint.
> >  */
> > Buffer
> > PrepareReadBuffer(BufferManagerRelation bmr,
> >   ForkNumber forkNum,
> >   BlockNumber blockNum,
> >   BufferAccessStrategy strategy,
> >   bool *found,
> >   bool *allocated)
> > 
> 
> If you decide you don't want to perform the read, after all, is there a way
> to abort it without calling CompleteReadBuffers()?

When would that be needed?


> Looking at the later patch that introduces the streaming read API, seems
> that it finishes all the reads, so I suppose we don't need an abort
> function. Does it all get cleaned up correctly on error?

I think it should.  The buffer error handling is one of the areas where I
really would like to have some way of testing the various cases, it's easy to
get things wrong, and basically impossible to write reliable tests for with
our current infrastructure.


> > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
> > uintptr_t pgsr_private,
> > void *per_io_private,
> > BufferManagerRelation *bmr,
> > ForkNumber *forkNum,
> > BlockNumber *blockNum,
> > ReadBufferMode *mode);
> 
> I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
> each read. I see that you used that in the WAL replay prefetching, so I
> guess that makes sense.

Yea, that's the origin - I don't like it, but I don't really have a better
idea.


> > extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
> > extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void 
> > **per_io_private);
> > extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
> > extern void pg_streaming_read_free(PgStreamingRead *pgsr);
> 
> Do we need to expose pg_streaming_read_prefetch()? It's only used in the WAL
> replay prefetching patch, and only after calling pg_streaming_read_reset().
> Could pg_streaming_read_reset() call pg_streaming_read_prefetch() directly?
> Is there any need to "reset" the stream, without also starting prefetching?

Heh, I think this is a discussion Thomas I were having before...

Greetings,

Andres Freund