Re: Do away with zero-padding assumption before WALRead()

2024-02-18 Thread Kyotaro Horiguchi
At Mon, 19 Feb 2024 11:02:39 +0530, Bharath Rupireddy 
 wrote in 
> > After all, the patch looks good to me.
> 
> Thanks. It was committed -
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=73f0a1326608ac3a7d390706fdeec59fe4dc42c0.

Yeah. I realied that after I had already sent the mail.. No harm done:p

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Do away with zero-padding assumption before WALRead()

2024-02-18 Thread Bharath Rupireddy
On Mon, Feb 19, 2024 at 8:26 AM Kyotaro Horiguchi
 wrote:
>
> On
> the flip side, SimpleXLogPageRead always reads a whole page and
> returns XLOG_BLCKSZ. However, as you know, the returned buffer doesn't
> contain random garbage bytes.

Is this assumption true when wal_init_zero is off? I think when
wal_init_zero is off, the last few bytes of the last page from the WAL
file may contain garbage bytes i.e. not zero bytes, no?

> Therefore, it's safe as long as the
> caller doesn't access beyond the returned count. As a result, the
> description you pointed out seems to be enough.

Right.

> After all, the patch looks good to me.

Thanks. It was committed -
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=73f0a1326608ac3a7d390706fdeec59fe4dc42c0.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Do away with zero-padding assumption before WALRead()

2024-02-18 Thread Kyotaro Horiguchi
At Mon, 19 Feb 2024 11:56:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Yeah, perhaps I was overly concerned. The removed comment made me
> think that someone could add code relying on the incorrect assumption
> that the remaining bytes beyond the returned count are cleared out. On
> the flip side, SimpleXLogPageRead always reads a whole page and
> returns XLOG_BLCKSZ. However, as you know, the returned buffer doesn't
> contain random garbage bytes. Therefore, it's safe as long as the

Forgot to mention that there is a case involving non-initialized
pages, but it doesn't affect the correctness of the description you
pointed out.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Do away with zero-padding assumption before WALRead()

2024-02-18 Thread Kyotaro Horiguchi
At Fri, 16 Feb 2024 19:50:00 +0530, Bharath Rupireddy 
 wrote in 
> On Fri, Feb 16, 2024 at 7:10 AM Kyotaro Horiguchi
>  wrote:
> > 1. It's useless to copy the whole page regardless of the 'count'. It's
> >   enough to copy only up to the 'count'. The patch looks good in this
> >   regard.
> 
> Yes, it's not needed to copy the whole page. Per my understanding
> about page transfers between disk and OS page cache - upon OS page
> cache miss, the whole page (of disk block size) gets fetched from disk
> even if we just read 'count' bytes (< disk block size).

Right, but with a possibly-different block size. Anyway that behavior
doesn't affect the result of this change. (Could affect performance
hereafter if it were not the case, though..)

> > 2. Maybe we need a comment that states the page_read callback
> >   functions leave garbage bytes beyond the returned count, due to
> >   partial copying without clearing the unused portion.
> 
> Isn't the comment around page_read callback at
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/xlogreader.h;h=2e9e5f43eb2de1ca9ba81afe76d21357065c61aa;hb=d57b7cc3338e9d9aa1d7c5da1b25a17c5a72dcce#l78
> enough?
> 
> "The callback shall return the number of bytes read (never more than
> XLOG_BLCKSZ), or -1 on failure."

Yeah, perhaps I was overly concerned. The removed comment made me
think that someone could add code relying on the incorrect assumption
that the remaining bytes beyond the returned count are cleared out. On
the flip side, SimpleXLogPageRead always reads a whole page and
returns XLOG_BLCKSZ. However, as you know, the returned buffer doesn't
contain random garbage bytes. Therefore, it's safe as long as the
caller doesn't access beyond the returned count. As a result, the
description you pointed out seems to be enough.

After all, the patch looks good to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Do away with zero-padding assumption before WALRead()

2024-02-16 Thread Bharath Rupireddy
On Fri, Feb 16, 2024 at 7:10 AM Kyotaro Horiguchi
 wrote:
>
> Good catch! The comment seems wrong also to me. The subsequent bytes
> can be written simultaneously, and it's very normal that there are
> unflushed bytes are in OS's page buffer. The objective of the comment
> seems to be to declare that there's no need to clear out the remaining
> bytes, here. I agree that it's not a problem for now. However, I think
> we need two fixes here.
>
> 1. It's useless to copy the whole page regardless of the 'count'. It's
>   enough to copy only up to the 'count'. The patch looks good in this
>   regard.

Yes, it's not needed to copy the whole page. Per my understanding
about page transfers between disk and OS page cache - upon OS page
cache miss, the whole page (of disk block size) gets fetched from disk
even if we just read 'count' bytes (< disk block size).

> 2. Maybe we need a comment that states the page_read callback
>   functions leave garbage bytes beyond the returned count, due to
>   partial copying without clearing the unused portion.

Isn't the comment around page_read callback at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/xlogreader.h;h=2e9e5f43eb2de1ca9ba81afe76d21357065c61aa;hb=d57b7cc3338e9d9aa1d7c5da1b25a17c5a72dcce#l78
enough?

"The callback shall return the number of bytes read (never more than
XLOG_BLCKSZ), or -1 on failure."

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Do away with zero-padding assumption before WALRead()

2024-02-16 Thread Bharath Rupireddy
On Thu, Feb 15, 2024 at 3:49 PM Nazir Bilal Yavuz  wrote:
>
> > I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
> > despite knowing exactly how much to read. Is it to tell the OS to
> > explicitly fetch the whole page from the disk? If yes, the OS will do
> > that anyway because the page transfers from disk to OS page cache are
> > always in terms of disk block sizes, no?
>
> I am curious about the same. The page size and disk block size could
> be different,

Yes, they can be different, but (see below)

> so the reason could be explicitly fetching the whole
> page from the disk as you said.

Upon OS page cache miss, the whole page (of disk block size) gets
fetched from disk even if we just read 'count' bytes (< disk block
size), no? This is my understanding about page transfers between disk
and OS page cache.

> Is this the reason or are there any
> other benefits of always reading XLOG_BLCKSZ instead of reading the
> sufficient part? I tried to search in older threads and code comments
> but I could not find an explanation.

FWIW, walsender for physical replication will just read as much as it
wants to read which can range from WAL of size < XLOG_BLCKSZ to
MAX_SEND_SIZE (XLOG_BLCKSZ * 16). I mean, it does not read the whole
page of bytes XLOG_BLCKSZ when it wants to read < XLOG_BLCKSZ.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Do away with zero-padding assumption before WALRead()

2024-02-15 Thread Kyotaro Horiguchi
At Tue, 13 Feb 2024 11:47:06 +0530, Bharath Rupireddy 
 wrote in 
> Hi,
> 
> I noticed an assumption [1] at WALRead() call sites expecting the
> flushed WAL page to be zero-padded after the flush LSN. I think this
> can't always be true as the WAL can get flushed after determining the
> flush LSN before reading it from the WAL file using WALRead(). I've
> hacked the code up a bit to check if that's true -

Good catch! The comment seems wrong also to me. The subsequent bytes
can be written simultaneously, and it's very normal that there are
unflushed bytes are in OS's page buffer. The objective of the comment
seems to be to declare that there's no need to clear out the remaining
bytes, here. I agree that it's not a problem for now. However, I think
we need two fixes here.

1. It's useless to copy the whole page regardless of the 'count'. It's
  enough to copy only up to the 'count'. The patch looks good in this
  regard.

2. Maybe we need a comment that states the page_read callback
  functions leave garbage bytes beyond the returned count, due to
  partial copying without clearing the unused portion.

> I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
> despite knowing exactly how much to read. Is it to tell the OS to
> explicitly fetch the whole page from the disk? If yes, the OS will do
> that anyway because the page transfers from disk to OS page cache are
> always in terms of disk block sizes, no?

If I understand your question correctly, I guess that the whole-page
copy was expected to clear out the remaining bytes, as I mentioned
earlier.

> Although, there's no immediate problem with it right now, the
> assumption is going to be incorrect when reading WAL from WAL buffers
> using WALReadFromBuffers -
> https://www.postgresql.org/message-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hla_hsgwnx2y5-g+mswd...@mail.gmail.com.
>
> If we have no reason, can the WALRead() callers just read how much
> they want like walsender for physical replication? Attached a patch
> for the change.
> 
> Thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Do away with zero-padding assumption before WALRead()

2024-02-15 Thread Nazir Bilal Yavuz
Hi,

On Tue, 13 Feb 2024 at 09:17, Bharath Rupireddy
 wrote:
>
> Hi,
>
> I noticed an assumption [1] at WALRead() call sites expecting the
> flushed WAL page to be zero-padded after the flush LSN. I think this
> can't always be true as the WAL can get flushed after determining the
> flush LSN before reading it from the WAL file using WALRead(). I've
> hacked the code up a bit to check if that's true -
> https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP,
> the tests hit the Assert(false); added. Which means, the zero-padding
> comment around WALRead() call sites isn't quite right.
>
> I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
> despite knowing exactly how much to read. Is it to tell the OS to
> explicitly fetch the whole page from the disk? If yes, the OS will do
> that anyway because the page transfers from disk to OS page cache are
> always in terms of disk block sizes, no?

I am curious about the same. The page size and disk block size could
be different, so the reason could be explicitly fetching the whole
page from the disk as you said. Is this the reason or are there any
other benefits of always reading XLOG_BLCKSZ instead of reading the
sufficient part? I tried to search in older threads and code comments
but I could not find an explanation.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Do away with zero-padding assumption before WALRead()

2024-02-12 Thread Bharath Rupireddy
Hi,

I noticed an assumption [1] at WALRead() call sites expecting the
flushed WAL page to be zero-padded after the flush LSN. I think this
can't always be true as the WAL can get flushed after determining the
flush LSN before reading it from the WAL file using WALRead(). I've
hacked the code up a bit to check if that's true -
https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP,
the tests hit the Assert(false); added. Which means, the zero-padding
comment around WALRead() call sites isn't quite right.

I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
despite knowing exactly how much to read. Is it to tell the OS to
explicitly fetch the whole page from the disk? If yes, the OS will do
that anyway because the page transfers from disk to OS page cache are
always in terms of disk block sizes, no?

Although, there's no immediate problem with it right now, the
assumption is going to be incorrect when reading WAL from WAL buffers
using WALReadFromBuffers -
https://www.postgresql.org/message-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hla_hsgwnx2y5-g+mswd...@mail.gmail.com.

If we have no reason, can the WALRead() callers just read how much
they want like walsender for physical replication? Attached a patch
for the change.

Thoughts?

[1]
/*
 * Even though we just determined how much of the page can be validly read
 * as 'count', read the whole page anyway. It's guaranteed to be
 * zero-padded up to the page boundary if it's incomplete.
 */
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
 ))

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Do-away-with-zero-padding-assumption-before-WALRe.patch
Description: Binary data