On Sun, 23 Mar 2025, Tom Lane wrote:
Dimitrios Apostolou <ji...@gmx.net> writes:
On Thu, 20 Mar 2025, Tom Lane wrote:
I am betting that the problem is that the dump's TOC (table of
contents) lacks offsets to the actual data of the database objects,
and thus the readers have to reconstruct that information by scanning
the dump file. Normally, pg_dump will back-fill offset data in the
TOC at completion of the dump, but if it's told to write to an
un-seekable output file then it cannot do that.
Further questions:
* The seeking pattern in pg_restore seems non-sensical to me: reading 4K,
jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an
hour. /Maybe/ something to improve there... Where can I read more about
the format?
It's reading data blocks (or at least the headers thereof), which have
a limited size. I don't think that size has changed since circa 1999,
so maybe we could consider increasing it; but I doubt we could move
the needle very far that way.
If it could be substantially increased then it would cut down the number
of seeks by a big factor and definitely make a difference. But it would
have a negative impact on the size of small backups?
However, I would have expected the kernel to handle this reading-seeking
pattern better: The 4K read() should cause a big pre-fetch (read-ahead) to
cover the next seek and read. Not sure why I don't see it happening, it's
a rather typical setup (NVMe SSD with ext4 fs on Linux 6.11).
As a workaround, pg_restore could sequentially read() the whole archive to
construct the TOC, instead of jumping around. I'm a bit split on this one,
as it would be much faster on the compressed archive (~800GB) but not that
much on the uncompressed one (>5TB).
I'm also under the impression that all the pg_restore child processes are
doing the same thing: seeking+reading the whole file, each for their own
sake. Can you verify or disprove?
Thank you for the detailed response.
Dimitris