Problem Statement =============== Large object replication is currently unsupported in logical replication, which presents a significant barrier to using logical replication when large objects are in use. This lack of support limits the usability of logical replication for applications that heavily rely on large objects.
Background
==========
The primary challenge preventing support for this feature stems from
large objects being stored within the catalog. Catalog decoding is
intentionally unsupported because replicating a catalog entry, such as
a pg_class row, is meaningless without the underlying storage and
related catalog entries (e.g., pg_attribute, pg_constraint, etc.).
Nevertheless, the data within pg_largeobjects is different; it behaves
more like regular user table data than typical catalog data.
While it is technically possible to enable logical logging and
replication for rows in pg_largeobjects by modifying certain macros,
this approach presents several key drawbacks: 1) The decoded data
needs a logical representation, as a single pg_largeobject row is
meaningless outside of PostgreSQL. 2) The apply worker may not have
the necessary permissions to directly insert into the pg_largeobject
catalog if the subscription was not created by a superuser.
Proposal
=======
The proposed solution involves introducing a configuration parameter
that allows a large object to be logically logged, similar to a
standard user table. Then the decoder will convert the pg_largeobjects
rows into logical operations, such as LOWRITE: OID, OFFSET, DATA,
LENGTH. The decoder will then translate the standard single-row
insert/update WAL records generated for the internal pg_largeobject
table rows into logical operations formatted as LOWRITE: OID, OFFSET,
DATA, LENGTH, where the OFFSET is calculated as pageno (corresponding
to pg_largeobject row) * LOBLKSIZE. Subsequently, the apply worker on
the subscriber side converts this logical operation into lo_open(),
lo_seek() and lowrite() operation. While there is potential for
further optimization by generating the LOWRITE operation only for the
modified data instead of for the entire LOBLKSIZE bytes this behavior
is consistent with how updates are currently logged for standard user
tables.
I have a POC patch for the decoding part of this implementation. I
need feedback on the overall strategy before I put effort on cleanup,
testing, and further development. The patch still requires the
following to be completed:
a) This is just POC so it needs substantial cleanup and testing,
implementing other large objects operations other than lowrite.
b) Development for the 'apply' side of the implementation.
c) Implementation of a configuration parameter to conditionally enable
logically logging the large object (currently, it always logs the
large object).
Open points to be discussed
======================
1. I propose that logically logging the pg_largeobject tuple data
should be controlled by a configuration parameter, rather than being
implemented unconditionally. Making it configurable allows users to
opt-in to this behavior, preventing unnecessary logging overhead for
users who do not require large object replication. Thoughts?
2. Supporting lo_create() operations : We should consider extending
this proposal to support the lo_create() operation as well. If the
large object replication configuration is enabled, we could generate a
new WAL record directly from lo_create(). Alternatively, we could
leverage the existing WAL record for the insertion into
pg_largeobject_metadata and decode it into a logical lo_create
operation.
While some may categorize lo_create() as a DDL operation, it behaves
practically like a DML operation. When a table contains a large object
column, new large objects are created frequently, often for every row
insertion making it a commonplace runtime event rather than a schema
design activity.
Acknowledgements:
================
I would like to express my sincere thanks to Amit Kapila, Michael
Bautin, Hannu Korosing, Noah Misch, and Joe Conway for their valuable
input, including discussing various alternatives and suggesting
different approaches to this.
Also added some tests using test decoding to show how it works and
here is one of the examples.
postgres[1000776]=# select lo_create(1000);
lo_create
-----------
1000
(1 row)
postgres[1000776]=# SELECT lowrite(lo_open(1000, CAST(x'20000' |
x'40000' AS integer)), 'try decoding test data');
lowrite
---------
22
(1 row)
postgres[1000776]=# SELECT data FROM
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1');
data
-------------------------------------------------------------------------
BEGIN
LO_WRITE: loid: 1000 offset: 0 datalen: 22 data: try decoding test data
COMMIT
(3 rows)
--
Regards,
Dilip Kumar
Google
POC-0001-Support-large-object-decoding.patch
Description: Binary data
