Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

2023-05-22 Thread Vladislav Vaintroub
 Hi Marko and Kristian >> I also had the idea to use fibers/coroutines as is mentined in the MDEV>> description, but if that can be avoided, so much the better. >I too like https://quoteinvestigator.com/2011/05/13/einstein-simple/>or the KISS principle. About MDEV-24341That was not implemented as fibers or coroutines after all, but still do_command() is a restartable function, that just uses goto to jump to the place where it left off last time. Luckily, there is only one restartability point – at the end of the query, prior to reporting OK to client, where we would otherwise need to wait for innodb group commit lead to make the changes persistent “coroutine-like” is only active if threadpool is used –do_command completes only partially, and frees the worker thread, so that it can accept requests from somebody else.  The group commit lead thread takes care of resuming unfinished do_commands, after redo log is written/flushed. With thread-per-connection, coroutine-like logic above makes no sense. But still, wait for persistent redo is delayed until the last possible moment , when the server writes to network, to report OK to the client . This delay allows to either decrease wait time, or avoid it entirely, depending on how fast the group commit lead finishes its work. However  if binlog is used, none of the above would happen. Lacking binlog  knowledge, I took conservative approach. Therefore,  with  binlog on, all waits for persistent Innodb redo  are “synchronous”, i.e they happen inside Innodb’s log_write_up_to() function. Wlad___Mailing list: https://launchpad.net/~maria-developersPost to : maria-developers@lists.launchpad.netUnsubscribe : https://launchpad.net/~maria-developersMore help   : https://help.launchpad.net/ListHelp 

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

2023-05-22 Thread Marko Mäkelä
On Sat, May 20, 2023 at 11:07 PM Kristian Nielsen
 wrote:
> I agree that a function parameter seems simpler. We're requesting to be
> notified when a specific commit is durable in the engine, better specify
> _which_ commit than for InnoDB to guess, as you say :-).

Given that this code is only executed when switching binlog files
(every 1 gigabyte or so of written binlog), or during RESET MASTER,
the slight performance degradation due to the "unnecessary but
sufficient" condition on InnoDB should not matter much.

> We need to be careful about lifetime. The transaction may no longer exist as
> a THD or trx_t (?) in memory. But innobase_commit_ordered() could return the
> corresponding LSN, as was suggested in another mail, and then
> commit_checkpoint_request() could pass that value.

Right. The trx_t objects are being allocated from a special memory
pool that facilitates fast reuse. The object of an old transaction
could have been reused for something else. So, it would be better to
let the storage engine somehow return its logical time. 64 bits for
that could be sufficient for all storage engines, and the special
value 0 could imply that the current logic (ask the storage engine to
write everything) will be used.

[snip]
> Yes. single-engine transaction is surely the important usecase to optimise.
> It's nice if multi-engine transactions still work, but if they require
> multiple fsync still, I think that's perfectly fine, not something to
> allocate a lot of resources to optimise for.

It's nice that we agree here.

> I also had the idea to use fibers/coroutines as is mentined in the MDEV
> description, but if that can be avoided, so much the better.

I too like https://quoteinvestigator.com/2011/05/13/einstein-simple/
or the KISS principle.

Example: If the buf_pool.mutex or fil_system.mutex is a bottleneck,
fix the bottlenecks (MDEV-15053, MDEV-23855) instead of introducing
complex things such as multiple buffer pool instances and multiple
page cleaner threads (removed in MDEV-15058), or introducing a
Fil_shard (MySQL 8.0). Or if the log_sys.mutex is a bottleneck, do not
introduce a "jungle of threads" that will write to multiple log files,
but just reduce the bottlenecks with increased use of std::atomic or
with a file format change that allows more clever locking
(MDEV-27774). Thread context switches can be expensive when system
calls such as mutex waits are involved, and when not, race conditions
in lock-free algorithms are hard to diagnose (often invisible to tools
like https://rr-project.org). Even when there are no system calls
involved in inter-thread communication, "cache line ping-pong" can
quickly become expensive, especially on NUMA systems.

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB plc

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

2023-05-20 Thread Kristian Nielsen
Marko Mäkelä  writes:

>> recent binlog files. We want to know when scanning the old binlog file is no
>> longer necessary; then we will log a CHECKPOINT_EVENT recording this fact.
>> But we don't want to stall waiting for everything to be flushed to disk
>> immediately; we can just log the CHECKPOINT_EVENT later when appropriate.
>
> Is this the original purpose and motivation of the
> handlerton::commit_checkpoint_request?

Yes.

> I think that for InnoDB to guess the last binlog-committed transaction
> would require adding a thread-local variable, which could cause
> trouble with the connection thread pool. An added function parameter
> would be much simpler.

I agree that a function parameter seems simpler. We're requesting to be
notified when a specific commit is durable in the engine, better specify
_which_ commit than for InnoDB to guess, as you say :-).

And an engine is free to ignore the parameter, and just make sure all
existing transactions at that point are committed (eg. current lsn), as my
original InnoDB implementation did.

We need to be careful about lifetime. The transaction may no longer exist as
a THD or trx_t (?) in memory. But innobase_commit_ordered() could return the
corresponding LSN, as was suggested in another mail, and then
commit_checkpoint_request() could pass that value.

Something like that seems indeed a better API.

>> There seems to be some lock-free operations, I wonder if that makes sense,
>> this is not timing-critical code (it's called once per binlog rotation).
>
> It was a long time ago when I worked on this. The InnoDB log_sys.mutex
> certainly was a bottleneck before some InnoDB log related data
> structures were refactored to use C++11 std::atomic.

I was thinking that the log_flush_request list does not need anything fancy,
and could just use a normal mutex, it is so rarely accessed. And any more
critical code (eg. using log_sys.mutex) could just do an unsafe check for
NULL log_flush_request list, at the worst skipping a binlog checkpoint
notification until the next flush.

But all that becomes more complicated when RESET MASTER will hang until the
checkpoint notification arrives. Then a lost checkpoint notification becomes
a problem on an idle server. That's why I'm unhappy about the way RESET
MASTER is done now (even though I implemented it, IIRC). It really was my
intention that skipping a binlog checkpoint from time to time should be
allowed to simplify things in the engine.

> The 10.4 innobase_mysql_log_notify() follows the anti-pattern
> https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces#double-checked-locking
>
> The comment that claims that it is safe might not hold for
> architectures that use a weak memory model, such as typical
> implementations of ARM, POWER, and RISC-V (RVWMO).
> I believe that the
> https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering
> that is employed by the 10.5 innodb_log_flush_request() and
> log_flush_notify() should be safer.

Right. As long as RESET MASTER doesn't tolerate delayed notifications, all
of this becomes a concern :-(

> If the last write to the log was an InnoDB internal one, it could be
> done by log_write_up_to(lsn, false).
> The log_sys.flushed_to_disk_lsn would only be advanced (to somewhere
> closer to log_sys.lsn) if there was a subsequent call to
> log_write_up_to(lsn, true).

What I found in the test case I added to MDEV-25611 is that InnoDB
flushes the redo log every second (assuming I read the code
right). This is the function srv_sync_log_buffer_in_background(). Hm, but
actually this can be controlled by the option srv_flush_log_at_timeout, so
my testcase could be improved to set this to a high value instead of
disabling srv_sync_log_buffer_in_background() in the code.

> Right, I did not think of multi-engine transactions. Perhaps there
> could be a special single-engine transaction mode? If the binlog is

Yes. single-engine transaction is surely the important usecase to optimise.
It's nice if multi-engine transactions still work, but if they require
multiple fsync still, I think that's perfectly fine, not something to
allocate a lot of resources to optimise for.

> One more thing: The post
> https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ is missing
> the fact that when the "fast part" of an InnoDB transaction state
> change has finished, the LSN of the transaction COMMIT (or XA PREPARE
> or XA ROLLBACK or XA COMMIT) will be determined. It is not possible to
> reorder transactions after that point. In fact, after the completion
> of the "fast part", other transactions in the system will observe the
> state change of the transaction, even before it becomes durable.

Thanks. Yes, this sounds like what I would expect.

> Another thing: Could replication make use of the log write completion
> mechanism https://jira.mariadb.org/browse/MDEV-24341 that was added to
> MariaDB Server 10.6.0?

Ah, this is interesting, thanks for the pointer.
I 

Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

2023-05-16 Thread Marko Mäkelä
On Tue, May 16, 2023 at 1:37 PM Andrei Elkin  wrote:
> Can you comment on chances for `flush_lsn` never advance to the value of
>  `write_lsn` when/because Innodb has nothing to commit (so at such a time
> RM (or PURGE or FLUSH LOGS that have been mentioned too) might be run)?

If the last write to the log was an InnoDB internal one, it could be
done by log_write_up_to(lsn, false).
The log_sys.flushed_to_disk_lsn would only be advanced (to somewhere
closer to log_sys.lsn) if there was a subsequent call to
log_write_up_to(lsn, true).
Log checkpoint would advance log_sys.flushed_to_disk_lsn, as should a
user transaction commit. This could be a possible explanation.
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB plc

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

2023-05-16 Thread Marko Mäkelä
On Fri, May 12, 2023 at 8:32 PM Kristian Nielsen
 wrote:
> The transaction of interest is the last one that called commit_ordered()
> prior to this call of commit_checkpoint_request(). I guess binlog code could
> save it (it's trivial as commit_ordered() calls are serialised), or InnoDB
> could do it itself.

I think that for InnoDB to guess the last binlog-committed transaction
would require adding a thread-local variable, which could cause
trouble with the connection thread pool. An added function parameter
would be much simpler.

> But let's take it from what we actually want to achive. We have created a
> new binlog file, and now crash recovery will have to scan the two most
> recent binlog files. We want to know when scanning the old binlog file is no
> longer necessary; then we will log a CHECKPOINT_EVENT recording this fact.
> But we don't want to stall waiting for everything to be flushed to disk
> immediately; we can just log the CHECKPOINT_EVENT later when appropriate.

Is this the original purpose and motivation of the
handlerton::commit_checkpoint_request?

> There seems to be some lock-free operations, I wonder if that makes sense,
> this is not timing-critical code (it's called once per binlog rotation).

It was a long time ago when I worked on this. The InnoDB log_sys.mutex
certainly was a bottleneck before some InnoDB log related data
structures were refactored to use C++11 std::atomic.
I might have originally misunderstood the source code comment in
sql/handler.h that was added in MDEV-232.

The 10.4 innobase_mysql_log_notify() follows the anti-pattern
https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces#double-checked-locking

The comment that claims that it is safe might not hold for
architectures that use a weak memory model, such as typical
implementations of ARM, POWER, and RISC-V (RVWMO).
I believe that the
https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering
that is employed by the 10.5 innodb_log_flush_request() and
log_flush_notify() should be safer.

> There's also some complex logic to avoid missing a
> commit_checkpoint_notify_ha(). I can see how that is important and could
> occur in a completely idle server. But I think it can be done simpler, by
> simply scheduling a new check of this every second or so as long as there
> are pending checkpoint requests. This is all asynchronous, no point in
> trying to report the checkpoint immediately.

According to a comment in https://jira.mariadb.org/browse/MDEV-25611
there is an https://rr-project.org/ trace of a hang available to
Andrei Elkin. It is also pretty easy to reproduce the hang by using
some debug instrumentation that will make InnoDB to try harder to
avoid issuing log checkpoints (and therefore trigger another call to
log_flush_notify() which will therefore "rescue" the hang). It would
be very interesting to know what happened concurrently with or after
the last call to innodb_log_flush_request().

> If InnoDB is running non-durably, I think there is no point in checking the
> log at all. It doesn't help that we scan the required log files if the
> transactions to be recoved from InnoDB were not prepared durably and are
> unrecoverable anyway. So in non-durable mode InnoDB could just call
> commit_checkpoint_notify_ha() immediately?

I do not think that we should spend too much effort to optimize an
unsafe, corruption-prone mode of operation. We should be safe by
default (https://jira.mariadb.org/browse/MDEV-16589) and optimize that
case.

> I notice that RESET MASTER also does a commit_checkpoint_request() and waits
> for the notification. I can understand why I did it this way, but in
> hindsight it doesn't seem like a good idea. Binlog checkpoint is async, so
> we should not wait on it. RESET MASTER surely cannot be time critical, so
> probably it would be better to just ask InnoDB to flush everything to its
> log, skipping the checkpoint mechanism. Or just do nothing, does it make any
> sense to recover the binlog into a consistent state with InnoDB when we just
> deleted all of the binlog?!?

If RESET MASTER can rewind the binlog position, then it might make
sense to ensure that the new binlog position is durably written to the
persistent InnoDB transaction metadata storage.

> Anyway, these are comments on the current code. I still didn't fully
> understand what the problem is with the current API wrt. InnoDB?

Based on your description, it does not feel like a big design problem.
I would very much like to see an analysis and fix of the MDEV-25611
hang. It could be something fairly trivial, or a sign of a more
serious problem in the design.

> I have an old worklog where this idea was explored in some detail. It is not
> trivial, there are many corner cases where replaying binlog events will not
> reliably recover the state (eg. statement-based binlogging). People have
> learned to live with this for replication, but now that will be extended to
> the binlog. One 

Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

2023-05-12 Thread Kristian Nielsen
Marko Mäkelä  writes:

> Do you know how MyRocks implements ACID? There is a function

Sorry, I do not. Maybe Sergey Petrunya does.

> That function does not currently take any parameter (such as THD) to
> identify the transaction of interest, and it cannot indicate that the
> most recent state change of the transaction has already been durably
> written.

The transaction of interest is the last one that called commit_ordered()
prior to this call of commit_checkpoint_request(). I guess binlog code could
save it (it's trivial as commit_ordered() calls are serialised), or InnoDB
could do it itself.

But let's take it from what we actually want to achive. We have created a
new binlog file, and now crash recovery will have to scan the two most
recent binlog files. We want to know when scanning the old binlog file is no
longer necessary; then we will log a CHECKPOINT_EVENT recording this fact.
But we don't want to stall waiting for everything to be flushed to disk
immediately; we can just log the CHECKPOINT_EVENT later when appropriate.

We call commit_checkpoint_request() into all storage engines to receive a
notification when all currently (binlog) committed transactions have become
durable. Once the engine has synced past its corresponding LSN, it replies
with commit_checkpoint_notify_ha(), and when all engines have done so, the
CHECKPOINT_EVENT is binlogged.

In 10.4 innobase_checkpoint_request() the code saves the current LSN:

lsn = log_get_lsn();

An extra check is made if already lsn <= log_get_flush_lsn(); if so,
commit_checkpoint_notify_ha() is called immediately. Else, we check again
whenever the log is flushed, until the recorded lsn has been flushed.

The code in 11.1 is different, though it seems to be doing something
similar.

There seems to be some lock-free operations, I wonder if that makes sense,
this is not timing-critical code (it's called once per binlog rotation).

There's also some complex logic to avoid missing a
commit_checkpoint_notify_ha(). I can see how that is important and could
occur in a completely idle server. But I think it can be done simpler, by
simply scheduling a new check of this every second or so as long as there
are pending checkpoint requests. This is all asynchronous, no point in
trying to report the checkpoint immediately.

If InnoDB is running non-durably, I think there is no point in checking the
log at all. It doesn't help that we scan the required log files if the
transactions to be recoved from InnoDB were not prepared durably and are
unrecoverable anyway. So in non-durable mode InnoDB could just call
commit_checkpoint_notify_ha() immediately?

I notice that RESET MASTER also does a commit_checkpoint_request() and waits
for the notification. I can understand why I did it this way, but in
hindsight it doesn't seem like a good idea. Binlog checkpoint is async, so
we should not wait on it. RESET MASTER surely cannot be time critical, so
probably it would be better to just ask InnoDB to flush everything to its
log, skipping the checkpoint mechanism. Or just do nothing, does it make any
sense to recover the binlog into a consistent state with InnoDB when we just
deleted all of the binlog?!?

Anyway, these are comments on the current code. I still didn't fully
understand what the problem is with the current API wrt. InnoDB?

But I don't see a big problem with changing it either if InnoDB needs
something different. All that is needed from the binlog side is to get a
notification at some point that an old binlog file is no longer needed for
recovery. If we can find a simpler way to do this, then that's good.
Removing the checkpointing from RESET MASTER completely might be a good
start.

> Where should a notification be initiated if all changes have already
> been written at the time handlerton::commit_checkpoint_request is
> called?

Then commit_checkpoint_notify_ha() could be called immediately before
returning from commit_checkpoint_request().

>> Since the binlog writing is essentially sequential in nature, it is more
>> efficient to do it in a single thread for all waiting threads, and this is
>> how the MariaDB binlog group commit is done.
>
> Yes, optimizing that would require a file format change, to replace
> the append-only-to-last-file with something else, such as one or
> multiple preallocated files, possibly arranged as a ring buffer.

Agree, pre-allocated would be more efficient.

> One related idea that has been floating around is to use the InnoDB
> log as a "doublewrite buffer" for short enough binlog event groups.
> The maximum mini-transaction size (dictated by the InnoDB recovery
> code) is something like 4 megabytes. On an InnoDB log checkpoint, any
> buffered binlog event groups would have to be durably written to the
> binlog. Likewise, if a binlog event group is too large to be buffered,
> it would have to be written and fdatasync()ed in advance.

Ah, yes, this is an interesting idea actually.

>> Yes, it is an interesting idea. What do you 

Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

2023-05-12 Thread Marko Mäkelä
Hi Kristian,

Thank you for your detailed reply. Before I read the linked blog
posts, here are a few quick comments.

> 1. The APIs (handlerton methods and server calls) are designed to be
> storage-engine independent. So this is one reason why we would not see an
> InnoDB LSN anywhere in the server part.

Every transactional storage engine under MariaDB should have some
concept of logical time. It is not necessarily totally ordered like
the InnoDB or Aria LSN. The Aria LSN consists of a 32-bit log file
number and a 32-bit offset within a log file. The InnoDB LSN is a
64-bit offset, but the log file is written as a ring buffer, so the
LSN will be mapped to a file offset using some modulo arithmetics.

Do you know how MyRocks implements ACID? There is a function
rocksdb_flush_wal(), but it is commented out in MariaDB. In LevelDB, I
see some SST and log files, but they may be specific to a single LSM
tree, and I would suppose that there are multiple LSM trees in
MyRocks, instead of the data from all tables being stored in a single
LSM tree.

> commit_checkpoint_request() is called when the binlog wants to mark a point
> before which binlog files no longer need to be scanned (a "binlog
> checkpoint"). We can see in innobase_checkpoint_request() that this is
> mapped to an InnoDB LSN (using log_get_lsn()). This follows point (1) above,
> where the server-level API (binlog) is kept separate from storage engine
> internals (InnoDB LSN).

That function does not currently take any parameter (such as THD) to
identify the transaction of interest, and it cannot indicate that the
most recent state change of the transaction has already been durably
written.

> I wonder if this is a real mismatch between the API and InnoDB, or a
> misunderstanding about what is intended from commit_checkpoint_request().
> The binlog checkpoint mechanism is completely asynchronous, and it is not a
> request for the storage engine to take any action to make things durable. It
> is merely a request for InnoDB to conservatively pick an LSN that is known
> to contain any transaction that has completed commit_ordered(), it may be
> any later LSN that is cheaper to obtain. Then later, when that LSN is
> eventually made durable on the engine's own initiative, the engine will call
> commit_checkpoint_notify_ha() to complete the binlog checkpoint.

Where should a notification be initiated if all changes have already
been written at the time handlerton::commit_checkpoint_request is
called?

> So if I understand correctly, records from parallel transactions are written
> concurrently into the InnoDB log in an interleaved fashion. So one thread
> can just put log records into a buffer as they are generated and continue
> query execution, and later they may be written or synced asynchroneously by
> another thread.

Yes. Typically, durable writes are only cared about when there is a
tangible change of persistent state, such as changing the state of a
user transaction, or advancing the InnoDB log checkpoint. If a
transaction is rolled back, redo log will be written, but nothing in
that transaction should wait for write_lock or flush_lock. The
rollback could have a side effect of making the commits other
transactions durable.

> Since the binlog writing is essentially sequential in nature, it is more
> efficient to do it in a single thread for all waiting threads, and this is
> how the MariaDB binlog group commit is done.

Yes, optimizing that would require a file format change, to replace
the append-only-to-last-file with something else, such as one or
multiple preallocated files, possibly arranged as a ring buffer.

> Another problem is that the binlog is a logical log (SQL statements), and
> fragile to base recovery on that _must_ be 100% reliable. If we don't want
> to recover InnoDB transactions by replaying binlog replication event groups,
> then we anyway need an fdatasync() inside InnoDB in the XA prepare step
> before we can be sure the transaction is durable, don't we?

Yes, it is a balance between writing more upfront, or potentially
executing more recovery.

> Another idea would be to go in the opposite direction, where the storage
> engine could provide a logging service to the server (just like it provides
> tables). This way the binlog code could call into InnoDB to write the binlog
> records into the InnoDB log. But this requires changes on the InnoDB side
> (log archiving for one), and reading binlogs for replication will be slower
> as it's interleaved with unrelated log records.

That would work too, and in fact, I had proposed that years ago. But
like you point out, there are obvious drawbacks with that.

One related idea that has been floating around is to use the InnoDB
log as a "doublewrite buffer" for short enough binlog event groups.
The maximum mini-transaction size (dictated by the InnoDB recovery
code) is something like 4 megabytes. On an InnoDB log checkpoint, any
buffered binlog event groups would have to be durably 

Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

2023-05-12 Thread Kristian Nielsen
Marko Mäkelä  writes:

> later. I understand that you were an early contributor to MariaDB and
> implemented some durability I/O optimization between the "distributed
> transaction" of binlog and the InnoDB write-ahead log (redo log,
> ib_logfile0).

Indeed, that was the group commit work:

  
https://m.facebook.com/notes/mark-callaghan/group-commit-in-mariadb-is-fast/10150211546215933/
  https://knielsen-hq.org/w/fixing-mysql-group-commit-part-1/
  https://knielsen-hq.org/w/fixing-mysql-group-commit-part-2/
  https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/
  https://knielsen-hq.org/w/fixing-mysql-group-commit-part-4-of-3/

> I think that some improvement in this area is long overdue. I find it

Agree. I think it would be fantastic to combine your extensive knowledge of
InnoDB with my binlog/replication expertise to improve this! There's a lot
of very interesting points in your mail:

> hard to read the high-level descriptions. Nowhere in
> https://jira.mariadb.org/browse/MDEV-232 or
> https://jira.mariadb.org/browse/MDEV-532 I see any mention of any key
> terminology:

Indeed, my understanding of InnoDB internals is far from complete (though I
do understand the concept of LSN at a high level of course).

From the binlog side, there are some basic design goals that influence how
the binlog group commit is implemented.

1. The APIs (handlerton methods and server calls) are designed to be
storage-engine independent. So this is one reason why we would not see an
InnoDB LSN anywhere in the server part.

2. Transactions (the replication term is "event group") are always written
into the binlog as a single atomic (kind of) block. So no interleaving of
transactions in the binlog, or transactions that span multiple binlog files.

3. Transactions in the binlogs are strictly ordered, even across different
servers in the same replication topology (this is used for GTID and in-order
optimistic parallel replication).

4. Identical commit order is enforced between the binlog and the storage
engine. I think this is related to getting consistent replication slave
position from an innobackup or LVM/BTRFS snapshot backup.

> and trx_t::must_flush_log_later. There is also the confusingly named
> handlerton::commit_checkpoint_request that has nothing to do with log
> checkpoints, but with some kind of a durability request that

This was an optimisation to remove one fsync from commits when binlog is
enabled:

  https://knielsen-hq.org/w/even-faster-group-commit/

If the server crashes, then on restart the binlog transaction coordinator
needs to recover the XA state between the binlog and the engines. For this
it scans the storage engines and the binlog files for any pending
transactions. Any binlog file that contains pending XA transactions
needs to be scanned. Once all transactions in a binlog file are durable
committed in the engines, it is preferred to not scan it for efficiency
reasons.

commit_checkpoint_request() is called when the binlog wants to mark a point
before which binlog files no longer need to be scanned (a "binlog
checkpoint"). We can see in innobase_checkpoint_request() that this is
mapped to an InnoDB LSN (using log_get_lsn()). This follows point (1) above,
where the server-level API (binlog) is kept separate from storage engine
internals (InnoDB LSN).

> fails to
> identify the minimum logical time up to which everything is desired to
> be durable. (My talk

I wonder if this is a real mismatch between the API and InnoDB, or a
misunderstanding about what is intended from commit_checkpoint_request().
The binlog checkpoint mechanism is completely asynchronous, and it is not a
request for the storage engine to take any action to make things durable. It
is merely a request for InnoDB to conservatively pick an LSN that is known
to contain any transaction that has completed commit_ordered(), it may be
any later LSN that is cheaper to obtain. Then later, when that LSN is
eventually made durable on the engine's own initiative, the engine will call
commit_checkpoint_notify_ha() to complete the binlog checkpoint.

The intention is that this will be very easy for an engine to implement,
without any changes to how it manages internal logs and durability. It is
just a way for the engine to periodically communicate that some LSN is now
durable, in a way that can match the LSN to a binlog position without
assuming a particular storage engine or LSN format. But if this intention
fails, we should look into it.

> Some years ago, in MDEV-21534 Vladislav Vaintroub introduced
> write_lock and flush_lock in InnoDB. This works pretty well: a

> If
> everything up to "my" time has been written or durably written, I do
> not care to wait.

So if I understand correctly, records from parallel transactions are written
concurrently into the InnoDB log in an interleaved fashion. So one thread
can just put log records into a buffer as they are generated and continue
query execution, and later they may be written or synced 

[Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)

2023-05-11 Thread Marko Mäkelä
Hi Kristian,

I am happy to see you being active on the maria-developers list. I
don’t think we have collaborated in the past. I have been working on
the InnoDB storage engine under MySQL 4.0 through 5.7 (and some 8.0)
since 2003, and since 2016 under MariaDB Server, mostly 10.1 and
later. I understand that you were an early contributor to MariaDB and
implemented some durability I/O optimization between the "distributed
transaction" of binlog and the InnoDB write-ahead log (redo log,
ib_logfile0).

I think that some improvement in this area is long overdue. I find it
hard to read the high-level descriptions. Nowhere in
https://jira.mariadb.org/browse/MDEV-232 or
https://jira.mariadb.org/browse/MDEV-532 I see any mention of any key
terminology:

* InnoDB log sequence number (LSN): a logical point of time for any
persistent InnoDB changes. Measured in bytes. Mini-transactions
(atomic changes of InnoDB pages) start and end at some LSN.
Corresponds to the Oracle system change number (SCN).
* Binlog offset: a logical point of time for any changes at the
replication level. Measured in bytes. Replication events such as
statements or row-level events start and end at some binlog offset.
* The LSN or binlog offset of a COMMIT, XA PREPARE, XA ROLLBACK, or XA COMMIT.

There is quite a bit of code in InnoDB that I would like to get rid
of. Some of it is related to the data members trx_t::flush_log_later
and trx_t::must_flush_log_later. There is also the confusingly named
handlerton::commit_checkpoint_request that has nothing to do with log
checkpoints, but with some kind of a durability request that fails to
identify the minimum logical time up to which everything is desired to
be durable. (My talk
https://fosdem.org/2022/schedule/event/mariadb_innodb/ explains what a
log checkpoint is and how recovery works.)

Some years ago, in MDEV-21534 Vladislav Vaintroub introduced
write_lock and flush_lock in InnoDB. This works pretty well: a
Sysbench workload with innodb_flush_log_at_trx_commit=1 will linearly
improve throughput as you add more client connection threads. I think
that something similar would be beneficial for the binlog, to keep
track of the latest written offset and the latest write offset after
which fdatasync() has completed. I am not familiar with the binlog
code. A colleague recently claimed to me that LOCK_commit_ordered is a
huge bottleneck. As you can see, the write_lock and flush_lock are
better because the lock requests include a notion of logical time: If
everything up to "my" time has been written or durably written, I do
not care to wait. In MDEV-27774 (MariaDB Server 10.8) this was
improved further: multiple mini-transactions can copy their local log
buffer to the shared log_sys.buf concurrently.

It seems to me that the missing tracking of logical time in the data
structures is causing hangs like MDEV-25611. I feel that the design is
based on a wrong assumption that the binlog-covered transactions are
the only source of InnoDB log writes. That has never been the case:
For example, there have always been background threads or tasks that
purge the history of committed transactions.

The task MDEV-18959 (ensuring that one fdatasync() per durable
transaction suffices) is also related to this. I think that the
simplest solution (assuming that moving the binlog to a bunch of
transactional tables is out of the question) is to ensure that one of
the logs (binlog and other write-ahead logs) is always ahead. If we
choose that the binlog must be ahead, then in InnoDB the write_lock
acquisition would have to be preceded by a binlog_write_lock
acquisition that will ensure that any relevant changes of persistent
state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part
of the InnoDB log write must have been durably written to the binlog.

With best regards,

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB plc

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp