Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)
Hi Marko and Kristian >> I also had the idea to use fibers/coroutines as is mentined in the MDEV>> description, but if that can be avoided, so much the better. >I too like https://quoteinvestigator.com/2011/05/13/einstein-simple/>or the KISS principle. About MDEV-24341That was not implemented as fibers or coroutines after all, but still do_command() is a restartable function, that just uses goto to jump to the place where it left off last time. Luckily, there is only one restartability point – at the end of the query, prior to reporting OK to client, where we would otherwise need to wait for innodb group commit lead to make the changes persistent “coroutine-like” is only active if threadpool is used –do_command completes only partially, and frees the worker thread, so that it can accept requests from somebody else. The group commit lead thread takes care of resuming unfinished do_commands, after redo log is written/flushed. With thread-per-connection, coroutine-like logic above makes no sense. But still, wait for persistent redo is delayed until the last possible moment , when the server writes to network, to report OK to the client . This delay allows to either decrease wait time, or avoid it entirely, depending on how fast the group commit lead finishes its work. However if binlog is used, none of the above would happen. Lacking binlog knowledge, I took conservative approach. Therefore, with binlog on, all waits for persistent Innodb redo are “synchronous”, i.e they happen inside Innodb’s log_write_up_to() function. Wlad___Mailing list: https://launchpad.net/~maria-developersPost to : maria-developers@lists.launchpad.netUnsubscribe : https://launchpad.net/~maria-developersMore help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)
On Sat, May 20, 2023 at 11:07 PM Kristian Nielsen wrote: > I agree that a function parameter seems simpler. We're requesting to be > notified when a specific commit is durable in the engine, better specify > _which_ commit than for InnoDB to guess, as you say :-). Given that this code is only executed when switching binlog files (every 1 gigabyte or so of written binlog), or during RESET MASTER, the slight performance degradation due to the "unnecessary but sufficient" condition on InnoDB should not matter much. > We need to be careful about lifetime. The transaction may no longer exist as > a THD or trx_t (?) in memory. But innobase_commit_ordered() could return the > corresponding LSN, as was suggested in another mail, and then > commit_checkpoint_request() could pass that value. Right. The trx_t objects are being allocated from a special memory pool that facilitates fast reuse. The object of an old transaction could have been reused for something else. So, it would be better to let the storage engine somehow return its logical time. 64 bits for that could be sufficient for all storage engines, and the special value 0 could imply that the current logic (ask the storage engine to write everything) will be used. [snip] > Yes. single-engine transaction is surely the important usecase to optimise. > It's nice if multi-engine transactions still work, but if they require > multiple fsync still, I think that's perfectly fine, not something to > allocate a lot of resources to optimise for. It's nice that we agree here. > I also had the idea to use fibers/coroutines as is mentined in the MDEV > description, but if that can be avoided, so much the better. I too like https://quoteinvestigator.com/2011/05/13/einstein-simple/ or the KISS principle. Example: If the buf_pool.mutex or fil_system.mutex is a bottleneck, fix the bottlenecks (MDEV-15053, MDEV-23855) instead of introducing complex things such as multiple buffer pool instances and multiple page cleaner threads (removed in MDEV-15058), or introducing a Fil_shard (MySQL 8.0). Or if the log_sys.mutex is a bottleneck, do not introduce a "jungle of threads" that will write to multiple log files, but just reduce the bottlenecks with increased use of std::atomic or with a file format change that allows more clever locking (MDEV-27774). Thread context switches can be expensive when system calls such as mutex waits are involved, and when not, race conditions in lock-free algorithms are hard to diagnose (often invisible to tools like https://rr-project.org). Even when there are no system calls involved in inter-thread communication, "cache line ping-pong" can quickly become expensive, especially on NUMA systems. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc ___ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)
Marko Mäkelä writes: >> recent binlog files. We want to know when scanning the old binlog file is no >> longer necessary; then we will log a CHECKPOINT_EVENT recording this fact. >> But we don't want to stall waiting for everything to be flushed to disk >> immediately; we can just log the CHECKPOINT_EVENT later when appropriate. > > Is this the original purpose and motivation of the > handlerton::commit_checkpoint_request? Yes. > I think that for InnoDB to guess the last binlog-committed transaction > would require adding a thread-local variable, which could cause > trouble with the connection thread pool. An added function parameter > would be much simpler. I agree that a function parameter seems simpler. We're requesting to be notified when a specific commit is durable in the engine, better specify _which_ commit than for InnoDB to guess, as you say :-). And an engine is free to ignore the parameter, and just make sure all existing transactions at that point are committed (eg. current lsn), as my original InnoDB implementation did. We need to be careful about lifetime. The transaction may no longer exist as a THD or trx_t (?) in memory. But innobase_commit_ordered() could return the corresponding LSN, as was suggested in another mail, and then commit_checkpoint_request() could pass that value. Something like that seems indeed a better API. >> There seems to be some lock-free operations, I wonder if that makes sense, >> this is not timing-critical code (it's called once per binlog rotation). > > It was a long time ago when I worked on this. The InnoDB log_sys.mutex > certainly was a bottleneck before some InnoDB log related data > structures were refactored to use C++11 std::atomic. I was thinking that the log_flush_request list does not need anything fancy, and could just use a normal mutex, it is so rarely accessed. And any more critical code (eg. using log_sys.mutex) could just do an unsafe check for NULL log_flush_request list, at the worst skipping a binlog checkpoint notification until the next flush. But all that becomes more complicated when RESET MASTER will hang until the checkpoint notification arrives. Then a lost checkpoint notification becomes a problem on an idle server. That's why I'm unhappy about the way RESET MASTER is done now (even though I implemented it, IIRC). It really was my intention that skipping a binlog checkpoint from time to time should be allowed to simplify things in the engine. > The 10.4 innobase_mysql_log_notify() follows the anti-pattern > https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces#double-checked-locking > > The comment that claims that it is safe might not hold for > architectures that use a weak memory model, such as typical > implementations of ARM, POWER, and RISC-V (RVWMO). > I believe that the > https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering > that is employed by the 10.5 innodb_log_flush_request() and > log_flush_notify() should be safer. Right. As long as RESET MASTER doesn't tolerate delayed notifications, all of this becomes a concern :-( > If the last write to the log was an InnoDB internal one, it could be > done by log_write_up_to(lsn, false). > The log_sys.flushed_to_disk_lsn would only be advanced (to somewhere > closer to log_sys.lsn) if there was a subsequent call to > log_write_up_to(lsn, true). What I found in the test case I added to MDEV-25611 is that InnoDB flushes the redo log every second (assuming I read the code right). This is the function srv_sync_log_buffer_in_background(). Hm, but actually this can be controlled by the option srv_flush_log_at_timeout, so my testcase could be improved to set this to a high value instead of disabling srv_sync_log_buffer_in_background() in the code. > Right, I did not think of multi-engine transactions. Perhaps there > could be a special single-engine transaction mode? If the binlog is Yes. single-engine transaction is surely the important usecase to optimise. It's nice if multi-engine transactions still work, but if they require multiple fsync still, I think that's perfectly fine, not something to allocate a lot of resources to optimise for. > One more thing: The post > https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ is missing > the fact that when the "fast part" of an InnoDB transaction state > change has finished, the LSN of the transaction COMMIT (or XA PREPARE > or XA ROLLBACK or XA COMMIT) will be determined. It is not possible to > reorder transactions after that point. In fact, after the completion > of the "fast part", other transactions in the system will observe the > state change of the transaction, even before it becomes durable. Thanks. Yes, this sounds like what I would expect. > Another thing: Could replication make use of the log write completion > mechanism https://jira.mariadb.org/browse/MDEV-24341 that was added to > MariaDB Server 10.6.0? Ah, this is interesting, thanks for the pointer. I
Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)
On Tue, May 16, 2023 at 1:37 PM Andrei Elkin wrote: > Can you comment on chances for `flush_lsn` never advance to the value of > `write_lsn` when/because Innodb has nothing to commit (so at such a time > RM (or PURGE or FLUSH LOGS that have been mentioned too) might be run)? If the last write to the log was an InnoDB internal one, it could be done by log_write_up_to(lsn, false). The log_sys.flushed_to_disk_lsn would only be advanced (to somewhere closer to log_sys.lsn) if there was a subsequent call to log_write_up_to(lsn, true). Log checkpoint would advance log_sys.flushed_to_disk_lsn, as should a user transaction commit. This could be a possible explanation. -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc ___ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)
On Fri, May 12, 2023 at 8:32 PM Kristian Nielsen wrote: > The transaction of interest is the last one that called commit_ordered() > prior to this call of commit_checkpoint_request(). I guess binlog code could > save it (it's trivial as commit_ordered() calls are serialised), or InnoDB > could do it itself. I think that for InnoDB to guess the last binlog-committed transaction would require adding a thread-local variable, which could cause trouble with the connection thread pool. An added function parameter would be much simpler. > But let's take it from what we actually want to achive. We have created a > new binlog file, and now crash recovery will have to scan the two most > recent binlog files. We want to know when scanning the old binlog file is no > longer necessary; then we will log a CHECKPOINT_EVENT recording this fact. > But we don't want to stall waiting for everything to be flushed to disk > immediately; we can just log the CHECKPOINT_EVENT later when appropriate. Is this the original purpose and motivation of the handlerton::commit_checkpoint_request? > There seems to be some lock-free operations, I wonder if that makes sense, > this is not timing-critical code (it's called once per binlog rotation). It was a long time ago when I worked on this. The InnoDB log_sys.mutex certainly was a bottleneck before some InnoDB log related data structures were refactored to use C++11 std::atomic. I might have originally misunderstood the source code comment in sql/handler.h that was added in MDEV-232. The 10.4 innobase_mysql_log_notify() follows the anti-pattern https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces#double-checked-locking The comment that claims that it is safe might not hold for architectures that use a weak memory model, such as typical implementations of ARM, POWER, and RISC-V (RVWMO). I believe that the https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering that is employed by the 10.5 innodb_log_flush_request() and log_flush_notify() should be safer. > There's also some complex logic to avoid missing a > commit_checkpoint_notify_ha(). I can see how that is important and could > occur in a completely idle server. But I think it can be done simpler, by > simply scheduling a new check of this every second or so as long as there > are pending checkpoint requests. This is all asynchronous, no point in > trying to report the checkpoint immediately. According to a comment in https://jira.mariadb.org/browse/MDEV-25611 there is an https://rr-project.org/ trace of a hang available to Andrei Elkin. It is also pretty easy to reproduce the hang by using some debug instrumentation that will make InnoDB to try harder to avoid issuing log checkpoints (and therefore trigger another call to log_flush_notify() which will therefore "rescue" the hang). It would be very interesting to know what happened concurrently with or after the last call to innodb_log_flush_request(). > If InnoDB is running non-durably, I think there is no point in checking the > log at all. It doesn't help that we scan the required log files if the > transactions to be recoved from InnoDB were not prepared durably and are > unrecoverable anyway. So in non-durable mode InnoDB could just call > commit_checkpoint_notify_ha() immediately? I do not think that we should spend too much effort to optimize an unsafe, corruption-prone mode of operation. We should be safe by default (https://jira.mariadb.org/browse/MDEV-16589) and optimize that case. > I notice that RESET MASTER also does a commit_checkpoint_request() and waits > for the notification. I can understand why I did it this way, but in > hindsight it doesn't seem like a good idea. Binlog checkpoint is async, so > we should not wait on it. RESET MASTER surely cannot be time critical, so > probably it would be better to just ask InnoDB to flush everything to its > log, skipping the checkpoint mechanism. Or just do nothing, does it make any > sense to recover the binlog into a consistent state with InnoDB when we just > deleted all of the binlog?!? If RESET MASTER can rewind the binlog position, then it might make sense to ensure that the new binlog position is durably written to the persistent InnoDB transaction metadata storage. > Anyway, these are comments on the current code. I still didn't fully > understand what the problem is with the current API wrt. InnoDB? Based on your description, it does not feel like a big design problem. I would very much like to see an analysis and fix of the MDEV-25611 hang. It could be something fairly trivial, or a sign of a more serious problem in the design. > I have an old worklog where this idea was explored in some detail. It is not > trivial, there are many corner cases where replaying binlog events will not > reliably recover the state (eg. statement-based binlogging). People have > learned to live with this for replication, but now that will be extended to > the binlog. One
Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)
Marko Mäkelä writes: > Do you know how MyRocks implements ACID? There is a function Sorry, I do not. Maybe Sergey Petrunya does. > That function does not currently take any parameter (such as THD) to > identify the transaction of interest, and it cannot indicate that the > most recent state change of the transaction has already been durably > written. The transaction of interest is the last one that called commit_ordered() prior to this call of commit_checkpoint_request(). I guess binlog code could save it (it's trivial as commit_ordered() calls are serialised), or InnoDB could do it itself. But let's take it from what we actually want to achive. We have created a new binlog file, and now crash recovery will have to scan the two most recent binlog files. We want to know when scanning the old binlog file is no longer necessary; then we will log a CHECKPOINT_EVENT recording this fact. But we don't want to stall waiting for everything to be flushed to disk immediately; we can just log the CHECKPOINT_EVENT later when appropriate. We call commit_checkpoint_request() into all storage engines to receive a notification when all currently (binlog) committed transactions have become durable. Once the engine has synced past its corresponding LSN, it replies with commit_checkpoint_notify_ha(), and when all engines have done so, the CHECKPOINT_EVENT is binlogged. In 10.4 innobase_checkpoint_request() the code saves the current LSN: lsn = log_get_lsn(); An extra check is made if already lsn <= log_get_flush_lsn(); if so, commit_checkpoint_notify_ha() is called immediately. Else, we check again whenever the log is flushed, until the recorded lsn has been flushed. The code in 11.1 is different, though it seems to be doing something similar. There seems to be some lock-free operations, I wonder if that makes sense, this is not timing-critical code (it's called once per binlog rotation). There's also some complex logic to avoid missing a commit_checkpoint_notify_ha(). I can see how that is important and could occur in a completely idle server. But I think it can be done simpler, by simply scheduling a new check of this every second or so as long as there are pending checkpoint requests. This is all asynchronous, no point in trying to report the checkpoint immediately. If InnoDB is running non-durably, I think there is no point in checking the log at all. It doesn't help that we scan the required log files if the transactions to be recoved from InnoDB were not prepared durably and are unrecoverable anyway. So in non-durable mode InnoDB could just call commit_checkpoint_notify_ha() immediately? I notice that RESET MASTER also does a commit_checkpoint_request() and waits for the notification. I can understand why I did it this way, but in hindsight it doesn't seem like a good idea. Binlog checkpoint is async, so we should not wait on it. RESET MASTER surely cannot be time critical, so probably it would be better to just ask InnoDB to flush everything to its log, skipping the checkpoint mechanism. Or just do nothing, does it make any sense to recover the binlog into a consistent state with InnoDB when we just deleted all of the binlog?!? Anyway, these are comments on the current code. I still didn't fully understand what the problem is with the current API wrt. InnoDB? But I don't see a big problem with changing it either if InnoDB needs something different. All that is needed from the binlog side is to get a notification at some point that an old binlog file is no longer needed for recovery. If we can find a simpler way to do this, then that's good. Removing the checkpointing from RESET MASTER completely might be a good start. > Where should a notification be initiated if all changes have already > been written at the time handlerton::commit_checkpoint_request is > called? Then commit_checkpoint_notify_ha() could be called immediately before returning from commit_checkpoint_request(). >> Since the binlog writing is essentially sequential in nature, it is more >> efficient to do it in a single thread for all waiting threads, and this is >> how the MariaDB binlog group commit is done. > > Yes, optimizing that would require a file format change, to replace > the append-only-to-last-file with something else, such as one or > multiple preallocated files, possibly arranged as a ring buffer. Agree, pre-allocated would be more efficient. > One related idea that has been floating around is to use the InnoDB > log as a "doublewrite buffer" for short enough binlog event groups. > The maximum mini-transaction size (dictated by the InnoDB recovery > code) is something like 4 megabytes. On an InnoDB log checkpoint, any > buffered binlog event groups would have to be durably written to the > binlog. Likewise, if a binlog event group is too large to be buffered, > it would have to be written and fdatasync()ed in advance. Ah, yes, this is an interesting idea actually. >> Yes, it is an interesting idea. What do you
Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)
Hi Kristian, Thank you for your detailed reply. Before I read the linked blog posts, here are a few quick comments. > 1. The APIs (handlerton methods and server calls) are designed to be > storage-engine independent. So this is one reason why we would not see an > InnoDB LSN anywhere in the server part. Every transactional storage engine under MariaDB should have some concept of logical time. It is not necessarily totally ordered like the InnoDB or Aria LSN. The Aria LSN consists of a 32-bit log file number and a 32-bit offset within a log file. The InnoDB LSN is a 64-bit offset, but the log file is written as a ring buffer, so the LSN will be mapped to a file offset using some modulo arithmetics. Do you know how MyRocks implements ACID? There is a function rocksdb_flush_wal(), but it is commented out in MariaDB. In LevelDB, I see some SST and log files, but they may be specific to a single LSM tree, and I would suppose that there are multiple LSM trees in MyRocks, instead of the data from all tables being stored in a single LSM tree. > commit_checkpoint_request() is called when the binlog wants to mark a point > before which binlog files no longer need to be scanned (a "binlog > checkpoint"). We can see in innobase_checkpoint_request() that this is > mapped to an InnoDB LSN (using log_get_lsn()). This follows point (1) above, > where the server-level API (binlog) is kept separate from storage engine > internals (InnoDB LSN). That function does not currently take any parameter (such as THD) to identify the transaction of interest, and it cannot indicate that the most recent state change of the transaction has already been durably written. > I wonder if this is a real mismatch between the API and InnoDB, or a > misunderstanding about what is intended from commit_checkpoint_request(). > The binlog checkpoint mechanism is completely asynchronous, and it is not a > request for the storage engine to take any action to make things durable. It > is merely a request for InnoDB to conservatively pick an LSN that is known > to contain any transaction that has completed commit_ordered(), it may be > any later LSN that is cheaper to obtain. Then later, when that LSN is > eventually made durable on the engine's own initiative, the engine will call > commit_checkpoint_notify_ha() to complete the binlog checkpoint. Where should a notification be initiated if all changes have already been written at the time handlerton::commit_checkpoint_request is called? > So if I understand correctly, records from parallel transactions are written > concurrently into the InnoDB log in an interleaved fashion. So one thread > can just put log records into a buffer as they are generated and continue > query execution, and later they may be written or synced asynchroneously by > another thread. Yes. Typically, durable writes are only cared about when there is a tangible change of persistent state, such as changing the state of a user transaction, or advancing the InnoDB log checkpoint. If a transaction is rolled back, redo log will be written, but nothing in that transaction should wait for write_lock or flush_lock. The rollback could have a side effect of making the commits other transactions durable. > Since the binlog writing is essentially sequential in nature, it is more > efficient to do it in a single thread for all waiting threads, and this is > how the MariaDB binlog group commit is done. Yes, optimizing that would require a file format change, to replace the append-only-to-last-file with something else, such as one or multiple preallocated files, possibly arranged as a ring buffer. > Another problem is that the binlog is a logical log (SQL statements), and > fragile to base recovery on that _must_ be 100% reliable. If we don't want > to recover InnoDB transactions by replaying binlog replication event groups, > then we anyway need an fdatasync() inside InnoDB in the XA prepare step > before we can be sure the transaction is durable, don't we? Yes, it is a balance between writing more upfront, or potentially executing more recovery. > Another idea would be to go in the opposite direction, where the storage > engine could provide a logging service to the server (just like it provides > tables). This way the binlog code could call into InnoDB to write the binlog > records into the InnoDB log. But this requires changes on the InnoDB side > (log archiving for one), and reading binlogs for replication will be slower > as it's interleaved with unrelated log records. That would work too, and in fact, I had proposed that years ago. But like you point out, there are obvious drawbacks with that. One related idea that has been floating around is to use the InnoDB log as a "doublewrite buffer" for short enough binlog event groups. The maximum mini-transaction size (dictated by the InnoDB recovery code) is something like 4 megabytes. On an InnoDB log checkpoint, any buffered binlog event groups would have to be durably
Re: [Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)
Marko Mäkelä writes: > later. I understand that you were an early contributor to MariaDB and > implemented some durability I/O optimization between the "distributed > transaction" of binlog and the InnoDB write-ahead log (redo log, > ib_logfile0). Indeed, that was the group commit work: https://m.facebook.com/notes/mark-callaghan/group-commit-in-mariadb-is-fast/10150211546215933/ https://knielsen-hq.org/w/fixing-mysql-group-commit-part-1/ https://knielsen-hq.org/w/fixing-mysql-group-commit-part-2/ https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ https://knielsen-hq.org/w/fixing-mysql-group-commit-part-4-of-3/ > I think that some improvement in this area is long overdue. I find it Agree. I think it would be fantastic to combine your extensive knowledge of InnoDB with my binlog/replication expertise to improve this! There's a lot of very interesting points in your mail: > hard to read the high-level descriptions. Nowhere in > https://jira.mariadb.org/browse/MDEV-232 or > https://jira.mariadb.org/browse/MDEV-532 I see any mention of any key > terminology: Indeed, my understanding of InnoDB internals is far from complete (though I do understand the concept of LSN at a high level of course). From the binlog side, there are some basic design goals that influence how the binlog group commit is implemented. 1. The APIs (handlerton methods and server calls) are designed to be storage-engine independent. So this is one reason why we would not see an InnoDB LSN anywhere in the server part. 2. Transactions (the replication term is "event group") are always written into the binlog as a single atomic (kind of) block. So no interleaving of transactions in the binlog, or transactions that span multiple binlog files. 3. Transactions in the binlogs are strictly ordered, even across different servers in the same replication topology (this is used for GTID and in-order optimistic parallel replication). 4. Identical commit order is enforced between the binlog and the storage engine. I think this is related to getting consistent replication slave position from an innobackup or LVM/BTRFS snapshot backup. > and trx_t::must_flush_log_later. There is also the confusingly named > handlerton::commit_checkpoint_request that has nothing to do with log > checkpoints, but with some kind of a durability request that This was an optimisation to remove one fsync from commits when binlog is enabled: https://knielsen-hq.org/w/even-faster-group-commit/ If the server crashes, then on restart the binlog transaction coordinator needs to recover the XA state between the binlog and the engines. For this it scans the storage engines and the binlog files for any pending transactions. Any binlog file that contains pending XA transactions needs to be scanned. Once all transactions in a binlog file are durable committed in the engines, it is preferred to not scan it for efficiency reasons. commit_checkpoint_request() is called when the binlog wants to mark a point before which binlog files no longer need to be scanned (a "binlog checkpoint"). We can see in innobase_checkpoint_request() that this is mapped to an InnoDB LSN (using log_get_lsn()). This follows point (1) above, where the server-level API (binlog) is kept separate from storage engine internals (InnoDB LSN). > fails to > identify the minimum logical time up to which everything is desired to > be durable. (My talk I wonder if this is a real mismatch between the API and InnoDB, or a misunderstanding about what is intended from commit_checkpoint_request(). The binlog checkpoint mechanism is completely asynchronous, and it is not a request for the storage engine to take any action to make things durable. It is merely a request for InnoDB to conservatively pick an LSN that is known to contain any transaction that has completed commit_ordered(), it may be any later LSN that is cheaper to obtain. Then later, when that LSN is eventually made durable on the engine's own initiative, the engine will call commit_checkpoint_notify_ha() to complete the binlog checkpoint. The intention is that this will be very easy for an engine to implement, without any changes to how it manages internal logs and durability. It is just a way for the engine to periodically communicate that some LSN is now durable, in a way that can match the LSN to a binlog position without assuming a particular storage engine or LSN format. But if this intention fails, we should look into it. > Some years ago, in MDEV-21534 Vladislav Vaintroub introduced > write_lock and flush_lock in InnoDB. This works pretty well: a > If > everything up to "my" time has been written or durably written, I do > not care to wait. So if I understand correctly, records from parallel transactions are written concurrently into the InnoDB log in an interleaved fashion. So one thread can just put log records into a buffer as they are generated and continue query execution, and later they may be written or synced
[Maria-developers] Understanding binlog group commit (MDEV-232, MDEV-532, MDEV-25611, MDEV-18959)
Hi Kristian, I am happy to see you being active on the maria-developers list. I don’t think we have collaborated in the past. I have been working on the InnoDB storage engine under MySQL 4.0 through 5.7 (and some 8.0) since 2003, and since 2016 under MariaDB Server, mostly 10.1 and later. I understand that you were an early contributor to MariaDB and implemented some durability I/O optimization between the "distributed transaction" of binlog and the InnoDB write-ahead log (redo log, ib_logfile0). I think that some improvement in this area is long overdue. I find it hard to read the high-level descriptions. Nowhere in https://jira.mariadb.org/browse/MDEV-232 or https://jira.mariadb.org/browse/MDEV-532 I see any mention of any key terminology: * InnoDB log sequence number (LSN): a logical point of time for any persistent InnoDB changes. Measured in bytes. Mini-transactions (atomic changes of InnoDB pages) start and end at some LSN. Corresponds to the Oracle system change number (SCN). * Binlog offset: a logical point of time for any changes at the replication level. Measured in bytes. Replication events such as statements or row-level events start and end at some binlog offset. * The LSN or binlog offset of a COMMIT, XA PREPARE, XA ROLLBACK, or XA COMMIT. There is quite a bit of code in InnoDB that I would like to get rid of. Some of it is related to the data members trx_t::flush_log_later and trx_t::must_flush_log_later. There is also the confusingly named handlerton::commit_checkpoint_request that has nothing to do with log checkpoints, but with some kind of a durability request that fails to identify the minimum logical time up to which everything is desired to be durable. (My talk https://fosdem.org/2022/schedule/event/mariadb_innodb/ explains what a log checkpoint is and how recovery works.) Some years ago, in MDEV-21534 Vladislav Vaintroub introduced write_lock and flush_lock in InnoDB. This works pretty well: a Sysbench workload with innodb_flush_log_at_trx_commit=1 will linearly improve throughput as you add more client connection threads. I think that something similar would be beneficial for the binlog, to keep track of the latest written offset and the latest write offset after which fdatasync() has completed. I am not familiar with the binlog code. A colleague recently claimed to me that LOCK_commit_ordered is a huge bottleneck. As you can see, the write_lock and flush_lock are better because the lock requests include a notion of logical time: If everything up to "my" time has been written or durably written, I do not care to wait. In MDEV-27774 (MariaDB Server 10.8) this was improved further: multiple mini-transactions can copy their local log buffer to the shared log_sys.buf concurrently. It seems to me that the missing tracking of logical time in the data structures is causing hangs like MDEV-25611. I feel that the design is based on a wrong assumption that the binlog-covered transactions are the only source of InnoDB log writes. That has never been the case: For example, there have always been background threads or tasks that purge the history of committed transactions. The task MDEV-18959 (ensuring that one fdatasync() per durable transaction suffices) is also related to this. I think that the simplest solution (assuming that moving the binlog to a bunch of transactional tables is out of the question) is to ensure that one of the logs (binlog and other write-ahead logs) is always ahead. If we choose that the binlog must be ahead, then in InnoDB the write_lock acquisition would have to be preceded by a binlog_write_lock acquisition that will ensure that any relevant changes of persistent state (COMMIT, XA PREPARE, XA ROLLBACK, XA COMMIT) that could be part of the InnoDB log write must have been durably written to the binlog. With best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc ___ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp