Re: AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync
Hi, On 2023-11-10 17:16:35 +0200, Heikki Linnakangas wrote: > On 10/11/2023 05:54, Andres Freund wrote: > > In this case I had used wal_sync_method=open_datasync - it's often faster > > and > > if we want to scale WAL writes more we'll have to use it more widely (you > > can't have multiple fdatasyncs in progress and reason about which one > > affects > > what, but you can have multiple DSYNC writes in progress at the same time). > > Not sure I understand that. If you issue an fdatasync, it will sync all > writes that were complete before the fdatasync started. Right? If you have > multiple fdatasyncs in progress, that's true for each fdatasync. Or is there > a bottleneck in the kernel with multiple in-progress fdatasyncs or > something? Many filesystems only allow a single fdatasync to really be in progress at the same time, they eventually acquire an inode specific lock. More problematic cases include things like a write followed by an fdatasync, followed by a write of the same block in another process/thread - there's very little guarantee about which contents of that block are now durable. But more importantly, using fdatasync doesn't scale because it effectively has to flush the entire write cache one the device - which often contains plenty other dirty data. Whereas O_DSYNC can use FUA writes, which just makes the individual WAL writes write through the cache, while leaving the rest of the cache "unaffected". > > After a bit of confused staring and debugging I figured out that the problem > > is that the RequestXLogSwitch() within the code for starting a basebackup > > was > > triggering writing back the WAL in individual 8kB writes via > > GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these > > writes is durable - on this drive each take about 1ms. > > I see. So the assumption in AdvanceXLInsertBuffer() is that XLogWrite() is > relatively fast. But with open_datasync, it's not. I'm not sure that was an explicit assumption rather than just how it worked out. > > To fix this, I suspect we need to make > > GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this > > specific case, we even know for sure that we are going to fill a lot more > > buffers, so no heuristic would be needed. In other cases however we need > > some > > heuristic to know how much to write out. > > +1. Maybe use the same logic as in XLogFlush(). I've actually been wondering about moving all the handling of WALWriteLock to XLogWrite() and/or a new function called from all the places calling XLogWrite(). I suspect we can't quite use the same logic in AdvanceXLInsertBuffer() as we do in XLogFlush() - we e.g. don't ever want to trigger flushing out a partially filled page, for example. Or really ever want to unnecessarily wait for a WAL insertion to complete when we don't have to. > I wonder if the 'flexible' argument to XLogWrite() is too inflexible. It > would be nice to pass a hard minimum XLogRecPtr that it must write up to, > but still allow it to write more than that if it's convenient. Yes, I've also thought that. In the AIOified WAL code I ended up tracking "minimum" and "optimal" write/flush locations. Greetings, Andres Freund
Re: AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync
On 10/11/2023 05:54, Andres Freund wrote: In this case I had used wal_sync_method=open_datasync - it's often faster and if we want to scale WAL writes more we'll have to use it more widely (you can't have multiple fdatasyncs in progress and reason about which one affects what, but you can have multiple DSYNC writes in progress at the same time). Not sure I understand that. If you issue an fdatasync, it will sync all writes that were complete before the fdatasync started. Right? If you have multiple fdatasyncs in progress, that's true for each fdatasync. Or is there a bottleneck in the kernel with multiple in-progress fdatasyncs or something? After a bit of confused staring and debugging I figured out that the problem is that the RequestXLogSwitch() within the code for starting a basebackup was triggering writing back the WAL in individual 8kB writes via GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these writes is durable - on this drive each take about 1ms. I see. So the assumption in AdvanceXLInsertBuffer() is that XLogWrite() is relatively fast. But with open_datasync, it's not. To fix this, I suspect we need to make GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this specific case, we even know for sure that we are going to fill a lot more buffers, so no heuristic would be needed. In other cases however we need some heuristic to know how much to write out. +1. Maybe use the same logic as in XLogFlush(). I wonder if the 'flexible' argument to XLogWrite() is too inflexible. It would be nice to pass a hard minimum XLogRecPtr that it must write up to, but still allow it to write more than that if it's convenient. -- Heikki Linnakangas Neon (https://neon.tech)
AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync
Hi, I just created a primary with wal_segment_size=512. Then tried to create a standby via pg_basebackup. The pg_basebackup appeared to just hang, for quite a while, but did eventually complete. Over a minute for an empty cluster, when using -c fast. In this case I had used wal_sync_method=open_datasync - it's often faster and if we want to scale WAL writes more we'll have to use it more widely (you can't have multiple fdatasyncs in progress and reason about which one affects what, but you can have multiple DSYNC writes in progress at the same time). After a bit of confused staring and debugging I figured out that the problem is that the RequestXLogSwitch() within the code for starting a basebackup was triggering writing back the WAL in individual 8kB writes via GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these writes is durable - on this drive each take about 1ms. Normally we write out WAL in bigger chunks - but as it turns out, we don't have any logic for doing larger writes when AdvanceXLInsertBuffers() is called from within GetXLogBuffer(). We just try to make enough space so that one buffer can be replaced. The time for a single SELECT pg_switch_wal() on this system, when using open_datasync and a 512MB segment, are: wal_buffers time for pg_switch_xlog() 1664s 100 53s 400 13s 600 1.3s That's pretty bad. We don't really benefit from more buffering here, it just avoids flushing in tiny increments. With a smaller wal_buffers, the large record by pg_switch_xlog() needs to replace buffers it inself inserted, and does so one-by-one. If we never re-encounter an buffer we inserted ourself earlier due to a larger wal_buffers, the problem isn't present. This can bit with smaller segments too, it doesn't require large ones ones. The reason this doesn't constantly become an issue is that walwriter normally tries to write out WAL, and if it does, the AdvanceXLInsertBuffers() called in backends doesn't need to (walsender also calls AdvanceXLInsertBuffers(), but it won't ever write out data). In my case, walsender is actually trying to do something - but it never gets WALWriteLock. The semaphore does get set after AdvanceXLInsertBuffers() releases WALWriteLock, but on this system walwriter never succeeds taking the lwlock before AdvanceXLInsertBuffers() succeeds re-acquiring it. I think it might be a lucky accident that the problem was visible this blatantly in this one case - I suspect that this behaviour is encountered during normal operation in the wild, but much harder to pinpoint, because it doesn't happen "exclusively". E.g. I see a lot higher throughput bulk-loading data with larger wal_buffers when using open_datasync, but basically no difference when using fdatasync. And there are a lot of wal_buffers_full writes. To fix this, I suspect we need to make GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this specific case, we even know for sure that we are going to fill a lot more buffers, so no heuristic would be needed. In other cases however we need some heuristic to know how much to write out. Given how *extremely* aggressive we are about flushing out nearly all pending WAL in XLogFlush(), I'm not sure there's much point in not also being somewhat aggressive in GetXLogBuffer()->AdvanceXLInsertBuffer(). Greetings, Andres Freund