Re: [HACKERS] WAL SHM principles
When you mmap, you don't use write() ! mlock actualy locks page in memory and as long as the page is locked the OS doesn't attempt to store the dirty page. It is intended also for security app to ensure that sensitive data are not written to unsecure storage (hdd). It is definition of mlock so that you can be probably sure with it. News to me ... can you please point to such a definition? I see no reference to such semantics in the mlock() manual page in UNIX98 (Single Unix Standard, version 2). sorry, maybe I'm biased toward Linux. The statement above is from Linux's man page and as I looked into mm code it seems to be right. I'm not sore about other unices. mlock() guarantees that the locked address space is in memory. This doesn't imply that updates are not written to the backing file. yes, probably it depends on OS in question. In Linux kernel the page is not written when mlocked (but I'm not sure about msync here). I would expect an OS that doesn't have a unified buffer cache but tries to keep a consistent view for mmap() and read()/write() to update the file. hmm but why to mlock page then ? Only to be sure the page is not wsapped out ? regards, devik ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] WAL SHM principles
When you mmap, you don't use write() ! mlock actualy locks page in memory and as long as the page is locked the OS doesn't attempt to store the dirty page. It is intended also for security app to ensure that sensitive data are not written to unsecure storage (hdd). It is definition of mlock so that you can be probably sure with it. News to me ... can you please point to such a definition? I see no reference to such semantics in the mlock() manual page in UNIX98 (Single Unix Standard, version 2). mlock() guarantees that the locked address space is in memory. This doesn't imply that updates are not written to the backing file. I would expect an OS that doesn't have a unified buffer cache but tries to keep a consistent view for mmap() and read()/write() to update the file. Regards, Giles ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] WAL SHM principles
Giles Lean [EMAIL PROTECTED] wrote: When you mmap, you don't use write() ! mlock actualy locks page in memory and as long as the page is locked the OS doesn't attempt to store the dirty page. It is intended also for security app to ensure that sensitive data are not written to unsecure storage (hdd). It is definition of mlock so that you can be probably sure with it. News to me ... can you please point to such a definition? I see no reference to such semantics in the mlock() manual page in UNIX98 (Single Unix Standard, version 2). mlock() guarantees that the locked address space is in memory. This doesn't imply that updates are not written to the backing file. I've wondered about this myself. It _is_ true on Linux that mlock prevents writes to the backing store, and this is used as a security feature for cryptography software. The code for gnupg assumes that if you have mlock() on any operating system, it does mean this--which doesn't mean it's true, but perhaps whoever wrote it does have good reason to think so. But I don't know about other systems. Does anybody know what the POSIX.1b standard says? It was even suggested to me on the linux-fsdev mailing list that mlock() was a good way to insure the write-ahead condition. Ken Hirsch ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] WAL SHM principles
Giles Lean [EMAIL PROTECTED] wrote: When you mmap, you don't use write() ! mlock actualy locks page in memory and as long as the page is locked the OS doesn't attempt to store the dirty page. It is intended also for security app to ensure that sensitive data are not written to unsecure storage (hdd). It is definition of mlock so that you can be probably sure with it. News to me ... can you please point to such a definition? I see no reference to such semantics in the mlock() manual page in UNIX98 (Single Unix Standard, version 2). mlock() guarantees that the locked address space is in memory. This doesn't imply that updates are not written to the backing file. I've wondered about this myself. It _is_ true on Linux that mlock prevents writes to the backing store, and this is used as a security feature for cryptography software. The code for gnupg assumes that if you have mlock() on any operating system, it does mean this--which doesn't mean it's true, but perhaps whoever wrote it does have good reason to think so. But I don't know about other systems. Does anybody know what the POSIX.1b standard says? It was even suggested to me on the linux-fsdev mailing list that mlock() was a good way to insure the write-ahead condition. Ken Hirsch ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] WAL SHM principles
On Tue, 13 Mar 2001, Ken Hirsch wrote: mlock() guarantees that the locked address space is in memory. This doesn't imply that updates are not written to the backing file. I've wondered about this myself. It _is_ true on Linux that mlock prevents writes to the backing store, I don't believe that this is true. The manpage offers no such promises, and the semantics are not useful. and this is used as a security feature for cryptography software. mlock() is used to prevent pages being swapped out. Its use for crypto software is essentially restricted to anon memory (allocated via brk() or mmap() of /dev/zero). If my understanding is accurate, before 2.4 Linux would never swap out pages which had a backing store. It would simply write them back or drop them (if clean). (This is why you need around twice as much swap with 2.4.) The code for gnupg assumes that if you have mlock() on any operating system, it does mean this--which doesn't mean it's true, but perhaps whoever wrote it does have good reason to think so. strace on gpg startup says: mmap(0, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40015000 getuid()= 500 mlock(0x40015000) = -1 EPERM (Operation not permitted) so whatever the authors think, it does not require this semantic. Matthew. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] WAL SHM principles
* Matthew Kirkwood [EMAIL PROTECTED] [010313 13:12] wrote: On Tue, 13 Mar 2001, Ken Hirsch wrote: mlock() guarantees that the locked address space is in memory. This doesn't imply that updates are not written to the backing file. I've wondered about this myself. It _is_ true on Linux that mlock prevents writes to the backing store, I don't believe that this is true. The manpage offers no such promises, and the semantics are not useful. Afaik FreeBSD's Linux emulator: revision 1.13 date: 2001/02/28 04:30:27; author: dillon; state: Exp; lines: +3 -1 Linux does not filesystem-sync file-backed writable mmap pages on a regular basis. Adjust our linux emulation to conform. This will cause more dirty pages to be left for the pagedaemon to deal with, but our new low-memory handling code can deal with it. The linux way appears to be a trend, and we may very well make MAP_NOSYNC the default for FreeBSD as well (once we have reasonable sequential write-behind heuristics for random faults). (will be MFC'd prior to 4.3 freeze) Suggested by: Andrew Gallatin Basically any mmap'd data doesn't seem to get sync()'d out on a regular basis. and this is used as a security feature for cryptography software. mlock() is used to prevent pages being swapped out. Its use for crypto software is essentially restricted to anon memory (allocated via brk() or mmap() of /dev/zero). What about userland device drivers that want to send parts of a disk backed file to a driver's dma routine? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] WAL SHM principles
On Tue, 13 Mar 2001, Alfred Perlstein wrote: [..] Linux does not filesystem-sync file-backed writable mmap pages on a regular basis. Very intersting. I'm not sure that is necessarily the case in 2.4, though -- my understanding is that the new all-singing, all-dancing page cache makes very little distinction between mapped and unmapped dirty pages. Basically any mmap'd data doesn't seem to get sync()'d out on a regular basis. Hmm.. I'd call that a bug, anyway. and this is used as a security feature for cryptography software. mlock() is used to prevent pages being swapped out. Its use for crypto software is essentially restricted to anon memory (allocated via brk() or mmap() of /dev/zero). What about userland device drivers that want to send parts of a disk backed file to a driver's dma routine? And realtime software. I'm not disputing that mlock is useful, but what it can do be security software is not that huge. The Linux manpage says: Memory locking has two main applications: real-time algo rithms and high-security data processing. Matthew. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] WAL SHM principles
""Mikheev, Vadim"" [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... It is possible to build a logging system so that you mostly don't care when the data blocks get written; a particular data block on disk is considered garbage until the next checkpoint, so that you How to know if a particular data page was modified if there is no log record for that modification? (Ie how to know where is garbage? -:)) You could store a log sequence number in the data page header that indicates the log address of the last log record that was applied to the page. This is described in Bernstein and Newcomer's book (sec 8.5 operation logging). Sorry if I'm misunderstanding the question. Back to lurking mode... ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] WAL SHM principles
It is possible to build a logging system so that you mostly don't care when the data blocks get written; a particular data block on disk is considered garbage until the next checkpoint, so that you How to know if a particular data page was modified if there is no log record for that modification? (Ie how to know where is garbage? -:)) You could store a log sequence number in the data page header that indicates the log address of the last log record that was applied to the page. We do. But how to know at the time of recovery that there is a page in multi-Gb index file with tuple pointing to uninserted table row? Well, actually we could make some improvements in this area: a buffer without "first after checkpoint" modification could be written without flushing log records: entire block will be rewritten on recovery. Not sure how much we get, though -:) Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] WAL SHM principles
Sorry for taking so long to reply... On Wed, Mar 07, 2001 at 01:27:34PM -0800, Mikheev, Vadim wrote: Nathan wrote: It is possible to build a logging system so that you mostly don't care when the data blocks get written [after being changed, as long as they get written by an fsync]; a particular data block on disk is considered garbage until the next checkpoint, so that you How to know if a particular data page was modified if there is no log record for that modification? (Ie how to know where is garbage? -:)) In such a scheme, any block on disk not referenced up to (and including) the last checkpoint is garbage, and is either blank or reflects a recent logged or soon-to-be-logged change. Everything written (except in the log) after the checkpoint thus has to happen in blocks not otherwise referenced from on-disk -- except in other post-checkpoint blocks. During recovery, the log contents get written to those pages during startup. Blocks that actually got written before the crash are not changed by being overwritten from the log, but that's ok. If they got written before the corresponding log entry, too, nothing references them, so they are considered blank. might as well allow the blocks to be written any time, even before the log entry. And what to do with index tuples pointing to unupdated heap pages after that? Maybe index pages are cached in shm and copied to mmapped blocks after it is ok for them to be written. What platforms does PG run on that don't have mmap()? Nathan Myers [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] WAL SHM principles
BTW, what means "bummer" ? Sorry, it means, "Oh, I am disappointed." thanks :) But for many OSes you CAN control when to write data - you can mlock individual pages. mlock() controls locking in physical memory. I don't see it controling write(). When you mmap, you don't use write() ! mlock actualy locks page in memory and as long as the page is locked the OS doesn't attempt to store the dirty page. It is intended also for security app to ensure that sensitive data are not written to unsecure storage (hdd). It is definition of mlock so that you can be probably sure with it. There is way to do it without mlock (fallback): You definitely need some kind of page headers. The header should has info whether the page can be mmaped or is in "dirty pool". Pages in dirty pool are pages which are dirty but not written yet and are waiting to appropriate log record to be flushed. When log is flushed the data at dirty pool can be copied to its regular mmap location and discarded. If dirty pool is too large, simply sync log and whole pool can be discarded. mmap version could be faster when loading data from hdd and will result in better utilization of memory (because you are directly working with data at OS' page-cache instead of having duplicates in pg's buffer cache). Also page cache expiration is handled by OS and it will allow pg to use as much memory as is available (no need to specify buffer page size). devik ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: [HACKERS] WAL SHM principles
Pros: upper layers can think thet buffers are always safe/logged and there is no special handling for indices; very simple/fast redo Cons: can't implement undo - but in non-overwriting is not needed (?) But needed if we want to get rid of vacuum and have savepoints. Hmm. How do you implement savepoints ? When there is rollback to savepoint do you use xlog to undo all changes which the particular transaction has done ? Hmmm it seems nice ... these resords are locked by such transaction so that it can safely undo them :-) Am I right ? But how can you use xlog to get rid of vacuum ? Do you treat all delete log records as candidates for free space ? regards, devik ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: [HACKERS] WAL SHM principles
But needed if we want to get rid of vacuum and have savepoints. Hmm. How do you implement savepoints ? When there is rollback to savepoint do you use xlog to undo all changes which the particular transaction has done ? Hmmm it seems nice ... these resords are locked by such transaction so that it can safely undo them :-) Am I right ? Yes, but there is no savepoints in 7.1 - hopefully in 7.2 But how can you use xlog to get rid of vacuum ? Do you treat all delete log records as candidates for free space ? Vaccum removes deleted records *and* records inserted by aborted transactions - last ones will be removed by UNDO. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] WAL SHM principles
On Thu, 8 Mar 2001, Martin Devera wrote: Bruce Momjian [EMAIL PROTECTED] writes: Unfortunately, this alone is a *fatal* objection. See nearby discussions about WAL behavior: we must be able to control the relative timing of WAL write/flush and data page writes. Bummer. BTW, what means "bummer" ? It's a Postgres-specific extension to the SQL standard. It means "I am disappointed". As far as I can tell, you _may_ use it as a column or table name. :-) Tim -- --- Tim Allen [EMAIL PROTECTED] Proximity Pty Ltd http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] WAL SHM principles
Bruce Momjian [EMAIL PROTECTED] writes: The only problem is that we would no longer have control over which pages made it to disk. The OS would perhaps write pages as we modified them. Not sure how important that is. Unfortunately, this alone is a *fatal* objection. See nearby discussions about WAL behavior: we must be able to control the relative timing of WAL write/flush and data page writes. Bummer. BTW, what means "bummer" ? Sorry, it means, "Oh, I am disappointed." But for many OSes you CAN control when to write data - you can mlock individual pages. mlock() controls locking in physical memory. I don't see it controling write(). -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] WAL SHM principles
Hello, maybe I missed something, but in last days I was thinking how would I write my own sql server. I got several ideas and because these are not used in PG they are probably bad - but I can't figure why. 1) WAL We have buffer manager, ok. So why not to use WAL as part of it and don't log INSERT/UPDATE/DELETE xlog records but directly changes into buffer pages ? When someone dirties page it has to inform bmgr about dirty region and bmgr would formulate xlog record. The record could be for example fixed bitmap where each bit corresponds to part of page (of size pgsize/no-of-bits) which was changed. These changed regions follows. Multiple writes (by multiple backends) can be coalesced together as long as their transactions overlaps and there is enough memory to keep changed buffer pages in memory. Pros: upper layers can think thet buffers are always safe/logged and there is no special handling for indices; very simple/fast redo Cons: can't implement undo - but in non-overwriting is not needed (?) 2) SHM vs. MMAP Why don't use mmap to share pages (instead of shm) ? There would be no problem with tuning pg's buffer cache size - it is balanced by OS. When using SHM there are often two copies of page: one in OS' page cache and one in SHM (vaste of memory). When using mmap the data goes (almost) directly from HDD into your memory page - now you need to copy it from OS' page to PG's page. There is one problem: how to assure that dirtied page is not flushed before its xlog. One can use mlock but you often need root privileges to use it. Another way is to implement own COW (copy on write) to create intermediate buffers used only until xlog is flushed. Are there considerations correct ? regards, devik ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] WAL SHM principles
This was brought up a week ago, and I consider it an interesting idea. The only problem is that we would no longer have control over which pages made it to disk. The OS would perhaps write pages as we modified them. Not sure how important that is. Yes. As I work on linux kernel I know something about it. When page is accessed the CPU sets one bit in PTE. The OS writes the page when it needs page frame. It also tries to launder pages periodicaly but actual alghoritm changes too often in recent kernels ;-) Also page write is not atomic - several buffer heads are filled for the page and asynchronously posted for write. Elevator then sort and coalesce these buffers heads and create actual scsi/ide write requests. But there is no guarantee that buffer heads from one page will be coalested to one write request ... You can call mlock (PageLock on Win32) to lock page in memory. You can postpone write using it. It is ok under Win32 and many unices but under linux only admin or one with CAP_MEMLOCK (not exact name) can mlock. The good news is that most/all OS's are smart enought that if two processes mmap() the same file, they see each other's changes, so in a yes, when using SHARED flag to mmap then IMHO it is mandatory for an OS sense it is shared memory, but a much larger, smarter pool of shared memory than what we have now. We would still need buffer headers and stuff because we need to synchronize access to the buffers. Also some smart algorithm which tries to mmap several pages in one continuous block. You can mmap each page at its own but OSes stores mmap informations per page range. You need to minimize number of such ranges. devik ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] WAL SHM principles
Bruce Momjian [EMAIL PROTECTED] writes: The only problem is that we would no longer have control over which pages made it to disk. The OS would perhaps write pages as we modified them. Not sure how important that is. Unfortunately, this alone is a *fatal* objection. See nearby discussions about WAL behavior: we must be able to control the relative timing of WAL write/flush and data page writes. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] WAL SHM principles
Bruce Momjian [EMAIL PROTECTED] writes: The only problem is that we would no longer have control over which pages made it to disk. The OS would perhaps write pages as we modified them. Not sure how important that is. Unfortunately, this alone is a *fatal* objection. See nearby discussions about WAL behavior: we must be able to control the relative timing of WAL write/flush and data page writes. Bummer. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] WAL SHM principles
On Wed, Mar 07, 2001 at 11:21:37AM -0500, Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: The only problem is that we would no longer have control over which pages made it to disk. The OS would perhaps write pages as we modified them. Not sure how important that is. Unfortunately, this alone is a *fatal* objection. See nearby discussions about WAL behavior: we must be able to control the relative timing of WAL write/flush and data page writes. Not so fast! It is possible to build a logging system so that you mostly don't care when the data blocks get written; a particular data block on disk is considered garbage until the next checkpoint, so that you might as well allow the blocks to be written any time, even before the log entry. Letting the OS manage sharing of disk block images via mmap should be an enormous win vs. a fixed shm and manual scheduling by PG. If that requires changes in the logging protocol, it's worth it. (What supported platforms don't have mmap?) Nathan Myers [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: [HACKERS] WAL SHM principles
It is possible to build a logging system so that you mostly don't care when the data blocks get written; a particular data block on disk is considered garbage until the next checkpoint, so that you How to know if a particular data page was modified if there is no log record for that modification? (Ie how to know where is garbage? -:)) might as well allow the blocks to be written any time, even before the log entry. And what to do with index tuples pointing to unupdated heap pages after that? Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: [HACKERS] WAL SHM principles
1) WAL We have buffer manager, ok. So why not to use WAL as part of it and don't log INSERT/UPDATE/DELETE xlog records but directly changes into buffer pages ? When someone dirties page it has to inform bmgr about dirty region and bmgr would formulate xlog record. The record could be for example fixed bitmap where each bit corresponds to part of page (of size pgsize/no-of-bits) which was changed. These changed regions follows. Multiple writes (by multiple backends) can be coalesced together as long as their transactions overlaps and there is enough memory to keep changed buffer pages in memory. Pros: upper layers can think thet buffers are always safe/logged and there is no special handling for indices; very simple/fast redo Cons: can't implement undo - but in non-overwriting is not needed (?) But needed if we want to get rid of vacuum and have savepoints. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl