Re: SLES7 mmap problem on s390
Ben Marzinski wrote: There is a reproducible memory mapping problem with the s390 SuSE linux setup we have. The bug occurs when two processes have private, read-only mappings of the same file and both processes page in the same page at the same time. The PTE for that page gets incorrectly marked dirty, which causes the page to be marked dirty, and the writepage() address space operation to be called. Nothing that the processes have done should have caused the page to be written back to the file. The file is modified even if the whole filesystem is mounted Read-Only. There does indeed appear to be a race condition in the s390-specific memory management backend that could explain the symptom you're seeing. The problem is that the S/390 does not have a dirty bit that resides in the PTE like most platforms, and which gets set whenever the processor writes to the page via the mapping defined by this PTE. Instead we have a dirty bit in the 'storage key'; there is one storage key associated with every physical page. The dirty bit there is set by the memory subsystem on every write to the page, no matter via which PTE, and even on DMA writes by hardware devices. This means that pages from memory-mapped files would usually start out with the dirty bit set (because the page was read in from backing store, and doing so would access the physical page and set the dirty bit). So we have to reset the dirty bit, which we do when the page is first mapped into a user address space. (Note we must only do so on the *first* mapping to user space, subsequent mapping of a page must not reset a dirty bit that might have been set in the mean time.) Unfortunately, this logic would appear to exhibit a race condition in the situation you're describing. When a process faults in a page from a memory-mapped file, the following steps happen in sequence: - the page is looked up in the page cache; if not found, a new page is allocated and added to the page cache. In any case, at this point the page reference count is incremented. - if the page is not uptodate, a read-in from backing store is triggered and the process sleeps until the read has completed - finally, a page table entry referring to the page is placed into the process' page tables. At this point, our platform-specific hook is triggered; we check whether the page count is 1, and if so, reset the dirty bit in the storage key. If the same file is mapped into multiple address spaces, this is broken. Consider two processes faulting in the page at (nearly) the same time; both increment the page count, and none of them will see a page count of 1 when updating the page table, therefore the dirty bit will not be reset. In later kernels (e.g. SLES-8), this problem is less visible because we completely ignore the storage key on private read-only mappings as those can never be dirty anyway. This was intended as a performance improvement only (we save storage key operations, which can be expensive), which is why we didn't backport the new logic. However, it would have prevented your symptoms. (The race is still there, but only on read- write mappings where it has less obvious effects; some pages may be written back although there isn't really a need to.) A proper fix of the race is probably not possible without changes to common code; a good place to reset the storage key dirty bit might be SetPageUptodate, but there's no arch-dependent hook there. We'll have to think about this ... Has anyone else seen this? Does anyone know of any patches to deal with this? If anyone wants to see if they can reproduce this, I can send them a copy of the program that we wrote to do Step 4 from above. It's less than 100 lines of C code. I'd appreciate it if you could send me this program. Mit freundlichen Gruessen / Best Regards Ulrich Weigand -- Dr. Ulrich Weigand Linux for S/390 Design Development IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen Phone: +49-7031/16-3727 --- Email: [EMAIL PROTECTED]
Re: SLES7 mmap problem on s390
Bob: Did you test this with one process accessing the file? In your description of the problem I got the impression that you didn't. The problem occurs when multiple processes have the file mmaped. What version is your kernel? 2.4.7-SuSE-53 kernel. In case you didn't know: the truncate will not cause any space to be allocated to the file. There must be more going on here then just marking the page dirty, otherwise there would be no place to write the dirty page. It is definitely the incorrect calling of writepage that is causing us problems. Like I said, we made a patch to ext2_writepage() and even without a file with a hole, linux writes to it when it shouldn't.. The more I think about this, the more I think this may be intentional. The mmap data may need to be backed someplace. When it's mmapped PRIVATE READONLY, it shouldn't. It doesn't on other kernels. Anyone: I would be interested in knowing if this problem can be reproduced on another architecture. Can anyone test this on a PC with the same version of the Linux kernel? I know it doesn't happen on a kernel.org 2.4.7 kernel on a PC. I haven't tried it with this specific kernel on a PC though... That would give me a clue if it's in the architecture dependent code or not. Good Idea. Thanks. -Ben [EMAIL PROTECTED] Unfortunately I don't have Linux running on anything right now, or I would test it. -Original Message- From: Ben Marzinski [mailto:[EMAIL PROTECTED]] Sent: Wednesday, January 29, 2003 5:02 PM To: [EMAIL PROTECTED] Subject: SLES7 mmap problem on s390 There is a reproducible memory mapping problem with the s390 SuSE linux setup we have. The bug occurs when two processes have private, read-only mappings of the same file and both processes page in the same page at the same time. The PTE for that page gets incorrectly marked dirty, which causes the page to be marked dirty, and the writepage() address space operation to be called. Nothing that the processes have done should have caused the page to be written back to the file. The file is modified even if the whole filesystem is mounted Read-Only. Our setup is: A 31-bit s390 A 2-processor virtual machine with 128MB of RAM SuSE SLES7 with the timer-patched kernel. A 2.5 GB dasd The problem can be reproduced by doing the following: 1) Make an Ext2 filesystem on a spare device. Mount it. 2) On the new filesystem, create a file that is larger than the available memory and nothing but a hole. # touch file; perl -e 'truncate(file, 209715200);' 3) Remount the filesystem Read-Only 4) Run a program that mmaps the file, and then forks a couple processes that keep on printing out random parts of the mmaped file. 5) watch the number of Used blocks in the filesystem grow with df. We also wrote a patch to ext2_writepage to prove that it was getting called. Has anyone else seen this? Does anyone know of any patches to deal with this? If anyone wants to see if they can reproduce this, I can send them a copy of the program that we wrote to do Step 4 from above. It's less than 100 lines of C code. Thanks -Ben Marzinski [EMAIL PROTECTED]
Re: SLES7 mmap problem on s390
This is very curious. I bet that there is some problem in the VM subsystem having to do with pages that are mapped to more than one process. I am also wondering why the filesystem allowed the write to a disk that is mounted read only. -Original Message- From: Ben Marzinski [mailto:[EMAIL PROTECTED]] Sent: Thursday, January 30, 2003 12:17 PM To: [EMAIL PROTECTED] Subject: Re: SLES7 mmap problem on s390 Bob: Did you test this with one process accessing the file? In your description of the problem I got the impression that you didn't. The problem occurs when multiple processes have the file mmaped. What version is your kernel? 2.4.7-SuSE-53 kernel. In case you didn't know: the truncate will not cause any space to be allocated to the file. There must be more going on here then just marking the page dirty, otherwise there would be no place to write the dirty page. It is definitely the incorrect calling of writepage that is causing us problems. Like I said, we made a patch to ext2_writepage() and even without a file with a hole, linux writes to it when it shouldn't.. The more I think about this, the more I think this may be intentional. The mmap data may need to be backed someplace. When it's mmapped PRIVATE READONLY, it shouldn't. It doesn't on other kernels. Anyone: I would be interested in knowing if this problem can be reproduced on another architecture. Can anyone test this on a PC with the same version of the Linux kernel? I know it doesn't happen on a kernel.org 2.4.7 kernel on a PC. I haven't tried it with this specific kernel on a PC though... That would give me a clue if it's in the architecture dependent code or not. Good Idea. Thanks. -Ben [EMAIL PROTECTED] Unfortunately I don't have Linux running on anything right now, or I would test it. -Original Message- From: Ben Marzinski [mailto:[EMAIL PROTECTED]] Sent: Wednesday, January 29, 2003 5:02 PM To: [EMAIL PROTECTED] Subject: SLES7 mmap problem on s390 There is a reproducible memory mapping problem with the s390 SuSE linux setup we have. The bug occurs when two processes have private, read-only mappings of the same file and both processes page in the same page at the same time. The PTE for that page gets incorrectly marked dirty, which causes the page to be marked dirty, and the writepage() address space operation to be called. Nothing that the processes have done should have caused the page to be written back to the file. The file is modified even if the whole filesystem is mounted Read-Only. Our setup is: A 31-bit s390 A 2-processor virtual machine with 128MB of RAM SuSE SLES7 with the timer-patched kernel. A 2.5 GB dasd The problem can be reproduced by doing the following: 1) Make an Ext2 filesystem on a spare device. Mount it. 2) On the new filesystem, create a file that is larger than the available memory and nothing but a hole. # touch file; perl -e 'truncate(file, 209715200);' 3) Remount the filesystem Read-Only 4) Run a program that mmaps the file, and then forks a couple processes that keep on printing out random parts of the mmaped file. 5) watch the number of Used blocks in the filesystem grow with df. We also wrote a patch to ext2_writepage to prove that it was getting called. Has anyone else seen this? Does anyone know of any patches to deal with this? If anyone wants to see if they can reproduce this, I can send them a copy of the program that we wrote to do Step 4 from above. It's less than 100 lines of C code. Thanks -Ben Marzinski [EMAIL PROTECTED]
Re: SLES7 mmap problem on s390
Ext2 lets the stuff get written to disk because it assumes linux won't mark pages from files in a RO filesystem dirty. Since ext2 assumes this, it doesn't bother to check in ext2_writepage(), which is perfectly sensible. It just happens to be wrong in this case. -Ben On Thu, Jan 30, 2003 at 12:56:30PM -0800, Fargusson.Alan wrote: This is very curious. I bet that there is some problem in the VM subsystem having to do with pages that are mapped to more than one process. I am also wondering why the filesystem allowed the write to a disk that is mounted read only. -Original Message- From: Ben Marzinski [mailto:[EMAIL PROTECTED]] Sent: Thursday, January 30, 2003 12:17 PM To: [EMAIL PROTECTED] Subject: Re: SLES7 mmap problem on s390 Bob: Did you test this with one process accessing the file? In your description of the problem I got the impression that you didn't. The problem occurs when multiple processes have the file mmaped. What version is your kernel? 2.4.7-SuSE-53 kernel. In case you didn't know: the truncate will not cause any space to be allocated to the file. There must be more going on here then just marking the page dirty, otherwise there would be no place to write the dirty page. It is definitely the incorrect calling of writepage that is causing us problems. Like I said, we made a patch to ext2_writepage() and even without a file with a hole, linux writes to it when it shouldn't.. The more I think about this, the more I think this may be intentional. The mmap data may need to be backed someplace. When it's mmapped PRIVATE READONLY, it shouldn't. It doesn't on other kernels. Anyone: I would be interested in knowing if this problem can be reproduced on another architecture. Can anyone test this on a PC with the same version of the Linux kernel? I know it doesn't happen on a kernel.org 2.4.7 kernel on a PC. I haven't tried it with this specific kernel on a PC though... That would give me a clue if it's in the architecture dependent code or not. Good Idea. Thanks. -Ben [EMAIL PROTECTED] Unfortunately I don't have Linux running on anything right now, or I would test it. -Original Message- From: Ben Marzinski [mailto:[EMAIL PROTECTED]] Sent: Wednesday, January 29, 2003 5:02 PM To: [EMAIL PROTECTED] Subject: SLES7 mmap problem on s390 There is a reproducible memory mapping problem with the s390 SuSE linux setup we have. The bug occurs when two processes have private, read-only mappings of the same file and both processes page in the same page at the same time. The PTE for that page gets incorrectly marked dirty, which causes the page to be marked dirty, and the writepage() address space operation to be called. Nothing that the processes have done should have caused the page to be written back to the file. The file is modified even if the whole filesystem is mounted Read-Only. Our setup is: A 31-bit s390 A 2-processor virtual machine with 128MB of RAM SuSE SLES7 with the timer-patched kernel. A 2.5 GB dasd The problem can be reproduced by doing the following: 1) Make an Ext2 filesystem on a spare device. Mount it. 2) On the new filesystem, create a file that is larger than the available memory and nothing but a hole. # touch file; perl -e 'truncate(file, 209715200);' 3) Remount the filesystem Read-Only 4) Run a program that mmaps the file, and then forks a couple processes that keep on printing out random parts of the mmaped file. 5) watch the number of Used blocks in the filesystem grow with df. We also wrote a patch to ext2_writepage to prove that it was getting called. Has anyone else seen this? Does anyone know of any patches to deal with this? If anyone wants to see if they can reproduce this, I can send them a copy of the program that we wrote to do Step 4 from above. It's less than 100 lines of C code. Thanks -Ben Marzinski [EMAIL PROTECTED]