Re: SLES7 mmap problem on s390

2003-01-31 Thread Ulrich Weigand
Ben Marzinski wrote:

There is a reproducible memory mapping problem with the s390 SuSE linux
setup we have.  The bug occurs when two processes have private, read-only
mappings of the same file and both processes page in the same page at
the same time. The PTE for that page gets incorrectly marked dirty, which
causes the page to be marked dirty, and the writepage() address space
operation to be called. Nothing that the processes have done should have
caused the page to be written back to the file. The file is modified even
if the whole filesystem is mounted Read-Only.

There does indeed appear to be a race condition in the s390-specific
memory management backend that could explain the symptom you're seeing.

The problem is that the S/390 does not have a dirty bit that resides
in the PTE like most platforms, and which gets set whenever the processor
writes to the page via the mapping defined by this PTE.  Instead we have a
dirty bit in the 'storage key'; there is one storage key associated with
every physical page.  The dirty bit there is set by the memory subsystem
on every write to the page, no matter via which PTE, and even on DMA
writes by hardware devices.

This means that pages from memory-mapped files would usually start out
with the dirty bit set (because the page was read in from backing store,
and doing so would access the physical page and set the dirty bit).
So we have to reset the dirty bit, which we do when the page is first
mapped into a user address space.  (Note we must only do so on the *first*
mapping to user space, subsequent mapping of a page must not reset a
dirty bit that might have been set in the mean time.)

Unfortunately, this logic would appear to exhibit a race condition
in the situation you're describing.  When a process faults in a page
from a memory-mapped file, the following steps happen in sequence:

- the page is looked up in the page cache; if not found, a new page is
  allocated and added to the page cache.  In any case, at this point
  the page reference count is incremented.

- if the page is not uptodate, a read-in from backing store is triggered
  and the process sleeps until the read has completed

- finally, a page table entry referring to the page is placed into the
  process' page tables.  At this point, our platform-specific hook is
  triggered; we check whether the page count is 1, and if so, reset the
  dirty bit in the storage key.

If the same file is mapped into multiple address spaces, this is broken.
Consider two processes faulting in the page at (nearly) the same time;
both increment the page count, and none of them will see a page count
of 1 when updating the page table, therefore the dirty bit will not be
reset.

In later kernels (e.g. SLES-8), this problem is less visible because
we completely ignore the storage key on private read-only mappings as
those can never be dirty anyway.  This was intended as a performance
improvement only (we save storage key operations, which can be expensive),
which is why we didn't backport the new logic.  However, it would have
prevented your symptoms.  (The race is still there, but only on read-
write mappings where it has less obvious effects; some pages may be
written back although there isn't really a need to.)

A proper fix of the race is probably not possible without changes
to common code; a good place to reset the storage key dirty bit might
be SetPageUptodate, but there's no arch-dependent hook there.  We'll
have to think about this ...

Has anyone else seen this? Does anyone know of any patches to deal with this?
If anyone wants to see if they can reproduce this, I can send them a copy
of the program that we wrote to do Step 4 from above. It's less than
100 lines of C code.

I'd appreciate it if you could send me this program.


Mit freundlichen Gruessen / Best Regards

Ulrich Weigand

--
  Dr. Ulrich Weigand
  Linux for S/390 Design  Development
  IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen
  Phone: +49-7031/16-3727   ---   Email: [EMAIL PROTECTED]



Re: SLES7 mmap problem on s390

2003-01-30 Thread Ben Marzinski
 Bob:
 Did you test this with one process accessing the file?  In your description of the 
problem I got the impression that you didn't.


The problem occurs when multiple processes have the file mmaped.


 What version is your kernel?


2.4.7-SuSE-53 kernel.

 In case you didn't know: the truncate will not cause any space to be allocated to 
the file.  There must be more going on here then just marking the page dirty, 
otherwise there would be no place to write the dirty page.

It is definitely the incorrect calling of writepage that is causing us problems.
Like I said, we made a patch to ext2_writepage() and even without a file
with a hole, linux writes to it when it shouldn't..

 The more I think about this, the more I think this may be intentional.  The mmap 
data may need to be backed someplace.

When it's mmapped PRIVATE READONLY, it shouldn't.  It doesn't on other kernels.

 Anyone:
 I would be interested in knowing if this problem can be reproduced on another 
architecture.  Can anyone test this on a PC with the same version of the Linux kernel?

I know it doesn't happen on a kernel.org 2.4.7 kernel on a PC.  I haven't
tried it with this specific kernel on a PC though... That would give me a clue
if it's in the architecture dependent code or not.  Good Idea.

Thanks.

-Ben

[EMAIL PROTECTED]

 Unfortunately I don't have Linux running on anything right now, or I would test it.

 -Original Message-
 From: Ben Marzinski [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, January 29, 2003 5:02 PM
 To: [EMAIL PROTECTED]
 Subject: SLES7 mmap problem on s390


 There is a reproducible memory mapping problem with the s390 SuSE linux
 setup we have.  The bug occurs when two processes have private, read-only
 mappings of the same file and both processes page in the same page at
 the same time. The PTE for that page gets incorrectly marked dirty, which
 causes the page to be marked dirty, and the writepage() address space
 operation to be called. Nothing that the processes have done should have
 caused the page to be written back to the file. The file is modified even
 if the whole filesystem is mounted Read-Only.

 Our setup is:

 A 31-bit s390
 A 2-processor virtual machine with 128MB of RAM
 SuSE SLES7 with the timer-patched kernel.
 A 2.5 GB dasd

 The problem can be reproduced by doing the following:

 1) Make an Ext2 filesystem on a spare device. Mount it.
 2) On the new filesystem, create a file that is larger than the available
memory and nothing but a hole.
# touch file; perl -e 'truncate(file, 209715200);'
 3) Remount the filesystem Read-Only
 4) Run a program that mmaps the file, and then forks a couple processes
that keep on printing out random parts of the mmaped file.
 5) watch the number of Used blocks in the filesystem grow with df.

 We also wrote a patch to ext2_writepage to prove that it was getting called.

 Has anyone else seen this? Does anyone know of any patches to deal with this?
 If anyone wants to see if they can reproduce this, I can send them a copy
 of the program that we wrote to do Step 4 from above. It's less than
 100 lines of C code.

 Thanks

 -Ben Marzinski

 [EMAIL PROTECTED]



Re: SLES7 mmap problem on s390

2003-01-30 Thread Fargusson.Alan
This is very curious.  I bet that there is some problem in the VM subsystem having to 
do with pages that are mapped to more than one process.  I am also wondering why the 
filesystem allowed the write to a disk that is mounted read only.

-Original Message-
From: Ben Marzinski [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 30, 2003 12:17 PM
To: [EMAIL PROTECTED]
Subject: Re: SLES7 mmap problem on s390


 Bob:
 Did you test this with one process accessing the file?  In your description of the 
problem I got the impression that you didn't.


The problem occurs when multiple processes have the file mmaped.


 What version is your kernel?


2.4.7-SuSE-53 kernel.

 In case you didn't know: the truncate will not cause any space to be allocated to 
the file.  There must be more going on here then just marking the page dirty, 
otherwise there would be no place to write the dirty page.

It is definitely the incorrect calling of writepage that is causing us problems.
Like I said, we made a patch to ext2_writepage() and even without a file
with a hole, linux writes to it when it shouldn't..

 The more I think about this, the more I think this may be intentional.  The mmap 
data may need to be backed someplace.

When it's mmapped PRIVATE READONLY, it shouldn't.  It doesn't on other kernels.

 Anyone:
 I would be interested in knowing if this problem can be reproduced on another 
architecture.  Can anyone test this on a PC with the same version of the Linux kernel?

I know it doesn't happen on a kernel.org 2.4.7 kernel on a PC.  I haven't
tried it with this specific kernel on a PC though... That would give me a clue
if it's in the architecture dependent code or not.  Good Idea.

Thanks.

-Ben

[EMAIL PROTECTED]

 Unfortunately I don't have Linux running on anything right now, or I would test it.

 -Original Message-
 From: Ben Marzinski [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, January 29, 2003 5:02 PM
 To: [EMAIL PROTECTED]
 Subject: SLES7 mmap problem on s390


 There is a reproducible memory mapping problem with the s390 SuSE linux
 setup we have.  The bug occurs when two processes have private, read-only
 mappings of the same file and both processes page in the same page at
 the same time. The PTE for that page gets incorrectly marked dirty, which
 causes the page to be marked dirty, and the writepage() address space
 operation to be called. Nothing that the processes have done should have
 caused the page to be written back to the file. The file is modified even
 if the whole filesystem is mounted Read-Only.

 Our setup is:

 A 31-bit s390
 A 2-processor virtual machine with 128MB of RAM
 SuSE SLES7 with the timer-patched kernel.
 A 2.5 GB dasd

 The problem can be reproduced by doing the following:

 1) Make an Ext2 filesystem on a spare device. Mount it.
 2) On the new filesystem, create a file that is larger than the available
memory and nothing but a hole.
# touch file; perl -e 'truncate(file, 209715200);'
 3) Remount the filesystem Read-Only
 4) Run a program that mmaps the file, and then forks a couple processes
that keep on printing out random parts of the mmaped file.
 5) watch the number of Used blocks in the filesystem grow with df.

 We also wrote a patch to ext2_writepage to prove that it was getting called.

 Has anyone else seen this? Does anyone know of any patches to deal with this?
 If anyone wants to see if they can reproduce this, I can send them a copy
 of the program that we wrote to do Step 4 from above. It's less than
 100 lines of C code.

 Thanks

 -Ben Marzinski

 [EMAIL PROTECTED]



Re: SLES7 mmap problem on s390

2003-01-30 Thread Ben Marzinski
Ext2 lets the stuff get written to disk because it assumes linux won't mark
pages from files in a RO filesystem dirty.  Since ext2 assumes this, it
doesn't bother to check in ext2_writepage(), which is perfectly sensible. It
just happens to be wrong in this case.

-Ben

On Thu, Jan 30, 2003 at 12:56:30PM -0800, Fargusson.Alan wrote:
 This is very curious.  I bet that there is some problem in the VM subsystem having 
to do with pages that are mapped to more than one process.  I am also wondering why 
the filesystem allowed the write to a disk that is mounted read only.

 -Original Message-
 From: Ben Marzinski [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, January 30, 2003 12:17 PM
 To: [EMAIL PROTECTED]
 Subject: Re: SLES7 mmap problem on s390


  Bob:
  Did you test this with one process accessing the file?  In your description of the 
problem I got the impression that you didn't.
 

 The problem occurs when multiple processes have the file mmaped.


  What version is your kernel?
 

 2.4.7-SuSE-53 kernel.

  In case you didn't know: the truncate will not cause any space to be allocated to 
the file.  There must be more going on here then just marking the page dirty, 
otherwise there would be no place to write the dirty page.

 It is definitely the incorrect calling of writepage that is causing us problems.
 Like I said, we made a patch to ext2_writepage() and even without a file
 with a hole, linux writes to it when it shouldn't..

  The more I think about this, the more I think this may be intentional.  The mmap 
data may need to be backed someplace.

 When it's mmapped PRIVATE READONLY, it shouldn't.  It doesn't on other kernels.

  Anyone:
  I would be interested in knowing if this problem can be reproduced on another 
architecture.  Can anyone test this on a PC with the same version of the Linux kernel?

 I know it doesn't happen on a kernel.org 2.4.7 kernel on a PC.  I haven't
 tried it with this specific kernel on a PC though... That would give me a clue
 if it's in the architecture dependent code or not.  Good Idea.

 Thanks.

 -Ben

 [EMAIL PROTECTED]

  Unfortunately I don't have Linux running on anything right now, or I would test it.
 
  -Original Message-
  From: Ben Marzinski [mailto:[EMAIL PROTECTED]]
  Sent: Wednesday, January 29, 2003 5:02 PM
  To: [EMAIL PROTECTED]
  Subject: SLES7 mmap problem on s390
 
 
  There is a reproducible memory mapping problem with the s390 SuSE linux
  setup we have.  The bug occurs when two processes have private, read-only
  mappings of the same file and both processes page in the same page at
  the same time. The PTE for that page gets incorrectly marked dirty, which
  causes the page to be marked dirty, and the writepage() address space
  operation to be called. Nothing that the processes have done should have
  caused the page to be written back to the file. The file is modified even
  if the whole filesystem is mounted Read-Only.
 
  Our setup is:
 
  A 31-bit s390
  A 2-processor virtual machine with 128MB of RAM
  SuSE SLES7 with the timer-patched kernel.
  A 2.5 GB dasd
 
  The problem can be reproduced by doing the following:
 
  1) Make an Ext2 filesystem on a spare device. Mount it.
  2) On the new filesystem, create a file that is larger than the available
 memory and nothing but a hole.
 # touch file; perl -e 'truncate(file, 209715200);'
  3) Remount the filesystem Read-Only
  4) Run a program that mmaps the file, and then forks a couple processes
 that keep on printing out random parts of the mmaped file.
  5) watch the number of Used blocks in the filesystem grow with df.
 
  We also wrote a patch to ext2_writepage to prove that it was getting called.
 
  Has anyone else seen this? Does anyone know of any patches to deal with this?
  If anyone wants to see if they can reproduce this, I can send them a copy
  of the program that we wrote to do Step 4 from above. It's less than
  100 lines of C code.
 
  Thanks
 
  -Ben Marzinski
 
  [EMAIL PROTECTED]