Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-24 Thread Oliver Xymoron

On Tue, 24 Oct 2000, Pavel Machek wrote:

> Hi!
> 
> > > [EMAIL PROTECTED] said:
> > > >  You can also pretty trivially keep track of an error term so that the
> > > > clock is right on average: 
> > > 
> > > True, but I don't want 'right on average'. I want 'not screwed with at all'.
> > > Shifting the timer tick onto the RTC will give me that. 
> > > 
> > > Even if we _do_ get the maths right, fudging with the frequency of the 
> > > system timer is still an ugly hack.
> > 
> > I personally think the system timer is already an ugly hack. HZ is
> > arbitrary, slow by modern standards, and introduces latency.
> > As the comment you quoted points out, it's also not very accurate.  
> > Much better would be an agenda structure with one shot timers between
> > events and jiffies based on cycle counters. This works on modern hardware
> > and scales well for higher processor speeds. 
> 
> ...and breaks horribly when your CPU frequency changes... like on most
> current notebooks.

The unit of time is arbitrary. Cycle counts just happens to be convenient
on most machines and very high resolution. The important point is that
hanging tasks off a periodic low-resolution timer is not scaling with CPU
clock speeds at all and there are ways to throw it out.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-24 Thread Pavel Machek

Hi!

> > [EMAIL PROTECTED] said:
> > >  You can also pretty trivially keep track of an error term so that the
> > > clock is right on average: 
> > 
> > True, but I don't want 'right on average'. I want 'not screwed with at all'.
> > Shifting the timer tick onto the RTC will give me that. 
> > 
> > Even if we _do_ get the maths right, fudging with the frequency of the 
> > system timer is still an ugly hack.
> 
> I personally think the system timer is already an ugly hack. HZ is
> arbitrary, slow by modern standards, and introduces latency.
> As the comment you quoted points out, it's also not very accurate.  
> Much better would be an agenda structure with one shot timers between
> events and jiffies based on cycle counters. This works on modern hardware
> and scales well for higher processor speeds. 

...and breaks horribly when your CPU frequency changes... like on most
current notebooks.
Pavel
-- 
I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-24 Thread Alexander Viro



On Tue, 24 Oct 2000, Petr Vandrovec wrote:

> On 23 Oct 00 at 23:05, Alexander Viro wrote:
> 
> > Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what
> > analysis had been done? Petr, can you reproduce the problem on -test7?
> > Unfortunately, clean test would take the backport of ext2 changes
> > (truncate-related, happened around the same time), but IIRC -test7 was
> > relatively stable...
> 
> Just FYI - I was able to reproduce it with 2.4.0-test5 and all laters
> (different pre-s of test6/7/8/9/10) I have. I was not able to reproduce 
> it with 2.2.17-pre13 (although this kernel had really bad behavior when
> I run out of space with backing file - sendmail suspended itself due
> to load of 37... on machine which has usually load 0.01). 
> 
> I do not have kernels between 2.2.17 and 2.4.0-test5 installed.
> 
> If you have anything I should check...

I see the race that can lead to screwups you've seen, but I don't see a
really clear way to close it.

Race in question:
truncate_inode_pages() removes the page from pagecache.
sync_pte is called before we get to killing the pagetable
references.

Potentially, the following might be more or less OK, but I'm not too
happy with it:
* All ->nopage() methods mark page uptodate (upon success, that
is).
* do_no_page() fails if page is not uptodate.
* vmtruncate starts with setting the size and going through the
pagecache, removing all fully truncated pages from the hash chains and
gathering them in the list and removing uptodate flag.
* then it does vmtruncate_list() calls, zapping the pagetable
references to the pages in question.
* *then* it goes through the list and does the rest of
truncate_inode_pages() work.

The theory being:
once the i_size is set, no new pages will be bound to affected
part of file.
when we unhash all affected pages and mark them non-uptodate there
will be no new pagetable references to them (they will not be found, and
any pagefault in progress will fail since we mark them non-uptodate).
any possible IO-in-progress on these pages will not be disturbed -
it doesn't depend on the page being hashed. Aliases do not matter, since
they won't be bound to the relevant part of file anyway.
once we kill all pagetable references (and no new references are
going to appear at that point) we can be sure that no process will
initiate _any_ operations on these pages. Notice that potential dirty data
on them doesn't matter - it's going to die anyway (file is being
truncated).
*now* we grab the page lock on each of these pages, thus letting
any method-in-progress to be completed and that point we can safely kill
->mapping.

Linus, could you comment on that variant?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-24 Thread Petr Vandrovec

On 23 Oct 00 at 23:05, Alexander Viro wrote:

> Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what
> analysis had been done? Petr, can you reproduce the problem on -test7?
> Unfortunately, clean test would take the backport of ext2 changes
> (truncate-related, happened around the same time), but IIRC -test7 was
> relatively stable...

Just FYI - I was able to reproduce it with 2.4.0-test5 and all laters
(different pre-s of test6/7/8/9/10) I have. I was not able to reproduce 
it with 2.2.17-pre13 (although this kernel had really bad behavior when
I run out of space with backing file - sendmail suspended itself due
to load of 37... on machine which has usually load 0.01). 

I do not have kernels between 2.2.17 and 2.4.0-test5 installed.

If you have anything I should check...
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-24 Thread Petr Vandrovec

On 23 Oct 00 at 23:05, Alexander Viro wrote:

 Oh, crap... Who introduced -i_mmap_shared/-i_mmap separation and what
 analysis had been done? Petr, can you reproduce the problem on -test7?
 Unfortunately, clean test would take the backport of ext2 changes
 (truncate-related, happened around the same time), but IIRC -test7 was
 relatively stable...

Just FYI - I was able to reproduce it with 2.4.0-test5 and all laters
(different pre-s of test6/7/8/9/10) I have. I was not able to reproduce 
it with 2.2.17-pre13 (although this kernel had really bad behavior when
I run out of space with backing file - sendmail suspended itself due
to load of 37... on machine which has usually load 0.01). 

I do not have kernels between 2.2.17 and 2.4.0-test5 installed.

If you have anything I should check...
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-24 Thread Alexander Viro



On Tue, 24 Oct 2000, Petr Vandrovec wrote:

 On 23 Oct 00 at 23:05, Alexander Viro wrote:
 
  Oh, crap... Who introduced -i_mmap_shared/-i_mmap separation and what
  analysis had been done? Petr, can you reproduce the problem on -test7?
  Unfortunately, clean test would take the backport of ext2 changes
  (truncate-related, happened around the same time), but IIRC -test7 was
  relatively stable...
 
 Just FYI - I was able to reproduce it with 2.4.0-test5 and all laters
 (different pre-s of test6/7/8/9/10) I have. I was not able to reproduce 
 it with 2.2.17-pre13 (although this kernel had really bad behavior when
 I run out of space with backing file - sendmail suspended itself due
 to load of 37... on machine which has usually load 0.01). 
 
 I do not have kernels between 2.2.17 and 2.4.0-test5 installed.
 
 If you have anything I should check...

I see the race that can lead to screwups you've seen, but I don't see a
really clear way to close it.

Race in question:
truncate_inode_pages() removes the page from pagecache.
sync_pte is called before we get to killing the pagetable
references.

Potentially, the following might be more or less OK, but I'm not too
happy with it:
* All -nopage() methods mark page uptodate (upon success, that
is).
* do_no_page() fails if page is not uptodate.
* vmtruncate starts with setting the size and going through the
pagecache, removing all fully truncated pages from the hash chains and
gathering them in the list and removing uptodate flag.
* then it does vmtruncate_list() calls, zapping the pagetable
references to the pages in question.
* *then* it goes through the list and does the rest of
truncate_inode_pages() work.

The theory being:
once the i_size is set, no new pages will be bound to affected
part of file.
when we unhash all affected pages and mark them non-uptodate there
will be no new pagetable references to them (they will not be found, and
any pagefault in progress will fail since we mark them non-uptodate).
any possible IO-in-progress on these pages will not be disturbed -
it doesn't depend on the page being hashed. Aliases do not matter, since
they won't be bound to the relevant part of file anyway.
once we kill all pagetable references (and no new references are
going to appear at that point) we can be sure that no process will
initiate _any_ operations on these pages. Notice that potential dirty data
on them doesn't matter - it's going to die anyway (file is being
truncated).
*now* we grab the page lock on each of these pages, thus letting
any method-in-progress to be completed and that point we can safely kill
-mapping.

Linus, could you comment on that variant?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-24 Thread Pavel Machek

Hi!

  [EMAIL PROTECTED] said:
You can also pretty trivially keep track of an error term so that the
   clock is right on average: 
  
  True, but I don't want 'right on average'. I want 'not screwed with at all'.
  Shifting the timer tick onto the RTC will give me that. 
  
  Even if we _do_ get the maths right, fudging with the frequency of the 
  system timer is still an ugly hack.
 
 I personally think the system timer is already an ugly hack. HZ is
 arbitrary, slow by modern standards, and introduces latency.
 As the comment you quoted points out, it's also not very accurate.  
 Much better would be an agenda structure with one shot timers between
 events and jiffies based on cycle counters. This works on modern hardware
 and scales well for higher processor speeds. 

...and breaks horribly when your CPU frequency changes... like on most
current notebooks.
Pavel
-- 
I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-24 Thread Oliver Xymoron

On Tue, 24 Oct 2000, Pavel Machek wrote:

 Hi!
 
   [EMAIL PROTECTED] said:
 You can also pretty trivially keep track of an error term so that the
clock is right on average: 
   
   True, but I don't want 'right on average'. I want 'not screwed with at all'.
   Shifting the timer tick onto the RTC will give me that. 
   
   Even if we _do_ get the maths right, fudging with the frequency of the 
   system timer is still an ugly hack.
  
  I personally think the system timer is already an ugly hack. HZ is
  arbitrary, slow by modern standards, and introduces latency.
  As the comment you quoted points out, it's also not very accurate.  
  Much better would be an agenda structure with one shot timers between
  events and jiffies based on cycle counters. This works on modern hardware
  and scales well for higher processor speeds. 
 
 ...and breaks horribly when your CPU frequency changes... like on most
 current notebooks.

The unit of time is arbitrary. Cycle counts just happens to be convenient
on most machines and very high resolution. The important point is that
hanging tasks off a periodic low-resolution timer is not scaling with CPU
clock speeds at all and there are ways to throw it out.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro



On Mon, 23 Oct 2000, Linus Torvalds wrote:

> Note that if there really are only 9 "nopage" routines, then it is a lot
> easier to just add the single "SetPageUptodate(page)" into those 9
> routines, and thus let the VM know of the race.

Works for me. And yes, the list of ->nopage instances that can return
success is that short:
drm_vm_shm_nopage
drm_vm_shm_nopage_lock
drm_vm_dma_nopage
sgi_graphics_nopage
via_mm_nopage
ncp_file_mmap_nopage
shm_nopage
shmzero_nopage
filemap_nopage
- sorry, it's not 9+filemap_nopage, it's 8+filemap_nopage.

However, it still leaves a window for the race: we invalidate first and
remove from pagetables later. And ClearPageUptodate() is obviously
truncate_inode_pages() work. Grr...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Tue, 24 Oct 2000, Alexander Viro wrote:
> 
> It's not the only problem, but I would feel _much_ safer if pagefault
> wouldn't rely on pagecache miss. Actually... Hey. Why don't we do the
> insertion into page tables _within_ ->nopage()?

NO!

We used to do this a LOONG time ago.

Distributing the VM information on how to properly insert a pte into the
page table into drivers is _NOT_ something I want to do again. It was a
pain to get it properly modularized, we're not going back to the horror.

Note that if there really are only 9 "nopage" routines, then it is a lot
easier to just add the single "SetPageUptodate(page)" into those 9
routines, and thus let the VM know of the race.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro



On Mon, 23 Oct 2000, Linus Torvalds wrote:

> 
> 
> On Mon, 23 Oct 2000, Alexander Viro wrote:
> > 
> > Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what
> > analysis had been done? Petr, can you reproduce the problem on -test7?
> 
> I don't think that is it - that code looks very straightforward (and is
> needed on some silly architectures that cannot easily otherwise see if
> they need to be coherent wrt user space - mainly sparc and virtual
> caches).

OK, I see where the race can happen. Yes, vmtruncate() tries to kill the
mappings. Right. However, it does that _after_ truncate_inode_pages().
And there is a window when the only lock we are holding is ->i_sem.
sync_pte in that window ==> we are fucked.

So the question being: WTF do we postpone zapping the page tables until
after the truncate_inode_pages()? The following rules might make life
simpler, AFAICS:
* as soon as ->i_size is set, no new pagetable references to
off-limits pages can appear.
* as soon as we are don with vmtruncate_list() there is no
pagetable references.
* truncate_inode_pages() never has to deal with pages refered from
pagetables.
* ->i_size can't increase until we return from vmtruncate().

It's not the only problem, but I would feel _much_ safer if pagefault
wouldn't rely on pagecache miss. Actually... Hey. Why don't we do the
insertion into page tables _within_ ->nopage()? Look: let's take the tail
of do_no_page() into helper function and just call it from the end of
every bloody ->nopage() out there. It _is_ easy: we have only 9 instances
in the tree not counting filemap_nopage(). Moreover,
do_anonymous_page() will become symmetrical to the rest of the crowd.
Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Alexander Viro wrote:
> 
> Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what
> analysis had been done? Petr, can you reproduce the problem on -test7?

I don't think that is it - that code looks very straightforward (and is
needed on some silly architectures that cannot easily otherwise see if
they need to be coherent wrt user space - mainly sparc and virtual
caches).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro



Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what
analysis had been done? Petr, can you reproduce the problem on -test7?
Unfortunately, clean test would take the backport of ext2 changes
(truncate-related, happened around the same time), but IIRC -test7 was
relatively stable...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro



On Mon, 23 Oct 2000, Linus Torvalds wrote:

> Also, the fact that Petr didn't see anything trigger in nopage() makes me
> nervous again. Even if the problem happened during read-ahead, it should
> have gotten into the address space only through nopage. Maybe there is
> some vma that isn't added to the right inode VM list - so that we end up
> missing part of the vmtruncate() stuff?
> 
> Al, any ideas?

Just one: let's slow down a bit and try to write down the rules. Close to
release or not, let's understand WTF happens in that part of VM before
deciding on the choice of band-aids/fixes. Frankly, right now I don't feel
that area - too many changes during the last month. So I'm going to sit
down and read it through. IMO it's worth the delay - I have a gut feeling
that band-aids are going to cost more.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Linus Torvalds wrote:
> 
> I'm starting to suspect that we leave this path as-is, and just fix the
> mapping case (and PageUptodate() can work there). That should also avoid
> the nasties.

..and even that looks like I'd have to do the quick-and-dirty case with
the race still there on SMP. Adding the page Uptodate logic to the VM
layer proper is too painful at this point - every single nopage function
would have to be updated to mark its page up-to-date, as they don't
generally do that currently.

Also, the fact that Petr didn't see anything trigger in nopage() makes me
nervous again. Even if the problem happened during read-ahead, it should
have gotten into the address space only through nopage. Maybe there is
some vma that isn't added to the right inode VM list - so that we end up
missing part of the vmtruncate() stuff?

Al, any ideas?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> With ClearPageDirty() kernel locked up (but no watchdog, so probably
> some livelock) during bootup after fsck /

Actually, it turns out that even with this issue fixed, there's the more
serious issue that the page _has_ to be removed from the page cache once
we get to the point that we're freeing the whole inode (which also calls
truncate_inode_pages).

Which means that we cannot take the easy way out and say "let's delay the
freeing until everything is ok" - even if a buffer is busy due to some
unfortunate timing (so that we turn the page into anonymous buffers), we'd
better get rid of the page from the page cache.

I'm starting to suspect that we leave this path as-is, and just fix the
mapping case (and PageUptodate() can work there). That should also avoid
the nasties.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> With ClearPageDirty() kernel locked up (but no watchdog, so probably
> some livelock) during bootup after fsck /.

Yeah, the way the truncate logic works right now truncate_whole_page() has
to remove the page from the inode list - otherwise truncate ends up
looping forever trying to truncate that page ;).

And there was a off-by-one error in my first untested version anyway: the
page_count() should be tested against "2", not "1", as the truncate logic
has elevated the count anyway (probably unnecessarily: once we get the
page lock nobody else can race to remove it from the page cache anyway, so
it's not as if it could go away).

> Should I try ClearPageUptodate() instead?

No, I'll have to fix the truncate logic to allow for this all.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 14:34, Linus Torvalds wrote:
> On Mon, 23 Oct 2000, Alexander Viro wrote:
> > On Mon, 23 Oct 2000, Linus Torvalds wrote:
> > > 
> > > Nope, that just makes the race window smaller. We should check for i_size
> > > after we've gotten the page table lock and just before actually entering
> > > the page into the page tables. Otherwise we'll still race on SMP (a _very_
> > > hard window to get into, admittedly).
> > 
> > Umm... I would probably remove Uptodate upon truncate() and check _that_
> > in the place you've mentioned.
> 
> Works for me..
> 
> > > ClearPageDirty(page);
> >   ClearPageUptodate(page);
> > 
> > How about that?
> 
> Makes sense. 

With ClearPageDirty() kernel locked up (but no watchdog, so probably
some livelock) during bootup after fsck /. Unfortunately, 'PC' column
shows complete garbage - init,kswapd,kupdate and rcS in 'R', with
'PC'=current for rcS, 0xc1459f28 for init, 0xc146ffa8 for kswapd and
0xc1469fc8 for kupdate (needless to say that my kernel does not have
20MB... whee what's with PC? it now prints esp instead of eip?). Should 
I try ClearPageUptodate() instead?

Petr Vandrovec
[EMAIL PROTECTED]

P.S.: I have to go home. Continuing after 10 hours.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Alexander Viro wrote:
> 
> On Mon, 23 Oct 2000, Linus Torvalds wrote:
> > 
> > Nope, that just makes the race window smaller. We should check for i_size
> > after we've gotten the page table lock and just before actually entering
> > the page into the page tables. Otherwise we'll still race on SMP (a _very_
> > hard window to get into, admittedly).
> 
> Umm... I would probably remove Uptodate upon truncate() and check _that_
> in the place you've mentioned.

Works for me..

> > ClearPageDirty(page);
>   ClearPageUptodate(page);
> 
> How about that?

Makes sense. 

Note that I'd actually like to hear from Petr first _without_ any added
code in the nopage() handler - the issue of having a page mapped after
nopage() that we shouldn't have mapped is a separate one, and I'd first
like to hear if the problem really goes away even if the mapping bug is
still there..

And _then_ we fix the fact that we should not allow anybody to have a page
mapped past i_size.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> Yes. With sleep(60) no oops occur (it takes ~45 secs to exit child). 
> This signals to me: should not vmtruncate_list acquire mm->mmap_sem, 
> if it modifies page tables?

No. 

It should get the page_table lock, but that is sufficient for anybody who
_clears_ page tables (and is pretty much the same case as paging something
out of somebody elses page tables).

>I cannot find anything what prevents doing 
> vmtruncate in one task and filemap_sync in another - neither
> page_table_lock spinlock, nor mmap_sem semaphore...

filemap_sync() does hold the page table lock. 

(Which is certainly not to say that it's necessarily bug-free, but I don't
see any obvious problems off-hand).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro



On Mon, 23 Oct 2000, Linus Torvalds wrote:

> On Mon, 23 Oct 2000, Alexander Viro wrote:
> > 
> > That's fine, but I'm afraid that we'll need a bit more than that. A couple of
> > obvious ones:
> > * filemap_nopage() needs the second check for ->i_size. Upon exit.
> 
> Nope, that just makes the race window smaller. We should check for i_size
> after we've gotten the page table lock and just before actually entering
> the page into the page tables. Otherwise we'll still race on SMP (a _very_
> hard window to get into, admittedly).

Umm... I would probably remove Uptodate upon truncate() and check _that_
in the place you've mentioned.

> So how about this truncate_complete_page() implementation:
> 
>   /*
>* Try to get rid of a page.. Clear it if it fails
>* for some reason. The page must be locked upon calling
>* this function.
>*
>* We remove the page from the page cache _after_ we have
>* destroyed all buffer-cache references to it. Otherwise some
>* other process might think this inode page is not in the
>* page cache and creates a buffer-cache alias to it causing
>* all sorts of fun problems ...
>*/
>   static inline void truncate_complete_page(struct page *page)
>   {
>   /* Try to get rid of buffers */
>   if (page->buffers)
>   block_flushpage(page, 0);
>   
>   spin_lock(_lock);
>   spin_lock(_lru_lock);
>   
>   if (page_count(page) != 1) {
>   memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
>   } else {
>   ClearPageDirty(page);
ClearPageUptodate(page);

How about that?

>   __lru_cache_del(page);
>   __remove_inode_page(page);
>   page_cache_release(page);
>   }
>   spin_unlock(_lru_lock);
>   spin_unlock(_lock);
>   }
> 
> we should probably special-case the "block_flushpage()" failed case, but
> the above should do reasonable things with it (because page_count() will
> be > 1 due to buffers).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Alexander Viro wrote:
> 
> That's fine, but I'm afraid that we'll need a bit more than that. A couple of
> obvious ones:
>   * filemap_nopage() needs the second check for ->i_size. Upon exit.

Nope, that just makes the race window smaller. We should check for i_size
after we've gotten the page table lock and just before actually entering
the page into the page tables. Otherwise we'll still race on SMP (a _very_
hard window to get into, admittedly).

> Moreover, what is (area->vm_mm == current->mm) doing in the existing check?

It's for ptrace. You can do ugly things with ptrace that aren't possible
for the process itself.

>   * truncate() should zero the page out if it doesn't remove it from
> cache.

So how about this truncate_complete_page() implementation:

/*
 * Try to get rid of a page.. Clear it if it fails
 * for some reason. The page must be locked upon calling
 * this function.
 *
 * We remove the page from the page cache _after_ we have
 * destroyed all buffer-cache references to it. Otherwise some
 * other process might think this inode page is not in the
 * page cache and creates a buffer-cache alias to it causing
 * all sorts of fun problems ...
 */
static inline void truncate_complete_page(struct page *page)
{
/* Try to get rid of buffers */
if (page->buffers)
block_flushpage(page, 0);

spin_lock(_lock);
spin_lock(_lru_lock);

if (page_count(page) != 1) {
memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
} else {
ClearPageDirty(page);
__lru_cache_del(page);
__remove_inode_page(page);
page_cache_release(page);
}
spin_unlock(_lru_lock);
spin_unlock(_lock);
}

we should probably special-case the "block_flushpage()" failed case, but
the above should do reasonable things with it (because page_count() will
be > 1 due to buffers).

The above is obviously completely and utterly untested. Petr? Willing to
give this a go?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 13:57, Linus Torvalds wrote:
> On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> > First page->mapping == NULL entry in syslog is dated 22:23:58, but
> > couple of entries was lost before (probably I should print only '.' for
> > each such page; this run there was more than 100 such pages)
> > Another question is why SIGCHLD was delivered to parent AFTER ftruncate,
> > but exit was invoked couple of seconds before - maybe it syncs
> > child address space to disk?
> 
> exit() basically does do a msync(MSASYNC), so that could be it. 

Yes. With sleep(60) no oops occur (it takes ~45 secs to exit child). 
This signals to me: should not vmtruncate_list acquire mm->mmap_sem, 
if it modifies page tables? I cannot find anything what prevents doing 
vmtruncate in one task and filemap_sync in another - neither
page_table_lock spinlock, nor mmap_sem semaphore...
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro

On Mon, 23 Oct 2000, Linus Torvalds wrote:

> On Mon, 23 Oct 2000, Alexander Viro wrote:
> 
> > On Mon, 23 Oct 2000, Linus Torvalds wrote:
> > 
> > > Al, any ideas? I have this feeling that the simplest fix is just to leave
> > > the race open, and make truncate_complete_page() just leave such a "racy"
> > > page in the page cache. It will still race, and the invalid page will
> > > still exist, but the end result should be harmless.
> > 
> > Provided that we clean it - why the hell do we want to take it out of
> > the pagecache? I don't see any fundamental reasons to prohibit pages
> > past the ->i_size being hashed.
> 
> In fact, we used to allow them.
> 
> We do want to remove it from the page cache under normal circumstances: it
> makes for much better MM behaviour (ie free pages that are truly useless).
> But I suspect that just adding a test for "page_count(page) == 1" (ie same
> as for the page cache invalidation) before freeing it should give that
> advantage, along with avoiding the problematic unlikely case...

That's fine, but I'm afraid that we'll need a bit more than that. A couple of
obvious ones:
* filemap_nopage() needs the second check for ->i_size. Upon exit.
Moreover, what is (area->vm_mm == current->mm) doing in the existing check?
* truncate() should zero the page out if it doesn't remove it from
cache.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> Yes. Bad news. No problem was catched in filemap_nopage, but one
> (of 57000) pages was dirty and had page->mapping == NULL... (maybe
> only one was caused that this was just after bootup, with plenty of memory)
> Maybe I should look at readahead code?

Yes, you're right, read-ahead is in fact even more likely to catch the
race.

> In case of truncate it is going irrevocable away. Accesses after truncate
> should (and sometime give you) SIGBUS...

Yes. But I'd rather have a small race where we under certain strange
circumstances forget to raise a SIGBUS than have a kernel bug that causes
oops and possible filesystem corruption. 

So I'm not just looking for a fix, I'm also looking for a SIMPLE fix, with
perhaps some major surgery later (things like making the inode semaphore a
read-write semaphore and just always synchronizing 100% on i_size - which
would fix the problem, but is not a 2.4.x thing, no way Jose).

> First page->mapping == NULL entry in syslog is dated 22:23:58, but
> couple of entries was lost before (probably I should print only '.' for
> each such page; this run there was more than 100 such pages)
> Another question is why SIGCHLD was delivered to parent AFTER ftruncate,
> but exit was invoked couple of seconds before - maybe it syncs
> child address space to disk?

exit() basically does do a msync(MSASYNC), so that could be it. 

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Alexander Viro wrote:

> On Mon, 23 Oct 2000, Linus Torvalds wrote:
> 
> > Al, any ideas? I have this feeling that the simplest fix is just to leave
> > the race open, and make truncate_complete_page() just leave such a "racy"
> > page in the page cache. It will still race, and the invalid page will
> > still exist, but the end result should be harmless.
> 
> Provided that we clean it - why the hell do we want to take it out of
> the pagecache? I don't see any fundamental reasons to prohibit pages
> past the ->i_size being hashed.

In fact, we used to allow them.

We do want to remove it from the page cache under normal circumstances: it
makes for much better MM behaviour (ie free pages that are truly useless).
But I suspect that just adding a test for "page_count(page) == 1" (ie same
as for the page cache invalidation) before freeing it should give that
advantage, along with avoiding the problematic unlikely case...

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 16:13, Alexander Viro wrote:
> On Mon, 23 Oct 2000, Linus Torvalds wrote:
> 
> > Al, any ideas? I have this feeling that the simplest fix is just to leave
> > the race open, and make truncate_complete_page() just leave such a "racy"
> > page in the page cache. It will still race, and the invalid page will
> > still exist, but the end result should be harmless.
> 
> Provided that we clean it - why the hell do we want to take it out of
> the pagecache? I don't see any fundamental reasons to prohibit pages
> past the ->i_size being hashed. Methods _must_ check for ->i_size,
> but they do it anyway. All race-prevention is based on page locks and
> ->i_sem.
> 
> Yes, filemap_nopage() should check for i_size at the very end and fail if
> the page became off-limits. But that's completely unrelated issue - it's
> mmap semantics, not pagecache one.

Yes. Bad news. No problem was catched in filemap_nopage, but one
(of 57000) pages was dirty and had page->mapping == NULL... (maybe
only one was caused that this was just after bootup, with plenty of memory)
Maybe I should look at readahead code? Although to be clear I do not 
know why. Unless there is bug in logic in test program, it should 
first dirty pages, and AFTER that it should truncate - and unmap and 
exit, without ever touching pages of mapping... My first testcases were 
with this race (and with raw devices), but then I found (by removing 
more and more code) that no race (and no raw devices) are required...
 
> The point being: we should _never_ drop ->mapping unless the page is
> irrevocably going away. We can (and probably should) drop the off-limits
> page as soon as ->count hits zero, but we should not do it before that.

In case of truncate it is going irrevocable away. Accesses after truncate
should (and sometime give you) SIGBUS...

 total   used   free sharedbuffers cached
Mem:255768  42208 213560  0496  18420
-/+ buffers/cache:  23292 232476
Swap:   530136  13200 516936

Strace of another run:

1688  22:23:41.748438 execve("./oopsdemo", ["./oopsdemo"], [/* 18 vars */]) = 0
1688  22:23:41.749058 brk(0)= 0x8049ae8
1688  22:23:41.749399 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40016000
1688  22:23:41.749641 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file 
or directory)
1688  22:23:41.749861 open("/etc/ld.so.cache", O_RDONLY) = 4
1688  22:23:41.750011 fstat(4, {st_mode=S_IFREG|0644, st_size=43818, ...}) = 0
1688  22:23:41.750253 old_mmap(NULL, 43818, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40017000
1688  22:23:41.750408 close(4)  = 0
1688  22:23:41.750538 open("/lib/libc.so.6", O_RDONLY) = 4
1688  22:23:41.750676 fstat(4, {st_mode=S_IFREG|0755, st_size=1057576, ...}) = 0
1688  22:23:41.750878 read(4, 
"\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\224\314"..., 4096) = 4096
1688  22:23:41.751173 old_mmap(NULL, 1072484, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) 
= 0x40022000
1688  22:23:41.751327 mprotect(0x4011e000, 40292, PROT_NONE) = 0
1688  22:23:41.751441 old_mmap(0x4011e000, 24576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED, 4, 0xfb000) = 0x4011e000
1688  22:23:41.751633 old_mmap(0x40124000, 15716, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40124000
1688  22:23:41.751809 close(4)  = 0
1688  22:23:41.753291 munmap(0x40017000, 43818) = 0
1688  22:23:41.753554 getpid()  = 1688
1688  22:23:41.753785 fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(4, 1), ...}) = 0
1688  22:23:41.753993 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40017000
1688  22:23:41.754162 ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
1688  22:23:41.754430 write(1, "Go\n", 3) = 3
1688  22:23:41.754727 open("ram0", O_RDWR|O_CREAT, 0600) = 4
1688  22:23:41.754949 unlink("ram0")= 0
1688  22:23:41.755103 ftruncate(4, 234881024) = 0
1688  22:23:41.755228 old_mmap(NULL, 234881024, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 
0) = 0x40128000
1688  22:23:41.755700 pipe([5, 6])  = 0
1688  22:23:41.755847 pipe([7, 8])  = 0
1688  22:23:41.756024 fork()= 1689
1689  22:23:41.756735 close(6 
1688  22:23:41.756798 close(5 
1689  22:23:41.756844 <... close resumed> ) = 0
1688  22:23:41.756894 <... close resumed> ) = 0
1689  22:23:41.756949 close(7 
1688  22:23:41.756997 close(8 
1689  22:23:41.757041 <... close resumed> ) = 0
1688  22:23:41.757089 <... close resumed> ) = 0
1689  22:23:41.757139 close(4 
1688  22:23:41.757188 write(6, "\0", 1 
1689  22:23:41.757250 <... close resumed> ) = 0
1688  22:23:41.757301 <... write resumed> ) = 1
1689  22:23:41.757355 read(5,  
1688  22:23:41.757405 read(7,  
1689  22:23:41.757450 <... read resumed> "\0", 1) = 1
1689  22:23:49.260756 write(8, "\0", 1) = 1
1689  22:23:49.436204 read(5,  
1688  22:23:49.442969 <... read resumed> 

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro

On Mon, 23 Oct 2000, Linus Torvalds wrote:

> Al, any ideas? I have this feeling that the simplest fix is just to leave
> the race open, and make truncate_complete_page() just leave such a "racy"
> page in the page cache. It will still race, and the invalid page will
> still exist, but the end result should be harmless.

Provided that we clean it - why the hell do we want to take it out of
the pagecache? I don't see any fundamental reasons to prohibit pages
past the ->i_size being hashed. Methods _must_ check for ->i_size,
but they do it anyway. All race-prevention is based on page locks and
->i_sem.

Yes, filemap_nopage() should check for i_size at the very end and fail if
the page became off-limits. But that's completely unrelated issue - it's
mmap semantics, not pagecache one.

The point being: we should _never_ drop ->mapping unless the page is
irrevocably going away. We can (and probably should) drop the off-limits
page as soon as ->count hits zero, but we should not do it before that.

Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> > I'll take a better look at the truncate case (I consider the invalidate
> > case closed). Do you have a simple test-program around?
> 
> Well, I cannot say simple. As I was not able to reproduce it with only
> one task, code below:

Ok, without running this I can already guess at what's up.

One process does the "truncate()", and races with another process that
does a page-in.

The truncate code does

notify_change(ATTR_SIZE);

which will basically cause a "vmtruncate(inode, newsize)".

Just before this happened, the other process gets a page fault, and does a
page_cache_read(). Now, because we are low on memory (this is why it only
shows up when you're swapping), that read will take a while. During that
time, the vmtruncate() starts to execute, and sets inode->i_size to the
new value (that would cause us to no longer accept a page fault - but we
already got past that check in the faulting process). It will then
invalidate all the inode pages > i_size.

Now, the page faulting process comes back, and puts the page into the VM
space. Never mind that it has in the meantime gotten i_mapping = NULL due
to the other process doing the truncate.

Now, if I'm right, you should be able to add something like

if (!old_page->mapping)
printk("mapping went away from under us\n");

to just before the "return old_page()" case in the success path of
filemap_nopage() (mm/filemap.c), and you should see that printk() trigger
when the bug happens.

(There are other users too that are not synchronized wrt inode size
changes and could get an access to a page past the end of the file this
way - I just think that filemap_nopage is probably the one where this is
most easily seen).

The above is obviously not a bug-fix, it's just a validation of the
theory.

Al, any ideas? I have this feeling that the simplest fix is just to leave
the race open, and make truncate_complete_page() just leave such a "racy"
page in the page cache. It will still race, and the invalid page will
still exist, but the end result should be harmless.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 12:09, Linus Torvalds wrote:
> On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> > 
> > Hi Linus,
> >   unfortunately, this does not explain problem I reported. In my case,
> > underlying fs is ext2, and there is no locking around - only one task
> > does ftruncate() on (big) shareable mapped file (original code does it to
> > prevent dirty pages to be written to disk after their contents is no
> > longer relevant).
> 
> Yes. Yours is a separate problem, and is not due to invalidate_pages() at
> all, but apparently due to truncate_complete_page().
> 
> Basically, all the same old arguments do apply - we cannot just remove the
> page from the page cache if it is mapped.
> 
> Now, the truncate() case is different, because the code _tries_ to also
> zap the actual mappings. The fact that it fails to do so is a bit
> unnerving, actually. 
> 
> I'll take a better look at the truncate case (I consider the invalidate
> case closed). Do you have a simple test-program around?

Well, I cannot say simple. As I was not able to reproduce it with only
one task, code below:
(1) mlocks itself in memory (you can remove it)
(2) creates 224MB file and maps it shareable into memory (224 is
for 256MB machine; I was not able to reproduce it with less than 100MB
file on 256MB machine) (and you must have 224MB of free disk space
in current directory, otherwise...) and unlinks it
(3) fork
(4) child process dirties memory
(5) parent signals child to exit and wait 5 secs
(6) child exits (hopefully...)
(7) parent unmaps data
(8) parent exits
You'll always get page->mapping == NULL dereference... And btw, if
I start code below on my machine (dual PIII/450), after 2 secs of run
whole machine stops - only ping and console switching works - and stays
that way for more than 20 secs. After that program finishes (killing
kernel if no antioops in filemap_write_page present...) and everything
returns to normal (if "-/+ buffers/cache: 28848  226920" is normal ;-) ).

  You can look at "http://platan.vc.cvut.cz/listit" for dump of "struct page"
at time when page->mapping == NULL was catched in filemap_write_page,
or through linux-kernel@ archive searching for 'page->mapping == NULL'
during last month, either from me or from Quintella.
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

unsigned char zero;

#define MSIZE 0x0E00

void x4768(void) {
int fd;
pid_t pid;
int x4778_1[2];
int x4778_2[2];
int from4778;
int to4778;
char* mma[65536];
unsigned int mml[65536];
unsigned int mmi = 0;
unsigned int ln = 0;
unsigned int x;

#define MAL2(a,l) ftruncate(fd, ln+l); mml[mmi] = l; mma[mmi++] = mmap(a, l, 
PROT_READ|PROT_WRITE, MAP_SHARED, fd, ln); ln += l; mma[mmi-1][0] = 99;
#define MAL(l) MAL2(NULL,l)

fd = open("ram0", O_RDWR | O_CREAT, 0600);
unlink("ram0");
MAL(MSIZE);
pipe(x4778_1);
pipe(x4778_2);
pid = fork();
if (!pid) {
int from4768 = x4778_1[0];
int to4768 = x4778_2[1];
close(x4778_1[1]);
close(x4778_2[0]);
close(fd);
read(from4768, , 1);
for (x = 0; x < mml[mmi - 1]; x += 4096)
mma[mmi-1][x] = 98;
write(to4768, , 1);
read(from4768, , 1);
exit(0);

} else if (pid < 0) {
perror("fork failed");
exit(222);
}
to4778 = x4778_1[1];
from4778 = x4778_2[0];
close(x4778_1[0]);
close(x4778_2[1]);

{
write(to4778, , 1);
read(from4778, , 1);

write(to4778, , 1);
sleep(5);
ftruncate(fd, 0);
}
while (mmi--) {
munmap(mma[mmi], mml[mmi]);
}
exit(0);
}

int main(int argc, char* argv[]) {
printf("Go\n");
mlockall(MCL_CURRENT);

x4768();
return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> Hi Linus,
>   unfortunately, this does not explain problem I reported. In my case,
> underlying fs is ext2, and there is no locking around - only one task
> does ftruncate() on (big) shareable mapped file (original code does it to
> prevent dirty pages to be written to disk after their contents is no
> longer relevant).

Yes. Yours is a separate problem, and is not due to invalidate_pages() at
all, but apparently due to truncate_complete_page().

Basically, all the same old arguments do apply - we cannot just remove the
page from the page cache if it is mapped.

Now, the truncate() case is different, because the code _tries_ to also
zap the actual mappings. The fact that it fails to do so is a bit
unnerving, actually. 

I'll take a better look at the truncate case (I consider the invalidate
case closed). Do you have a simple test-program around?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 10:33, Linus Torvalds wrote:
> Trond Myklebust  <[EMAIL PROTECTED]> wrote:
> >
> >As for simply settling for a self-consistent mmap() rather than
> >tackling the problem of rereading; the main crime is that you're
> >rendering file locking unusable.
...
> This is not a crime.
> 
...
> And yes, I know that file locking can try to be nice in both of these
> cases.  I know that there are systems that try to revert shared mappings
> etc upon file locking (or not allow file locking at all if mappings are
> present).  Linux is not one of them, and never has been. Neither are
> most UNIXes out there.
> Linus

Hi Linus,
  unfortunately, this does not explain problem I reported. In my case,
underlying fs is ext2, and there is no locking around - only one task
does ftruncate() on (big) shareable mapped file (original code does it to
prevent dirty pages to be written to disk after their contents is no
longer relevant). And for some reason, not all instance of shared mapping 
are invalidated - some are still connected to underyling pages, but 
these pages have page->mapping = NULL... (and no, in testcase there
is no vmware around). So on task exit kernel dies with page->mapping
dereference in filemap_write_page (->filemap_sync->filemap_unmap->exit_mmap->
->mmput->do_exit)...
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Trond Myklebust  <[EMAIL PROTECTED]> wrote:
>
>As for simply settling for a self-consistent mmap() rather than
>tackling the problem of rereading; the main crime is that you're
>rendering file locking unusable.

This is not a crime.

Anybody who uses file locking over NFS is buggy and nobody sane should
expect it to work reliably. There are good reasons why mail agents etc
depend on other kinds of locking over NFS.

Furthermore, anybody who expects file locking to work _anywhere_ (ie
including local filesystems) in the presense of file mapping is just so
out to lunch that it's ridiculous. 

Neither of these issues have anything to do with Linux. They are just
statements of fact.

And yes, I know that file locking can try to be nice in both of these
cases.  I know that there are systems that try to revert shared mappings
etc upon file locking (or not allow file locking at all if mappings are
present).  Linux is not one of them, and never has been. Neither are
most UNIXes out there.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-23 Thread Oliver Xymoron

On Mon, 23 Oct 2000, David Woodhouse wrote:

> 
> [EMAIL PROTECTED] said:
> >  You can also pretty trivially keep track of an error term so that the
> > clock is right on average: 
> 
> True, but I don't want 'right on average'. I want 'not screwed with at all'.
> Shifting the timer tick onto the RTC will give me that. 
> 
> Even if we _do_ get the maths right, fudging with the frequency of the 
> system timer is still an ugly hack.

I personally think the system timer is already an ugly hack. HZ is
arbitrary, slow by modern standards, and introduces latency.
As the comment you quoted points out, it's also not very accurate.  
Much better would be an agenda structure with one shot timers between
events and jiffies based on cycle counters. This works on modern hardware
and scales well for higher processor speeds. 

As for the current inaccuracy, the jitter introduced by using an error
term can never be more than a single real interrupt cycle, which is
already well below the average userspace latency.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-23 Thread David Woodhouse


[EMAIL PROTECTED] said:
>  You can also pretty trivially keep track of an error term so that the
> clock is right on average: 

True, but I don't want 'right on average'. I want 'not screwed with at all'.
Shifting the timer tick onto the RTC will give me that. 

Even if we _do_ get the maths right, fudging with the frequency of the 
system timer is still an ugly hack.



--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-23 Thread Oliver Xymoron

On Mon, 23 Oct 2000, David Woodhouse wrote:

> See arch/i386/kernel/time.c:
> 
> /* This function must be called with interrupts disabled 
>  * It was inspired by Steve McCanne's microtime-i386 for BSD.  -- jrs
>  * 
>  * However, the pc-audio speaker driver changes the divisor so that
>  * it gets interrupted rather more often - it loads 64 into the
>  * counter rather than 11932! This has an adverse impact on
>  * do_gettimeoffset() -- it stops working! What is also not
>  * good is that the interval that our timer function gets called
>  * is no longer 10.0002 ms, but 9.9767 ms. To get around this
>  * would require using a different timing source.

You can also pretty trivially keep track of an error term so that the
clock is right on average:

#define CLK_HZ 1193200 /* or whatever */
#define ET_PRECISION 1000
#define ET_TIMER_TICKS CLK_HZ*ET_PRECISION/HZ
#define ET_PCSP_TICKS 64*ET_PRECISION
...
static long et=0;

...
et+=ET_PCSP_TICKS;
if(et>ET_TIMER_TICKS) {
et-=ET_TIMER_TICKS;
simulate_timer_tick();
}

This keeps track of the residual timing error to a few decimal places to
several decimal places and will get closer to HZ than even the current
code.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-23 Thread David Woodhouse


[EMAIL PROTECTED] said:
> Apparently Linus doesn't like the way it handles interrupts or
> something, and is therefore 'wrong.' ::shrug:: As long as it's
> available though, I'll use it ;p 

It's not just 'wrong'. It's 'sick'. But it works. The correct fix is to
shift the system timer onto the RTC, leaving the 8253 timer available for us
to abuse for the PC speaker.

See arch/i386/kernel/time.c:

/* This function must be called with interrupts disabled 
 * It was inspired by Steve McCanne's microtime-i386 for BSD.  -- jrs
 * 
 * However, the pc-audio speaker driver changes the divisor so that
 * it gets interrupted rather more often - it loads 64 into the
 * counter rather than 11932! This has an adverse impact on
 * do_gettimeoffset() -- it stops working! What is also not
 * good is that the interval that our timer function gets called
 * is no longer 10.0002 ms, but 9.9767 ms. To get around this
 * would require using a different timing source. Maybe someone
 * could use the RTC - I know that this can interrupt at frequencies
 * ranging from 8192Hz to 2Hz. If I had the energy, I'd somehow fix
 * it so that at startup, the timer code in sched.c would select
 * using either the RTC or the 8253 timer. The decision would be
 * based on whether there was any other device around that needed
 * to trample on the 8253. I'd set up the RTC to interrupt at 1024 Hz,
 * and then do some jiggery to have a version of do_timer that 
 * advanced the clock by 1/1024 s. Every time that reached over 1/100
 * of a second, then do all the old code. If the time was kept correct
 * then do_gettimeoffset could just return 0 - there is no low order
 * divider that can be accessed.
 *
 * Ideally, you would be able to use the RTC for the speaker driver,
 * but it appears that the speaker driver really needs interrupt more
 * often than every 120 us or so.
 *
 * Anyway, this needs more thought  pjsg (1993-08-28)
 * 
 * If you are really that interested, you should be reading
 * comp.protocols.time.ntp!
 */


--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-23 Thread David Woodhouse


[EMAIL PROTECTED] said:
 Apparently Linus doesn't like the way it handles interrupts or
 something, and is therefore 'wrong.' ::shrug:: As long as it's
 available though, I'll use it ;p 

It's not just 'wrong'. It's 'sick'. But it works. The correct fix is to
shift the system timer onto the RTC, leaving the 8253 timer available for us
to abuse for the PC speaker.

See arch/i386/kernel/time.c:

/* This function must be called with interrupts disabled 
 * It was inspired by Steve McCanne's microtime-i386 for BSD.  -- jrs
 * 
 * However, the pc-audio speaker driver changes the divisor so that
 * it gets interrupted rather more often - it loads 64 into the
 * counter rather than 11932! This has an adverse impact on
 * do_gettimeoffset() -- it stops working! What is also not
 * good is that the interval that our timer function gets called
 * is no longer 10.0002 ms, but 9.9767 ms. To get around this
 * would require using a different timing source. Maybe someone
 * could use the RTC - I know that this can interrupt at frequencies
 * ranging from 8192Hz to 2Hz. If I had the energy, I'd somehow fix
 * it so that at startup, the timer code in sched.c would select
 * using either the RTC or the 8253 timer. The decision would be
 * based on whether there was any other device around that needed
 * to trample on the 8253. I'd set up the RTC to interrupt at 1024 Hz,
 * and then do some jiggery to have a version of do_timer that 
 * advanced the clock by 1/1024 s. Every time that reached over 1/100
 * of a second, then do all the old code. If the time was kept correct
 * then do_gettimeoffset could just return 0 - there is no low order
 * divider that can be accessed.
 *
 * Ideally, you would be able to use the RTC for the speaker driver,
 * but it appears that the speaker driver really needs interrupt more
 * often than every 120 us or so.
 *
 * Anyway, this needs more thought  pjsg (1993-08-28)
 * 
 * If you are really that interested, you should be reading
 * comp.protocols.time.ntp!
 */


--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-23 Thread Oliver Xymoron

On Mon, 23 Oct 2000, David Woodhouse wrote:

 See arch/i386/kernel/time.c:
 
 /* This function must be called with interrupts disabled 
  * It was inspired by Steve McCanne's microtime-i386 for BSD.  -- jrs
  * 
  * However, the pc-audio speaker driver changes the divisor so that
  * it gets interrupted rather more often - it loads 64 into the
  * counter rather than 11932! This has an adverse impact on
  * do_gettimeoffset() -- it stops working! What is also not
  * good is that the interval that our timer function gets called
  * is no longer 10.0002 ms, but 9.9767 ms. To get around this
  * would require using a different timing source.

You can also pretty trivially keep track of an error term so that the
clock is right on average:

#define CLK_HZ 1193200 /* or whatever */
#define ET_PRECISION 1000
#define ET_TIMER_TICKS CLK_HZ*ET_PRECISION/HZ
#define ET_PCSP_TICKS 64*ET_PRECISION
...
static long et=0;

...
et+=ET_PCSP_TICKS;
if(etET_TIMER_TICKS) {
et-=ET_TIMER_TICKS;
simulate_timer_tick();
}

This keeps track of the residual timing error to a few decimal places to
several decimal places and will get closer to HZ than even the current
code.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-23 Thread David Woodhouse


[EMAIL PROTECTED] said:
  You can also pretty trivially keep track of an error term so that the
 clock is right on average: 

True, but I don't want 'right on average'. I want 'not screwed with at all'.
Shifting the timer tick onto the RTC will give me that. 

Even if we _do_ get the maths right, fudging with the frequency of the 
system timer is still an ugly hack.



--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-23 Thread Oliver Xymoron

On Mon, 23 Oct 2000, David Woodhouse wrote:

 
 [EMAIL PROTECTED] said:
   You can also pretty trivially keep track of an error term so that the
  clock is right on average: 
 
 True, but I don't want 'right on average'. I want 'not screwed with at all'.
 Shifting the timer tick onto the RTC will give me that. 
 
 Even if we _do_ get the maths right, fudging with the frequency of the 
 system timer is still an ugly hack.

I personally think the system timer is already an ugly hack. HZ is
arbitrary, slow by modern standards, and introduces latency.
As the comment you quoted points out, it's also not very accurate.  
Much better would be an agenda structure with one shot timers between
events and jiffies based on cycle counters. This works on modern hardware
and scales well for higher processor speeds. 

As for the current inaccuracy, the jitter introduced by using an error
term can never be more than a single real interrupt cycle, which is
already well below the average userspace latency.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

In article [EMAIL PROTECTED],
Trond Myklebust  [EMAIL PROTECTED] wrote:

As for simply settling for a self-consistent mmap() rather than
tackling the problem of rereading; the main crime is that you're
rendering file locking unusable.

This is not a crime.

Anybody who uses file locking over NFS is buggy and nobody sane should
expect it to work reliably. There are good reasons why mail agents etc
depend on other kinds of locking over NFS.

Furthermore, anybody who expects file locking to work _anywhere_ (ie
including local filesystems) in the presense of file mapping is just so
out to lunch that it's ridiculous. 

Neither of these issues have anything to do with Linux. They are just
statements of fact.

And yes, I know that file locking can try to be nice in both of these
cases.  I know that there are systems that try to revert shared mappings
etc upon file locking (or not allow file locking at all if mappings are
present).  Linux is not one of them, and never has been. Neither are
most UNIXes out there.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 10:33, Linus Torvalds wrote:
 Trond Myklebust  [EMAIL PROTECTED] wrote:
 
 As for simply settling for a self-consistent mmap() rather than
 tackling the problem of rereading; the main crime is that you're
 rendering file locking unusable.
...
 This is not a crime.
 
...
 And yes, I know that file locking can try to be nice in both of these
 cases.  I know that there are systems that try to revert shared mappings
 etc upon file locking (or not allow file locking at all if mappings are
 present).  Linux is not one of them, and never has been. Neither are
 most UNIXes out there.
 Linus

Hi Linus,
  unfortunately, this does not explain problem I reported. In my case,
underlying fs is ext2, and there is no locking around - only one task
does ftruncate() on (big) shareable mapped file (original code does it to
prevent dirty pages to be written to disk after their contents is no
longer relevant). And for some reason, not all instance of shared mapping 
are invalidated - some are still connected to underyling pages, but 
these pages have page-mapping = NULL... (and no, in testcase there
is no vmware around). So on task exit kernel dies with page-mapping
dereference in filemap_write_page (-filemap_sync-filemap_unmap-exit_mmap-
-mmput-do_exit)...
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
 
 Hi Linus,
   unfortunately, this does not explain problem I reported. In my case,
 underlying fs is ext2, and there is no locking around - only one task
 does ftruncate() on (big) shareable mapped file (original code does it to
 prevent dirty pages to be written to disk after their contents is no
 longer relevant).

Yes. Yours is a separate problem, and is not due to invalidate_pages() at
all, but apparently due to truncate_complete_page().

Basically, all the same old arguments do apply - we cannot just remove the
page from the page cache if it is mapped.

Now, the truncate() case is different, because the code _tries_ to also
zap the actual mappings. The fact that it fails to do so is a bit
unnerving, actually. 

I'll take a better look at the truncate case (I consider the invalidate
case closed). Do you have a simple test-program around?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 12:09, Linus Torvalds wrote:
 On Mon, 23 Oct 2000, Petr Vandrovec wrote:
  
  Hi Linus,
unfortunately, this does not explain problem I reported. In my case,
  underlying fs is ext2, and there is no locking around - only one task
  does ftruncate() on (big) shareable mapped file (original code does it to
  prevent dirty pages to be written to disk after their contents is no
  longer relevant).
 
 Yes. Yours is a separate problem, and is not due to invalidate_pages() at
 all, but apparently due to truncate_complete_page().
 
 Basically, all the same old arguments do apply - we cannot just remove the
 page from the page cache if it is mapped.
 
 Now, the truncate() case is different, because the code _tries_ to also
 zap the actual mappings. The fact that it fails to do so is a bit
 unnerving, actually. 
 
 I'll take a better look at the truncate case (I consider the invalidate
 case closed). Do you have a simple test-program around?

Well, I cannot say simple. As I was not able to reproduce it with only
one task, code below:
(1) mlocks itself in memory (you can remove it)
(2) creates 224MB file and maps it shareable into memory (224 is
for 256MB machine; I was not able to reproduce it with less than 100MB
file on 256MB machine) (and you must have 224MB of free disk space
in current directory, otherwise...) and unlinks it
(3) fork
(4) child process dirties memory
(5) parent signals child to exit and wait 5 secs
(6) child exits (hopefully...)
(7) parent unmaps data
(8) parent exits
You'll always get page-mapping == NULL dereference... And btw, if
I start code below on my machine (dual PIII/450), after 2 secs of run
whole machine stops - only ping and console switching works - and stays
that way for more than 20 secs. After that program finishes (killing
kernel if no antioops in filemap_write_page present...) and everything
returns to normal (if "-/+ buffers/cache: 28848  226920" is normal ;-) ).

  You can look at "http://platan.vc.cvut.cz/listit" for dump of "struct page"
at time when page-mapping == NULL was catched in filemap_write_page,
or through linux-kernel@ archive searching for 'page-mapping == NULL'
during last month, either from me or from Quintella.
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]


#include sys/mman.h
#include sys/stat.h
#include sys/types.h
#include errno.h
#include stdio.h
#include unistd.h
#include fcntl.h
#include sys/shm.h
#include sys/ioctl.h
#include sys/raw.h

unsigned char zero;

#define MSIZE 0x0E00

void x4768(void) {
int fd;
pid_t pid;
int x4778_1[2];
int x4778_2[2];
int from4778;
int to4778;
char* mma[65536];
unsigned int mml[65536];
unsigned int mmi = 0;
unsigned int ln = 0;
unsigned int x;

#define MAL2(a,l) ftruncate(fd, ln+l); mml[mmi] = l; mma[mmi++] = mmap(a, l, 
PROT_READ|PROT_WRITE, MAP_SHARED, fd, ln); ln += l; mma[mmi-1][0] = 99;
#define MAL(l) MAL2(NULL,l)

fd = open("ram0", O_RDWR | O_CREAT, 0600);
unlink("ram0");
MAL(MSIZE);
pipe(x4778_1);
pipe(x4778_2);
pid = fork();
if (!pid) {
int from4768 = x4778_1[0];
int to4768 = x4778_2[1];
close(x4778_1[1]);
close(x4778_2[0]);
close(fd);
read(from4768, zero, 1);
for (x = 0; x  mml[mmi - 1]; x += 4096)
mma[mmi-1][x] = 98;
write(to4768, zero, 1);
read(from4768, zero, 1);
exit(0);

} else if (pid  0) {
perror("fork failed");
exit(222);
}
to4778 = x4778_1[1];
from4778 = x4778_2[0];
close(x4778_1[0]);
close(x4778_2[1]);

{
write(to4778, zero, 1);
read(from4778, zero, 1);

write(to4778, zero, 1);
sleep(5);
ftruncate(fd, 0);
}
while (mmi--) {
munmap(mma[mmi], mml[mmi]);
}
exit(0);
}

int main(int argc, char* argv[]) {
printf("Go\n");
mlockall(MCL_CURRENT);

x4768();
return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
  I'll take a better look at the truncate case (I consider the invalidate
  case closed). Do you have a simple test-program around?
 
 Well, I cannot say simple. As I was not able to reproduce it with only
 one task, code below:

Ok, without running this I can already guess at what's up.

One process does the "truncate()", and races with another process that
does a page-in.

The truncate code does

notify_change(ATTR_SIZE);

which will basically cause a "vmtruncate(inode, newsize)".

Just before this happened, the other process gets a page fault, and does a
page_cache_read(). Now, because we are low on memory (this is why it only
shows up when you're swapping), that read will take a while. During that
time, the vmtruncate() starts to execute, and sets inode-i_size to the
new value (that would cause us to no longer accept a page fault - but we
already got past that check in the faulting process). It will then
invalidate all the inode pages  i_size.

Now, the page faulting process comes back, and puts the page into the VM
space. Never mind that it has in the meantime gotten i_mapping = NULL due
to the other process doing the truncate.

Now, if I'm right, you should be able to add something like

if (!old_page-mapping)
printk("mapping went away from under us\n");

to just before the "return old_page()" case in the success path of
filemap_nopage() (mm/filemap.c), and you should see that printk() trigger
when the bug happens.

(There are other users too that are not synchronized wrt inode size
changes and could get an access to a page past the end of the file this
way - I just think that filemap_nopage is probably the one where this is
most easily seen).

The above is obviously not a bug-fix, it's just a validation of the
theory.

Al, any ideas? I have this feeling that the simplest fix is just to leave
the race open, and make truncate_complete_page() just leave such a "racy"
page in the page cache. It will still race, and the invalid page will
still exist, but the end result should be harmless.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro

On Mon, 23 Oct 2000, Linus Torvalds wrote:

 Al, any ideas? I have this feeling that the simplest fix is just to leave
 the race open, and make truncate_complete_page() just leave such a "racy"
 page in the page cache. It will still race, and the invalid page will
 still exist, but the end result should be harmless.

Provided that we clean it - why the hell do we want to take it out of
the pagecache? I don't see any fundamental reasons to prohibit pages
past the -i_size being hashed. Methods _must_ check for -i_size,
but they do it anyway. All race-prevention is based on page locks and
-i_sem.

Yes, filemap_nopage() should check for i_size at the very end and fail if
the page became off-limits. But that's completely unrelated issue - it's
mmap semantics, not pagecache one.

The point being: we should _never_ drop -mapping unless the page is
irrevocably going away. We can (and probably should) drop the off-limits
page as soon as -count hits zero, but we should not do it before that.

Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 16:13, Alexander Viro wrote:
 On Mon, 23 Oct 2000, Linus Torvalds wrote:
 
  Al, any ideas? I have this feeling that the simplest fix is just to leave
  the race open, and make truncate_complete_page() just leave such a "racy"
  page in the page cache. It will still race, and the invalid page will
  still exist, but the end result should be harmless.
 
 Provided that we clean it - why the hell do we want to take it out of
 the pagecache? I don't see any fundamental reasons to prohibit pages
 past the -i_size being hashed. Methods _must_ check for -i_size,
 but they do it anyway. All race-prevention is based on page locks and
 -i_sem.
 
 Yes, filemap_nopage() should check for i_size at the very end and fail if
 the page became off-limits. But that's completely unrelated issue - it's
 mmap semantics, not pagecache one.

Yes. Bad news. No problem was catched in filemap_nopage, but one
(of 57000) pages was dirty and had page-mapping == NULL... (maybe
only one was caused that this was just after bootup, with plenty of memory)
Maybe I should look at readahead code? Although to be clear I do not 
know why. Unless there is bug in logic in test program, it should 
first dirty pages, and AFTER that it should truncate - and unmap and 
exit, without ever touching pages of mapping... My first testcases were 
with this race (and with raw devices), but then I found (by removing 
more and more code) that no race (and no raw devices) are required...
 
 The point being: we should _never_ drop -mapping unless the page is
 irrevocably going away. We can (and probably should) drop the off-limits
 page as soon as -count hits zero, but we should not do it before that.

In case of truncate it is going irrevocable away. Accesses after truncate
should (and sometime give you) SIGBUS...

 total   used   free sharedbuffers cached
Mem:255768  42208 213560  0496  18420
-/+ buffers/cache:  23292 232476
Swap:   530136  13200 516936

Strace of another run:

1688  22:23:41.748438 execve("./oopsdemo", ["./oopsdemo"], [/* 18 vars */]) = 0
1688  22:23:41.749058 brk(0)= 0x8049ae8
1688  22:23:41.749399 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40016000
1688  22:23:41.749641 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file 
or directory)
1688  22:23:41.749861 open("/etc/ld.so.cache", O_RDONLY) = 4
1688  22:23:41.750011 fstat(4, {st_mode=S_IFREG|0644, st_size=43818, ...}) = 0
1688  22:23:41.750253 old_mmap(NULL, 43818, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40017000
1688  22:23:41.750408 close(4)  = 0
1688  22:23:41.750538 open("/lib/libc.so.6", O_RDONLY) = 4
1688  22:23:41.750676 fstat(4, {st_mode=S_IFREG|0755, st_size=1057576, ...}) = 0
1688  22:23:41.750878 read(4, 
"\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\224\314"..., 4096) = 4096
1688  22:23:41.751173 old_mmap(NULL, 1072484, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) 
= 0x40022000
1688  22:23:41.751327 mprotect(0x4011e000, 40292, PROT_NONE) = 0
1688  22:23:41.751441 old_mmap(0x4011e000, 24576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED, 4, 0xfb000) = 0x4011e000
1688  22:23:41.751633 old_mmap(0x40124000, 15716, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40124000
1688  22:23:41.751809 close(4)  = 0
1688  22:23:41.753291 munmap(0x40017000, 43818) = 0
1688  22:23:41.753554 getpid()  = 1688
1688  22:23:41.753785 fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(4, 1), ...}) = 0
1688  22:23:41.753993 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40017000
1688  22:23:41.754162 ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
1688  22:23:41.754430 write(1, "Go\n", 3) = 3
1688  22:23:41.754727 open("ram0", O_RDWR|O_CREAT, 0600) = 4
1688  22:23:41.754949 unlink("ram0")= 0
1688  22:23:41.755103 ftruncate(4, 234881024) = 0
1688  22:23:41.755228 old_mmap(NULL, 234881024, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 
0) = 0x40128000
1688  22:23:41.755700 pipe([5, 6])  = 0
1688  22:23:41.755847 pipe([7, 8])  = 0
1688  22:23:41.756024 fork()= 1689
1689  22:23:41.756735 close(6 unfinished ...
1688  22:23:41.756798 close(5 unfinished ...
1689  22:23:41.756844 ... close resumed ) = 0
1688  22:23:41.756894 ... close resumed ) = 0
1689  22:23:41.756949 close(7 unfinished ...
1688  22:23:41.756997 close(8 unfinished ...
1689  22:23:41.757041 ... close resumed ) = 0
1688  22:23:41.757089 ... close resumed ) = 0
1689  22:23:41.757139 close(4 unfinished ...
1688  22:23:41.757188 write(6, "\0", 1 unfinished ...
1689  22:23:41.757250 ... close resumed ) = 0
1688  22:23:41.757301 ... write resumed ) = 1
1689  22:23:41.757355 read(5,  unfinished ...
1688  22:23:41.757405 read(7,  unfinished ...
1689  22:23:41.757450 ... read resumed "\0", 1) = 1
1689  22:23:49.260756 write(8, "\0", 1) = 1
1689  

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Alexander Viro wrote:

 On Mon, 23 Oct 2000, Linus Torvalds wrote:
 
  Al, any ideas? I have this feeling that the simplest fix is just to leave
  the race open, and make truncate_complete_page() just leave such a "racy"
  page in the page cache. It will still race, and the invalid page will
  still exist, but the end result should be harmless.
 
 Provided that we clean it - why the hell do we want to take it out of
 the pagecache? I don't see any fundamental reasons to prohibit pages
 past the -i_size being hashed.

In fact, we used to allow them.

We do want to remove it from the page cache under normal circumstances: it
makes for much better MM behaviour (ie free pages that are truly useless).
But I suspect that just adding a test for "page_count(page) == 1" (ie same
as for the page cache invalidation) before freeing it should give that
advantage, along with avoiding the problematic unlikely case...

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
 
 Yes. Bad news. No problem was catched in filemap_nopage, but one
 (of 57000) pages was dirty and had page-mapping == NULL... (maybe
 only one was caused that this was just after bootup, with plenty of memory)
 Maybe I should look at readahead code?

Yes, you're right, read-ahead is in fact even more likely to catch the
race.

 In case of truncate it is going irrevocable away. Accesses after truncate
 should (and sometime give you) SIGBUS...

Yes. But I'd rather have a small race where we under certain strange
circumstances forget to raise a SIGBUS than have a kernel bug that causes
oops and possible filesystem corruption. 

So I'm not just looking for a fix, I'm also looking for a SIMPLE fix, with
perhaps some major surgery later (things like making the inode semaphore a
read-write semaphore and just always synchronizing 100% on i_size - which
would fix the problem, but is not a 2.4.x thing, no way Jose).

 First page-mapping == NULL entry in syslog is dated 22:23:58, but
 couple of entries was lost before (probably I should print only '.' for
 each such page; this run there was more than 100 such pages)
 Another question is why SIGCHLD was delivered to parent AFTER ftruncate,
 but exit was invoked couple of seconds before - maybe it syncs
 child address space to disk?

exit() basically does do a msync(MSASYNC), so that could be it. 

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro

On Mon, 23 Oct 2000, Linus Torvalds wrote:

 On Mon, 23 Oct 2000, Alexander Viro wrote:
 
  On Mon, 23 Oct 2000, Linus Torvalds wrote:
  
   Al, any ideas? I have this feeling that the simplest fix is just to leave
   the race open, and make truncate_complete_page() just leave such a "racy"
   page in the page cache. It will still race, and the invalid page will
   still exist, but the end result should be harmless.
  
  Provided that we clean it - why the hell do we want to take it out of
  the pagecache? I don't see any fundamental reasons to prohibit pages
  past the -i_size being hashed.
 
 In fact, we used to allow them.
 
 We do want to remove it from the page cache under normal circumstances: it
 makes for much better MM behaviour (ie free pages that are truly useless).
 But I suspect that just adding a test for "page_count(page) == 1" (ie same
 as for the page cache invalidation) before freeing it should give that
 advantage, along with avoiding the problematic unlikely case...

That's fine, but I'm afraid that we'll need a bit more than that. A couple of
obvious ones:
* filemap_nopage() needs the second check for -i_size. Upon exit.
Moreover, what is (area-vm_mm == current-mm) doing in the existing check?
* truncate() should zero the page out if it doesn't remove it from
cache.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 13:57, Linus Torvalds wrote:
 On Mon, 23 Oct 2000, Petr Vandrovec wrote:
 
  First page-mapping == NULL entry in syslog is dated 22:23:58, but
  couple of entries was lost before (probably I should print only '.' for
  each such page; this run there was more than 100 such pages)
  Another question is why SIGCHLD was delivered to parent AFTER ftruncate,
  but exit was invoked couple of seconds before - maybe it syncs
  child address space to disk?
 
 exit() basically does do a msync(MSASYNC), so that could be it. 

Yes. With sleep(60) no oops occur (it takes ~45 secs to exit child). 
This signals to me: should not vmtruncate_list acquire mm-mmap_sem, 
if it modifies page tables? I cannot find anything what prevents doing 
vmtruncate in one task and filemap_sync in another - neither
page_table_lock spinlock, nor mmap_sem semaphore...
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Alexander Viro wrote:
 
 That's fine, but I'm afraid that we'll need a bit more than that. A couple of
 obvious ones:
   * filemap_nopage() needs the second check for -i_size. Upon exit.

Nope, that just makes the race window smaller. We should check for i_size
after we've gotten the page table lock and just before actually entering
the page into the page tables. Otherwise we'll still race on SMP (a _very_
hard window to get into, admittedly).

 Moreover, what is (area-vm_mm == current-mm) doing in the existing check?

It's for ptrace. You can do ugly things with ptrace that aren't possible
for the process itself.

   * truncate() should zero the page out if it doesn't remove it from
 cache.

So how about this truncate_complete_page() implementation:

/*
 * Try to get rid of a page.. Clear it if it fails
 * for some reason. The page must be locked upon calling
 * this function.
 *
 * We remove the page from the page cache _after_ we have
 * destroyed all buffer-cache references to it. Otherwise some
 * other process might think this inode page is not in the
 * page cache and creates a buffer-cache alias to it causing
 * all sorts of fun problems ...
 */
static inline void truncate_complete_page(struct page *page)
{
/* Try to get rid of buffers */
if (page-buffers)
block_flushpage(page, 0);

spin_lock(pagecache_lock);
spin_lock(pagemap_lru_lock);

if (page_count(page) != 1) {
memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
} else {
ClearPageDirty(page);
__lru_cache_del(page);
__remove_inode_page(page);
page_cache_release(page);
}
spin_unlock(pagemap_lru_lock);
spin_unlock(pagecache_lock);
}

we should probably special-case the "block_flushpage()" failed case, but
the above should do reasonable things with it (because page_count() will
be  1 due to buffers).

The above is obviously completely and utterly untested. Petr? Willing to
give this a go?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro



On Mon, 23 Oct 2000, Linus Torvalds wrote:

 On Mon, 23 Oct 2000, Alexander Viro wrote:
  
  That's fine, but I'm afraid that we'll need a bit more than that. A couple of
  obvious ones:
  * filemap_nopage() needs the second check for -i_size. Upon exit.
 
 Nope, that just makes the race window smaller. We should check for i_size
 after we've gotten the page table lock and just before actually entering
 the page into the page tables. Otherwise we'll still race on SMP (a _very_
 hard window to get into, admittedly).

Umm... I would probably remove Uptodate upon truncate() and check _that_
in the place you've mentioned.

 So how about this truncate_complete_page() implementation:
 
   /*
* Try to get rid of a page.. Clear it if it fails
* for some reason. The page must be locked upon calling
* this function.
*
* We remove the page from the page cache _after_ we have
* destroyed all buffer-cache references to it. Otherwise some
* other process might think this inode page is not in the
* page cache and creates a buffer-cache alias to it causing
* all sorts of fun problems ...
*/
   static inline void truncate_complete_page(struct page *page)
   {
   /* Try to get rid of buffers */
   if (page-buffers)
   block_flushpage(page, 0);
   
   spin_lock(pagecache_lock);
   spin_lock(pagemap_lru_lock);
   
   if (page_count(page) != 1) {
   memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
   } else {
   ClearPageDirty(page);
ClearPageUptodate(page);

How about that?

   __lru_cache_del(page);
   __remove_inode_page(page);
   page_cache_release(page);
   }
   spin_unlock(pagemap_lru_lock);
   spin_unlock(pagecache_lock);
   }
 
 we should probably special-case the "block_flushpage()" failed case, but
 the above should do reasonable things with it (because page_count() will
 be  1 due to buffers).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
 
 Yes. With sleep(60) no oops occur (it takes ~45 secs to exit child). 
 This signals to me: should not vmtruncate_list acquire mm-mmap_sem, 
 if it modifies page tables?

No. 

It should get the page_table lock, but that is sufficient for anybody who
_clears_ page tables (and is pretty much the same case as paging something
out of somebody elses page tables).

I cannot find anything what prevents doing 
 vmtruncate in one task and filemap_sync in another - neither
 page_table_lock spinlock, nor mmap_sem semaphore...

filemap_sync() does hold the page table lock. 

(Which is certainly not to say that it's necessarily bug-free, but I don't
see any obvious problems off-hand).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Alexander Viro wrote:
 
 On Mon, 23 Oct 2000, Linus Torvalds wrote:
  
  Nope, that just makes the race window smaller. We should check for i_size
  after we've gotten the page table lock and just before actually entering
  the page into the page tables. Otherwise we'll still race on SMP (a _very_
  hard window to get into, admittedly).
 
 Umm... I would probably remove Uptodate upon truncate() and check _that_
 in the place you've mentioned.

Works for me..

  ClearPageDirty(page);
   ClearPageUptodate(page);
 
 How about that?

Makes sense. 

Note that I'd actually like to hear from Petr first _without_ any added
code in the nopage() handler - the issue of having a page mapped after
nopage() that we shouldn't have mapped is a separate one, and I'd first
like to hear if the problem really goes away even if the mapping bug is
still there..

And _then_ we fix the fact that we should not allow anybody to have a page
mapped past i_size.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 14:34, Linus Torvalds wrote:
 On Mon, 23 Oct 2000, Alexander Viro wrote:
  On Mon, 23 Oct 2000, Linus Torvalds wrote:
   
   Nope, that just makes the race window smaller. We should check for i_size
   after we've gotten the page table lock and just before actually entering
   the page into the page tables. Otherwise we'll still race on SMP (a _very_
   hard window to get into, admittedly).
  
  Umm... I would probably remove Uptodate upon truncate() and check _that_
  in the place you've mentioned.
 
 Works for me..
 
   ClearPageDirty(page);
ClearPageUptodate(page);
  
  How about that?
 
 Makes sense. 

With ClearPageDirty() kernel locked up (but no watchdog, so probably
some livelock) during bootup after fsck /. Unfortunately, 'PC' column
shows complete garbage - init,kswapd,kupdate and rcS in 'R', with
'PC'=current for rcS, 0xc1459f28 for init, 0xc146ffa8 for kswapd and
0xc1469fc8 for kupdate (needless to say that my kernel does not have
20MB... whee what's with PC? it now prints esp instead of eip?). Should 
I try ClearPageUptodate() instead?

Petr Vandrovec
[EMAIL PROTECTED]

P.S.: I have to go home. Continuing after 10 hours.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
 
 With ClearPageDirty() kernel locked up (but no watchdog, so probably
 some livelock) during bootup after fsck /.

Yeah, the way the truncate logic works right now truncate_whole_page() has
to remove the page from the inode list - otherwise truncate ends up
looping forever trying to truncate that page ;).

And there was a off-by-one error in my first untested version anyway: the
page_count() should be tested against "2", not "1", as the truncate logic
has elevated the count anyway (probably unnecessarily: once we get the
page lock nobody else can race to remove it from the page cache anyway, so
it's not as if it could go away).

 Should I try ClearPageUptodate() instead?

No, I'll have to fix the truncate logic to allow for this all.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Petr Vandrovec wrote:
 
 With ClearPageDirty() kernel locked up (but no watchdog, so probably
 some livelock) during bootup after fsck /

Actually, it turns out that even with this issue fixed, there's the more
serious issue that the page _has_ to be removed from the page cache once
we get to the point that we're freeing the whole inode (which also calls
truncate_inode_pages).

Which means that we cannot take the easy way out and say "let's delay the
freeing until everything is ok" - even if a buffer is busy due to some
unfortunate timing (so that we turn the page into anonymous buffers), we'd
better get rid of the page from the page cache.

I'm starting to suspect that we leave this path as-is, and just fix the
mapping case (and PageUptodate() can work there). That should also avoid
the nasties.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Linus Torvalds wrote:
 
 I'm starting to suspect that we leave this path as-is, and just fix the
 mapping case (and PageUptodate() can work there). That should also avoid
 the nasties.

..and even that looks like I'd have to do the quick-and-dirty case with
the race still there on SMP. Adding the page Uptodate logic to the VM
layer proper is too painful at this point - every single nopage function
would have to be updated to mark its page up-to-date, as they don't
generally do that currently.

Also, the fact that Petr didn't see anything trigger in nopage() makes me
nervous again. Even if the problem happened during read-ahead, it should
have gotten into the address space only through nopage. Maybe there is
some vma that isn't added to the right inode VM list - so that we end up
missing part of the vmtruncate() stuff?

Al, any ideas?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro



On Mon, 23 Oct 2000, Linus Torvalds wrote:

 Also, the fact that Petr didn't see anything trigger in nopage() makes me
 nervous again. Even if the problem happened during read-ahead, it should
 have gotten into the address space only through nopage. Maybe there is
 some vma that isn't added to the right inode VM list - so that we end up
 missing part of the vmtruncate() stuff?
 
 Al, any ideas?

Just one: let's slow down a bit and try to write down the rules. Close to
release or not, let's understand WTF happens in that part of VM before
deciding on the choice of band-aids/fixes. Frankly, right now I don't feel
that area - too many changes during the last month. So I'm going to sit
down and read it through. IMO it's worth the delay - I have a gut feeling
that band-aids are going to cost more.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro



Oh, crap... Who introduced -i_mmap_shared/-i_mmap separation and what
analysis had been done? Petr, can you reproduce the problem on -test7?
Unfortunately, clean test would take the backport of ext2 changes
(truncate-related, happened around the same time), but IIRC -test7 was
relatively stable...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Alexander Viro wrote:
 
 Oh, crap... Who introduced -i_mmap_shared/-i_mmap separation and what
 analysis had been done? Petr, can you reproduce the problem on -test7?

I don't think that is it - that code looks very straightforward (and is
needed on some silly architectures that cannot easily otherwise see if
they need to be coherent wrt user space - mainly sparc and virtual
caches).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro



On Mon, 23 Oct 2000, Linus Torvalds wrote:

 
 
 On Mon, 23 Oct 2000, Alexander Viro wrote:
  
  Oh, crap... Who introduced -i_mmap_shared/-i_mmap separation and what
  analysis had been done? Petr, can you reproduce the problem on -test7?
 
 I don't think that is it - that code looks very straightforward (and is
 needed on some silly architectures that cannot easily otherwise see if
 they need to be coherent wrt user space - mainly sparc and virtual
 caches).

OK, I see where the race can happen. Yes, vmtruncate() tries to kill the
mappings. Right. However, it does that _after_ truncate_inode_pages().
And there is a window when the only lock we are holding is -i_sem.
sync_pte in that window == we are fucked.

So the question being: WTF do we postpone zapping the page tables until
after the truncate_inode_pages()? The following rules might make life
simpler, AFAICS:
* as soon as -i_size is set, no new pagetable references to
off-limits pages can appear.
* as soon as we are don with vmtruncate_list() there is no
pagetable references.
* truncate_inode_pages() never has to deal with pages refered from
pagetables.
* -i_size can't increase until we return from vmtruncate().

It's not the only problem, but I would feel _much_ safer if pagefault
wouldn't rely on pagecache miss. Actually... Hey. Why don't we do the
insertion into page tables _within_ -nopage()? Look: let's take the tail
of do_no_page() into helper function and just call it from the end of
every bloody -nopage() out there. It _is_ easy: we have only 9 instances
in the tree not counting filemap_nopage(). Moreover,
do_anonymous_page() will become symmetrical to the rest of the crowd.
Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds



On Tue, 24 Oct 2000, Alexander Viro wrote:
 
 It's not the only problem, but I would feel _much_ safer if pagefault
 wouldn't rely on pagecache miss. Actually... Hey. Why don't we do the
 insertion into page tables _within_ -nopage()?

NO!

We used to do this a LOONG time ago.

Distributing the VM information on how to properly insert a pte into the
page table into drivers is _NOT_ something I want to do again. It was a
pain to get it properly modularized, we're not going back to the horror.

Note that if there really are only 9 "nopage" routines, then it is a lot
easier to just add the single "SetPageUptodate(page)" into those 9
routines, and thus let the VM know of the race.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro



On Mon, 23 Oct 2000, Linus Torvalds wrote:

 Note that if there really are only 9 "nopage" routines, then it is a lot
 easier to just add the single "SetPageUptodate(page)" into those 9
 routines, and thus let the VM know of the race.

Works for me. And yes, the list of -nopage instances that can return
success is that short:
drm_vm_shm_nopage
drm_vm_shm_nopage_lock
drm_vm_dma_nopage
sgi_graphics_nopage
via_mm_nopage
ncp_file_mmap_nopage
shm_nopage
shmzero_nopage
filemap_nopage
- sorry, it's not 9+filemap_nopage, it's 8+filemap_nopage.

However, it still leaves a window for the race: we invalidate first and
remove from pagetables later. And ClearPageUptodate() is obviously
truncate_inode_pages() work. Grr...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-22 Thread Mike A. Harris

On Sun, 22 Oct 2000, Dr. Kelsey Hudson wrote:

>Date: Sun, 22 Oct 2000 13:39:09 -0700 (PDT)
>From: Dr. Kelsey Hudson <[EMAIL PROTECTED]>
>To: Mike A. Harris <[EMAIL PROTECTED]>
>Cc: David Woodhouse <[EMAIL PROTECTED]>,
> Linux Kernel mailing list <[EMAIL PROTECTED]>
>Content-Type: TEXT/PLAIN; charset=US-ASCII
>Subject: Re: PC speaker driver patch for 2.4.0-test10-pre3
>
>> Is there a major compelling reason that this patch isn't included
>> in the standard kernel tree?
>> 
>> 
>> --
>>   Mike A. Harris  -  Linux advocate  -  Open source advocate
>>   Computer Consultant - Capslock Consulting
>>  Copyright 2000 all rights reserved
>> --
>
>Apparently Linus doesn't like the way it handles interrupts or something,
>and is therefore 'wrong.' ::shrug:: As long as it's available though, I'll
>use it ;p

Someone has explained to me why it isn't included, and it made
decent sense to me.  Probably best left as a patch...


--
  Mike A. Harris  -  Linux advocate  -  Open source advocate
  Computer Consultant - Capslock Consulting
 Copyright 2000 all rights reserved
--

#[Mike A. Harris bash tip #1 - separate history files per virtual console]
# Put the following at the bottom of your ~/.bash_profile
[ ! -d ~/.bash_histdir ] && mkdir ~/.bash_histdir
tty |grep "^/dev/tty[0-9]" >& /dev/null && \
export HISTFILE=~/.bash_histdir/.$(tty | sed -e 's/.*\///')

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-22 Thread Dr. Kelsey Hudson

> Is there a major compelling reason that this patch isn't included
> in the standard kernel tree?
> 
> 
> --
>   Mike A. Harris  -  Linux advocate  -  Open source advocate
>   Computer Consultant - Capslock Consulting
>  Copyright 2000 all rights reserved
> --

Apparently Linus doesn't like the way it handles interrupts or something,
and is therefore 'wrong.' ::shrug:: As long as it's available though, I'll
use it ;p

 Kelsey Hudson   [EMAIL PROTECTED] 
 Software Engineer
 Compendium Technologies, Inc   (619) 725-0771
--- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-22 Thread Dr. Kelsey Hudson

 Is there a major compelling reason that this patch isn't included
 in the standard kernel tree?
 
 
 --
   Mike A. Harris  -  Linux advocate  -  Open source advocate
   Computer Consultant - Capslock Consulting
  Copyright 2000 all rights reserved
 --

Apparently Linus doesn't like the way it handles interrupts or something,
and is therefore 'wrong.' ::shrug:: As long as it's available though, I'll
use it ;p

 Kelsey Hudson   [EMAIL PROTECTED] 
 Software Engineer
 Compendium Technologies, Inc   (619) 725-0771
--- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PC speaker driver patch for 2.4.0-test10-pre3

2000-10-22 Thread Mike A. Harris

On Sun, 22 Oct 2000, Dr. Kelsey Hudson wrote:

Date: Sun, 22 Oct 2000 13:39:09 -0700 (PDT)
From: Dr. Kelsey Hudson [EMAIL PROTECTED]
To: Mike A. Harris [EMAIL PROTECTED]
Cc: David Woodhouse [EMAIL PROTECTED],
 Linux Kernel mailing list [EMAIL PROTECTED]
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: Re: PC speaker driver patch for 2.4.0-test10-pre3

 Is there a major compelling reason that this patch isn't included
 in the standard kernel tree?
 
 
 --
   Mike A. Harris  -  Linux advocate  -  Open source advocate
   Computer Consultant - Capslock Consulting
  Copyright 2000 all rights reserved
 --

Apparently Linus doesn't like the way it handles interrupts or something,
and is therefore 'wrong.' ::shrug:: As long as it's available though, I'll
use it ;p

Someone has explained to me why it isn't included, and it made
decent sense to me.  Probably best left as a patch...


--
  Mike A. Harris  -  Linux advocate  -  Open source advocate
  Computer Consultant - Capslock Consulting
 Copyright 2000 all rights reserved
--

#[Mike A. Harris bash tip #1 - separate history files per virtual console]
# Put the following at the bottom of your ~/.bash_profile
[ ! -d ~/.bash_histdir ]  mkdir ~/.bash_histdir
tty |grep "^/dev/tty[0-9]"  /dev/null  \
export HISTFILE=~/.bash_histdir/.$(tty | sed -e 's/.*\///')

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



HD Benchmark 2.2.17(ide_patch)/2.4.0-test10-pre3 AMD v1.2 ide driver

2000-10-21 Thread bruj0

Greetings all:
I was doing some benchmark between kernels and i came out with
this results, for what I can see, 2.2.17 with the ide_patch is faster,
although it uses more CPU, but then again i could be wrong, any ideas?
-Rodrigo

2.4.0-test10-pre3 FS=ext2 AMD v1.2 (viper) IDE DRIVE

Version 1.00d   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Ono-Sendai.lin 248M  6713  98 23372  17  7515   9  6814  94 16644  12  40.4 0
--Sequential Create-- RandomCreate
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 30   256  99 + +++  8282 100   258  99 + +++  1005 99
--
2.2.17-ide_patch-reiser_patch FS=ext2
Version 1.00d   
--Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Ono-Sendai.lin 248M  6458  94 23786  19  6662  13  6941  95 17266  11  43.1   0
   --Sequential Create-- Random Create
   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
   30   261  99   636  99  6466  99   259  99   801 100   373  57
--
lspci output:
00:07.1 IDE interface: Advanced Micro Devices [AMD]: Unknown device 7409 (rev 07)
128mb RAM
/dev/hda: (western digital 20gig udma(66))

 Model=WDC WD205AA, FwRev=05.05B05, SerialNo=WD-WMA0W1681543
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
 BuffType=3(DualPortCache), BuffSize=2048kB, MaxMultSect=16, MultSect=off
 DblWordIO=no, maxPIO=2(fast), DMA=yes, maxDMA=0(slow)
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes
 LBA CHS=1023/256/63 Remapping, LBA=yes, LBAsects=40079088
 tDMA={min:120,rec:120}, DMA modes: mword0 mword1 mword2
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, PIO modes: mode3 mode4
 UDMA modes: mode0 mode1 mode2

-
Webmaster of http://www.securityportal.com.ar
[EMAIL PROTECTED] 
 /"\
 \ / ASCII Ribbon Campaign
  X  Against HTML Mail
 / \
  Proud member of http://www.undersec.com
-


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test10-pre3

2000-10-21 Thread Rik van Riel

On 16 Oct 2000, H. Peter Anvin wrote:
> Followup to:  <[EMAIL PROTECTED]>
> By author:[EMAIL PROTECTED]
> In newsgroup: linux.dev.kernel
> > 
> > > but intel refuse to guarantee they wont produce an actual '786' class
> > > CPU.
> > 
> > Worry about that if/when it happens ?
> 
> Dare one guess the 786 is actually the Itanic in x86 mode?

In that case I guess we won't have to worry about it
happening ;)

*runs like hell*

cheers,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



2.4.0-test10-pre3:mkdir() returns -EIO when ext2 filesystem full

2000-10-21 Thread Russell King

Hi,

Something weird is going on with VFS return codes with kernel
2.4.0-test10-pre3:

[root@sturm glibc-2.1.92]# df /var/tmp
Filesystem   1k-blocks  Used Available Use% Mounted on
/dev/hda3   127383127383 0 100% /var
[root@sturm glibc-2.1.92]# df -i /var/tmp
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/hda3  329122936   299769% /var
[root@sturm glibc-2.1.92]# mkdir /var/tmp/tst
mkdir: cannot create directory `/var/tmp/tst': Input/output error
[root@sturm glibc-2.1.92]#

Just a guess, but I think this is caused by:

dir_block = ext2_bread (inode, 0, 1, );
if (!dir_block) {
inode->i_nlink--; /* is this nlink == 0? */
mark_inode_dirty(inode);
iput (inode);
return -EIO;
}

in fs/ext2/namei.c:ext2_mkdir().  Note that I can't find any cases where
ext2_alloc_block() would actually give a ENOSPC return.

Note (again) the dodgy return error by reference, and then we totally
ignore 'err', so even if ext2_alloc_block() was fixed, it wouldn't
matter unless this and probably other places were also fixed to return
err instead of EIO.
   _
  |_| - ---+---+-
  |   | Russell King[EMAIL PROTECTED]  --- ---
  | | | | http://www.arm.linux.org.uk/personal/aboutme.html   /  /  |
  | +-+-+ --- -+-
  /   |   THE developer of ARM Linux  |+| /|\
 /  | | | ---  |
+-+-+ -  /\\\  |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test10-pre3

2000-10-21 Thread Rik van Riel

On 16 Oct 2000, H. Peter Anvin wrote:
 Followup to:  [EMAIL PROTECTED]
 By author:[EMAIL PROTECTED]
 In newsgroup: linux.dev.kernel
  
   but intel refuse to guarantee they wont produce an actual '786' class
   CPU.
  
  Worry about that if/when it happens ?
 
 Dare one guess the 786 is actually the Itanic in x86 mode?

In that case I guess we won't have to worry about it
happening ;)

*runs like hell*

cheers,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



HD Benchmark 2.2.17(ide_patch)/2.4.0-test10-pre3 AMD v1.2 ide driver

2000-10-21 Thread bruj0

Greetings all:
I was doing some benchmark between kernels and i came out with
this results, for what I can see, 2.2.17 with the ide_patch is faster,
although it uses more CPU, but then again i could be wrong, any ideas?
-Rodrigo

2.4.0-test10-pre3 FS=ext2 AMD v1.2 (viper) IDE DRIVE

Version 1.00d   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Ono-Sendai.lin 248M  6713  98 23372  17  7515   9  6814  94 16644  12  40.4 0
--Sequential Create-- RandomCreate
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 30   256  99 + +++  8282 100   258  99 + +++  1005 99
--
2.2.17-ide_patch-reiser_patch FS=ext2
Version 1.00d   
--Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Ono-Sendai.lin 248M  6458  94 23786  19  6662  13  6941  95 17266  11  43.1   0
   --Sequential Create-- Random Create
   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
   30   261  99   636  99  6466  99   259  99   801 100   373  57
--
lspci output:
00:07.1 IDE interface: Advanced Micro Devices [AMD]: Unknown device 7409 (rev 07)
128mb RAM
/dev/hda: (western digital 20gig udma(66))

 Model=WDC WD205AA, FwRev=05.05B05, SerialNo=WD-WMA0W1681543
 Config={ HardSect NotMFM HdSw15uSec SpinMotCtl Fixed DTR5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
 BuffType=3(DualPortCache), BuffSize=2048kB, MaxMultSect=16, MultSect=off
 DblWordIO=no, maxPIO=2(fast), DMA=yes, maxDMA=0(slow)
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes
 LBA CHS=1023/256/63 Remapping, LBA=yes, LBAsects=40079088
 tDMA={min:120,rec:120}, DMA modes: mword0 mword1 mword2
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, PIO modes: mode3 mode4
 UDMA modes: mode0 mode1 mode2

-
Webmaster of http://www.securityportal.com.ar
[EMAIL PROTECTED] 
 /"\
 \ / ASCII Ribbon Campaign
  X  Against HTML Mail
 / \
  Proud member of http://www.undersec.com
-


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Alexander Viro



On Thu, 19 Oct 2000, Linus Torvalds wrote:

> You'd have to do something like
> 
>   LockPage(page); /* Nobody gets to write to this page (except 
>through mmaps, ugh) */
>   gather_all_mmap_users(page);/* THIS is the nasty one */

Wait a second. invalidate_inode_pages() has no idea of range, right? Finding
all VMAs of shared mappings for given inode is not a big deal. Sure,
repeating it for each bloody page would be painful at extreme, but
"make sure that every access to address within _that_ VMA will result in
pagefault" looks like a reasonable operation. Basically, we have
two sides - pagecache and many VMAs. And loop over VMAs with per-VMA
operations sounds more reasonable than loop over pages. We _can_ block
pageins without messing with pages themselves (it starts with finding VMA,
after all) and we can block new shared mappings (not a big deal).

Comments? Basically, I propose per-VMA rw-semaphore taken on page-in for
read and on that "flush and make sure we'll reread" operation for write +
rw-sem on i_mmap_shared. We could even make invalidation work for less
than whole file, all we need for that is skipping the VMAs out of the
range. Ingo, Linus?

>   nfs_wb_page(page);  /* force write-back on this page */
>   ClearPageUptodate(page);/* mark it not up-to-date to force a read-in 
>next time */
>   UnlockPage(page);   /* Ok, now the client can go wild */
> 
> where everything but the "gather_all_mmap_users()" part is fairly
> straightforward.  The "gather" phase is nasty - it would need to figure
> out every place the page is mapped, make sure those are synchronized (ie
> something like marking the page table entry write-protected and causing a
> TLB invalidate SMP cross-call - at which point the resulting page fault
> and the page lock will catch anybody who tries to write to the page)..
 
> In no case could you do something like what the current
> invalidate_inode_pages() does, which is to just try to drop the page from
> the cache - that really only works if we're the only user of that page,
> which the "page_count != 1" test now enforces.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Trond Myklebust

> " " == Alexander Viro <[EMAIL PROTECTED]> writes:

 > Again, consider the case when two processes share the
 > mapping. Process A has page faulted in. Page is
 > invalidated. Process B tries to access the same page. If you
 > leave it in page tables of A you _MUST_ leave it in
 > cache. Period. Otherwise A and B will have different instances
 > of the page.

Even so, you want to reread it. When I said automatically msync(), I
meant 'schedule a write of what has changed and then do whatever you
need to do to get the page reread'. (You explicitly asked for
semantics, not an implementation)

 > And rereading the thing might be tolerable _only_ if there is
 > another client that had changed the file.  Even if you msync()
 > everything, you have to deal with plain and boring memory
 > modifications done by a process that did that bloody mmap(). If
 > they happen while you are reading the data from server - too
 > fscking bad, you'ld better have a good excuse for destroying
 > the data. write() from another client _is_ a good excuse. But
 > from my reading of fs/nfs/* it looks like we do that (cache
 > invalidation) left, right and center in cases that have nothing
 > to another clients.

That's because under NFS you don't have a cache consistency
protocol. Nothing tells you that the file/directory has changed and
that you have to resync your cache. Instead, you have to infer it from
the fact that some operation has returned file attributes that are
screwy.
In addition you may want to force a reread, because some operation
just changed a directory on the server, and you don't know what else
changed beforehand.

 > IOW, I think that invalidate_inode_pages() is bogus. There is
 > only one situation when we have a right to remove page from
 > pagecache - when it is not mapped anywhere.

The issue is not about removing pages. It's about forcing a reread of
the cached data from the server. Removing the actual pages from the
cache has so far been the only race-free method for doing this (since
pre-2.2.x at least) while ensuring that at least generic 'read',
'readdir' and 'write' work as expected.
Yes it screws up mmap() and should be fixed but without breaking what
little that works please.

As for simply settling for a self-consistent mmap() rather than
tackling the problem of rereading; the main crime is that you're
rendering file locking unusable.
Locking is the case in which you have to issue a guarantee that the
cache is consistent between client and server within the area covered
by the lock. In all other cases you *could* get away with the partial
cache invalidation implementation by arguing that there no consistency
guarantees inherent in the protocol.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Alexander Viro



On Fri, 20 Oct 2000, Trond Myklebust wrote:

> For the general case of the page cache I think we can keep them quite
> simple:
> 
> + We do in any case want to drop all pages that are unreferenced. (The
> reason for flushing may be that the file size has changed.)
> 
> + For pages that are referenced (and unlocked) we would like to force
> them to get read in anew ASAP. How this is done in practice is
> irrelevant as far as NFS is concerned provided that we don't sleep on
> any I/O while in nfs_zap_caches()/invalidate_inode_pages().
> 
> The lower level stuff can and will sort out the business of flushing
> out pending writebacks that conflict with the read, so that isn't a
> problem for the VFS/VM.
> 
> The problem lies with writes that haven't yet been msync()ed (and
> hence do not have writebacks). For shared mappings, one should perhaps
> schedule an automatic msync() of the dirty pages (???). For private
> mappings, perhaps the best thing would be to defer the read?

Again, consider the case when two processes share the mapping. Process A
has page faulted in. Page is invalidated. Process B tries to access the
same page. If you leave it in page tables of A you _MUST_ leave it in
cache. Period. Otherwise A and B will have different instances of the
page.

It's not about writebacks. If you map something with MAP_SHARED and
fork() afterwards you _MUST_ have the same data at the address returned by
mmap() until one of the processes unmaps the thing.

And rereading the thing might be tolerable _only_ if there is another
client that had changed the file.  Even if you msync() everything, you
have to deal with plain and boring memory modifications done by a process
that did that bloody mmap(). If they happen while you are reading the data
from server - too fscking bad, you'ld better have a good excuse for
destroying the data. write() from another client _is_ a good
excuse. But from my reading of fs/nfs/* it looks like we do that (cache
invalidation) left, right and center in cases that have nothing to another
clients.

IOW, I think that invalidate_inode_pages() is bogus. There is only one
situation when we have a right to remove page from pagecache - when it is
not mapped anywhere.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Alexander Viro



On 20 Oct 2000, Trond Myklebust wrote:

> > " " == Russell King <[EMAIL PROTECTED]> writes:
> 
>  > invalidate_inode_pages nfs_zap_caches nfs_lock fcntl_setlk
>  > do_fcntl sys_fcntl
> 
>  > So I guess that NFS locking is really bad if the region is
>  > mmapped!
> 
> Yep, but that's a symptom, not a cause. We want to be able to run
> invalidate_inode_pages() safely at any moment, since the need can be
> triggered externally (because the server and client page caches
> disagree).

So what exactly do you want it to do when page is mapped by user process?
Should it remain visible or not? What should happen if process writes to
that page?

Trond, I'm not asking about implementation - the question being what
semantics do you want for nfs_zap_caches() wrt user-mapped pages.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Alexander Viro



On 20 Oct 2000, Trond Myklebust wrote:

> Under NFS the problem is that pages can (and *should*) be invalidated
> despite there being pending write backs. The server can trigger the
> need for a cache invalidation at any time.

OK, so what should happen if user does mmap() on NFS file, dirties the
page and server tells that page should be invalidated? Should we still see
that page in process' address space? Silently making it anonymous is the
worst thing possible - it's still accessible through process page tables
and you (a) still see the (allegedly) invalidated data and (b) have no way
in hell to get the cache coherency between processes. I.e. you are in for
situations when we do mmap(), fork(), write data in child and attempt to
read from that address in parent and child yield different
results. Moreover, future attempts to modify the data by either of them
will be invisible in another.

IOW, what do you really want from the invalidation? It looks like the
right thing _might_ be "remove from all page tables", but that can become
really interesting with mlock() and friends.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Alexander Viro



On Fri, 20 Oct 2000, Roger Larsson wrote:

> Is it legal/good practice to unmap the file after closing it?
Yes.

> (Since the sharing needs the fd to mmap it)

It doesn't. Mapping needs struct file * and it doesn't care about
fd. mmap() takes a reference to struct file by fd you've passed and after
that we can forget about descriptors - vma_struct holds a reference to
file and that's it.

> Successful unlinking a file should probably free pages directly to
> free list - might be worth optimizing for.

_Definitely_ no. Hell, unlink() doesn't mean that anything happens with
data - unlink an opened file and if that affects the read/write/lseek
you've found a bug. And mapping is equivalent to having opened descriptors
in that respect.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Alexander Viro



On Fri, 20 Oct 2000, Trond Myklebust wrote:

> > " " == David S Miller <[EMAIL PROTECTED]> writes:
> 
>  > Actually, judging by the trace you provided Russell, I'd say
>  > this is some peculiarity with NFS silly rename handling, and
>  > it'd be best to look for the bug in that code (early inode
>  > reference loss, for example?)
> 
> Russel's trace indicates that the unlink actually has completed and
> has become a negative dentry since the file is labelled '(deleted)'.

No, it doesn't. Proof:

exec 3>/tmp/foo
rm /tmp/foo
ls -l /proc/$$/fd/3

and dentry is _not_ negative.

> That means that the dentry count must have been zero, so that
> dentry_iput() was called.

It doesn't and it wasn't.

> I don't see how dentry_iput() can be called on an open file. In
> principle the dentry count should always be >= 1, so unless there is
> some place where we're calling d_delete() without get()ing the dentry
> first, there should be no path for early inode loss.

d_delete() will not make dentry negative if there are other
users. However, it will make it unhashed in such case, ergo
(deleted) thing.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Linus Torvalds


[ Final comment, and then I'll shut up ]

On Thu, 19 Oct 2000, Linus Torvalds wrote:
> 
> You'd have to do something like
> 
>   LockPage(page); /* Nobody gets to write to this page (except 
>through mmaps, ugh) */
>   gather_all_mmap_users(page);/* THIS is the nasty one */
>   nfs_wb_page(page);  /* force write-back on this page */
>   ClearPageUptodate(page);/* mark it not up-to-date to force a read-in 
>next time */
>   UnlockPage(page);   /* Ok, now the client can go wild */

Note that one approach would be to just make invalidate_inode_pages() do
exactly the above.

Now, the gather_all_mmap_users() part is definitely a post-2.4.x thing,
and we can't do nfs_wb_page() in invalidate_inode_page() either, but what
we CAN do is to clear the Uptodate flag (and we do need the page lock to
do so).

The advantage of clearing the uptodate flag (as opposed to doing what we
do now - dropping the page altogether) is that there would be no cache
aliasing issues, and there would be no issues with a page and its
associated data just "disappearing" from under somebody. It would cause
the page to be read in again the next time it is faulted in or somebody
does a read() on it, and that's exactly what we want.

However, we _do_ need to WB the page data some way - but the decision on
whether to invalidate write-backs or finish them would have to be up to
the low-level filesystem.

(We should _not_ allow people to read in the page, and leave some stale
write-back information there, so that we'd write back stuff that we just
read because somebody else noticed that the page was not up-to-date. That
would be just an endless source of (a) confusion and (b) unnecessary
network traffic.)

We could, of course, have some combination of the two: if there is only
one user (us), we can just drop the page, otherwise we can mark it
non-up-to-date to force future readers to re-validate the dang thing.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Linus Torvalds



On Thu, 19 Oct 2000, Linus Torvalds wrote:
> 
> I'm saying that we're much much better off guaranteeing local consistency
> over knowingly breaking local consistency over a uncertain global
> consistency issue. Especially as NFS has never guaranteed global
> consistency in the first place, and was not designed to do that.

Btw, if you really worry about NFS v3 and want to add consistency
guarantees to NFS v3, you can probably do that. It's just that you cannot
do it with a simple cache invalidation - you'd need to be much much more
careful (the same way a simple CPU cache invalidate does not guarantee SMP
cache consistency - it only guarantees data corruption due to missed
writes).

I don't know what the best way to do a true cache consistency protocol on
a page leve would be, especially as you will have a _really_ hard time
getting hold of an exclusive pointer to a page with multiple mmap's, but
you could probably get something that guarantees cache consistency as long
as people do _not_ expect mmap to always be 100% consistent.

(The reason you cannot get mmap consistency is just that mmap doesn't have
any kernel synchronization point unless you're willing to shoot down every
single mapping - which is expensive as hell, but doable).

You'd have to do something like

LockPage(page); /* Nobody gets to write to this page (except 
through mmaps, ugh) */
gather_all_mmap_users(page);/* THIS is the nasty one */
nfs_wb_page(page);  /* force write-back on this page */
ClearPageUptodate(page);/* mark it not up-to-date to force a read-in 
next time */
UnlockPage(page);   /* Ok, now the client can go wild */

where everything but the "gather_all_mmap_users()" part is fairly
straightforward.  The "gather" phase is nasty - it would need to figure
out every place the page is mapped, make sure those are synchronized (ie
something like marking the page table entry write-protected and causing a
TLB invalidate SMP cross-call - at which point the resulting page fault
and the page lock will catch anybody who tries to write to the page)..

In no case could you do something like what the current
invalidate_inode_pages() does, which is to just try to drop the page from
the cache - that really only works if we're the only user of that page,
which the "page_count != 1" test now enforces.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Trond Myklebust

> " " == Linus Torvalds <[EMAIL PROTECTED]> writes:

 > which is really really bad, because now you have the case that
 > you have 'n' copies of the same page in memory, with 'n' users,
 > out of which 'n-1' users have the wrong page. And those 'n-1'
 > users don't even have any way of _knowing_ that they have the
 > wrong page.

 > Which is why we MUST NOT drop a page that has users. Really.

 > I'm telling you that cases #4 and #5 are _much_ worse than your
 > "solution" to case #2. And you argue that your solution is good
 > only because you're completely ignoring cases #4 and #5.

No. I'm arguing (at 4:40am and while trying to keep one eye on our
detector's data acquisition) on the basis that whoever holds the file
lock has to have a guarantee of obtaining 100% accuracy on the locked
region.

I agree that dropping pages is ugly and that it will always give
problems with shared mmap(), so if it can be shown that clearing
PG_uptodate and rereading the same page will give the required
guarantee on locking, then I'm not going to complain.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Linus Torvalds



On 20 Oct 2000, Trond Myklebust wrote:
> 
> The problem here is that NFS pages have 3 rather than 2 states:
>   1) mmapped & correct.
>   2) mmapped & incorrect. (but possibly dirty)
>   3) Unmapped
> 
> For case 1), we clearly want to have the page in inode->i_mapping.
> For cases 2) & 3) we don't.

I think you're WRONG, WRONG, WRONG.

Your case #2 is not at all a state.

It's "mapped, possibly dirty, and locally correct. But _maybe_ globally
incorrect".

If you remove the page in case #2 from the hashing, you will have case #4,
which you're ignoring:

   4) Totally, and utterly incorrect. Without any way for the application
  to even _know_ that it's incorrect.

Note that case 4 is accompanied by case #5, later on:

   5) two separate pages both mapped, one hashed and up-to-date, the other
  mapped in other processes but incorrect.

which is really really bad, because now you have the case that you have
'n' copies of the same page in memory, with 'n' users, out of which 'n-1'
users have the wrong page. And those 'n-1' users don't even have any way
of _knowing_ that they have the wrong page.

Which is why we MUST NOT drop a page that has users. Really.

I'm telling you that cases #4 and #5 are _much_ worse than your "solution"
to case #2. And you argue that your solution is good only because you're
completely ignoring cases #4 and #5.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Linus Torvalds



On 20 Oct 2000, Trond Myklebust wrote:
> 
> Under NFS the problem is that pages can (and *should*) be invalidated
> despite there being pending write backs. The server can trigger the
> need for a cache invalidation at any time.
> The existence of file locks that aren't page aligned, as well as
> partial page writeback ensure that we cannot make the equivalence
>page has pending write == page is correct.

Note that this is not what my patch really says at all.

My patch says "ok, we should invalidate this page, because somebody thinks
it may be bad. HOWEVER, we cannot reasonably do that, because we have
users of the page (which may be completely unrelated to pending writes),
and invalidating this page is _certain_ to cause corruption".

Basically think of the case of somebody having a shared mmap over NFS. He
has dirty data in his pages, and he expects those to be written out to the
server. There is NO QUESTION about this fact.

Imagine somebody (possibly even the same person) then releasing or getting
a file lock, possibly in some other part of the file. 

Without that added test, all those dirty local pages just got completely
thrown away. 

Now, you have a choice: you can KNOW that you're doing something horribly
and utterly wrong (throwing away data), or you can SUSPECT that you might
cause non-coherency between the client and the server. You cannot get
both.

I'm saying that we're much much better off guaranteeing local consistency
over knowingly breaking local consistency over a uncertain global
consistency issue. Especially as NFS has never guaranteed global
consistency in the first place, and was not designed to do that.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Trond Myklebust

> " " == Linus Torvalds <[EMAIL PROTECTED]> writes:


 > The advantage of clearing the uptodate flag (as opposed to
 > doing what we do now - dropping the page altogether) is that
 > there would be no cache aliasing issues, and there would be no
 > issues with a page and its associated data just "disappearing"
 > from under somebody. It would cause the page to be read in
 > again the next time it is faulted in or somebody does a read()
 > on it, and that's exactly what we want.

I've proposed this in the past, but there you refused on the grounds
that it breaks assumptions in the VFS. I'm otherwise happy with this
sort of thing. It ensures both shared mmap() and cache consistency.

The only problem I can think of would be for dirty pages that haven't
yet been msync()ed, however I'm far from being competent to evaluate
any side-effects on the VM.

 > However, we _do_ need to WB the page data some way - but the
 > decision on whether to invalidate write-backs or finish them
 > would have to be up to the low-level filesystem.

This is already in place.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Trond Myklebust

> " " == Linus Torvalds <[EMAIL PROTECTED]> writes:

 > Btw, that "invalidate_inode_pages()" thing is just wrong - we
 > can't just remove pages that are mapped etc, because that would
 > result in no end of fun aliasing problems etc.


 > How about adding a test in invalidate_inode_pages() like

 >  /* We cannot invalidate a locked page */ if
 >  (TryLockPage(page))
 >  continue;

 > + /* We cannot invalidate a page that is in use */
 > + if (page_count(page) != 1) {
 > + UnlockPage(page);
 > + continue;
 > + }
 > +
 >  __lru_cache_del(page); __remove_inode_page(page);

The problem here is that NFS pages have 3 rather than 2 states:
  1) mmapped & correct.
  2) mmapped & incorrect. (but possibly dirty)
  3) Unmapped

For case 1), we clearly want to have the page in inode->i_mapping.
For cases 2) & 3) we don't.

However for case 2) we still have a weak association to the inode
itself, and we want to be able to reference inode metadata etc.  Would
it make sense then to remove these pages from i_mapping, but to hang
them onto a new struct address_space (call it i_unmapped for want of a
better name)?

That would allow you to keep a consistent state for the page, while
still allowing you to 'invalidate' it (by removing it from the
i_mapping) and hence maintain a consistent cache.

invalidate_inode_pages() would then reduce to

   remove_page_from_inode_queue(page);
   remove_page_from_hash_queue(page);
   if (page_count(page))
 add_page_to_inode_unmapped(page);

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Russell King

Alexander Viro writes:
> Trond, I'm not asking about implementation - the question being what
> semantics do you want for nfs_zap_caches() wrt user-mapped pages.

Ok, looking through sendmail, and then db2, the situation is
created by the db2 library.  If the process does the following:

1. creates NFS file
2. sets file size by lseek + write
3. maps it
4. fcntl locks the file (writes above data back to file)
5. write to the mapping (makes pages dirty)
6. fcntl unlocks it (dirty page data is NOT written back)
7. closes it
8. unlinks it
9. some time later unmaps it(causes dirty data to be written back)

Note that (6) doesn't act as a barrier to synchronise writes in this case,
but it does for any normal write()s.  Surely NFS should cause any dirty
data associated with the file to be written back to the server no matter
what?

Although Linus' fix seems to prevent the problem, I get the feeling that it
is a sticky plaster over a much bigger problem.
   _
  |_| - ---+---+-
  |   | Russell King[EMAIL PROTECTED]  --- ---
  | | | | http://www.arm.linux.org.uk/personal/aboutme.html   /  /  |
  | +-+-+ --- -+-
  /   |   THE developer of ARM Linux  |+| /|\
 /  | | | ---  |
+-+-+ -  /\\\  |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Petr Vandrovec

On 19 Oct 00 at 16:32, Linus Torvalds wrote:

> How about adding a test in invalidate_inode_pages() like
> 
> /* We cannot invalidate a locked page */
> if (TryLockPage(page))
> continue;
> 
> +   /* We cannot invalidate a page that is in use */
> +   if (page_count(page) != 1) {
> +   UnlockPage(page);
> +   continue;
> +   }
> +
> __lru_cache_del(page);
> __remove_inode_page(page);
 
Hi Linus,
  this did not fix problem with my testcase - 
mmap(shared),fork,dirty_it,ftruncate,finish - does NOT go through
this code (invalidate_inode_pages) at all (should it?).

  I instrumented both swapout in page_io, and __remove_inode_page,
and pages are going through __remove_inode_page with strange use
count. I do not know whether it is good or no... During normal life
of system, pages with count==2 && index==0 are passed to __remove_inode_page.
As soon as I start my test, non-zero index pages with count > 2 are
passed here. Later this pages appear in filemap_sync...
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]

vana:~# free
 total   used   free sharedbuffers cached
Mem:255768 203232  52536  0   4084  35496
-/+ buffers/cache: 163652  92116
Swap:   530136  0 530136
vana:~# free
 total   used   free sharedbuffers cached
Mem:255768 203376  52392  0   4084  35636
-/+ buffers/cache: 163656  92112
Swap:   530136  0 530136

vana:~# free
 total   used   free sharedbuffers cached
Mem:255768  72172 183596  0   4084  32888
-/+ buffers/cache:  35200 220568
Swap:   530136  0 530136
vana:~# free
 total   used   free sharedbuffers cached
Mem:255768  72180 183588  0   4084  32888
-/+ buffers/cache:  35208 220560
Swap:   530136  0 530136


11:57:44 Bad page count in __remove_inode_page!
11:57:44   page: c1349ef8
11:57:44 mapping:  cc65c6dc
11:57:44 index:0
11:57:44 nexthash: 
11:57:44 count:2
11:57:44 flags:0x0009
11:57:44 age:  0
11:57:44 pprevhash: 
11:57:44 buffers:  
11:57:44 virtual:  cc61a000
11:57:44 zone: c021f578
11:57:44 pagedump done
11:57:44 Bad page count in __remove_inode_page!
11:57:44   page: c1120394
11:57:44 mapping:  cc6b2edc
11:57:44 index:32767
11:57:44 nexthash: 
11:57:44 count:3
11:57:44 flags:0x0009
11:57:44 age:  5
11:57:44 pprevhash: 
11:57:44 buffers:  
11:57:44 virtual:  c43d1000
11:57:44 zone: c021f578
11:57:44 pagedump done

11:57:46 Bad page count in __remove_inode_page!
11:57:46   page: c1266298
11:57:46 mapping:  cc6b2edc
11:57:46 index:13635
11:57:46 nexthash: 
11:57:46 count:3
11:57:46 flags:0x0009
11:57:46 age:  5
11:57:46 pprevhash: 
11:57:46 buffers:  
11:57:46 virtual:  c9082000
11:57:46 zone: c021f578
11:57:46 pagedump done

11:57:46 page->mapping == NULL
11:57:46 error:  -22
11:57:46 ptep:   c92f6a2c
11:57:46 pteval: 09062027
11:57:46 vma:ce926420
11:57:46   vm_mm:ccd67d60
11:57:46   vm_start: 40128000
11:57:46   vm_end:   48128000
11:57:46   vm_next:  c1490b20
11:57:46   vm_avl_height:  1
11:57:46   vm_avl_left:
11:57:46   vm_avl_right:   
11:57:46   vm_next_share:  
11:57:46   vm_pprev_share: ccdee684
11:57:46   vm_operations_struct: c021f1a0
11:57:46   vm_pgoff:  
11:57:46   vm_file:   cd4270e0
11:57:46   vm_raend:  
11:57:46   vm_private_data:  
11:57:46 address:4368B000
11:57:46 flags:  0001
11:57:46   file: cd4270e0
11:57:46 dentry:   cc8ae440
11:57:46   inode: cc6b2e40
11:57:46   num: 848677
11:57:46   dev: 0x0302
11:57:46   path: /usr/src/tst/ram0 (deleted)
11:57:46 vfsmount: c14ca8c0
11:57:46 op:   c0221bc0
11:57:46 count:4
11:57:46 flags:0x0002
11:57:46 mode: 03
11:57:46 pos:  0
11:57:46 reada:0
11:57:46 ramax:0
11:57:46 raend:0
11:57:46 ralen:0
11:57:46 rawin:0
11:57:46 owner.pid: 0
11:57:46 owner.uid: 0
11:57:46 owner.euid: 0
11:57:46 owner.signum: 0
11:57:46 uid:  0
11:57:46 gid:  0
11:57:46 error:0
11:57:46 version:  11641
11:57:46 private_data: 
11:57:46   page: c1265a18
11:57:46 mapping:  
11:57:46 index:   

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Trond Myklebust

> " " == Alexander Viro <[EMAIL PROTECTED]> writes:

 > So what exactly do you want it to do when page is mapped by
 > user process?  Should it remain visible or not? What should
 > happen if process writes to that page?

 > Trond, I'm not asking about implementation - the question being
 > what semantics do you want for nfs_zap_caches() wrt user-mapped
 > pages.

For the general case of the page cache I think we can keep them quite
simple:

+ We do in any case want to drop all pages that are unreferenced. (The
reason for flushing may be that the file size has changed.)

+ For pages that are referenced (and unlocked) we would like to force
them to get read in anew ASAP. How this is done in practice is
irrelevant as far as NFS is concerned provided that we don't sleep on
any I/O while in nfs_zap_caches()/invalidate_inode_pages().

The lower level stuff can and will sort out the business of flushing
out pending writebacks that conflict with the read, so that isn't a
problem for the VFS/VM.

The problem lies with writes that haven't yet been msync()ed (and
hence do not have writebacks). For shared mappings, one should perhaps
schedule an automatic msync() of the dirty pages (???). For private
mappings, perhaps the best thing would be to defer the read?

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Linus Torvalds



On Fri, 20 Oct 2000, Trond Myklebust wrote:
> 
> The problem lies with writes that haven't yet been msync()ed (and
> hence do not have writebacks). For shared mappings, one should perhaps
> schedule an automatic msync() of the dirty pages (???). For private
> mappings, perhaps the best thing would be to defer the read?

Note that NONE of this is going to happen for 2.4.x.

We've never _ever_ done this before, there's no point in even suggesting
that this is suddenly a "critical" bug. It's not.

I want to know what the suggestion for 2.4.x is. Right now that's the "if
the count is elevated, we don't invalidate". 

Quite frankly, I don't see any other option. Doing the !Uptodate version
will lose local data as it stands now - in fact right now you'd lose data
that way even if you are the only client accessing the file, which is
obviously complete crap and _completely_ unacceptable.

I'm open to suggestions, but I haven't heard anything realistic.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Linus Torvalds



On Fri, 20 Oct 2000, Trond Myklebust wrote:
 
 The problem lies with writes that haven't yet been msync()ed (and
 hence do not have writebacks). For shared mappings, one should perhaps
 schedule an automatic msync() of the dirty pages (???). For private
 mappings, perhaps the best thing would be to defer the read?

Note that NONE of this is going to happen for 2.4.x.

We've never _ever_ done this before, there's no point in even suggesting
that this is suddenly a "critical" bug. It's not.

I want to know what the suggestion for 2.4.x is. Right now that's the "if
the count is elevated, we don't invalidate". 

Quite frankly, I don't see any other option. Doing the !Uptodate version
will lose local data as it stands now - in fact right now you'd lose data
that way even if you are the only client accessing the file, which is
obviously complete crap and _completely_ unacceptable.

I'm open to suggestions, but I haven't heard anything realistic.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Trond Myklebust

 " " == Alexander Viro [EMAIL PROTECTED] writes:

  So what exactly do you want it to do when page is mapped by
  user process?  Should it remain visible or not? What should
  happen if process writes to that page?

  Trond, I'm not asking about implementation - the question being
  what semantics do you want for nfs_zap_caches() wrt user-mapped
  pages.

For the general case of the page cache I think we can keep them quite
simple:

+ We do in any case want to drop all pages that are unreferenced. (The
reason for flushing may be that the file size has changed.)

+ For pages that are referenced (and unlocked) we would like to force
them to get read in anew ASAP. How this is done in practice is
irrelevant as far as NFS is concerned provided that we don't sleep on
any I/O while in nfs_zap_caches()/invalidate_inode_pages().

The lower level stuff can and will sort out the business of flushing
out pending writebacks that conflict with the read, so that isn't a
problem for the VFS/VM.

The problem lies with writes that haven't yet been msync()ed (and
hence do not have writebacks). For shared mappings, one should perhaps
schedule an automatic msync() of the dirty pages (???). For private
mappings, perhaps the best thing would be to defer the read?

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Petr Vandrovec

On 19 Oct 00 at 16:32, Linus Torvalds wrote:

 How about adding a test in invalidate_inode_pages() like
 
 /* We cannot invalidate a locked page */
 if (TryLockPage(page))
 continue;
 
 +   /* We cannot invalidate a page that is in use */
 +   if (page_count(page) != 1) {
 +   UnlockPage(page);
 +   continue;
 +   }
 +
 __lru_cache_del(page);
 __remove_inode_page(page);
 
Hi Linus,
  this did not fix problem with my testcase - 
mmap(shared),fork,dirty_it,ftruncate,finish - does NOT go through
this code (invalidate_inode_pages) at all (should it?).

  I instrumented both swapout in page_io, and __remove_inode_page,
and pages are going through __remove_inode_page with strange use
count. I do not know whether it is good or no... During normal life
of system, pages with count==2  index==0 are passed to __remove_inode_page.
As soon as I start my test, non-zero index pages with count  2 are
passed here. Later this pages appear in filemap_sync...
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]

vana:~# free
 total   used   free sharedbuffers cached
Mem:255768 203232  52536  0   4084  35496
-/+ buffers/cache: 163652  92116
Swap:   530136  0 530136
vana:~# free
 total   used   free sharedbuffers cached
Mem:255768 203376  52392  0   4084  35636
-/+ buffers/cache: 163656  92112
Swap:   530136  0 530136
here testcode finished
vana:~# free
 total   used   free sharedbuffers cached
Mem:255768  72172 183596  0   4084  32888
-/+ buffers/cache:  35200 220568
Swap:   530136  0 530136
vana:~# free
 total   used   free sharedbuffers cached
Mem:255768  72180 183588  0   4084  32888
-/+ buffers/cache:  35208 220560
Swap:   530136  0 530136


11:57:44 Bad page count in __remove_inode_page!
11:57:44   page: c1349ef8
11:57:44 mapping:  cc65c6dc
11:57:44 index:0
11:57:44 nexthash: 
11:57:44 count:2
11:57:44 flags:0x0009
11:57:44 age:  0
11:57:44 pprevhash: 
11:57:44 buffers:  
11:57:44 virtual:  cc61a000
11:57:44 zone: c021f578
11:57:44 pagedump done
11:57:44 Bad page count in __remove_inode_page!
11:57:44   page: c1120394
11:57:44 mapping:  cc6b2edc
11:57:44 index:32767
11:57:44 nexthash: 
11:57:44 count:3
11:57:44 flags:0x0009
11:57:44 age:  5
11:57:44 pprevhash: 
11:57:44 buffers:  
11:57:44 virtual:  c43d1000
11:57:44 zone: c021f578
11:57:44 pagedump done
snip index going down ...
11:57:46 Bad page count in __remove_inode_page!
11:57:46   page: c1266298
11:57:46 mapping:  cc6b2edc
11:57:46 index:13635
11:57:46 nexthash: 
11:57:46 count:3
11:57:46 flags:0x0009
11:57:46 age:  5
11:57:46 pprevhash: 
11:57:46 buffers:  
11:57:46 virtual:  c9082000
11:57:46 zone: c021f578
11:57:46 pagedump done
missing part of log
11:57:46 page-mapping == NULL
11:57:46 error:  -22
11:57:46 ptep:   c92f6a2c
11:57:46 pteval: 09062027
11:57:46 vma:ce926420
11:57:46   vm_mm:ccd67d60
11:57:46   vm_start: 40128000
11:57:46   vm_end:   48128000
11:57:46   vm_next:  c1490b20
11:57:46   vm_avl_height:  1
11:57:46   vm_avl_left:
11:57:46   vm_avl_right:   
11:57:46   vm_next_share:  
11:57:46   vm_pprev_share: ccdee684
11:57:46   vm_operations_struct: c021f1a0
11:57:46   vm_pgoff:  
11:57:46   vm_file:   cd4270e0
11:57:46   vm_raend:  
11:57:46   vm_private_data:  
11:57:46 address:4368B000
11:57:46 flags:  0001
11:57:46   file: cd4270e0
11:57:46 dentry:   cc8ae440
11:57:46   inode: cc6b2e40
11:57:46   num: 848677
11:57:46   dev: 0x0302
11:57:46   path: /usr/src/tst/ram0 (deleted)
11:57:46 vfsmount: c14ca8c0
11:57:46 op:   c0221bc0
11:57:46 count:4
11:57:46 flags:0x0002
11:57:46 mode: 03
11:57:46 pos:  0
11:57:46 reada:0
11:57:46 ramax:0
11:57:46 raend:0
11:57:46 ralen:0
11:57:46 rawin:0
11:57:46 owner.pid: 0
11:57:46 owner.uid: 0
11:57:46 owner.euid: 0
11:57:46 owner.signum: 0
11:57:46 uid:  0
11:57:46 gid:  0
11:57:46 error:0
11:57:46 version:  11641
11:57:46 private_data: 
11:57:46   page: c1265a18

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Russell King

Alexander Viro writes:
 Trond, I'm not asking about implementation - the question being what
 semantics do you want for nfs_zap_caches() wrt user-mapped pages.

Ok, looking through sendmail, and then db2, the situation is
created by the db2 library.  If the process does the following:

1. creates NFS file
2. sets file size by lseek + write
3. maps it
4. fcntl locks the file (writes above data back to file)
5. write to the mapping (makes pages dirty)
6. fcntl unlocks it (dirty page data is NOT written back)
7. closes it
8. unlinks it
9. some time later unmaps it(causes dirty data to be written back)

Note that (6) doesn't act as a barrier to synchronise writes in this case,
but it does for any normal write()s.  Surely NFS should cause any dirty
data associated with the file to be written back to the server no matter
what?

Although Linus' fix seems to prevent the problem, I get the feeling that it
is a sticky plaster over a much bigger problem.
   _
  |_| - ---+---+-
  |   | Russell King[EMAIL PROTECTED]  --- ---
  | | | | http://www.arm.linux.org.uk/personal/aboutme.html   /  /  |
  | +-+-+ --- -+-
  /   |   THE developer of ARM Linux  |+| /|\
 /  | | | ---  |
+-+-+ -  /\\\  |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Trond Myklebust

 " " == Linus Torvalds [EMAIL PROTECTED] writes:

  Btw, that "invalidate_inode_pages()" thing is just wrong - we
  can't just remove pages that are mapped etc, because that would
  result in no end of fun aliasing problems etc.

snip
  How about adding a test in invalidate_inode_pages() like

   /* We cannot invalidate a locked page */ if
   (TryLockPage(page))
   continue;

  + /* We cannot invalidate a page that is in use */
  + if (page_count(page) != 1) {
  + UnlockPage(page);
  + continue;
  + }
  +
   __lru_cache_del(page); __remove_inode_page(page);

The problem here is that NFS pages have 3 rather than 2 states:
  1) mmapped  correct.
  2) mmapped  incorrect. (but possibly dirty)
  3) Unmapped

For case 1), we clearly want to have the page in inode-i_mapping.
For cases 2)  3) we don't.

However for case 2) we still have a weak association to the inode
itself, and we want to be able to reference inode metadata etc.  Would
it make sense then to remove these pages from i_mapping, but to hang
them onto a new struct address_space (call it i_unmapped for want of a
better name)?

That would allow you to keep a consistent state for the page, while
still allowing you to 'invalidate' it (by removing it from the
i_mapping) and hence maintain a consistent cache.

invalidate_inode_pages() would then reduce to

   remove_page_from_inode_queue(page);
   remove_page_from_hash_queue(page);
   if (page_count(page))
 add_page_to_inode_unmapped(page);

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Linus Torvalds



On 20 Oct 2000, Trond Myklebust wrote:
 
 Under NFS the problem is that pages can (and *should*) be invalidated
 despite there being pending write backs. The server can trigger the
 need for a cache invalidation at any time.
 The existence of file locks that aren't page aligned, as well as
 partial page writeback ensure that we cannot make the equivalence
page has pending write == page is correct.

Note that this is not what my patch really says at all.

My patch says "ok, we should invalidate this page, because somebody thinks
it may be bad. HOWEVER, we cannot reasonably do that, because we have
users of the page (which may be completely unrelated to pending writes),
and invalidating this page is _certain_ to cause corruption".

Basically think of the case of somebody having a shared mmap over NFS. He
has dirty data in his pages, and he expects those to be written out to the
server. There is NO QUESTION about this fact.

Imagine somebody (possibly even the same person) then releasing or getting
a file lock, possibly in some other part of the file. 

Without that added test, all those dirty local pages just got completely
thrown away. 

Now, you have a choice: you can KNOW that you're doing something horribly
and utterly wrong (throwing away data), or you can SUSPECT that you might
cause non-coherency between the client and the server. You cannot get
both.

I'm saying that we're much much better off guaranteeing local consistency
over knowingly breaking local consistency over a uncertain global
consistency issue. Especially as NFS has never guaranteed global
consistency in the first place, and was not designed to do that.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-20 Thread Linus Torvalds



On 20 Oct 2000, Trond Myklebust wrote:
 
 The problem here is that NFS pages have 3 rather than 2 states:
   1) mmapped  correct.
   2) mmapped  incorrect. (but possibly dirty)
   3) Unmapped
 
 For case 1), we clearly want to have the page in inode-i_mapping.
 For cases 2)  3) we don't.

I think you're WRONG, WRONG, WRONG.

Your case #2 is not at all a state.

It's "mapped, possibly dirty, and locally correct. But _maybe_ globally
incorrect".

If you remove the page in case #2 from the hashing, you will have case #4,
which you're ignoring:

   4) Totally, and utterly incorrect. Without any way for the application
  to even _know_ that it's incorrect.

Note that case 4 is accompanied by case #5, later on:

   5) two separate pages both mapped, one hashed and up-to-date, the other
  mapped in other processes but incorrect.

which is really really bad, because now you have the case that you have
'n' copies of the same page in memory, with 'n' users, out of which 'n-1'
users have the wrong page. And those 'n-1' users don't even have any way
of _knowing_ that they have the wrong page.

Which is why we MUST NOT drop a page that has users. Really.

I'm telling you that cases #4 and #5 are _much_ worse than your "solution"
to case #2. And you argue that your solution is good only because you're
completely ignoring cases #4 and #5.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



  1   2   3   >