metadata operation reordering regards to crash

2018-09-14 Thread
Hi, all,

A probably bit of complex question:
Does nowadays practical filesystems, eg., extX, btfs, preserve metadata
operation order through a crash/power failure?

What I know is modern filesystems ensure metadata consistency
after crash/power failure. Journal filesystems like extX do that by
write-ahead logging of metadata operations into transactions. Other
filesystems do that in various ways as btfs do that by COW.

What I'm not so far clear is whether these filesystems preserve
metadata operation order after a crash.

For example,
op 1.  rename(A, B)
op 2.  rename(C, D)

As mentioned above,  metadata consistency is ensured after a crash.
Thus, B is either the original B(or not exists) or has been replaced by A.
The same to D.

Is it possible that, after a crash, D has been replaced by C but B is still
the original file(or not exists)?

Or, from the view of implementation, before the crash
- in a journal filesystem,
Is the atomic transaction `rename(C, D)` permitted to be written to disk journal
before the transaction `rename(A, B)`?
- in other filesystems, say btfs,
Is it permit to reorder `rename(C,D)` and `rename(A,B)` atomic operation hiting
disk?

The question is meaningful as many applications do that:
if (flag_file_says_need_generate_data) {
open_write_sync_close(data_tmp);
rename(data_tmp, data);

open_write_sync_close(flag_file_tmp, no_need_to_generate_data);
rename(flag_file_tmp, flag_file)
}
use_data_file()

If flag is here but data is not after a crash, that is a problem.

Thanks,
Trol


metadata operation reordering regards to crash

2018-09-14 Thread
Hi, all,

A probably bit of complex question:
Does nowadays practical filesystems, eg., extX, btfs, preserve metadata
operation order through a crash/power failure?

What I know is modern filesystems ensure metadata consistency
after crash/power failure. Journal filesystems like extX do that by
write-ahead logging of metadata operations into transactions. Other
filesystems do that in various ways as btfs do that by COW.

What I'm not so far clear is whether these filesystems preserve
metadata operation order after a crash.

For example,
op 1.  rename(A, B)
op 2.  rename(C, D)

As mentioned above,  metadata consistency is ensured after a crash.
Thus, B is either the original B(or not exists) or has been replaced by A.
The same to D.

Is it possible that, after a crash, D has been replaced by C but B is still
the original file(or not exists)?

Or, from the view of implementation, before the crash
- in a journal filesystem,
Is the atomic transaction `rename(C, D)` permitted to be written to disk journal
before the transaction `rename(A, B)`?
- in other filesystems, say btfs,
Is it permit to reorder `rename(C,D)` and `rename(A,B)` atomic operation hiting
disk?

The question is meaningful as many applications do that:
if (flag_file_says_need_generate_data) {
open_write_sync_close(data_tmp);
rename(data_tmp, data);

open_write_sync_close(flag_file_tmp, no_need_to_generate_data);
rename(flag_file_tmp, flag_file)
}
use_data_file()

If flag is here but data is not after a crash, that is a problem.

Thanks,
Trol


Re: POSIX violation by writeback error

2018-09-06 Thread
On Wed, Sep 5, 2018 at 4:09 PM 焦晓冬  wrote:
>
> On Tue, Sep 4, 2018 at 11:44 PM Jeff Layton  wrote:
> >
> > On Tue, 2018-09-04 at 22:56 +0800, 焦晓冬 wrote:
> > > On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton  wrote:
> > > >
> > > > On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > > > > On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff  
> > > > > wrote:
> > > > >
> > > > > ...
> > > > > > >
> > > > > > > Jlayton's patch is simple but wonderful idea towards correct error
> > > > > > > reporting. It seems one crucial thing is still here to be fixed. 
> > > > > > > Does
> > > > > > > anyone have some idea?
> > > > > > >
> > > > > > > The crucial thing may be that a read() after a successful
> > > > > > > open()-write()-close() may return old data.
> > > > > > >
> > > > > > > That may happen where an async writeback error occurs after 
> > > > > > > close()
> > > > > > > and the inode/mapping get evicted before read().
> > > > > >
> > > > > > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > > > > > and then close it. Then I repeat this 9 times.
> > > > > >
> > > > > > Now, when writing those files to storage fails, there is 5Gb of data
> > > > > > to remember and only 1Gb of RAM.
> > > > > >
> > > > > > I can choose any part of that 5Gb and try to read it.
> > > > > >
> > > > > > Please make a suggestion about where we should store that data?
> > > > >
> > > > > That is certainly not possible to be done. But at least, shall we 
> > > > > report
> > > > > error on read()? Silently returning wrong data may cause further 
> > > > > damage,
> > > > > such as removing wrong files since it was marked as garbage in the 
> > > > > old file.
> > > > >
> > > >
> > > > Is the data wrong though? You tried to write and then that failed.
> > > > Eventually we want to be able to get at the data that's actually in the
> > > > file -- what is that point?
> > >
> > > The point is silently data corruption is dangerous. I would prefer 
> > > getting an
> > > error back to receive wrong data.
> > >
> >
> > Well, _you_ might like that, but there are whole piles of applications
> > that may fall over completely in this situation. Legacy usage matters
> > here.
> >
> > > A practical and concrete example may be,
> > > A disk cleaner program that first searches for garbage files that won't 
> > > be used
> > > anymore and save the list in a file (open()-write()-close()) and wait for 
> > > the
> > > user to confirm the list of files to be removed.  A writeback error occurs
> > > and the related page/inode/address_space gets evicted while the user is
> > > taking a long thought about it. Finally, the user hits enter and the
> > > cleaner begin
> > > to open() read() the list again. But what gets removed is the old list
> > > of files that
> > > was generated several months ago...
> > >
> > > Another example may be,
> > > An email editor and a busy mail sender. A well written mail to my boss is
> > > composed by this email editor and is saved in a file 
> > > (open()-write()-close()).
> > > The mail sender gets notified with the path of the mail file to queue it 
> > > and
> > > send it later. A writeback error occurs and the related
> > > page/inode/address_space gets evicted while the mail is still waiting in 
> > > the
> > > queue of the mail sender. Finally, the mail file is open() read() by the 
> > > sender,
> > > but what is sent is the mail to my girlfriend that was composed 
> > > yesterday...
> > >
> > > In both cases, the files are not meant to be persisted onto the disk.
> > > So, fsync()
> > > is not likely to be called.
> > >
> >
> > So at what point are you going to give up on keeping the data? The
> > fundamental problem here is an open-ended commitment. We (justifiably)
> > avoid those in kernel development because it might leave the system
> > without a way out of a resource crunch.
> >
> > > >
> > &

Re: POSIX violation by writeback error

2018-09-06 Thread
On Wed, Sep 5, 2018 at 4:09 PM 焦晓冬  wrote:
>
> On Tue, Sep 4, 2018 at 11:44 PM Jeff Layton  wrote:
> >
> > On Tue, 2018-09-04 at 22:56 +0800, 焦晓冬 wrote:
> > > On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton  wrote:
> > > >
> > > > On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > > > > On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff  
> > > > > wrote:
> > > > >
> > > > > ...
> > > > > > >
> > > > > > > Jlayton's patch is simple but wonderful idea towards correct error
> > > > > > > reporting. It seems one crucial thing is still here to be fixed. 
> > > > > > > Does
> > > > > > > anyone have some idea?
> > > > > > >
> > > > > > > The crucial thing may be that a read() after a successful
> > > > > > > open()-write()-close() may return old data.
> > > > > > >
> > > > > > > That may happen where an async writeback error occurs after 
> > > > > > > close()
> > > > > > > and the inode/mapping get evicted before read().
> > > > > >
> > > > > > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > > > > > and then close it. Then I repeat this 9 times.
> > > > > >
> > > > > > Now, when writing those files to storage fails, there is 5Gb of data
> > > > > > to remember and only 1Gb of RAM.
> > > > > >
> > > > > > I can choose any part of that 5Gb and try to read it.
> > > > > >
> > > > > > Please make a suggestion about where we should store that data?
> > > > >
> > > > > That is certainly not possible to be done. But at least, shall we 
> > > > > report
> > > > > error on read()? Silently returning wrong data may cause further 
> > > > > damage,
> > > > > such as removing wrong files since it was marked as garbage in the 
> > > > > old file.
> > > > >
> > > >
> > > > Is the data wrong though? You tried to write and then that failed.
> > > > Eventually we want to be able to get at the data that's actually in the
> > > > file -- what is that point?
> > >
> > > The point is silently data corruption is dangerous. I would prefer 
> > > getting an
> > > error back to receive wrong data.
> > >
> >
> > Well, _you_ might like that, but there are whole piles of applications
> > that may fall over completely in this situation. Legacy usage matters
> > here.
> >
> > > A practical and concrete example may be,
> > > A disk cleaner program that first searches for garbage files that won't 
> > > be used
> > > anymore and save the list in a file (open()-write()-close()) and wait for 
> > > the
> > > user to confirm the list of files to be removed.  A writeback error occurs
> > > and the related page/inode/address_space gets evicted while the user is
> > > taking a long thought about it. Finally, the user hits enter and the
> > > cleaner begin
> > > to open() read() the list again. But what gets removed is the old list
> > > of files that
> > > was generated several months ago...
> > >
> > > Another example may be,
> > > An email editor and a busy mail sender. A well written mail to my boss is
> > > composed by this email editor and is saved in a file 
> > > (open()-write()-close()).
> > > The mail sender gets notified with the path of the mail file to queue it 
> > > and
> > > send it later. A writeback error occurs and the related
> > > page/inode/address_space gets evicted while the mail is still waiting in 
> > > the
> > > queue of the mail sender. Finally, the mail file is open() read() by the 
> > > sender,
> > > but what is sent is the mail to my girlfriend that was composed 
> > > yesterday...
> > >
> > > In both cases, the files are not meant to be persisted onto the disk.
> > > So, fsync()
> > > is not likely to be called.
> > >
> >
> > So at what point are you going to give up on keeping the data? The
> > fundamental problem here is an open-ended commitment. We (justifiably)
> > avoid those in kernel development because it might leave the system
> > without a way out of a resource crunch.
> >
> > > >
> > &

Re: POSIX violation by writeback error

2018-09-05 Thread
On Wed, Sep 5, 2018 at 4:04 PM Rogier Wolff  wrote:
>
> On Wed, Sep 05, 2018 at 09:39:58AM +0200, Martin Steigerwald wrote:
> > Rogier Wolff - 05.09.18, 09:08:
> > > So when a mail queuer puts mail the mailq files and the mail processor
> > > can get them out of there intact, nobody is going to notice.  (I know
> > > mail queuers should call fsync and report errors when that fails, but
> > > there are bound to be applications where calling fsync is not
> > > appropriate (*))
> >
> > AFAIK at least Postfix MDA only reports mail as being accepted over SMTP
> > once fsync() on the mail file completed successfully. And I´d expect
> > every sensible MDA to do this. I don´t know how Dovecot MDA which I
> > currently use for sieve support does this tough.
>

Is every implementation of mail editor really going to call fsync()? Why
they are going to call fsync(), when fsync() is meant to persist the file
on disk which is apparently unnecessary if the delivering to SMTP task
won't start again after reboot?

> Yes. That's why I added the remark that mailers will call fsync and know
> about it on the write side. I encountered a situation in the last few
> days that when a developer runs into this while developing, would have
> caused him to write:
>   /* Calling this fsync causes unacceptable performance */
>   // fsync (fd);
>
> I know of an application somewhere that does realtime-gathering of
> call-records (number X called Y for Z seconds). They come in from a
> variety of sources, get de-duplicated standardized and written to
> files. Then different output modules push the data to the different
> consumers within the company. Billing among them.
>
> Now getting old data there would be pretty bad. And calling fsync
> all the time might have performance issues
>
> That's the situation where "old data is really bad".
>
> But when apt-get upgrade replaces your /bin/sh and gets a write error
> returning error on subsequent reads is really bad.

At this point, the /bin/sh may be partially old and partially new. Execute
this corrupted bin is also dangerous though.

>
> It is more difficult than you think.
>
> Roger.
>
> --
> ** r.e.wo...@bitwizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> The plan was simple, like my brother-in-law Phil. But unlike
> Phil, this plan just might work.


Re: POSIX violation by writeback error

2018-09-05 Thread
On Wed, Sep 5, 2018 at 4:04 PM Rogier Wolff  wrote:
>
> On Wed, Sep 05, 2018 at 09:39:58AM +0200, Martin Steigerwald wrote:
> > Rogier Wolff - 05.09.18, 09:08:
> > > So when a mail queuer puts mail the mailq files and the mail processor
> > > can get them out of there intact, nobody is going to notice.  (I know
> > > mail queuers should call fsync and report errors when that fails, but
> > > there are bound to be applications where calling fsync is not
> > > appropriate (*))
> >
> > AFAIK at least Postfix MDA only reports mail as being accepted over SMTP
> > once fsync() on the mail file completed successfully. And I´d expect
> > every sensible MDA to do this. I don´t know how Dovecot MDA which I
> > currently use for sieve support does this tough.
>

Is every implementation of mail editor really going to call fsync()? Why
they are going to call fsync(), when fsync() is meant to persist the file
on disk which is apparently unnecessary if the delivering to SMTP task
won't start again after reboot?

> Yes. That's why I added the remark that mailers will call fsync and know
> about it on the write side. I encountered a situation in the last few
> days that when a developer runs into this while developing, would have
> caused him to write:
>   /* Calling this fsync causes unacceptable performance */
>   // fsync (fd);
>
> I know of an application somewhere that does realtime-gathering of
> call-records (number X called Y for Z seconds). They come in from a
> variety of sources, get de-duplicated standardized and written to
> files. Then different output modules push the data to the different
> consumers within the company. Billing among them.
>
> Now getting old data there would be pretty bad. And calling fsync
> all the time might have performance issues
>
> That's the situation where "old data is really bad".
>
> But when apt-get upgrade replaces your /bin/sh and gets a write error
> returning error on subsequent reads is really bad.

At this point, the /bin/sh may be partially old and partially new. Execute
this corrupted bin is also dangerous though.

>
> It is more difficult than you think.
>
> Roger.
>
> --
> ** r.e.wo...@bitwizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> The plan was simple, like my brother-in-law Phil. But unlike
> Phil, this plan just might work.


Re: POSIX violation by writeback error

2018-09-05 Thread
On Wed, Sep 5, 2018 at 4:18 AM Jeff Layton  wrote:
>
> On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote:
> > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > > > Well, I think the point was that in the above examples you'd prefer that
> > > > the read just fail--no need to keep the data.  A bit marking the file
> > > > (or even the entire filesystem) unreadable would satisfy posix, I guess.
> > > > Whether that's practical, I don't know.
> > >
> > > When you would do it like that (mark the whole filesystem as "in
> > > error") things go from bad to worse even faster. The Linux kernel
> > > tries to keep the system up even in the face of errors.
> > >
> > > With that suggestion, having one application run into a writeback
> > > error would effectively crash the whole system because the filesystem
> > > may be the root filesystem and stuff like "sshd" that you need to
> > > diagnose the problem needs to be read from the disk
> >
> > Well, the absolutist position on posix compliance here would be that a
> > crash is still preferable to returning the wrong data.  And for the
> > cases 焦晓冬 gives, that sounds right?  Maybe it's the wrong balance in
> > general, I don't know.  And we do already have filesystems with
> > panic-on-error options, so if they aren't used maybe then maybe users
> > have already voted against that level of strictness.
> >
>
> Yeah, idk. The problem here is that this is squarely in the domain of
> implementation defined behavior. I do think that the current "policy"
> (if you call it that) of what to do after a wb error is weird and wrong.
> What we probably ought to do is start considering how we'd like it to
> behave.
>
> How about something like this?
>
> Mark the pages as "uncleanable" after a writeback error. We'll satisfy
> reads from the cached data until someone calls fsync, at which point
> we'd return the error and invalidate the uncleanable pages.

Totally agree with you.

>
> If no one calls fsync and scrapes the error, we'll hold on to it for as
> long as we can (or up to some predefined limit) and then after that
> we'll invalidate the uncleanable pages and start returning errors on
> reads. If someone eventually calls fsync afterward, we can return to
> normal operation.

Agree with you except that using fsync() as `clear_error_mark()` seems
weird and counter-intuitive.

>
> As always though...what about mmap? Would we need to SIGBUS at the point
> where we'd start returning errors on read()?

I think SIGBUS to mmap() is the same thing as EIO to read().

>
> Would that approximate the current behavior enough and make sense?
> Implementing it all sounds non-trivial though...

No.
No problem is reported because nowadays we are relying on the
underlying disk drives. They transparently redirect bad sectors and
use S.M.A.R.T to waning us long before a real EIO could be seen.
As to network filesystems, if I'm not wrong, close() op calls fsync()
inside the implementation. So there is also no problem.

>
> --
> Jeff Layton 
>


Re: POSIX violation by writeback error

2018-09-05 Thread
On Wed, Sep 5, 2018 at 4:18 AM Jeff Layton  wrote:
>
> On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote:
> > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > > > Well, I think the point was that in the above examples you'd prefer that
> > > > the read just fail--no need to keep the data.  A bit marking the file
> > > > (or even the entire filesystem) unreadable would satisfy posix, I guess.
> > > > Whether that's practical, I don't know.
> > >
> > > When you would do it like that (mark the whole filesystem as "in
> > > error") things go from bad to worse even faster. The Linux kernel
> > > tries to keep the system up even in the face of errors.
> > >
> > > With that suggestion, having one application run into a writeback
> > > error would effectively crash the whole system because the filesystem
> > > may be the root filesystem and stuff like "sshd" that you need to
> > > diagnose the problem needs to be read from the disk
> >
> > Well, the absolutist position on posix compliance here would be that a
> > crash is still preferable to returning the wrong data.  And for the
> > cases 焦晓冬 gives, that sounds right?  Maybe it's the wrong balance in
> > general, I don't know.  And we do already have filesystems with
> > panic-on-error options, so if they aren't used maybe then maybe users
> > have already voted against that level of strictness.
> >
>
> Yeah, idk. The problem here is that this is squarely in the domain of
> implementation defined behavior. I do think that the current "policy"
> (if you call it that) of what to do after a wb error is weird and wrong.
> What we probably ought to do is start considering how we'd like it to
> behave.
>
> How about something like this?
>
> Mark the pages as "uncleanable" after a writeback error. We'll satisfy
> reads from the cached data until someone calls fsync, at which point
> we'd return the error and invalidate the uncleanable pages.

Totally agree with you.

>
> If no one calls fsync and scrapes the error, we'll hold on to it for as
> long as we can (or up to some predefined limit) and then after that
> we'll invalidate the uncleanable pages and start returning errors on
> reads. If someone eventually calls fsync afterward, we can return to
> normal operation.

Agree with you except that using fsync() as `clear_error_mark()` seems
weird and counter-intuitive.

>
> As always though...what about mmap? Would we need to SIGBUS at the point
> where we'd start returning errors on read()?

I think SIGBUS to mmap() is the same thing as EIO to read().

>
> Would that approximate the current behavior enough and make sense?
> Implementing it all sounds non-trivial though...

No.
No problem is reported because nowadays we are relying on the
underlying disk drives. They transparently redirect bad sectors and
use S.M.A.R.T to waning us long before a real EIO could be seen.
As to network filesystems, if I'm not wrong, close() op calls fsync()
inside the implementation. So there is also no problem.

>
> --
> Jeff Layton 
>


Re: POSIX violation by writeback error

2018-09-05 Thread
On Tue, Sep 4, 2018 at 11:44 PM Jeff Layton  wrote:
>
> On Tue, 2018-09-04 at 22:56 +0800, 焦晓冬 wrote:
> > On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton  wrote:
> > >
> > > On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > > > On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff  
> > > > wrote:
> > > >
> > > > ...
> > > > > >
> > > > > > Jlayton's patch is simple but wonderful idea towards correct error
> > > > > > reporting. It seems one crucial thing is still here to be fixed. 
> > > > > > Does
> > > > > > anyone have some idea?
> > > > > >
> > > > > > The crucial thing may be that a read() after a successful
> > > > > > open()-write()-close() may return old data.
> > > > > >
> > > > > > That may happen where an async writeback error occurs after close()
> > > > > > and the inode/mapping get evicted before read().
> > > > >
> > > > > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > > > > and then close it. Then I repeat this 9 times.
> > > > >
> > > > > Now, when writing those files to storage fails, there is 5Gb of data
> > > > > to remember and only 1Gb of RAM.
> > > > >
> > > > > I can choose any part of that 5Gb and try to read it.
> > > > >
> > > > > Please make a suggestion about where we should store that data?
> > > >
> > > > That is certainly not possible to be done. But at least, shall we report
> > > > error on read()? Silently returning wrong data may cause further damage,
> > > > such as removing wrong files since it was marked as garbage in the old 
> > > > file.
> > > >
> > >
> > > Is the data wrong though? You tried to write and then that failed.
> > > Eventually we want to be able to get at the data that's actually in the
> > > file -- what is that point?
> >
> > The point is silently data corruption is dangerous. I would prefer getting 
> > an
> > error back to receive wrong data.
> >
>
> Well, _you_ might like that, but there are whole piles of applications
> that may fall over completely in this situation. Legacy usage matters
> here.
>
> > A practical and concrete example may be,
> > A disk cleaner program that first searches for garbage files that won't be 
> > used
> > anymore and save the list in a file (open()-write()-close()) and wait for 
> > the
> > user to confirm the list of files to be removed.  A writeback error occurs
> > and the related page/inode/address_space gets evicted while the user is
> > taking a long thought about it. Finally, the user hits enter and the
> > cleaner begin
> > to open() read() the list again. But what gets removed is the old list
> > of files that
> > was generated several months ago...
> >
> > Another example may be,
> > An email editor and a busy mail sender. A well written mail to my boss is
> > composed by this email editor and is saved in a file 
> > (open()-write()-close()).
> > The mail sender gets notified with the path of the mail file to queue it and
> > send it later. A writeback error occurs and the related
> > page/inode/address_space gets evicted while the mail is still waiting in the
> > queue of the mail sender. Finally, the mail file is open() read() by the 
> > sender,
> > but what is sent is the mail to my girlfriend that was composed yesterday...
> >
> > In both cases, the files are not meant to be persisted onto the disk.
> > So, fsync()
> > is not likely to be called.
> >
>
> So at what point are you going to give up on keeping the data? The
> fundamental problem here is an open-ended commitment. We (justifiably)
> avoid those in kernel development because it might leave the system
> without a way out of a resource crunch.
>
> > >
> > > If I get an error back on a read, why should I think that it has
> > > anything at all to do with writes that previously failed? It may even
> > > have been written by a completely separate process that I had nothing at
> > > all to do with.
> > >
> > > > As I can see, that is all about error reporting.
> > > >
> > > > As for suggestion, maybe the error flag of inode/mapping, or the entire 
> > > > inode
> > > > should not be evicted if there was an error. That hopefully won't take 
> > > > much
> >

Re: POSIX violation by writeback error

2018-09-05 Thread
On Tue, Sep 4, 2018 at 11:44 PM Jeff Layton  wrote:
>
> On Tue, 2018-09-04 at 22:56 +0800, 焦晓冬 wrote:
> > On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton  wrote:
> > >
> > > On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > > > On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff  
> > > > wrote:
> > > >
> > > > ...
> > > > > >
> > > > > > Jlayton's patch is simple but wonderful idea towards correct error
> > > > > > reporting. It seems one crucial thing is still here to be fixed. 
> > > > > > Does
> > > > > > anyone have some idea?
> > > > > >
> > > > > > The crucial thing may be that a read() after a successful
> > > > > > open()-write()-close() may return old data.
> > > > > >
> > > > > > That may happen where an async writeback error occurs after close()
> > > > > > and the inode/mapping get evicted before read().
> > > > >
> > > > > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > > > > and then close it. Then I repeat this 9 times.
> > > > >
> > > > > Now, when writing those files to storage fails, there is 5Gb of data
> > > > > to remember and only 1Gb of RAM.
> > > > >
> > > > > I can choose any part of that 5Gb and try to read it.
> > > > >
> > > > > Please make a suggestion about where we should store that data?
> > > >
> > > > That is certainly not possible to be done. But at least, shall we report
> > > > error on read()? Silently returning wrong data may cause further damage,
> > > > such as removing wrong files since it was marked as garbage in the old 
> > > > file.
> > > >
> > >
> > > Is the data wrong though? You tried to write and then that failed.
> > > Eventually we want to be able to get at the data that's actually in the
> > > file -- what is that point?
> >
> > The point is silently data corruption is dangerous. I would prefer getting 
> > an
> > error back to receive wrong data.
> >
>
> Well, _you_ might like that, but there are whole piles of applications
> that may fall over completely in this situation. Legacy usage matters
> here.
>
> > A practical and concrete example may be,
> > A disk cleaner program that first searches for garbage files that won't be 
> > used
> > anymore and save the list in a file (open()-write()-close()) and wait for 
> > the
> > user to confirm the list of files to be removed.  A writeback error occurs
> > and the related page/inode/address_space gets evicted while the user is
> > taking a long thought about it. Finally, the user hits enter and the
> > cleaner begin
> > to open() read() the list again. But what gets removed is the old list
> > of files that
> > was generated several months ago...
> >
> > Another example may be,
> > An email editor and a busy mail sender. A well written mail to my boss is
> > composed by this email editor and is saved in a file 
> > (open()-write()-close()).
> > The mail sender gets notified with the path of the mail file to queue it and
> > send it later. A writeback error occurs and the related
> > page/inode/address_space gets evicted while the mail is still waiting in the
> > queue of the mail sender. Finally, the mail file is open() read() by the 
> > sender,
> > but what is sent is the mail to my girlfriend that was composed yesterday...
> >
> > In both cases, the files are not meant to be persisted onto the disk.
> > So, fsync()
> > is not likely to be called.
> >
>
> So at what point are you going to give up on keeping the data? The
> fundamental problem here is an open-ended commitment. We (justifiably)
> avoid those in kernel development because it might leave the system
> without a way out of a resource crunch.
>
> > >
> > > If I get an error back on a read, why should I think that it has
> > > anything at all to do with writes that previously failed? It may even
> > > have been written by a completely separate process that I had nothing at
> > > all to do with.
> > >
> > > > As I can see, that is all about error reporting.
> > > >
> > > > As for suggestion, maybe the error flag of inode/mapping, or the entire 
> > > > inode
> > > > should not be evicted if there was an error. That hopefully won't take 
> > > > much
> >

Re: POSIX violation by writeback error

2018-09-04 Thread
On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton  wrote:
>
> On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff  wrote:
> >
> > ...
> > > >
> > > > Jlayton's patch is simple but wonderful idea towards correct error
> > > > reporting. It seems one crucial thing is still here to be fixed. Does
> > > > anyone have some idea?
> > > >
> > > > The crucial thing may be that a read() after a successful
> > > > open()-write()-close() may return old data.
> > > >
> > > > That may happen where an async writeback error occurs after close()
> > > > and the inode/mapping get evicted before read().
> > >
> > > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > > and then close it. Then I repeat this 9 times.
> > >
> > > Now, when writing those files to storage fails, there is 5Gb of data
> > > to remember and only 1Gb of RAM.
> > >
> > > I can choose any part of that 5Gb and try to read it.
> > >
> > > Please make a suggestion about where we should store that data?
> >
> > That is certainly not possible to be done. But at least, shall we report
> > error on read()? Silently returning wrong data may cause further damage,
> > such as removing wrong files since it was marked as garbage in the old file.
> >
>
> Is the data wrong though? You tried to write and then that failed.
> Eventually we want to be able to get at the data that's actually in the
> file -- what is that point?

The point is silently data corruption is dangerous. I would prefer getting an
error back to receive wrong data.

A practical and concrete example may be,
A disk cleaner program that first searches for garbage files that won't be used
anymore and save the list in a file (open()-write()-close()) and wait for the
user to confirm the list of files to be removed.  A writeback error occurs
and the related page/inode/address_space gets evicted while the user is
taking a long thought about it. Finally, the user hits enter and the
cleaner begin
to open() read() the list again. But what gets removed is the old list
of files that
was generated several months ago...

Another example may be,
An email editor and a busy mail sender. A well written mail to my boss is
composed by this email editor and is saved in a file (open()-write()-close()).
The mail sender gets notified with the path of the mail file to queue it and
send it later. A writeback error occurs and the related
page/inode/address_space gets evicted while the mail is still waiting in the
queue of the mail sender. Finally, the mail file is open() read() by the sender,
but what is sent is the mail to my girlfriend that was composed yesterday...

In both cases, the files are not meant to be persisted onto the disk.
So, fsync()
is not likely to be called.

>
> If I get an error back on a read, why should I think that it has
> anything at all to do with writes that previously failed? It may even
> have been written by a completely separate process that I had nothing at
> all to do with.
>
> > As I can see, that is all about error reporting.
> >
> > As for suggestion, maybe the error flag of inode/mapping, or the entire 
> > inode
> > should not be evicted if there was an error. That hopefully won't take much
> > memory. On extreme conditions, where too much error inode requires staying
> > in memory, maybe we should panic rather then spread the error.
> >
> > >
> > > In the easy case, where the data easily fits in RAM, you COULD write a
> > > solution. But when the hardware fails, the SYSTEM will not be able to
> > > follow the posix rules.
> >
> > Nope, we are able to follow the rules. The above is one way that follows the
> > POSIX rules.
> >
>
> This is something we discussed at LSF this year.
>
> We could attempt to keep dirty data around for a little while, at least
> long enough to ensure that reads reflect earlier writes until the errors
> can be scraped out by fsync. That would sort of redefine fsync from
> being "ensure that my writes are flushed" to "synchronize my cache with
> the current state of the file".
>
> The problem of course is that applications are not required to do fsync
> at all. At what point do we give up on it, and toss out the pages that
> can't be cleaned?
>
> We could allow for a tunable that does a kernel panic if writebacks fail
> and the errors are never fetched via fsync, and we run out of memory. I
> don't think that is something most users would want though.
>
> Another thought: maybe we could OOM kill any process that has the file
> open and then toss out the page data in that situation?
>
> I'm wide open to (good) ideas here.

As I said above, silently data corruption is dangerous and maybe we really
should report errors to user space even in desperate cases.

One possible approach may be:

- When a writeback error occurs, mark the page clean and remember the error
in the inode/address_space of the file.
I think that is what the kernel is doing currently.

- If the following read() could be served by a page in 

Re: POSIX violation by writeback error

2018-09-04 Thread
On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton  wrote:
>
> On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff  wrote:
> >
> > ...
> > > >
> > > > Jlayton's patch is simple but wonderful idea towards correct error
> > > > reporting. It seems one crucial thing is still here to be fixed. Does
> > > > anyone have some idea?
> > > >
> > > > The crucial thing may be that a read() after a successful
> > > > open()-write()-close() may return old data.
> > > >
> > > > That may happen where an async writeback error occurs after close()
> > > > and the inode/mapping get evicted before read().
> > >
> > > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > > and then close it. Then I repeat this 9 times.
> > >
> > > Now, when writing those files to storage fails, there is 5Gb of data
> > > to remember and only 1Gb of RAM.
> > >
> > > I can choose any part of that 5Gb and try to read it.
> > >
> > > Please make a suggestion about where we should store that data?
> >
> > That is certainly not possible to be done. But at least, shall we report
> > error on read()? Silently returning wrong data may cause further damage,
> > such as removing wrong files since it was marked as garbage in the old file.
> >
>
> Is the data wrong though? You tried to write and then that failed.
> Eventually we want to be able to get at the data that's actually in the
> file -- what is that point?

The point is silently data corruption is dangerous. I would prefer getting an
error back to receive wrong data.

A practical and concrete example may be,
A disk cleaner program that first searches for garbage files that won't be used
anymore and save the list in a file (open()-write()-close()) and wait for the
user to confirm the list of files to be removed.  A writeback error occurs
and the related page/inode/address_space gets evicted while the user is
taking a long thought about it. Finally, the user hits enter and the
cleaner begin
to open() read() the list again. But what gets removed is the old list
of files that
was generated several months ago...

Another example may be,
An email editor and a busy mail sender. A well written mail to my boss is
composed by this email editor and is saved in a file (open()-write()-close()).
The mail sender gets notified with the path of the mail file to queue it and
send it later. A writeback error occurs and the related
page/inode/address_space gets evicted while the mail is still waiting in the
queue of the mail sender. Finally, the mail file is open() read() by the sender,
but what is sent is the mail to my girlfriend that was composed yesterday...

In both cases, the files are not meant to be persisted onto the disk.
So, fsync()
is not likely to be called.

>
> If I get an error back on a read, why should I think that it has
> anything at all to do with writes that previously failed? It may even
> have been written by a completely separate process that I had nothing at
> all to do with.
>
> > As I can see, that is all about error reporting.
> >
> > As for suggestion, maybe the error flag of inode/mapping, or the entire 
> > inode
> > should not be evicted if there was an error. That hopefully won't take much
> > memory. On extreme conditions, where too much error inode requires staying
> > in memory, maybe we should panic rather then spread the error.
> >
> > >
> > > In the easy case, where the data easily fits in RAM, you COULD write a
> > > solution. But when the hardware fails, the SYSTEM will not be able to
> > > follow the posix rules.
> >
> > Nope, we are able to follow the rules. The above is one way that follows the
> > POSIX rules.
> >
>
> This is something we discussed at LSF this year.
>
> We could attempt to keep dirty data around for a little while, at least
> long enough to ensure that reads reflect earlier writes until the errors
> can be scraped out by fsync. That would sort of redefine fsync from
> being "ensure that my writes are flushed" to "synchronize my cache with
> the current state of the file".
>
> The problem of course is that applications are not required to do fsync
> at all. At what point do we give up on it, and toss out the pages that
> can't be cleaned?
>
> We could allow for a tunable that does a kernel panic if writebacks fail
> and the errors are never fetched via fsync, and we run out of memory. I
> don't think that is something most users would want though.
>
> Another thought: maybe we could OOM kill any process that has the file
> open and then toss out the page data in that situation?
>
> I'm wide open to (good) ideas here.

As I said above, silently data corruption is dangerous and maybe we really
should report errors to user space even in desperate cases.

One possible approach may be:

- When a writeback error occurs, mark the page clean and remember the error
in the inode/address_space of the file.
I think that is what the kernel is doing currently.

- If the following read() could be served by a page in 

Re: POSIX violation by writeback error

2018-09-04 Thread
On Tue, Sep 4, 2018 at 5:29 PM Rogier Wolff  wrote:
>
> On Tue, Sep 04, 2018 at 04:58:59PM +0800, 焦晓冬 wrote:
>
> > As for suggestion, maybe the error flag of inode/mapping, or the entire 
> > inode
> > should not be evicted if there was an error. That hopefully won't take much
> > memory. On extreme conditions, where too much error inode requires staying
> > in memory, maybe we should panic rather then spread the error.
>
> Again you are hoping it will fit in memory. In an extreme case it
> won't fit in memory. Tyring to come up with heuristics about when to
> remember and when to forget such things from the past is very
> difficult.

The key point is to report errors, not to hide it from user space to
prevent further errors/damage,
and that is also what POSIX wants.

And, storing inode/mapping/error_flag in memory is quite different
from storing the data itself.
They are tiny and only increase per inode rather than per error page.
>
> Think of my comments as: "it's harder than you think", not as "can't
> be done".
>
> Roger.
>
> --
> ** r.e.wo...@bitwizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> The plan was simple, like my brother-in-law Phil. But unlike
> Phil, this plan just might work.


Re: POSIX violation by writeback error

2018-09-04 Thread
On Tue, Sep 4, 2018 at 5:29 PM Rogier Wolff  wrote:
>
> On Tue, Sep 04, 2018 at 04:58:59PM +0800, 焦晓冬 wrote:
>
> > As for suggestion, maybe the error flag of inode/mapping, or the entire 
> > inode
> > should not be evicted if there was an error. That hopefully won't take much
> > memory. On extreme conditions, where too much error inode requires staying
> > in memory, maybe we should panic rather then spread the error.
>
> Again you are hoping it will fit in memory. In an extreme case it
> won't fit in memory. Tyring to come up with heuristics about when to
> remember and when to forget such things from the past is very
> difficult.

The key point is to report errors, not to hide it from user space to
prevent further errors/damage,
and that is also what POSIX wants.

And, storing inode/mapping/error_flag in memory is quite different
from storing the data itself.
They are tiny and only increase per inode rather than per error page.
>
> Think of my comments as: "it's harder than you think", not as "can't
> be done".
>
> Roger.
>
> --
> ** r.e.wo...@bitwizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> The plan was simple, like my brother-in-law Phil. But unlike
> Phil, this plan just might work.


Re: POSIX violation by writeback error

2018-09-04 Thread
On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff  wrote:

...
> >
> > Jlayton's patch is simple but wonderful idea towards correct error
> > reporting. It seems one crucial thing is still here to be fixed. Does
> > anyone have some idea?
> >
> > The crucial thing may be that a read() after a successful
> > open()-write()-close() may return old data.
> >
> > That may happen where an async writeback error occurs after close()
> > and the inode/mapping get evicted before read().
>
> Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> and then close it. Then I repeat this 9 times.
>
> Now, when writing those files to storage fails, there is 5Gb of data
> to remember and only 1Gb of RAM.
>
> I can choose any part of that 5Gb and try to read it.
>
> Please make a suggestion about where we should store that data?

That is certainly not possible to be done. But at least, shall we report
error on read()? Silently returning wrong data may cause further damage,
such as removing wrong files since it was marked as garbage in the old file.

As I can see, that is all about error reporting.

As for suggestion, maybe the error flag of inode/mapping, or the entire inode
should not be evicted if there was an error. That hopefully won't take much
memory. On extreme conditions, where too much error inode requires staying
in memory, maybe we should panic rather then spread the error.

>
> In the easy case, where the data easily fits in RAM, you COULD write a
> solution. But when the hardware fails, the SYSTEM will not be able to
> follow the posix rules.

Nope, we are able to follow the rules. The above is one way that follows the
POSIX rules.

>
> Roger.
>
> --
> ** r.e.wo...@bitwizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> The plan was simple, like my brother-in-law Phil. But unlike
> Phil, this plan just might work.


Re: POSIX violation by writeback error

2018-09-04 Thread
On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff  wrote:

...
> >
> > Jlayton's patch is simple but wonderful idea towards correct error
> > reporting. It seems one crucial thing is still here to be fixed. Does
> > anyone have some idea?
> >
> > The crucial thing may be that a read() after a successful
> > open()-write()-close() may return old data.
> >
> > That may happen where an async writeback error occurs after close()
> > and the inode/mapping get evicted before read().
>
> Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> and then close it. Then I repeat this 9 times.
>
> Now, when writing those files to storage fails, there is 5Gb of data
> to remember and only 1Gb of RAM.
>
> I can choose any part of that 5Gb and try to read it.
>
> Please make a suggestion about where we should store that data?

That is certainly not possible to be done. But at least, shall we report
error on read()? Silently returning wrong data may cause further damage,
such as removing wrong files since it was marked as garbage in the old file.

As I can see, that is all about error reporting.

As for suggestion, maybe the error flag of inode/mapping, or the entire inode
should not be evicted if there was an error. That hopefully won't take much
memory. On extreme conditions, where too much error inode requires staying
in memory, maybe we should panic rather then spread the error.

>
> In the easy case, where the data easily fits in RAM, you COULD write a
> solution. But when the hardware fails, the SYSTEM will not be able to
> follow the posix rules.

Nope, we are able to follow the rules. The above is one way that follows the
POSIX rules.

>
> Roger.
>
> --
> ** r.e.wo...@bitwizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> The plan was simple, like my brother-in-law Phil. But unlike
> Phil, this plan just might work.


POSIX violation by writeback error

2018-09-04 Thread
Hi,

After reading several writeback error handling articles from LWN, I
begin to be upset about writeback error handling.

Jlayton's patch is simple but wonderful idea towards correct error
reporting. It seems one crucial thing is still here to be fixed. Does
anyone have some idea?

The crucial thing may be that a read() after a successful
open()-write()-close() may return old data.

That may happen where an async writeback error occurs after close()
and the inode/mapping get evicted before read().

That violate POSIX as POSIX requires that a read() that can be proved
to occur after a write() has returned will return the new data.

Regards,

Trol


POSIX violation by writeback error

2018-09-04 Thread
Hi,

After reading several writeback error handling articles from LWN, I
begin to be upset about writeback error handling.

Jlayton's patch is simple but wonderful idea towards correct error
reporting. It seems one crucial thing is still here to be fixed. Does
anyone have some idea?

The crucial thing may be that a read() after a successful
open()-write()-close() may return old data.

That may happen where an async writeback error occurs after close()
and the inode/mapping get evicted before read().

That violate POSIX as POSIX requires that a read() that can be proved
to occur after a write() has returned will return the new data.

Regards,

Trol


Re: resend, finish_wait with list_empty_careful maybe fragile or buggy

2018-03-12 Thread
+ CC Boqun
in case you are interested in this topic

Best Regards,
Trol

> Sorry, this is a resend because the previous one was messed
> up by my editor and hard to be read.
>
> void finish_wait(struct wait_queue_head *wq_head,
>   struct wait_queue_entry *wq_entry)
> {
>
> ->if (!list_empty_careful(_entry->entry)) {
> ->spin_lock_irqsave(_head->lock, flags);
> ->list_del_init(_entry->entry);
> ->spin_unlock_irqrestore(_head->lock, flags);
> ->}
> }
>
> After careful look into the stop/wakeup code, I found the above
> code fragile or even buggy. This code was introduced at least 14 years
> ago and seems fragile or buggy now after years of study on SMP
> synchronization by us.
>
> I understand this code are being used a lot and no bug seems to emerge.
> But, as I'll explain, it depends on a lot of unreliable implementation 
> details.
>
> Firstly,
>
> static inline int list_empty_careful(const struct list_head *head)
> {
> struct list_head *next = head->next;
> return (next == head) && (next == head->prev);
> }
>
> note that the read of head->next & head->prev is not protected by
> READ_ONCE. So the read is free to be optimized out entirely.
> Luckily, this optimization is hard for compilers now, since all other accesses
> are out of finish_wait. And still, GCC won't spit aligned accesses into 
> multiple
> instructions so it is atomic so far.
>
> Secondly,
>
> if ( ! list_empty_careful(_entry->entry) ) {
>  [remove entry with spinning-lock]
> }
> [ends stack frame of the function calling finish_wait]
> [overwrites wq_entry with another frame]
>
> and
>
> __wake_up_common()   -->
> [read wq_entry->func]  -->
> [read wq_entry->flags]  -->
> autoremove_wake_function()  -->
> [remove wq_entry->entry from wait queue]  -->
>
> are not properly ordered for SMP so that  [read wq_entry->flags]
> may be reordered after [remove wq_entry->entry from wait queue]
> since no dependency or memory barrier forbids it. This may cause
> [overwrites wq_entry with another frame] on one CPU takes place
> before [read wq_entry->flags] on another CPU and cause
> [read wq_entry->flags] to return bad value.
>
> This behavior is not reported may thank to:
> - few code is using autoremove_wake_function
> - CPU pipeline is not as deep as to make this emerge
>
> Best regards,
> Trol


Re: resend, finish_wait with list_empty_careful maybe fragile or buggy

2018-03-12 Thread
+ CC Boqun
in case you are interested in this topic

Best Regards,
Trol

> Sorry, this is a resend because the previous one was messed
> up by my editor and hard to be read.
>
> void finish_wait(struct wait_queue_head *wq_head,
>   struct wait_queue_entry *wq_entry)
> {
>
> ->if (!list_empty_careful(_entry->entry)) {
> ->spin_lock_irqsave(_head->lock, flags);
> ->list_del_init(_entry->entry);
> ->spin_unlock_irqrestore(_head->lock, flags);
> ->}
> }
>
> After careful look into the stop/wakeup code, I found the above
> code fragile or even buggy. This code was introduced at least 14 years
> ago and seems fragile or buggy now after years of study on SMP
> synchronization by us.
>
> I understand this code are being used a lot and no bug seems to emerge.
> But, as I'll explain, it depends on a lot of unreliable implementation 
> details.
>
> Firstly,
>
> static inline int list_empty_careful(const struct list_head *head)
> {
> struct list_head *next = head->next;
> return (next == head) && (next == head->prev);
> }
>
> note that the read of head->next & head->prev is not protected by
> READ_ONCE. So the read is free to be optimized out entirely.
> Luckily, this optimization is hard for compilers now, since all other accesses
> are out of finish_wait. And still, GCC won't spit aligned accesses into 
> multiple
> instructions so it is atomic so far.
>
> Secondly,
>
> if ( ! list_empty_careful(_entry->entry) ) {
>  [remove entry with spinning-lock]
> }
> [ends stack frame of the function calling finish_wait]
> [overwrites wq_entry with another frame]
>
> and
>
> __wake_up_common()   -->
> [read wq_entry->func]  -->
> [read wq_entry->flags]  -->
> autoremove_wake_function()  -->
> [remove wq_entry->entry from wait queue]  -->
>
> are not properly ordered for SMP so that  [read wq_entry->flags]
> may be reordered after [remove wq_entry->entry from wait queue]
> since no dependency or memory barrier forbids it. This may cause
> [overwrites wq_entry with another frame] on one CPU takes place
> before [read wq_entry->flags] on another CPU and cause
> [read wq_entry->flags] to return bad value.
>
> This behavior is not reported may thank to:
> - few code is using autoremove_wake_function
> - CPU pipeline is not as deep as to make this emerge
>
> Best regards,
> Trol


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread
On Mon, Mar 12, 2018 at 9:24 PM, Andrea Parri  wrote:
> Hi Trol,
>
> [...]
>
>
>> But this is just one special case that acquire-release chains promise us.
>>
>> A=B=0 as initial
>>
>>   CPU0CPU1CPU2CPU3
>>  write A=1
>>read A=1
>>write B=1
>>release X
>>  acquire X
>>  read A=?
>>  release Y
>>
>> acquire Y
>>
>> read B=?
>>
>> assurance 1: CPU3 will surely see B=1 writing by CPU1, and
>> assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case
>>
>> The second assurance is both in theory and implemented by real hardware.
>>
>> As for theory, the C++11 memory model, which is a potential formal model
>> for kernel memory model as
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
>> descripes, states that:
>>
>> If a value computation A of an atomic object M happens before a value
>> computation B of M, and A takes its value from a side effect X on M, then
>> the value computed by B shall either be the value stored by X or the value
>> stored by a side effect Y on M, where Y follows X in the modification
>> order of M.
>
> A formal memory consistency model for the Linux kernel is now available at:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git lkmm
>
> Commit
>
>   1c27b644c0fdbc61e113b8faee14baeb8df32486
>   ("Automate memory-barriers.txt; provide Linux-kernel memory model")
>
> provides some information (and references) on the development of this work.
>
> ---
>
> You can check the above observation against this model: unless I mis-typed
> your snippet,
>
> andrea@andrea:~/linux-rcu/tools/memory-model$ cat trol0.litmus
> C trol0
>
> {}
>
> P0(int *a)
> {
> WRITE_ONCE(*a, 1);
> }
>
> P1(int *a, int *b, int *x)
> {
> int r0;
>
> r0 = READ_ONCE(*a);
> WRITE_ONCE(*b, 1);
> smp_store_release(x, 1);
> }
>
> P2(int *a, int *x, int *y)
> {
> int r0;
> int r1;
>
> r0 = smp_load_acquire(x);
> r1 = READ_ONCE(*a);
> smp_store_release(y, 1);
> }
>
> P3(int *b, int *y)
> {
> int r0;
> int r1;
>
> r0 = smp_load_acquire(y);
> r1 = READ_ONCE(*b);
> }
>
> exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))
>
> andrea@andrea:~/linux-rcu/tools/memory-model$ herd7 -conf linux-kernel.cfg 
> trol0.litmus
> Test trol0 Allowed
> States 25
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
> No
> Witnesses
> Positive: 0 Negative: 25
> Condition exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))
> Observation trol0 Never 0 25
> Time trol0 0.03
> Hash=21369772c98e442dd382bd84b43067ee
>
> Please see "tools/memory-model/README" or "tools/memory-model/Documentation/"
> for further information about these tools/model.
>
> Best,
>   Andrea
>

This work is amazingly great, Andrea.
I'd like to study on it.

>
>>
>> at
>> $1.10 rule 18, on page 14
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf
>>
>> As for real hardware, Luc provided detailed test and explanation on
>> ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
>> in this paper:
>>
>> A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
>> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>>
>> So, I think we may remove RCsc from smp_mb__after_spinlock which is
>> really confusing.
>>
>> Best Regards,
>> Trol
>>
>> >
>> >> And for stopped tasks,
>> >>
>> >>  CPU0 CPU1CPU2
>> >>
>> >> 
>> >>
>> >> lock(rq0)
>> >> schedule out A
>> >> remove A from rq0
>> >> store-release(A->on_cpu)
>> >> unock(rq0)
>> >>
>> >>   load_acquire(A->on_cpu)
>> >>   set_task_cpu(A, 2)
>> >>
>> >>   lock(rq2)
>> >>   add 

Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread
On Mon, Mar 12, 2018 at 9:24 PM, Andrea Parri  wrote:
> Hi Trol,
>
> [...]
>
>
>> But this is just one special case that acquire-release chains promise us.
>>
>> A=B=0 as initial
>>
>>   CPU0CPU1CPU2CPU3
>>  write A=1
>>read A=1
>>write B=1
>>release X
>>  acquire X
>>  read A=?
>>  release Y
>>
>> acquire Y
>>
>> read B=?
>>
>> assurance 1: CPU3 will surely see B=1 writing by CPU1, and
>> assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case
>>
>> The second assurance is both in theory and implemented by real hardware.
>>
>> As for theory, the C++11 memory model, which is a potential formal model
>> for kernel memory model as
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
>> descripes, states that:
>>
>> If a value computation A of an atomic object M happens before a value
>> computation B of M, and A takes its value from a side effect X on M, then
>> the value computed by B shall either be the value stored by X or the value
>> stored by a side effect Y on M, where Y follows X in the modification
>> order of M.
>
> A formal memory consistency model for the Linux kernel is now available at:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git lkmm
>
> Commit
>
>   1c27b644c0fdbc61e113b8faee14baeb8df32486
>   ("Automate memory-barriers.txt; provide Linux-kernel memory model")
>
> provides some information (and references) on the development of this work.
>
> ---
>
> You can check the above observation against this model: unless I mis-typed
> your snippet,
>
> andrea@andrea:~/linux-rcu/tools/memory-model$ cat trol0.litmus
> C trol0
>
> {}
>
> P0(int *a)
> {
> WRITE_ONCE(*a, 1);
> }
>
> P1(int *a, int *b, int *x)
> {
> int r0;
>
> r0 = READ_ONCE(*a);
> WRITE_ONCE(*b, 1);
> smp_store_release(x, 1);
> }
>
> P2(int *a, int *x, int *y)
> {
> int r0;
> int r1;
>
> r0 = smp_load_acquire(x);
> r1 = READ_ONCE(*a);
> smp_store_release(y, 1);
> }
>
> P3(int *b, int *y)
> {
> int r0;
> int r1;
>
> r0 = smp_load_acquire(y);
> r1 = READ_ONCE(*b);
> }
>
> exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))
>
> andrea@andrea:~/linux-rcu/tools/memory-model$ herd7 -conf linux-kernel.cfg 
> trol0.litmus
> Test trol0 Allowed
> States 25
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
> No
> Witnesses
> Positive: 0 Negative: 25
> Condition exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))
> Observation trol0 Never 0 25
> Time trol0 0.03
> Hash=21369772c98e442dd382bd84b43067ee
>
> Please see "tools/memory-model/README" or "tools/memory-model/Documentation/"
> for further information about these tools/model.
>
> Best,
>   Andrea
>

This work is amazingly great, Andrea.
I'd like to study on it.

>
>>
>> at
>> $1.10 rule 18, on page 14
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf
>>
>> As for real hardware, Luc provided detailed test and explanation on
>> ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
>> in this paper:
>>
>> A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
>> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>>
>> So, I think we may remove RCsc from smp_mb__after_spinlock which is
>> really confusing.
>>
>> Best Regards,
>> Trol
>>
>> >
>> >> And for stopped tasks,
>> >>
>> >>  CPU0 CPU1CPU2
>> >>
>> >> 
>> >>
>> >> lock(rq0)
>> >> schedule out A
>> >> remove A from rq0
>> >> store-release(A->on_cpu)
>> >> unock(rq0)
>> >>
>> >>   load_acquire(A->on_cpu)
>> >>   set_task_cpu(A, 2)
>> >>
>> >>   lock(rq2)
>> >>   add A into rq2
>> >> 

Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread
On Mon, Mar 12, 2018 at 4:56 PM, Peter Zijlstra  wrote:
> On Mon, Mar 12, 2018 at 04:56:00PM +0800, Boqun Feng wrote:
>> So I think the purpose of smp_mb__after_spinlock() is to provide RCsc
>> locks, it's just the comments before that may be misleading. We want
>> RCsc locks in schedule code because we want writes in different critical
>> section are ordered even outside the critical sections, for case like:
>>
>>   CPU 0   CPU 1   CPU 2
>>
>>   {A =0 , B = 0}
>>   lock(rq0);
>>   write A=1;
>>   unlock(rq0);
>>
>>   lock(rq0);
>>   read A=1;
>>   write B=2;
>>   unlock(rq0);
>>
>>   read B=2;
>>   smp_rmb();
>>   read A=1;
>>
>> I think we need to fix the comments rather than loose the requirement.
>> Peter?
>
> Yes, ISTR people relying on schedule() being RCsc, and I just picked a
> bad exmaple.

Hi, Peter,
If the fixed comment could point out where this RCsc is used, it will be great.


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread
On Mon, Mar 12, 2018 at 4:56 PM, Peter Zijlstra  wrote:
> On Mon, Mar 12, 2018 at 04:56:00PM +0800, Boqun Feng wrote:
>> So I think the purpose of smp_mb__after_spinlock() is to provide RCsc
>> locks, it's just the comments before that may be misleading. We want
>> RCsc locks in schedule code because we want writes in different critical
>> section are ordered even outside the critical sections, for case like:
>>
>>   CPU 0   CPU 1   CPU 2
>>
>>   {A =0 , B = 0}
>>   lock(rq0);
>>   write A=1;
>>   unlock(rq0);
>>
>>   lock(rq0);
>>   read A=1;
>>   write B=2;
>>   unlock(rq0);
>>
>>   read B=2;
>>   smp_rmb();
>>   read A=1;
>>
>> I think we need to fix the comments rather than loose the requirement.
>> Peter?
>
> Yes, ISTR people relying on schedule() being RCsc, and I just picked a
> bad exmaple.

Hi, Peter,
If the fixed comment could point out where this RCsc is used, it will be great.


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread
>> Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
>> that the spinning-lock used at __schedule() should be RCsc to ensure
>> visibility of writes prior to __schedule when the task is to be migrated to
>> another CPU.
>>
>> And this is emphasized at the comment of the newly introduced
>> smp_mb__after_spinlock(),
>>
>>  * This barrier must provide two things:
>>  *
>>  *   - it must guarantee a STORE before the spin_lock() is ordered against a
>>  * LOAD after it, see the comments at its two usage sites.
>>  *
>>  *   - it must ensure the critical section is RCsc.
>>  *
>>  * The latter is important for cases where we observe values written by other
>>  * CPUs in spin-loops, without barriers, while being subject to scheduling.
>>  *
>>  * CPU0 CPU1CPU2
>>  *
>>  *  for (;;) {
>>  *if (READ_ONCE(X))
>>  *  break;
>>  *  }
>>  * X=1
>>  *  
>>  *  
>>  *  r = X;
>>  *
>>  * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
>>  * we get migrated and CPU2 sees X==0.
>>
>> which is used at,
>>
>> __schedule(bool preempt) {
>> ...
>> rq_lock(rq, );
>> smp_mb__after_spinlock();
>> ...
>> }
>> .
>>
>> If I didn't miss something, I found this kind of visibility is __not__
>> necessarily
>> depends on the spinning-lock at __schedule being RCsc.
>>
>> In fact, as for runnable task A, the migration would be,
>>
>>  CPU0 CPU1CPU2
>>
>> 
>>
>> lock(rq0)
>> schedule out A
>> unock(rq0)
>>
>>   lock(rq0)
>>   remove A from rq0
>>   unlock(rq0)
>>
>>   lock(rq2)
>>   add A into rq2
>>   unlock(rq2)
>> lock(rq2)
>> schedule in A
>> unlock(rq2)
>>
>> 
>>
>>  happens-before
>> unlock(rq0) happends-before
>> lock(rq0) happends-before
>> unlock(rq2) happens-before
>> lock(rq2) happens-before
>> 
>>
>
> But without RCsc lock, you cannot guarantee that a write propagates to
> CPU 0 and CPU 2 at the same time, so the same write may propagate to
> CPU0 before  but propagate to CPU 2 after
> . So..
>
> Regards,
> Boqun

Thank you for pointing out this case, Boqun.
But this is just one special case that acquire-release chains promise us.

A=B=0 as initial

  CPU0CPU1CPU2CPU3
 write A=1
   read A=1
   write B=1
   release X
 acquire X
 read A=?
 release Y

acquire Y

read B=?

assurance 1: CPU3 will surely see B=1 writing by CPU1, and
assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case

The second assurance is both in theory and implemented by real hardware.

As for theory, the C++11 memory model, which is a potential formal model
for kernel memory model as
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
descripes, states that:

If a value computation A of an atomic object M happens before a value
computation B of M, and A takes its value from a side effect X on M, then
the value computed by B shall either be the value stored by X or the value
stored by a side effect Y on M, where Y follows X in the modification
order of M.

at
$1.10 rule 18, on page 14
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf

As for real hardware, Luc provided detailed test and explanation on
ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
in this paper:

A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

So, I think we may remove RCsc from smp_mb__after_spinlock which is
really confusing.

Best Regards,
Trol

>
>> And for stopped tasks,
>>
>>  CPU0 CPU1CPU2
>>
>> 
>>
>> lock(rq0)
>> schedule out A
>> remove A from rq0
>> store-release(A->on_cpu)
>> unock(rq0)
>>
>>   load_acquire(A->on_cpu)
>>   set_task_cpu(A, 2)
>>
>>   lock(rq2)
>>   add A into rq2
>>   unlock(rq2)
>>
>> lock(rq2)
>> schedule in A
>> unlock(rq2)
>>
>> 
>>
>>  happens-before
>> store-release(A->on_cpu)  happens-before
>> load_acquire(A->on_cpu)  happens-before
>> unlock(rq2) happens-before
>> lock(rq2) happens-before
>> 
>>
>> So, I think the only requirement to smp_mb__after_spinlock is
>> to guarantee a STORE before the spin_lock() is ordered
>> against a LOAD after it. So we could remove the RCsc requirement
>> to 

Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread
>> Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
>> that the spinning-lock used at __schedule() should be RCsc to ensure
>> visibility of writes prior to __schedule when the task is to be migrated to
>> another CPU.
>>
>> And this is emphasized at the comment of the newly introduced
>> smp_mb__after_spinlock(),
>>
>>  * This barrier must provide two things:
>>  *
>>  *   - it must guarantee a STORE before the spin_lock() is ordered against a
>>  * LOAD after it, see the comments at its two usage sites.
>>  *
>>  *   - it must ensure the critical section is RCsc.
>>  *
>>  * The latter is important for cases where we observe values written by other
>>  * CPUs in spin-loops, without barriers, while being subject to scheduling.
>>  *
>>  * CPU0 CPU1CPU2
>>  *
>>  *  for (;;) {
>>  *if (READ_ONCE(X))
>>  *  break;
>>  *  }
>>  * X=1
>>  *  
>>  *  
>>  *  r = X;
>>  *
>>  * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
>>  * we get migrated and CPU2 sees X==0.
>>
>> which is used at,
>>
>> __schedule(bool preempt) {
>> ...
>> rq_lock(rq, );
>> smp_mb__after_spinlock();
>> ...
>> }
>> .
>>
>> If I didn't miss something, I found this kind of visibility is __not__
>> necessarily
>> depends on the spinning-lock at __schedule being RCsc.
>>
>> In fact, as for runnable task A, the migration would be,
>>
>>  CPU0 CPU1CPU2
>>
>> 
>>
>> lock(rq0)
>> schedule out A
>> unock(rq0)
>>
>>   lock(rq0)
>>   remove A from rq0
>>   unlock(rq0)
>>
>>   lock(rq2)
>>   add A into rq2
>>   unlock(rq2)
>> lock(rq2)
>> schedule in A
>> unlock(rq2)
>>
>> 
>>
>>  happens-before
>> unlock(rq0) happends-before
>> lock(rq0) happends-before
>> unlock(rq2) happens-before
>> lock(rq2) happens-before
>> 
>>
>
> But without RCsc lock, you cannot guarantee that a write propagates to
> CPU 0 and CPU 2 at the same time, so the same write may propagate to
> CPU0 before  but propagate to CPU 2 after
> . So..
>
> Regards,
> Boqun

Thank you for pointing out this case, Boqun.
But this is just one special case that acquire-release chains promise us.

A=B=0 as initial

  CPU0CPU1CPU2CPU3
 write A=1
   read A=1
   write B=1
   release X
 acquire X
 read A=?
 release Y

acquire Y

read B=?

assurance 1: CPU3 will surely see B=1 writing by CPU1, and
assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case

The second assurance is both in theory and implemented by real hardware.

As for theory, the C++11 memory model, which is a potential formal model
for kernel memory model as
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
descripes, states that:

If a value computation A of an atomic object M happens before a value
computation B of M, and A takes its value from a side effect X on M, then
the value computed by B shall either be the value stored by X or the value
stored by a side effect Y on M, where Y follows X in the modification
order of M.

at
$1.10 rule 18, on page 14
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf

As for real hardware, Luc provided detailed test and explanation on
ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
in this paper:

A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

So, I think we may remove RCsc from smp_mb__after_spinlock which is
really confusing.

Best Regards,
Trol

>
>> And for stopped tasks,
>>
>>  CPU0 CPU1CPU2
>>
>> 
>>
>> lock(rq0)
>> schedule out A
>> remove A from rq0
>> store-release(A->on_cpu)
>> unock(rq0)
>>
>>   load_acquire(A->on_cpu)
>>   set_task_cpu(A, 2)
>>
>>   lock(rq2)
>>   add A into rq2
>>   unlock(rq2)
>>
>> lock(rq2)
>> schedule in A
>> unlock(rq2)
>>
>> 
>>
>>  happens-before
>> store-release(A->on_cpu)  happens-before
>> load_acquire(A->on_cpu)  happens-before
>> unlock(rq2) happens-before
>> lock(rq2) happens-before
>> 
>>
>> So, I think the only requirement to smp_mb__after_spinlock is
>> to guarantee a STORE before the spin_lock() is ordered
>> against a LOAD after it. So we could remove the RCsc requirement
>> to 

Re: resend, finish_wait with list_empty_careful maybe fragile or buggy

2018-03-11 Thread
> Sorry, this is a resend because the previous one was messed
> up by my editor and hard to be read.
>
> void finish_wait(
> struct wait_queue_head *wq_head,
> struct wait_queue_entry *wq_entry)
> {
>
> ->if (!list_empty_careful(_entry->entry)) {
> ->spin_lock_irqsave(_head->lock, flags);
> ->list_del_init(_entry->entry);
> ->spin_unlock_irqrestore(_head->lock, flags);
> ->}
> }
>
> After careful look into the stop/wakeup code, I found
> the above code fragile or even buggy. This code was
> introduced at least 14 years ago and seems fragile or
> buggy now after years of study on SMP synchronization
> by us.
>
> I understand this code are being used a lot and no bug
> seems to emerge. But, as I'll explain, it depends on a lot
> of unreliable implementation details.
>
> Firstly,
>
> static inline int list_empty_careful(const struct list_head *head)
> {
> struct list_head *next = head->next;
> return (next == head) && (next == head->prev);
> }
>
> note that the read of head->next & head->prev is not
> protected by READ_ONCE. So the read is free to be
> optimized out entirely. Luckily, this optimization is hard
> for compilers now, since all other accesses are out of
> finish_wait. And still, GCC won't spit aligned accesses
> into multiple instructions so it is atomic so far.
>
> Secondly,
>
> if ( ! list_empty_careful(_entry->entry) ) {
>  
> }
> 
> 
>
> and
>
> __wake_up_common()   -->
> func>  -->
> flags>  -->
> autoremove_wake_function()  -->
> entry from wait queue>  -->
>
> are not properly ordered for SMP so that  flags>
> may be reordered after entry from wait queue>
> since no dependency or memory barrier forbids it. This may cause
>  on one CPU takes place
> before flags> on another CPU and cause
> flags> to return bad value.
>
> This behavior is not reported may thank to:
> - few code is using autoremove_wake_function
> - CPU pipeline is not as deep as to make this emerge
>

Hi, mingo, peterz

Would you please have a look at it if it won't waste you much time.
Thanks a lot.

CC scheduler maintainers

Best regards,
Patrick Trol


Re: resend, finish_wait with list_empty_careful maybe fragile or buggy

2018-03-11 Thread
> Sorry, this is a resend because the previous one was messed
> up by my editor and hard to be read.
>
> void finish_wait(
> struct wait_queue_head *wq_head,
> struct wait_queue_entry *wq_entry)
> {
>
> ->if (!list_empty_careful(_entry->entry)) {
> ->spin_lock_irqsave(_head->lock, flags);
> ->list_del_init(_entry->entry);
> ->spin_unlock_irqrestore(_head->lock, flags);
> ->}
> }
>
> After careful look into the stop/wakeup code, I found
> the above code fragile or even buggy. This code was
> introduced at least 14 years ago and seems fragile or
> buggy now after years of study on SMP synchronization
> by us.
>
> I understand this code are being used a lot and no bug
> seems to emerge. But, as I'll explain, it depends on a lot
> of unreliable implementation details.
>
> Firstly,
>
> static inline int list_empty_careful(const struct list_head *head)
> {
> struct list_head *next = head->next;
> return (next == head) && (next == head->prev);
> }
>
> note that the read of head->next & head->prev is not
> protected by READ_ONCE. So the read is free to be
> optimized out entirely. Luckily, this optimization is hard
> for compilers now, since all other accesses are out of
> finish_wait. And still, GCC won't spit aligned accesses
> into multiple instructions so it is atomic so far.
>
> Secondly,
>
> if ( ! list_empty_careful(_entry->entry) ) {
>  
> }
> 
> 
>
> and
>
> __wake_up_common()   -->
> func>  -->
> flags>  -->
> autoremove_wake_function()  -->
> entry from wait queue>  -->
>
> are not properly ordered for SMP so that  flags>
> may be reordered after entry from wait queue>
> since no dependency or memory barrier forbids it. This may cause
>  on one CPU takes place
> before flags> on another CPU and cause
> flags> to return bad value.
>
> This behavior is not reported may thank to:
> - few code is using autoremove_wake_function
> - CPU pipeline is not as deep as to make this emerge
>

Hi, mingo, peterz

Would you please have a look at it if it won't waste you much time.
Thanks a lot.

CC scheduler maintainers

Best regards,
Patrick Trol


resend, finish_wait with list_empty_careful maybe fragile or buggy

2018-03-11 Thread
Sorry, this is a resend because the previous one was messed
up by my editor and hard to be read.

void finish_wait(
struct wait_queue_head *wq_head,
struct wait_queue_entry *wq_entry)
{
   
->if (!list_empty_careful(_entry->entry)) {
->spin_lock_irqsave(_head->lock, flags);
->list_del_init(_entry->entry);
->spin_unlock_irqrestore(_head->lock, flags);
->}
}

After careful look into the stop/wakeup code, I found
the above code fragile or even buggy. This code was
introduced at least 14 years ago and seems fragile or
buggy now after years of study on SMP synchronization
by us.

I understand this code are being used a lot and no bug
seems to emerge. But, as I'll explain, it depends on a lot
of unreliable implementation details.

Firstly,

static inline int list_empty_careful(const struct list_head *head)
{
struct list_head *next = head->next;
return (next == head) && (next == head->prev);
}

note that the read of head->next & head->prev is not
protected by READ_ONCE. So the read is free to be
optimized out entirely. Luckily, this optimization is hard
for compilers now, since all other accesses are out of
finish_wait. And still, GCC won't spit aligned accesses
into multiple instructions so it is atomic so far.

Secondly,

if ( ! list_empty_careful(_entry->entry) ) {
 
}



and

__wake_up_common()   -->
func>  -->
flags>  -->
autoremove_wake_function()  -->
entry from wait queue>  -->

are not properly ordered for SMP so that  flags>
may be reordered after entry from wait queue>
since no dependency or memory barrier forbids it. This may cause
 on one CPU takes place
before flags> on another CPU and cause
flags> to return bad value.

This behavior is not reported may thank to:
- quite a few code is using autoremove_wake_function
- CPU pipeline is not as deep as to make this emerge

Best regards,
Patrick Trol


resend, finish_wait with list_empty_careful maybe fragile or buggy

2018-03-11 Thread
Sorry, this is a resend because the previous one was messed
up by my editor and hard to be read.

void finish_wait(
struct wait_queue_head *wq_head,
struct wait_queue_entry *wq_entry)
{
   
->if (!list_empty_careful(_entry->entry)) {
->spin_lock_irqsave(_head->lock, flags);
->list_del_init(_entry->entry);
->spin_unlock_irqrestore(_head->lock, flags);
->}
}

After careful look into the stop/wakeup code, I found
the above code fragile or even buggy. This code was
introduced at least 14 years ago and seems fragile or
buggy now after years of study on SMP synchronization
by us.

I understand this code are being used a lot and no bug
seems to emerge. But, as I'll explain, it depends on a lot
of unreliable implementation details.

Firstly,

static inline int list_empty_careful(const struct list_head *head)
{
struct list_head *next = head->next;
return (next == head) && (next == head->prev);
}

note that the read of head->next & head->prev is not
protected by READ_ONCE. So the read is free to be
optimized out entirely. Luckily, this optimization is hard
for compilers now, since all other accesses are out of
finish_wait. And still, GCC won't spit aligned accesses
into multiple instructions so it is atomic so far.

Secondly,

if ( ! list_empty_careful(_entry->entry) ) {
 
}



and

__wake_up_common()   -->
func>  -->
flags>  -->
autoremove_wake_function()  -->
entry from wait queue>  -->

are not properly ordered for SMP so that  flags>
may be reordered after entry from wait queue>
since no dependency or memory barrier forbids it. This may cause
 on one CPU takes place
before flags> on another CPU and cause
flags> to return bad value.

This behavior is not reported may thank to:
- quite a few code is using autoremove_wake_function
- CPU pipeline is not as deep as to make this emerge

Best regards,
Patrick Trol


finish_wait with list_empty_careful maybe fragile or buggy

2018-03-11 Thread
void finish_wait(struct wait_queue_head *wq_head, struct
wait_queue_entry *wq_entry)
{

->if (!list_empty_careful(_entry->entry)) {
->spin_lock_irqsave(_head->lock, flags);
->list_del_init(_entry->entry);
->spin_unlock_irqrestore(_head->lock, flags);
->}
}

After careful look into the stop/wakeup code, I found the above code
fragile or even
buggy. This code was introduced at least 14 years ago and seems
fragile or buggy now
after years of study on SMP synchronization by us.

I understand this code are being used a lot and no bug seems to
emerge. But, as I'll
explain, it depends on a lot of unreliable implementation details.

Firstly,

static inline int list_empty_careful(const struct list_head *head)
{
struct list_head *next = head->next;
return (next == head) && (next == head->prev);
}

note that the read of head->next & head->prev is not protected by READ_ONCE. So
the read is free to be optimized out entirely. Luckily, this
optimization is hard for compilers
now, since all other accesses are out of finish_wait. And still, GCC
won't spit aligned
accesses into multiple instructions so it is atomic so far.

Secondly,

if ( ! list_empty_careful(_entry->entry) ) {
 
}



and

__wake_up_common()   -->
func>  -->
flags>  -->
autoremove_wake_function()  -->
entry from wait queue>  -->

are not properly ordered for SMP so that  flags> may
be reordered after
entry from wait queue> since no dependency or memory
barrier forbids
it. This may cause  on one CPU
takes place before
flags> on another CPU and cause flags>
to return bad
value.

This behavior is not reported may thank to:
- quite a few code is using autoremove_wake_function
- CPU pipeline is not as deep as to make this emerge

Best regards,
Patrick Trol


finish_wait with list_empty_careful maybe fragile or buggy

2018-03-11 Thread
void finish_wait(struct wait_queue_head *wq_head, struct
wait_queue_entry *wq_entry)
{

->if (!list_empty_careful(_entry->entry)) {
->spin_lock_irqsave(_head->lock, flags);
->list_del_init(_entry->entry);
->spin_unlock_irqrestore(_head->lock, flags);
->}
}

After careful look into the stop/wakeup code, I found the above code
fragile or even
buggy. This code was introduced at least 14 years ago and seems
fragile or buggy now
after years of study on SMP synchronization by us.

I understand this code are being used a lot and no bug seems to
emerge. But, as I'll
explain, it depends on a lot of unreliable implementation details.

Firstly,

static inline int list_empty_careful(const struct list_head *head)
{
struct list_head *next = head->next;
return (next == head) && (next == head->prev);
}

note that the read of head->next & head->prev is not protected by READ_ONCE. So
the read is free to be optimized out entirely. Luckily, this
optimization is hard for compilers
now, since all other accesses are out of finish_wait. And still, GCC
won't spit aligned
accesses into multiple instructions so it is atomic so far.

Secondly,

if ( ! list_empty_careful(_entry->entry) ) {
 
}



and

__wake_up_common()   -->
func>  -->
flags>  -->
autoremove_wake_function()  -->
entry from wait queue>  -->

are not properly ordered for SMP so that  flags> may
be reordered after
entry from wait queue> since no dependency or memory
barrier forbids
it. This may cause  on one CPU
takes place before
flags> on another CPU and cause flags>
to return bad
value.

This behavior is not reported may thank to:
- quite a few code is using autoremove_wake_function
- CPU pipeline is not as deep as to make this emerge

Best regards,
Patrick Trol


smp_mb__after_spinlock requirement too strong?

2018-03-10 Thread
Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
that the spinning-lock used at __schedule() should be RCsc to ensure
visibility of writes prior to __schedule when the task is to be migrated to
another CPU.

And this is emphasized at the comment of the newly introduced
smp_mb__after_spinlock(),

 * This barrier must provide two things:
 *
 *   - it must guarantee a STORE before the spin_lock() is ordered against a
 * LOAD after it, see the comments at its two usage sites.
 *
 *   - it must ensure the critical section is RCsc.
 *
 * The latter is important for cases where we observe values written by other
 * CPUs in spin-loops, without barriers, while being subject to scheduling.
 *
 * CPU0 CPU1CPU2
 *
 *  for (;;) {
 *if (READ_ONCE(X))
 *  break;
 *  }
 * X=1
 *  
 *  
 *  r = X;
 *
 * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
 * we get migrated and CPU2 sees X==0.

which is used at,

__schedule(bool preempt) {
...
rq_lock(rq, );
smp_mb__after_spinlock();
...
}
.

If I didn't miss something, I found this kind of visibility is __not__
necessarily
depends on the spinning-lock at __schedule being RCsc.

In fact, as for runnable task A, the migration would be,

 CPU0 CPU1CPU2



lock(rq0)
schedule out A
unock(rq0)

  lock(rq0)
  remove A from rq0
  unlock(rq0)

  lock(rq2)
  add A into rq2
  unlock(rq2)
lock(rq2)
schedule in A
unlock(rq2)



 happens-before
unlock(rq0) happends-before
lock(rq0) happends-before
unlock(rq2) happens-before
lock(rq2) happens-before


And for stopped tasks,

 CPU0 CPU1CPU2



lock(rq0)
schedule out A
remove A from rq0
store-release(A->on_cpu)
unock(rq0)

  load_acquire(A->on_cpu)
  set_task_cpu(A, 2)

  lock(rq2)
  add A into rq2
  unlock(rq2)

lock(rq2)
schedule in A
unlock(rq2)



 happens-before
store-release(A->on_cpu)  happens-before
load_acquire(A->on_cpu)  happens-before
unlock(rq2) happens-before
lock(rq2) happens-before


So, I think the only requirement to smp_mb__after_spinlock is
to guarantee a STORE before the spin_lock() is ordered
against a LOAD after it. So we could remove the RCsc requirement
to allow more efficient implementation.

Did I miss something or this RCsc requirement does not really matter?


smp_mb__after_spinlock requirement too strong?

2018-03-10 Thread
Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
that the spinning-lock used at __schedule() should be RCsc to ensure
visibility of writes prior to __schedule when the task is to be migrated to
another CPU.

And this is emphasized at the comment of the newly introduced
smp_mb__after_spinlock(),

 * This barrier must provide two things:
 *
 *   - it must guarantee a STORE before the spin_lock() is ordered against a
 * LOAD after it, see the comments at its two usage sites.
 *
 *   - it must ensure the critical section is RCsc.
 *
 * The latter is important for cases where we observe values written by other
 * CPUs in spin-loops, without barriers, while being subject to scheduling.
 *
 * CPU0 CPU1CPU2
 *
 *  for (;;) {
 *if (READ_ONCE(X))
 *  break;
 *  }
 * X=1
 *  
 *  
 *  r = X;
 *
 * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
 * we get migrated and CPU2 sees X==0.

which is used at,

__schedule(bool preempt) {
...
rq_lock(rq, );
smp_mb__after_spinlock();
...
}
.

If I didn't miss something, I found this kind of visibility is __not__
necessarily
depends on the spinning-lock at __schedule being RCsc.

In fact, as for runnable task A, the migration would be,

 CPU0 CPU1CPU2



lock(rq0)
schedule out A
unock(rq0)

  lock(rq0)
  remove A from rq0
  unlock(rq0)

  lock(rq2)
  add A into rq2
  unlock(rq2)
lock(rq2)
schedule in A
unlock(rq2)



 happens-before
unlock(rq0) happends-before
lock(rq0) happends-before
unlock(rq2) happens-before
lock(rq2) happens-before


And for stopped tasks,

 CPU0 CPU1CPU2



lock(rq0)
schedule out A
remove A from rq0
store-release(A->on_cpu)
unock(rq0)

  load_acquire(A->on_cpu)
  set_task_cpu(A, 2)

  lock(rq2)
  add A into rq2
  unlock(rq2)

lock(rq2)
schedule in A
unlock(rq2)



 happens-before
store-release(A->on_cpu)  happens-before
load_acquire(A->on_cpu)  happens-before
unlock(rq2) happens-before
lock(rq2) happens-before


So, I think the only requirement to smp_mb__after_spinlock is
to guarantee a STORE before the spin_lock() is ordered
against a LOAD after it. So we could remove the RCsc requirement
to allow more efficient implementation.

Did I miss something or this RCsc requirement does not really matter?