subject:"Re\: \[FYI\] tux3\: Core changes"

Re: [FYI] tux3: Core changes

2015-08-18 Thread Rik van Riel


On 07/31/2015 01:27 PM, Daniel Phillips wrote:

On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:

Returning ENOSPC when you have free space you can't yet prove is safer
than
not returning it and risking a data loss when you get hit by a
write/commit
storm. :)


Remember when delayed allocation was scary and unproven, because proving
that ENOSPC will always be returned when needed is extremely difficult?
But the performance advantage was compelling, so we just worked at it
until it worked. There were times when it didn't work properly, but the
code was in the tree so it got fixed.

It's like that now with page forking - a new technique with compelling
advantages, and some challenges. In the past, we (the Linux community)
would rise to the challenge and err on the side of pushing optimizations
in early. That was our mojo, and that is how Linux became the dominant
operating system it is today. Do we, the Linux community, still have that
mojo?


Do you have the mojo to come up with a proposal on how
to make things work, in a way that ensures data consistency
for Linux users?

Yes, we know page forking is not compatible with the way
Linux currently uses refcounts.

The question is, does anyone have an idea on how we could
fix that?

Not necessarily an implementation yet, just an idea might
be enough to move forward at this stage.

However, if nobody wants to work on even an idea, page
forking may simply not be a safe thing to do.


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-08-16 Thread OGAWA Hirofumi

Jan Kara j...@suse.cz writes:

 On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote:
 Jan Kara j...@suse.cz writes:
 
  I'm not sure about which ENOSPC issue you are speaking BTW. Can you
  please ellaborate?
 
 1. GUP simulate page fault, and prepare to modify
 2. writeback clear dirty, and make PTE read-only
 3. snapshot/reflink make block cow

 I assume by point 3. you mean that snapshot / reflink happens now and thus
 the page / block is marked as COW. Am I right?

Right.

 4. driver called GUP modifies page, and dirty page without simulate page 
 fault

 OK, but this doesn't hit ENOSPC because as you correctly write in point 4.,
 the page gets modified without triggering another page fault so COW for the
 modified page isn't triggered. Modified page contents will be in both the
 original and the reflinked file, won't it?

And above result can be ENOSPC too, depending on implement and race
condition. Also, if FS converted zerod blocks to hole like hammerfs,
simply ENOSPC happens. I.e. other process uses all spaces, but then no
-page_mkwrite() callback to check ENOSPC.

 And I agree that the fact that snapshotted file's original contents can
 still get modified is a bug. A one which is difficult to fix.

Yes, it is why I'm thinking this logic is issue, before page forking.

 So it sounds like yet another stable page. I.e. unpredictable
 performance. (BTW, by recall of stable page, noticed stable page
 would not provide stabled page data for that logic too.)
 
 Well, assuming elevated refcount == threshold + waitq/wakeup, so
 IMO, it is not attractive.  Rather the last option if there is no
 others as design choice.

 I agree the performance will be less predictable and that is not good. But
 changing what is visible in the file when writeback races with GUP is a
 worse problem to me.

 Maybe if GUP marked pages it got ref for so that we could trigger the slow
 behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so
 that pages pinned by get_user_pages() would be properly accounted and then
 we could use PG_mlocked and elevated refcount as a more reliable indication
 of pages that need special handling).

I'm not reading Peter's patchset fully though, looks like good, and
maybe similar strategy in my mind currently. Also I'm thinking to add
callback for FS at start and end of GUP's pin window. (for just an
example, callback can be used to stop writeback by FS if FS wants.)

Thanks.
-- 
OGAWA Hirofumi hirof...@mail.parknet.co.jp

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-08-10 Thread Jan Kara

On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote:
 Jan Kara j...@suse.cz writes:
 
  I'm not sure about which ENOSPC issue you are speaking BTW. Can you
  please ellaborate?
 
 1. GUP simulate page fault, and prepare to modify
 2. writeback clear dirty, and make PTE read-only
 3. snapshot/reflink make block cow

I assume by point 3. you mean that snapshot / reflink happens now and thus
the page / block is marked as COW. Am I right?

 4. driver called GUP modifies page, and dirty page without simulate page fault

OK, but this doesn't hit ENOSPC because as you correctly write in point 4.,
the page gets modified without triggering another page fault so COW for the
modified page isn't triggered. Modified page contents will be in both the
original and the reflinked file, won't it?

And I agree that the fact that snapshotted file's original contents can
still get modified is a bug. A one which is difficult to fix.

  If you claim, there is strange logic widely used already, and of course,
  we can't simply break it because of compatibility. I would be able to
  agree. But your claim sounds like that logic is sane and well designed
  behavior. So I disagree.
 
  To me the rule: Do not detach a page from a radix tree if it has an 
  elevated
  refcount unless explicitely requested by a syscall looks like a sane one.
  Yes.
 
   And frankly I fail to see why you and Daniel care so much about this
   corner case because from performance POV it's IMHO a non-issue and you
   bother with page forking because of performance, don't you?
  
  Trying to penalize the corner case path, instead of normal path, should
  try at first. Penalizing normal path to allow corner case path is insane
  basically.
 
  Make normal path faster and more reliable is what we are trying.
 
  Elevated refcount of a page is in my opinion a corner case path. That's why
  I think that penalizing that case by waiting for IO instead of forking is
  acceptable cost for the improved compatibility  maintainability of the
  code.
 
 What is elevated refcount? What is difference with normal refcount?
 Are you saying refcount = specified threshold + waitq/wakeup or
 such? If so, it is not the path.  It is the state. IOW, some group may
 not hit much, but some group may hit much, on normal path.

Yes, by elevated refcount I meant refcount  2 (one for pagecache, one for
your code inspecting the page).

 So it sounds like yet another stable page. I.e. unpredictable
 performance. (BTW, by recall of stable page, noticed stable page
 would not provide stabled page data for that logic too.)
 
 Well, assuming elevated refcount == threshold + waitq/wakeup, so
 IMO, it is not attractive.  Rather the last option if there is no
 others as design choice.

I agree the performance will be less predictable and that is not good. But
changing what is visible in the file when writeback races with GUP is a
worse problem to me.

Maybe if GUP marked pages it got ref for so that we could trigger the slow
behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so
that pages pinned by get_user_pages() would be properly accounted and then
we could use PG_mlocked and elevated refcount as a more reliable indication
of pages that need special handling).

Honza

[1] http://thread.gmane.org/gmane.linux.kernel.mm/117679
-- 
Jan Kara j...@suse.com
SUSE Labs, CR

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-08-09 Thread OGAWA Hirofumi

Jan Kara j...@suse.cz writes:

 I'm not sure about which ENOSPC issue you are speaking BTW. Can you
 please ellaborate?

1. GUP simulate page fault, and prepare to modify
2. writeback clear dirty, and make PTE read-only
3. snapshot/reflink make block cow
4. driver called GUP modifies page, and dirty page without simulate page fault

 If you claim, there is strange logic widely used already, and of course,
 we can't simply break it because of compatibility. I would be able to
 agree. But your claim sounds like that logic is sane and well designed
 behavior. So I disagree.

 To me the rule: Do not detach a page from a radix tree if it has an elevated
 refcount unless explicitely requested by a syscall looks like a sane one.
 Yes.

  And frankly I fail to see why you and Daniel care so much about this
  corner case because from performance POV it's IMHO a non-issue and you
  bother with page forking because of performance, don't you?
 
 Trying to penalize the corner case path, instead of normal path, should
 try at first. Penalizing normal path to allow corner case path is insane
 basically.

 Make normal path faster and more reliable is what we are trying.

 Elevated refcount of a page is in my opinion a corner case path. That's why
 I think that penalizing that case by waiting for IO instead of forking is
 acceptable cost for the improved compatibility  maintainability of the
 code.

What is elevated refcount? What is difference with normal refcount?
Are you saying refcount = specified threshold + waitq/wakeup or
such? If so, it is not the path.  It is the state. IOW, some group may
not hit much, but some group may hit much, on normal path.

So it sounds like yet another stable page. I.e. unpredictable
performance. (BTW, by recall of stable page, noticed stable page
would not provide stabled page data for that logic too.)

Well, assuming elevated refcount == threshold + waitq/wakeup, so
IMO, it is not attractive.  Rather the last option if there is no
others as design choice.

Thanks.
-- 
OGAWA Hirofumi hirof...@mail.parknet.co.jp

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-08-03 Thread Jan Kara

On Fri 31-07-15 17:16:45, Daniel Phillips wrote:
 On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:
 Note: Hirofumi's email is clear, logical and speaks to the
 question. This branch of the thread is largely pointless, though
 it essentially says the same thing in non-technical terms. Perhaps
 your next response should be to Hirofumi, and perhaps it should be
 technical.
 
 Now, let me try to lead the way, but being specific. RDMA was raised
 as a potential failure case for Tux3 page forking. But the RDMA api
 does not let you use memory mmaped by Tux3 as a source or destination
 of IO. Instead, it sets up its own pages and hands them out to the
 RDMA app from a pool. So no issue. One down, right?

Can you please tell me how you arrived to that conclusion? As far as I'm
looking at the code in drivers/infiniband/ I don't see anything there
preventing userspace from passing in mmapped memory...

Honza
-- 
Jan Kara j...@suse.com
SUSE Labs, CR

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-08-01 Thread Elifarley Callado Coelho Cruz

My gosh This is driving me crazy. Please let's make it crystal clear,
in technical and precise terms, devoid of any ad hominem attacks or the
like, what is preventing Tux3 from being merged.

Maybe a list of issues in Github, so that each issue can be scrutinized
more easily. Like this one: https://github.com/tux3fs/tux3-merging/issues/1

Daniel, thank you so much for spending so much energy to fight not only for
tux3, but for logic, reason and rationality, and for a saner dev process in
Linux.

Elifarley Cruz

-

  Do not believe anything because it is said by an authority, or if it  is
said to come from angels, or from Gods, or from an inspired source.
Believe it only if you have explored it in your own heart and mind and body
and found it to be true.  Work out your own path, through diligence.
- Gautama Buddha

On Fri, Jul 31, 2015 at 7:27 PM, David Lang da...@lang.hm wrote:

 On Fri, 31 Jul 2015, Daniel Phillips wrote:

 On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:

 We, the Linux Community have less tolerance for losing people's data and
 preventing them from operating than we used to when it was all tinkerer's
 personal data and secondary systems.

 So rather than pushing optimizations out to everyone and seeing what
 breaks, we now do more testing and checking for failures before pushing
 things out.


 By the way, I am curious about whose data you think will get lost
 as a result of pushing out Tux3 with a possible theoretical bug
 in a wildly improbable scenario that has not actually been
 described with sufficient specificity to falsify, let alone
 demonstrated.


 you weren't asking about any particular feature of Tux, you were asking if
 we were still willing to push out stuff that breaks for users and fix it
 later.

 Especially for filesystems that can loose the data of whoever is using it,
 the answer seems to be a clear no.

 there may be bugs in what's pushed out that we don't know about. But we
 don't push out potential data corruption bugs that we do know about (or
 think we do)

 so if you think this should be pushed out with this known corner case
 that's not handled properly, you have to convince people that it's _so_
 improbable that they shouldn't care about it.

 David Lang


 ___
 Tux3 mailing list
 Tux3@phunq.net
 http://phunq.net/mailman/listinfo/tux3

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips


On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:

Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.


Now, let me try to lead the way, but being specific. RDMA was raised
as a potential failure case for Tux3 page forking. But the RDMA api
does not let you use memory mmaped by Tux3 as a source or destination
of IO. Instead, it sets up its own pages and hands them out to the
RDMA app from a pool. So no issue. One down, right?

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-31 Thread Raymond Jennings

Returning ENOSPC when you have free space you can't yet prove is safer than
not returning it and risking a data loss when you get hit by a write/commit
storm. :)

On Thu, Jul 30, 2015 at 9:44 PM, OGAWA Hirofumi hirof...@mail.parknet.co.jp
 wrote:

 Jan Kara j...@suse.cz writes:

   Yes, if userspace truncates the file, the situation we end up with is
   basically the same. However for truncate to happen some malicious
 process
   has to come and truncate the file - a failure scenario that is
 acceptable
   for most use cases since it doesn't happen unless someone is actively
   trying to screw you. With page forking it is enough for flusher thread
   to start writeback for that page to trigger the problem - event that
 is
   basically bound to happen without any other userspace application
   interfering.
 
  Acceptable conclusion is where came from? That pseudocode logic doesn't
  say about usage at all. And even if assume it is acceptable, as far as I
  can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
  page on non-exists block (sparse file. i.e. missing disk space check in
  your logic). And if really no any lock/check, there would be another
  races.
 
  So drop_caches won't cause any issues because it avoids mmaped pages.
  Also page reclaim or page migration don't cause any issues because
  they avoid pages with increased refcount (and increased refcount would
 stop
  drop_caches from reclaiming the page as well if it was not for the mmaped
  check before). Generally, elevated page refcount currently guarantees
 page
  isn't migrated, reclaimed, or otherwise detached from the mapping (except
  for truncate where the combination of mapping-index becomes invalid) and
  your page forking would change that assumption - which IMHO has a big
  potential for some breakage somewhere.

 Lifetime and visibility from user are different topic.  The issue here
 is visibility. Of course, those has relation more or less though,
 refcount doesn't stop to drop page from radix-tree at all.

 Well, anyway, your claim seems to be assuming the userspace app
 workarounds the issues. And it sounds like still not workarounds the
 ENOSPC issue (validate at page fault/GUP) even if assuming userspace
 behave as perfect. Calling it as kernel assumption is strange.

 If you claim, there is strange logic widely used already, and of course,
 we can't simply break it because of compatibility. I would be able to
 agree. But your claim sounds like that logic is sane and well designed
 behavior. So I disagree.

  And frankly I fail to see why you and Daniel care so much about this
  corner case because from performance POV it's IMHO a non-issue and you
  bother with page forking because of performance, don't you?

 Trying to penalize the corner case path, instead of normal path, should
 try at first. Penalizing normal path to allow corner case path is insane
 basically.

 Make normal path faster and more reliable is what we are trying.

  So you can have a look for example at
  drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting
 up
  of a video device buffer at virtual address specified by user. Now I
 don't
  know whether there really is any userspace video program that sets up the
  video buffer in mmaped file. I would agree with you that it would be a
  strange thing to do but I've seen enough strange userspace code that I
  would not be too surprised.
 
  Another example of similar kind is at
  drivers/infiniband/core/umem.c where we again set up buffer for
 infiniband
  cards at users specified virtual address. And there are more drivers in
  kernel like that.

 Unfortunately, I'm not looking those yet though. I guess those would be
 helpful to see the details.

 Thanks.
 --
 OGAWA Hirofumi hirof...@mail.parknet.co.jp

 ___
 Tux3 mailing list
 Tux3@phunq.net
 http://phunq.net/mailman/listinfo/tux3

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips


On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:

If you define this as loosing our mojo, then yes we have.


A pity. There remains so much to do that simply will not get
done in the absence of mojo.

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-31 Thread David Lang


On Fri, 31 Jul 2015, Daniel Phillips wrote:


Subject: Re: [FYI] tux3: Core changes

On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:

Returning ENOSPC when you have free space you can't yet prove is safer than
not returning it and risking a data loss when you get hit by a write/commit
storm. :)


Remember when delayed allocation was scary and unproven, because proving
that ENOSPC will always be returned when needed is extremely difficult?
But the performance advantage was compelling, so we just worked at it
until it worked. There were times when it didn't work properly, but the
code was in the tree so it got fixed.

It's like that now with page forking - a new technique with compelling
advantages, and some challenges. In the past, we (the Linux community)
would rise to the challenge and err on the side of pushing optimizations
in early. That was our mojo, and that is how Linux became the dominant
operating system it is today. Do we, the Linux community, still have that
mojo?


We, the Linux Community have less tolerance for losing people's data and 
preventing them from operating than we used to when it was all tinkerer's 
personal data and secondary systems.


So rather than pushing optimizations out to everyone and seeing what breaks, we 
now do more testing and checking for failures before pushing things out.


This means that when something new is introduced, we default to the safe, 
slightly slower way initially (there will be enough other bugs to deal with in 
any case), and then as we gain experience from the tinkerers enabling the 
performance optimizations, we make those optimizations reliable and only then 
push them out to all users.


If you define this as loosing our mojo, then yes we have. But most people see 
the pace of development as still being high, just with more testing and 
polishing before it gets out to users.


David Lang

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips


On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
We, the Linux Community have less tolerance for losing people's 
data and preventing them from operating than we used to when it 
was all tinkerer's personal data and secondary systems.


So rather than pushing optimizations out to everyone and seeing 
what breaks, we now do more testing and checking for failures 
before pushing things out.


By the way, I am curious about whose data you think will get lost
as a result of pushing out Tux3 with a possible theoretical bug
in a wildly improbable scenario that has not actually been
described with sufficient specificity to falsify, let alone
demonstrated.

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips


On Friday, July 31, 2015 3:27:12 PM PDT, David Lang wrote:

On Fri, 31 Jul 2015, Daniel Phillips wrote:


On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: ...


you weren't asking about any particular feature of Tux, you 
were asking if we were still willing to push out stuff that 
breaks for users and fix it later.


I think you left a key word out of my ask: theoretical.

Especially for filesystems that can loose the data of whoever 
is using it, the answer seems to be a clear no.


there may be bugs in what's pushed out that we don't know 
about. But we don't push out potential data corruption bugs that 
we do know about (or think we do)


so if you think this should be pushed out with this known 
corner case that's not handled properly, you have to convince 
people that it's _so_ improbable that they shouldn't care about 
it.


There should also be an onus on the person posing the worry
to prove their case beyond a reasonable doubt, which has not been
done in case we are discussing here. Note: that is a technical
assessment to which a technical response is appropriate.

I do think that we should put a cap on this fencing and make
a real effort to get Tux3 into mainline. We should at least
set a ground rule that a problem should be proved real before it
becomes a reason to derail a project in the way that our project
has been derailed. Otherwise, it's hard to see what interest is
served.

OK, lets get back to the program. I accept your assertion that
we should convince people that the issue is improbable. To do
that, I need a specific issue to address. So far, no such issue
has been provided with specificity. Do you see why this is
frustrating?

Please, community. Give us specific issues to address, or give us
some way out of this eternal limbo. Or better, lets go back to the
old way of doing things in Linux, which is what got us where we
are today. Not this.

Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips


On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:

We, the Linux Community have less tolerance for losing people's data and 
preventing them from operating than we used to when it was all tinkerer's 
personal data and secondary systems.

So rather than pushing optimizations out to everyone and seeing what breaks, we 
now do more testing and checking for failures before pushing things out.


By the way, I am curious about whose data you think will get lost
as a result of pushing out Tux3 with a possible theoretical bug
in a wildly improbable scenario that has not actually been
described with sufficient specificity to falsify, let alone
demonstrated.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-31 Thread David Lang


On Fri, 31 Jul 2015, Daniel Phillips wrote:


On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
We, the Linux Community have less tolerance for losing people's data and 
preventing them from operating than we used to when it was all tinkerer's 
personal data and secondary systems.


So rather than pushing optimizations out to everyone and seeing what 
breaks, we now do more testing and checking for failures before pushing 
things out.


By the way, I am curious about whose data you think will get lost
as a result of pushing out Tux3 with a possible theoretical bug
in a wildly improbable scenario that has not actually been
described with sufficient specificity to falsify, let alone
demonstrated.


you weren't asking about any particular feature of Tux, you were asking if we 
were still willing to push out stuff that breaks for users and fix it later.


Especially for filesystems that can loose the data of whoever is using it, the 
answer seems to be a clear no.


there may be bugs in what's pushed out that we don't know about. But we don't 
push out potential data corruption bugs that we do know about (or think we do)


so if you think this should be pushed out with this known corner case that's not 
handled properly, you have to convince people that it's _so_ improbable that 
they shouldn't care about it.


David Lang

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-30 Thread OGAWA Hirofumi

Jan Kara j...@suse.cz writes:

  Yes, if userspace truncates the file, the situation we end up with is
  basically the same. However for truncate to happen some malicious process
  has to come and truncate the file - a failure scenario that is acceptable
  for most use cases since it doesn't happen unless someone is actively
  trying to screw you. With page forking it is enough for flusher thread
  to start writeback for that page to trigger the problem - event that is
  basically bound to happen without any other userspace application
  interfering.
 
 Acceptable conclusion is where came from? That pseudocode logic doesn't
 say about usage at all. And even if assume it is acceptable, as far as I
 can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
 page on non-exists block (sparse file. i.e. missing disk space check in
 your logic). And if really no any lock/check, there would be another
 races.

 So drop_caches won't cause any issues because it avoids mmaped pages.
 Also page reclaim or page migration don't cause any issues because
 they avoid pages with increased refcount (and increased refcount would stop
 drop_caches from reclaiming the page as well if it was not for the mmaped
 check before). Generally, elevated page refcount currently guarantees page
 isn't migrated, reclaimed, or otherwise detached from the mapping (except
 for truncate where the combination of mapping-index becomes invalid) and
 your page forking would change that assumption - which IMHO has a big
 potential for some breakage somewhere.

Lifetime and visibility from user are different topic.  The issue here
is visibility. Of course, those has relation more or less though,
refcount doesn't stop to drop page from radix-tree at all.

Well, anyway, your claim seems to be assuming the userspace app
workarounds the issues. And it sounds like still not workarounds the
ENOSPC issue (validate at page fault/GUP) even if assuming userspace
behave as perfect. Calling it as kernel assumption is strange.

If you claim, there is strange logic widely used already, and of course,
we can't simply break it because of compatibility. I would be able to
agree. But your claim sounds like that logic is sane and well designed
behavior. So I disagree.

 And frankly I fail to see why you and Daniel care so much about this
 corner case because from performance POV it's IMHO a non-issue and you
 bother with page forking because of performance, don't you?

Trying to penalize the corner case path, instead of normal path, should
try at first. Penalizing normal path to allow corner case path is insane
basically.

Make normal path faster and more reliable is what we are trying.

 So you can have a look for example at
 drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up
 of a video device buffer at virtual address specified by user. Now I don't
 know whether there really is any userspace video program that sets up the
 video buffer in mmaped file. I would agree with you that it would be a
 strange thing to do but I've seen enough strange userspace code that I
 would not be too surprised.

 Another example of similar kind is at
 drivers/infiniband/core/umem.c where we again set up buffer for infiniband
 cards at users specified virtual address. And there are more drivers in
 kernel like that.

Unfortunately, I'm not looking those yet though. I guess those would be
helpful to see the details.

Thanks.
-- 
OGAWA Hirofumi hirof...@mail.parknet.co.jp

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-09 Thread Jan Kara

On Sun 05-07-15 21:54:45, OGAWA Hirofumi wrote:
 Jan Kara j...@suse.cz writes:
 
  I'm not sure I'm understanding your pseudocode logic correctly though.
  This logic doesn't seems to be a page forking specific issue.  And
  this pseudocode logic seems to be missing the locking and revalidate of
  page.
  
  If you can show more details, it would be helpful to see more, and
  discuss the issue of page forking, or we can think about how to handle
  the corner cases.
  
  Well, before that, why need more details?
  
  For example, replace the page fork at (4) with truncate, punch
  hole, or invalidate page.
  
  Those operations remove the old page from radix tree, so the
  userspace's write creates the new page, and HW still refererences the
  old page.  (I.e. situation should be same with page forking, in my
  understand of this pseudocode logic.)
 
  Yes, if userspace truncates the file, the situation we end up with is
  basically the same. However for truncate to happen some malicious process
  has to come and truncate the file - a failure scenario that is acceptable
  for most use cases since it doesn't happen unless someone is actively
  trying to screw you. With page forking it is enough for flusher thread
  to start writeback for that page to trigger the problem - event that is
  basically bound to happen without any other userspace application
  interfering.
 
 Acceptable conclusion is where came from? That pseudocode logic doesn't
 say about usage at all. And even if assume it is acceptable, as far as I
 can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
 page on non-exists block (sparse file. i.e. missing disk space check in
 your logic). And if really no any lock/check, there would be another
 races.

So drop_caches won't cause any issues because it avoids mmaped pages.
Also page reclaim or page migration don't cause any issues because
they avoid pages with increased refcount (and increased refcount would stop
drop_caches from reclaiming the page as well if it was not for the mmaped
check before). Generally, elevated page refcount currently guarantees page
isn't migrated, reclaimed, or otherwise detached from the mapping (except
for truncate where the combination of mapping-index becomes invalid) and
your page forking would change that assumption - which IMHO has a big
potential for some breakage somewhere. And frankly I fail to see why you
and Daniel care so much about this corner case because from performance POV
it's IMHO a non-issue and you bother with page forking because of
performance, don't you?

  IOW, this pseudocode logic seems to be broken without page forking if
  no lock and revalidate.  Usually, we prevent unpleasant I/O by
  lock_page or PG_writeback, and an obsolated page is revalidated under
  lock_page.
 
  Well, good luck with converting all the get_user_pages() users in kernel to
  use lock_page() or PG_writeback checks to avoid issues with page forking. I
  don't think that's really feasible.
 
 What does all get_user_pages() conversion mean? Well, maybe right more
 or less, I also think there is the issue in/around get_user_pages() that
 we have to tackle.
 
 
 IMO, if there is a code that pseudocode logic actually, it is the
 breakage. And it is acceptable and limitation, and give up to fix, I
 don't think it is the right way to go. If there is really code broken
 like your logic, I think we should fix.
 
 Could you point which code is using your logic? Since that seems to be
 so racy, I can't believe yet there are that racy codes actually.

So you can have a look for example at
drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up
of a video device buffer at virtual address specified by user. Now I don't
know whether there really is any userspace video program that sets up the
video buffer in mmaped file. I would agree with you that it would be a
strange thing to do but I've seen enough strange userspace code that I
would not be too surprised.

Another example of similar kind is at
drivers/infiniband/core/umem.c where we again set up buffer for infiniband
cards at users specified virtual address. And there are more drivers in
kernel like that.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-07-05 Thread OGAWA Hirofumi

Jan Kara j...@suse.cz writes:

 I'm not sure I'm understanding your pseudocode logic correctly though.
 This logic doesn't seems to be a page forking specific issue.  And
 this pseudocode logic seems to be missing the locking and revalidate of
 page.
 
 If you can show more details, it would be helpful to see more, and
 discuss the issue of page forking, or we can think about how to handle
 the corner cases.
 
 Well, before that, why need more details?
 
 For example, replace the page fork at (4) with truncate, punch
 hole, or invalidate page.
 
 Those operations remove the old page from radix tree, so the
 userspace's write creates the new page, and HW still refererences the
 old page.  (I.e. situation should be same with page forking, in my
 understand of this pseudocode logic.)

 Yes, if userspace truncates the file, the situation we end up with is
 basically the same. However for truncate to happen some malicious process
 has to come and truncate the file - a failure scenario that is acceptable
 for most use cases since it doesn't happen unless someone is actively
 trying to screw you. With page forking it is enough for flusher thread
 to start writeback for that page to trigger the problem - event that is
 basically bound to happen without any other userspace application
 interfering.

Acceptable conclusion is where came from? That pseudocode logic doesn't
say about usage at all. And even if assume it is acceptable, as far as I
can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
page on non-exists block (sparse file. i.e. missing disk space check in
your logic). And if really no any lock/check, there would be another
races.

 IOW, this pseudocode logic seems to be broken without page forking if
 no lock and revalidate.  Usually, we prevent unpleasant I/O by
 lock_page or PG_writeback, and an obsolated page is revalidated under
 lock_page.

 Well, good luck with converting all the get_user_pages() users in kernel to
 use lock_page() or PG_writeback checks to avoid issues with page forking. I
 don't think that's really feasible.

What does all get_user_pages() conversion mean? Well, maybe right more
or less, I also think there is the issue in/around get_user_pages() that
we have to tackle.


IMO, if there is a code that pseudocode logic actually, it is the
breakage. And it is acceptable and limitation, and give up to fix, I
don't think it is the right way to go. If there is really code broken
like your logic, I think we should fix.

Could you point which code is using your logic? Since that seems to be
so racy, I can't believe yet there are that racy codes actually.

 For page forking, we may also be able to prevent similar situation by
 locking, flags, and revalidate. But those details might be different
 with current code, because page states are different.

 Sorry, I don't understand what do you mean in this paragraph. Can you
 explain it a bit more?

This just means a forked page (old page) and a truncated page have
different set of flags and state, so we may have to adjust revalidation.

Thanks.
-- 
OGAWA Hirofumi hirof...@mail.parknet.co.jp

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-06-23 Thread Jan Kara

On Mon 22-06-15 00:36:00, OGAWA Hirofumi wrote:
 Jan Kara j...@suse.cz writes:
  So there are a few things to have in mind:
  1) There is nothing like a writeable page. Page is always writeable (at
  least on x86 architecture). When a page is mapped into some virtual address
  space (or more of them), this *mapping* can be either writeable or 
  read-only.
  mkwrite changes the mapping from read-only to writeable but kernel /
  hardware is free to write to the page regardless of the mapping.
 
  2) When kernel / hardware writes to the page, it first modifies the page
  and then marks it dirty.
   
  So what can happen in this scenario is:
 
  1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
 page is dirtied, kernel notes a PFN of the page somewhere internally.
 
  2) Writeback comes and starts writeback for the page.
 
  3) Kernel ships the PFN to the hardware.
 
  4) Userspace comes and wants to write to the page (different part than the
 HW is instructed to use). page_mkwrite is called, page is forked.
 Userspace writes to the forked page.
 
  5) HW stores its data in the original page.
 
  Userspace never sees data from the HW! Data corrupted where without page
  forking everything would work just fine.
 
 I'm not sure I'm understanding your pseudocode logic correctly though.
 This logic doesn't seems to be a page forking specific issue.  And
 this pseudocode logic seems to be missing the locking and revalidate of
 page.
 
 If you can show more details, it would be helpful to see more, and
 discuss the issue of page forking, or we can think about how to handle
 the corner cases.
 
 Well, before that, why need more details?
 
 For example, replace the page fork at (4) with truncate, punch
 hole, or invalidate page.
 
 Those operations remove the old page from radix tree, so the
 userspace's write creates the new page, and HW still refererences the
 old page.  (I.e. situation should be same with page forking, in my
 understand of this pseudocode logic.)

Yes, if userspace truncates the file, the situation we end up with is
basically the same. However for truncate to happen some malicious process
has to come and truncate the file - a failure scenario that is acceptable
for most use cases since it doesn't happen unless someone is actively
trying to screw you. With page forking it is enough for flusher thread
to start writeback for that page to trigger the problem - event that is
basically bound to happen without any other userspace application
interfering.

 IOW, this pseudocode logic seems to be broken without page forking if
 no lock and revalidate.  Usually, we prevent unpleasant I/O by
 lock_page or PG_writeback, and an obsolated page is revalidated under
 lock_page.

Well, good luck with converting all the get_user_pages() users in kernel to
use lock_page() or PG_writeback checks to avoid issues with page forking. I
don't think that's really feasible.
 
 For page forking, we may also be able to prevent similar situation by
 locking, flags, and revalidate. But those details might be different
 with current code, because page states are different.

Sorry, I don't understand what do you mean in this paragraph. Can you
explain it a bit more?
 
  Another possible scenario:
 
  1) Userspace app tells kernel to setup a HW buffer in a page.
 
  2) Userspace app fills page with data - page_mkwrite is called, page is
 dirtied.
 
  3) Userspace app tells kernel to ship buffer to video HW.
 
  4) Writeback comes and starts writeback for the page
 
  5) Video HW is done with the page. Userspace app fills new set of data into
 the page - page_mkwrite is called, page is forked.
 
  6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
 old data from the original page.
 
  Again a data corruption issue where previously things were working fine.
 
 This logic seems to be same as above. Replace the page fork at (5).
 With no revalidate of page, (6) will use the old page.

  Yes, the same arguments as above apply...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-06-21 Thread OGAWA Hirofumi

Jan Kara j...@suse.cz writes:

Hi,

 So there are a few things to have in mind:
 1) There is nothing like a writeable page. Page is always writeable (at
 least on x86 architecture). When a page is mapped into some virtual address
 space (or more of them), this *mapping* can be either writeable or read-only.
 mkwrite changes the mapping from read-only to writeable but kernel /
 hardware is free to write to the page regardless of the mapping.

 2) When kernel / hardware writes to the page, it first modifies the page
 and then marks it dirty.
  
 So what can happen in this scenario is:

 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
page is dirtied, kernel notes a PFN of the page somewhere internally.

 2) Writeback comes and starts writeback for the page.

 3) Kernel ships the PFN to the hardware.

 4) Userspace comes and wants to write to the page (different part than the
HW is instructed to use). page_mkwrite is called, page is forked.
Userspace writes to the forked page.

 5) HW stores its data in the original page.

 Userspace never sees data from the HW! Data corrupted where without page
 forking everything would work just fine.

I'm not sure I'm understanding your pseudocode logic correctly though.
This logic doesn't seems to be a page forking specific issue.  And
this pseudocode logic seems to be missing the locking and revalidate of
page.

If you can show more details, it would be helpful to see more, and
discuss the issue of page forking, or we can think about how to handle
the corner cases.

Well, before that, why need more details?

For example, replace the page fork at (4) with truncate, punch
hole, or invalidate page.

Those operations remove the old page from radix tree, so the
userspace's write creates the new page, and HW still refererences the
old page.  (I.e. situation should be same with page forking, in my
understand of this pseudocode logic.)

IOW, this pseudocode logic seems to be broken without page forking if
no lock and revalidate.  Usually, we prevent unpleasant I/O by
lock_page or PG_writeback, and an obsolated page is revalidated under
lock_page.

For page forking, we may also be able to prevent similar situation by
locking, flags, and revalidate. But those details might be different
with current code, because page states are different.

 Another possible scenario:

 1) Userspace app tells kernel to setup a HW buffer in a page.

 2) Userspace app fills page with data - page_mkwrite is called, page is
dirtied.

 3) Userspace app tells kernel to ship buffer to video HW.

 4) Writeback comes and starts writeback for the page

 5) Video HW is done with the page. Userspace app fills new set of data into
the page - page_mkwrite is called, page is forked.

 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
old data from the original page.

 Again a data corruption issue where previously things were working fine.

This logic seems to be same as above. Replace the page fork at (5).
With no revalidate of page, (6) will use the old page.

Thanks.
--
OGAWA Hirofumi hirof...@mail.parknet.co.jp

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-27 Thread Pavel Machek

On Wed 2015-05-27 11:09:25, Daniel Phillips wrote:
 On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
 On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
 On 05/14/2015 08:06 PM, Rik van Riel wrote: ...
 
 Umm. Why do you think it is only issue for executable files?
 
 I meant: files with code in them, that will be executed. Please excuse
 me for colliding with the chmod sense. I will say code files to avoid
 ambiguity.
 
 I'm free to mmap() any file, and then execute from it.
 
 /lib/ld-linux.so /path/to/binary
 
 is known way to exec programs that do not have x bit set.
 
 So... why would I write to a code file at the same time as stepping
 through it with ptrace? Should I expect ptrace to work perfectly if
 I do that? What would work perfectly mean, if the code is changing
 at the same time as being traced?

Do you have any imagination at all?

Reasons I should expect ptrace to work perfectly if I'm writing to
file:

1) it used to work before

2) it used to work before

3) it used to work before and regressions are not allowed

4) some kind of just in time compiler

5) some kind of malware, playing tricks so that you have trouble
analyzing it

and of course,

6) it used to work before.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-27 Thread Daniel Phillips



On 05/27/2015 02:37 PM, Pavel Machek wrote:
 On Wed 2015-05-27 11:09:25, Daniel Phillips wrote:
 On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
 On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
 On 05/14/2015 08:06 PM, Rik van Riel wrote: ...

 Umm. Why do you think it is only issue for executable files?

 I meant: files with code in them, that will be executed. Please excuse
 me for colliding with the chmod sense. I will say code files to avoid
 ambiguity.

 I'm free to mmap() any file, and then execute from it.

 /lib/ld-linux.so /path/to/binary

 is known way to exec programs that do not have x bit set.

 So... why would I write to a code file at the same time as stepping
 through it with ptrace? Should I expect ptrace to work perfectly if
 I do that? What would work perfectly mean, if the code is changing
 at the same time as being traced?
 
 Do you have any imagination at all?

[Non-collegial rhetoric alert, it would be helpful to avoid that.]

 Reasons I should expect ptrace to work perfectly if I'm writing to
 file:
 
 1) it used to work before
 
 2) it used to work before
 
 3) it used to work before and regressions are not allowed

Are you sure that ptrace will work perfectly on a file that you are
writing to at the same time as tracing? If so, it has magic that I
do not understand. Could you please explain.

 4) some kind of just in time compiler

A JIT that can tolerate being written to by a task it knows nothing
about, at the same time as it is generating code in the file? I do
not know of any such JIT.

 5) some kind of malware, playing tricks so that you have trouble
 analyzing it

By writing to a code file? Then it already has write access to the
code file, so it has already gotten inside your security perimeter
without needing help from page fork. That said, we should be alert
for any new holes that page fork might open. But if there are any,
they should be actual holes, not theoretical ones.

 and of course,
 
 6) it used to work before.

I look forward to your explanation of how.

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-26 Thread Jan Kara

On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
 On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
   E.g. video drivers (or infiniband or direct IO for that matter) which
 have buffers in user memory (may be mmapped file), grab references to pages
 and hand out PFNs of those pages to the hardware to store data in them...
 If you fork a page after the driver has handed PFNs to the hardware, you've
 just lost all the writes hardware will do.
 
 Hi Jan,
 
 The page forked because somebody wrote to it with write(2) or mmap write at
 the same time as a video driver (or infiniband or direct IO) was
 doing io to
 it. Isn't the application trying hard to lose data in that case? It
 would not need page fork to lose data that way.

So I can think of two valid uses:

1) You setup IO to part of a page and modify from userspace a different
   part of a page.

2) At least for video drivers there is one ioctl() which creates object
   with buffers in memory and another ioctl() to actually ship it to hardware
   (may be called repeatedly). So in theory app could validly dirty the pages
   before it ships them to hardware. If this happens repeatedly and interacts
   badly with background writeback, you will end up with a forked page in a
   buffer and from that point on things are broken.

So my opinion is: Don't fork the page if page_count is elevated. You can
just wait for the IO if you need stable pages in that case. It's slow but
it's safe and it should be pretty rare. Is there any problem with that?

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-26 Thread Jan Kara

On Mon 25-05-15 23:11:11, Daniel Phillips wrote:
 On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
 if the page gets modified again, will that cause any issues? what
 if the page gets modified before the copy gets written out, so
 that there are two dirty copies of the page in the process of
 being written?
 
 David Lang
 
 How is the page going to get modified again? A forked page isn't
 mapped by a pte, so userspace can't modify it by mmap. The forked
 page is not in the page cache, so usespace can't modify it by
 posix file ops. So the writer would have to be in kernel. Tux3
 knows what it is doing, so it won't modify the page. What kernel
 code besides Tux3 will modify the page?
  E.g. video drivers (or infiniband or direct IO for that matter) which
have buffers in user memory (may be mmapped file), grab references to pages
and hand out PFNs of those pages to the hardware to store data in them...
If you fork a page after the driver has handed PFNs to the hardware, you've
just lost all the writes hardware will do.

Honza

-- 
Jan Kara j...@suse.cz
SUSE Labs, CR

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-26 Thread Sergey Senozhatsky

On (05/26/15 01:08), Daniel Phillips wrote:
 On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
   E.g. video drivers (or infiniband or direct IO for that matter) which
 have buffers in user memory (may be mmapped file), grab references to pages
 and hand out PFNs of those pages to the hardware to store data in them...
 If you fork a page after the driver has handed PFNs to the hardware, you've
 just lost all the writes hardware will do.
 
 Hi Jan,
 
 The page forked because somebody wrote to it with write(2) or mmap write at
 the same time as a video driver (or infiniband or direct IO) was doing io to
 it. Isn't the application trying hard to lose data in that case? It would
 not need page fork to lose data that way.
 

Hello,

is it possible to page-fork-bomb the system by some 'malicious' app?

-ss

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips


On Tuesday, May 26, 2015 3:13:02 AM PDT, Pavel Machek wrote:

On Tue 2015-05-26 01:09:59, Daniel Phillips wrote:

On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote:

I'm assuming that Rik is talking about whatever has the reference to the
page via one of the methods that he talked about.


This would be a good moment to provide specifics.


Hmm. This seems like a good moment for you to audit whole kernel, to
make sure it does not do stuff you don't expect it to.

You are changing core semantics, stuff that was allowed before is not
allowed now, so it looks like you should do the auditing...


No, we are not changing core semantics. Core semantics remain
unchanged. All these questions are about Tux3 semantics. So I
believe that I am correct in asking for specifics, rather than
general theory. Auditing the whole kernel seems an unreasonable
barrier to merging a filesystem.


You may want to start with video4linux, as Jan pointed out.


I pointed out that the video case Jan mentioned is a race. Is
that right or wrong? And what responsibility does Tux3 have,
if the use case is a race? I would think: do not oops, do not
corrupt the filesystem. Anything else?

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips

On 05/26/2015 02:00 AM, Jan Kara wrote:
 On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
 On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
  E.g. video drivers (or infiniband or direct IO for that matter) which
 have buffers in user memory (may be mmapped file), grab references to pages
 and hand out PFNs of those pages to the hardware to store data in them...
 If you fork a page after the driver has handed PFNs to the hardware, you've
 just lost all the writes hardware will do.

 Hi Jan,

 The page forked because somebody wrote to it with write(2) or mmap write at
 the same time as a video driver (or infiniband or direct IO) was
 doing io to
 it. Isn't the application trying hard to lose data in that case? It
 would not need page fork to lose data that way.
 
 So I can think of two valid uses:
 
 1) You setup IO to part of a page and modify from userspace a different
part of a page.

Suppose the use case is reading textures from video memory into a mmapped
file, and at the same time, the application is allowed to update the
textures in the file via mmap or write(2). Fork happens at mkwrite time.
If the page is already dirty, we do not fork it. The video API must have
made the page writable and dirty, so I do not see an issue.

 2) At least for video drivers there is one ioctl() which creates object
with buffers in memory and another ioctl() to actually ship it to hardware
(may be called repeatedly). So in theory app could validly dirty the pages
before it ships them to hardware. If this happens repeatedly and interacts
badly with background writeback, you will end up with a forked page in a
buffer and from that point on things are broken.

Writeback does not fork pages. An app may dirty a page that is in process
of being shipped to hardware (must be a distinct part of the page, or it is
a race) and the data being sent to hardware will not be disturbed. If there
is an issue here, I do not see it.

 So my opinion is: Don't fork the page if page_count is elevated. You can
 just wait for the IO if you need stable pages in that case. It's slow but
 it's safe and it should be pretty rare. Is there any problem with that?

That would be our fallback if anybody discovers a specific case where page
fork breaks something, which so far has not been demonstrated.

With a known fallback, it is hard to see why we should delay merging over
that. Perfection has never been a requirement for merging filesystems. On
the contrary, imperfection is a reason for merging, so that the many
eyeballs effect may prove its value.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips

Hi Sergey,

On 05/26/2015 03:22 AM, Sergey Senozhatsky wrote:
 
 Hello,
 
 is it possible to page-fork-bomb the system by some 'malicious' app?

Not in any new way. A page fork can happen either in the front end,
where it has to wait for memory like any other normal memory user,
or in the backend, where Tux3 may have privileged access to low
memory reserves and therefore must place bounds on its memory use
like any other user of low memory reserves.

This is not specific to page fork. We must place such bounds for
any memory that the backend uses. Fortunately, the backend does not
allocate memory extravagently, for fork or anything else, so when
this does get to the top of our to-do list it should not be too
hard to deal with. We plan to attack that after merge, as we have
never observed a problem in practice. Rather, Tux3 already seems
to survive low memory situations pretty well compared to some other
filesystems.


Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-26 Thread Rik van Riel

On 05/26/2015 04:22 PM, Daniel Phillips wrote:
 On 05/26/2015 02:00 AM, Jan Kara wrote:

 So my opinion is: Don't fork the page if page_count is elevated. You can
 just wait for the IO if you need stable pages in that case. It's slow but
 it's safe and it should be pretty rare. Is there any problem with that?
 
 That would be our fallback if anybody discovers a specific case where page
 fork breaks something, which so far has not been demonstrated.
 
 With a known fallback, it is hard to see why we should delay merging over
 that. Perfection has never been a requirement for merging filesystems. On

However, avoiding data corruption by erring on the side of safety is
a pretty basic requirement.

 the contrary, imperfection is a reason for merging, so that the many
 eyeballs effect may prove its value.

If you skip the page fork when there is an elevated page count, tux3
should be safe (at least from that aspect). Only do the COW when there
is no strange use of the page going on.

-- 
All rights reversed

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-20 Thread Daniel Phillips



On 05/20/2015 07:44 AM, Jan Kara wrote:
 On Tue 19-05-15 13:33:31, David Lang wrote:
 On Tue, 19 May 2015, Daniel Phillips wrote:

 I understand that Tux3 may avoid these issues due to some other mechanisms
 it internally has but if page forking should get into mm subsystem, the
 above must work.

 It does work, and by example, it does not need a lot of code to make
 it work, but the changes are not trivial. Tux3's delta writeback model
 will not suit everyone, so you can't just lift our code and add it to
 Ext4. Using it in Ext4 would require a per-inode writeback model, which
 looks practical to me but far from a weekend project. Maybe something
 to consider for Ext5.

 It is the job of new designs like Tux3 to chase after that final drop
 of performance, not our trusty Ext4 workhorse. Though stranger things
 have happened - as I recall, Ext4 had O(n) directory operations at one
 time. Fixing that was not easy, but we did it because we had to. Fixing
 Ext4's write performance is not urgent by comparison, and the barrier
 is high, you would want jbd3 for one thing.

 I think the meta-question you are asking is, where is the second user
 for this new CoW functionality? With a possible implication that if
 there is no second user then Tux3 cannot be merged. Is that is the
 question?

 I don't think they are asking for a second user. What they are
 saying is that for this functionality to be accepted in the mm
 subsystem, these problem cases need to work reliably, not just work
 for Tux3 because of your implementation.

 So for things that you don't use, you need to make it an error if
 they get used on a page that's been forked (or not be an error and
 'do the right thing')

 For cases where it doesn't matter because Tux3 controls the
 writeback, and it's undefined in general what happens if writeback
 is triggered twice on the same page, you will need to figure out how
 to either prevent the second writeback from triggering if there's
 one in process, or define how the two writebacks are going to happen
 so that you can't end up with them re-ordered by some other
 filesystem.

 I think that that's what's meant by the top statement that I left in
 the quote. Even if your implementation details make it safe, these
 need to be safe even without your implementation details to be
 acceptable in the core kernel.
   Yeah, that's what I meant. If you create a function which manipulates
 page cache, you better make it work with other functions manipulating page
 cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
 developer. Sure you can document all the conditions under which the
 function is safe to use but a function that has several paragraphs in front
 of it explaning when it is safe to use isn't very good API...

Violent agreement, of course. To put it in concrete terms, each of
the page fork support functions must be examined and determined
sane. They are:

 * cow_replace_page_cache
 * cow_delete_from_page_cache
 * cow_clone_page
 * page_cow_one
 * page_cow_file

Would it be useful to drill down into those, starting from the top
of the list?

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-20 Thread David Lang


On Wed, 20 May 2015, Daniel Phillips wrote:


On 05/20/2015 03:51 PM, Daniel Phillips wrote:

On 05/20/2015 12:53 PM, Rik van Riel wrote:

How does tux3 prevent a user of find_get_page() from reading from
or writing into the pre-COW page, instead of the current page?


Careful control of the dirty bits (we have two of them, one each
for front and back). That is what pagefork_for_blockdirty is about.


Ah, and of course it does not matter if a reader is on the
pre-cow page. It would be reading the earlier copy, which might
no longer be the current copy, but it raced with the write so
nobody should be surprised. That is a race even without page fork.


how do you prevent it from continuing to interact with the old version of the 
page and never see updates or have it's changes reflected on the current page?


David Lang

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-19 Thread David Lang


On Tue, 19 May 2015, Daniel Phillips wrote:


I understand that Tux3 may avoid these issues due to some other mechanisms
it internally has but if page forking should get into mm subsystem, the
above must work.


It does work, and by example, it does not need a lot of code to make
it work, but the changes are not trivial. Tux3's delta writeback model
will not suit everyone, so you can't just lift our code and add it to
Ext4. Using it in Ext4 would require a per-inode writeback model, which
looks practical to me but far from a weekend project. Maybe something
to consider for Ext5.

It is the job of new designs like Tux3 to chase after that final drop
of performance, not our trusty Ext4 workhorse. Though stranger things
have happened - as I recall, Ext4 had O(n) directory operations at one
time. Fixing that was not easy, but we did it because we had to. Fixing
Ext4's write performance is not urgent by comparison, and the barrier
is high, you would want jbd3 for one thing.

I think the meta-question you are asking is, where is the second user
for this new CoW functionality? With a possible implication that if
there is no second user then Tux3 cannot be merged. Is that is the
question?


I don't think they are asking for a second user. What they are saying is that 
for this functionality to be accepted in the mm subsystem, these problem cases 
need to work reliably, not just work for Tux3 because of your implementation.


So for things that you don't use, you need to make it an error if they get used 
on a page that's been forked (or not be an error and 'do the right thing')


For cases where it doesn't matter because Tux3 controls the writeback, and it's 
undefined in general what happens if writeback is triggered twice on the same 
page, you will need to figure out how to either prevent the second writeback 
from triggering if there's one in process, or define how the two writebacks are 
going to happen so that you can't end up with them re-ordered by some other 
filesystem.


I think that that's what's meant by the top statement that I left in the quote. 
Even if your implementation details make it safe, these need to be safe even 
without your implementation details to be acceptable in the core kernel.


David Lang

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-19 Thread Jan Kara

On Thu 14-05-15 01:26:23, Daniel Phillips wrote:
 Hi Rik,
 
 Our linux-tux3 tree currently currently carries this 652 line diff
 against core, to make Tux3 work. This is mainly by Hirofumi, except
 the fs-writeback.c hook, which is by me. The main part you may be
 interested in is rmap.c, which addresses the issues raised at the
 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
 
LSFMM: Page forking
http://lwn.net/Articles/548091/
 
 This is just a FYI. An upcoming Tux3 report will be a tour of the page
 forking design and implementation. For now, this is just to give a
 general sense of what we have done. We heard there are concerns about
 how ptrace will work. I really am not familiar with the issue, could
 you please explain what you were thinking of there?
  So here are a few things I find problematic about page forking (besides
the cases with elevated page_count already discussed in this thread - there
I believe that anything more complex than wait for the IO instead of
forking when page has elevated use count isn't going to work. There are
too many users depending on too subtle details of the behavior...). Some
of them are actually mentioned in the above LWN article:

When you create a copy of a page and replace it in the radix tree, nobody
in mm subsystem is aware that oldpage may be under writeback. That causes
interesting issues:
* truncate_inode_pages() can finish before all IO for the file is finished.
  So far filesystems rely on the fact that once truncate_inode_pages()
  finishes all running IO against the file is completed and new cannot be
  submitted.
* Writeback can come and try to write newpage while oldpage is still under
  IO. Then you'll have two IOs against one block which has undefined
  results.
* filemap_fdatawait() called from fsync() has additional problem that it is
  not aware of oldpage and thus may return although IO hasn't finished yet.

I understand that Tux3 may avoid these issues due to some other mechanisms
it internally has but if page forking should get into mm subsystem, the
above must work.

Honza

 diffstat tux3.core.patch
  fs/Makefile   |1 
  fs/fs-writeback.c |  100 +
  include/linux/fs.h|6 +
  include/linux/mm.h|5 +
  include/linux/pagemap.h   |2 
  include/linux/rmap.h  |   14 
  include/linux/writeback.h |   23 +++
  mm/filemap.c  |   82 +++
  mm/rmap.c |  139 
 ++
  mm/truncate.c |   98 
  10 files changed, 411 insertions(+), 59 deletions(-)
 
 diff --git a/fs/Makefile b/fs/Makefile
 index 91fcfa3..44d7192 100644
 --- a/fs/Makefile
 +++ b/fs/Makefile
 @@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS)   += ext4/
  obj-$(CONFIG_JBD)+= jbd/
  obj-$(CONFIG_JBD2)   += jbd2/
  obj-$(CONFIG_TUX3)   += tux3/
 -obj-$(CONFIG_TUX3_MMAP)  += tux3/
  obj-$(CONFIG_CRAMFS) += cramfs/
  obj-$(CONFIG_SQUASHFS)   += squashfs/
  obj-y+= ramfs/
 diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
 index 2d609a5..fcd1c61 100644
 --- a/fs/fs-writeback.c
 +++ b/fs/fs-writeback.c
 @@ -34,25 +34,6 @@
   */
  #define MIN_WRITEBACK_PAGES  (4096UL  (PAGE_CACHE_SHIFT - 10))
  
 -/*
 - * Passed into wb_writeback(), essentially a subset of writeback_control
 - */
 -struct wb_writeback_work {
 - long nr_pages;
 - struct super_block *sb;
 - unsigned long *older_than_this;
 - enum writeback_sync_modes sync_mode;
 - unsigned int tagged_writepages:1;
 - unsigned int for_kupdate:1;
 - unsigned int range_cyclic:1;
 - unsigned int for_background:1;
 - unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */
 - enum wb_reason reason;  /* why was writeback initiated? */
 -
 - struct list_head list;  /* pending work list */
 - struct completion *done;/* set if the caller waits */
 -};
 -
  /**
   * writeback_in_progress - determine whether there is writeback in progress
   * @bdi: the device's backing_dev_info structure.
 @@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode)
  }
  
  /*
 + * Remove inode from writeback list if clean.
 + */
 +void inode_writeback_done(struct inode *inode)
 +{
 + struct backing_dev_info *bdi = inode_to_bdi(inode);
 +
 + spin_lock(bdi-wb.list_lock);
 + spin_lock(inode-i_lock);
 + if (!(inode-i_state  I_DIRTY))
 + list_del_init(inode-i_wb_list);
 + spin_unlock(inode-i_lock);
 + spin_unlock(bdi-wb.list_lock);
 +}
 +EXPORT_SYMBOL_GPL(inode_writeback_done);
 +
 +/*
 + * Add inode to writeback dirty list with current time.
 + */
 +void inode_writeback_touch(struct inode *inode)
 +{
 + struct backing_dev_info *bdi = inode_to_bdi(inode);

Re: [FYI] tux3: Core changes

2015-05-18 Thread Mel Gorman

On Sat, May 16, 2015 at 03:38:04PM -0700, David Lang wrote:
 On Fri, 15 May 2015, Mel Gorman wrote:
 
 On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
 
 
 On 05/15/2015 01:09 AM, Mel Gorman wrote:
 On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
 On 05/14/2015 08:06 PM, Daniel Phillips wrote:
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.
 
 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.
 
 This causes the process and the other subsystem to each
 look at a different page, instead of at shared state,
 causing ptrace to do nothing, AIO and RDMA data to be
 invisible (or corrupted), etc...
 
 Is this a bit like page migration?
 
 Yes. Page migration will fail if there is an extra
 reference to the page that is not accounted for by
 the migration code.
 
 When I said it's not like page migration, I was referring to the fact
 that a COW on a pinned page for RDMA is a different problem to page
 migration. The COW of a pinned page can lead to lost writes or
 corruption depending on the ordering of events.
 
 I see the lost writes case, but not the corruption case,
 
 Data corruption can occur depending on the ordering of events and the
 applications expectations. If a process starts IO, RDMA pins the page
 for read and forks are combined with writes from another thread then when
 the IO completes the reads may not be visible. The application may take
 improper action at that point.
 
 if tux3 forks the page and writes the copy while the original page
 is being modified by other things, this means that some of the
 changes won't be in the version written (and this could catch
 partial writes with 'interesting' results if the forking happens at
 the wrong time)
 

Potentially yes. There is likely to be some elevated memory usage but I
imagine that can be controlled.

 But if the original page gets re-marked as needing to be written out
 when it's changed by one of the other things that are accessing it,
 there shouldn't be any long-term corruption.
 
 As far as short-term corruption goes, any time you have a page
 mmapped it could get written out at any time, with only some of the
 application changes applied to it, so this sort of corruption could
 happen anyway couldn't it?
 

That becomes the responsibility of the application. It's up to it to sync
appropriately when it knows updates are complete.

 Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
 class of problem.
 
 You can choose to not define this as data corruption because thge kernel
 is not directly involved and that's your call.
 
 Do you
 mean corruption by changing a page already in writeout? If so,
 don't all filesystems have that problem?
 
 
 No, the problem is different. Backing devices requiring stable pages will
 block the write until the IO is complete. For those that do not require
 stable pages it's ok to allow the write as long as the page is dirtied so
 that it'll be written out again and no data is lost.
 
 so if tux3 is prevented from forking the page in cases where the
 write would be blocked, and will get forked again for follow-up
 writes if it's modified again otherwise, won't this be the same
 thing?
 

Functionally and from a correctness point of view, it *might* be
equivalent. It depends on the implementation and the page life cycle,
particularly the details of how the writeback and dirty state are coordinated
between the user-visible pages and the page being written back. I've read
none of the code or background so I cannot answer whether it's really
equivalent or not. Just be aware that it's not the same problem as page
migration and that it's not the same as how writeback and dirty state is
handled today.

-- 
Mel Gorman
SUSE Labs

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-18 Thread Boaz Harrosh

On 05/18/2015 05:20 AM, Rik van Riel wrote:
 On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
 On 05/14/2015 03:59 PM, Rik van Riel wrote:
 On 05/14/2015 04:26 AM, Daniel Phillips wrote:
 Hi Rik,
 

 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.


 All these problems can also happen with truncate+new-extending-write

 It is the responsibility of the application to take file/range locks
 to prevent these page-pinned problems.
 
 It is unreasonable to expect a process that is being ptraced
 (potentially without its knowledge) to take special measures
 to protect the ptraced memory from disappearing.

If the memory disappears that's a bug. No the memory is just there
it is just not reflecting the latest content of the fs-file.

 
 It is impossible for the debugger to take those special measures
 for anonymous memory, or unlinked inodes.
 

Why? one line of added code after the open and before the mmap do an flock

 I don't think your requirement is workable or reasonable.
 

Therefor it is unreasonable to write/modify a ptraced process
file.

Again what I'm saying is COWing a page on write, has the same effect
as truncate+write. They are both allowed and both might give you the same
stale effect. So the presidence is there. We are not introducing a new
anomaly, just introducing a new instance of it. I guess the question
is what applications/procedures are going to break. Need lots of testing
and real life installations to answer that, I guess.

Thanks
Boaz


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-18 Thread Boaz Harrosh

On 05/14/2015 03:59 PM, Rik van Riel wrote:
 On 05/14/2015 04:26 AM, Daniel Phillips wrote:
 Hi Rik,

 
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.
 
 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.
 

All these problems can also happen with truncate+new-extending-write

It is the responsibility of the application to take file/range locks
to prevent these page-pinned problems.

 This causes the process and the other subsystem to each
 look at a different page, instead of at shared state,
 causing ptrace to do nothing, AIO and RDMA data to be
 invisible (or corrupted), etc...
 

Again these problems already exist. Consider each in-place-write
being a truncate (punch hole) + new-write is that not the same?

Cheers
Boaz


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-18 Thread Daniel Phillips

On 05/17/2015 07:20 PM, Rik van Riel wrote:
 On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
 On 05/14/2015 03:59 PM, Rik van Riel wrote:
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.

 All these problems can also happen with truncate+new-extending-write

 It is the responsibility of the application to take file/range locks
 to prevent these page-pinned problems.
 
 It is unreasonable to expect a process that is being ptraced
 (potentially without its knowledge) to take special measures
 to protect the ptraced memory from disappearing.
 
 It is impossible for the debugger to take those special measures
 for anonymous memory, or unlinked inodes.
 
 I don't think your requirement is workable or reasonable.

Hi Rik,

You are quite right to poke at this aggressively. Whether or not
there is an issue needing fixing, we want to know the details. We
really need to do a deep dive in ptrace and know exactly what it
does, and whether Tux3 creates any new kind of hole. I really know
very little about ptrace at the moment, I only have heard that it
is a horrible hack we inherited from some place far away and a time
long ago.

A little guidance from you would help. Somewhere ptrace must modify
the executable page. Unlike uprobes, which makes sense to me, I did
not find where ptrace actually does that on a quick inspection.
Perhaps you could provide a pointer?

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-17 Thread Rik van Riel

On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
 On 05/14/2015 03:59 PM, Rik van Riel wrote:
 On 05/14/2015 04:26 AM, Daniel Phillips wrote:
 Hi Rik,
 

 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.

 
 All these problems can also happen with truncate+new-extending-write
 
 It is the responsibility of the application to take file/range locks
 to prevent these page-pinned problems.

It is unreasonable to expect a process that is being ptraced
(potentially without its knowledge) to take special measures
to protect the ptraced memory from disappearing.

It is impossible for the debugger to take those special measures
for anonymous memory, or unlinked inodes.

I don't think your requirement is workable or reasonable.

-- 
All rights reversed

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-16 Thread David Lang


On Fri, 15 May 2015, Mel Gorman wrote:


On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:



On 05/15/2015 01:09 AM, Mel Gorman wrote:

On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:

On 05/14/2015 08:06 PM, Daniel Phillips wrote:

The issue is that things like ptrace, AIO, infiniband
RDMA, and other direct memory access subsystems can take
a reference to page A, which Tux3 clones into a new page B
when the process writes it.

However, while the process now points at page B, ptrace,
AIO, infiniband, etc will still be pointing at page A.

This causes the process and the other subsystem to each
look at a different page, instead of at shared state,
causing ptrace to do nothing, AIO and RDMA data to be
invisible (or corrupted), etc...


Is this a bit like page migration?


Yes. Page migration will fail if there is an extra
reference to the page that is not accounted for by
the migration code.


When I said it's not like page migration, I was referring to the fact
that a COW on a pinned page for RDMA is a different problem to page
migration. The COW of a pinned page can lead to lost writes or
corruption depending on the ordering of events.


I see the lost writes case, but not the corruption case,


Data corruption can occur depending on the ordering of events and the
applications expectations. If a process starts IO, RDMA pins the page
for read and forks are combined with writes from another thread then when
the IO completes the reads may not be visible. The application may take
improper action at that point.


if tux3 forks the page and writes the copy while the original page is being 
modified by other things, this means that some of the changes won't be in the 
version written (and this could catch partial writes with 'interesting' results 
if the forking happens at the wrong time)


But if the original page gets re-marked as needing to be written out when it's 
changed by one of the other things that are accessing it, there shouldn't be any 
long-term corruption.


As far as short-term corruption goes, any time you have a page mmapped it could 
get written out at any time, with only some of the application changes applied 
to it, so this sort of corruption could happen anyway couldn't it?



Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
class of problem.

You can choose to not define this as data corruption because thge kernel
is not directly involved and that's your call.


Do you
mean corruption by changing a page already in writeout? If so,
don't all filesystems have that problem?



No, the problem is different. Backing devices requiring stable pages will
block the write until the IO is complete. For those that do not require
stable pages it's ok to allow the write as long as the page is dirtied so
that it'll be written out again and no data is lost.


so if tux3 is prevented from forking the page in cases where the write would be 
blocked, and will get forked again for follow-up writes if it's modified again 
otherwise, won't this be the same thing?


David Lang


If RDMA to a mmapped file races with write(2) to the same file,
maybe it is reasonable and expected to lose some data.



In the RDMA case, there is at least application awareness to work around
the problems. Normally it's ok to have both mapped and write() access
to data although userspace might need a lock to co-ordinate updates and
event ordering.




___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-15 Thread Mel Gorman

On Thu, May 14, 2015 at 05:06:39PM -0700, Daniel Phillips wrote:
 Hi Rik,
 
 Added Mel, Andrea and Peterz to CC as interested parties. There are
 probably others, please just jump in.
 
 On 05/14/2015 05:59 AM, Rik van Riel wrote:
  On 05/14/2015 04:26 AM, Daniel Phillips wrote:
  Hi Rik,
 
  Our linux-tux3 tree currently currently carries this 652 line diff
  against core, to make Tux3 work. This is mainly by Hirofumi, except
  the fs-writeback.c hook, which is by me. The main part you may be
  interested in is rmap.c, which addresses the issues raised at the
  2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
 
 LSFMM: Page forking
 http://lwn.net/Articles/548091/
 
  This is just a FYI. An upcoming Tux3 report will be a tour of the page
  forking design and implementation. For now, this is just to give a
  general sense of what we have done. We heard there are concerns about
  how ptrace will work. I really am not familiar with the issue, could
  you please explain what you were thinking of there?
  
  The issue is that things like ptrace, AIO, infiniband
  RDMA, and other direct memory access subsystems can take
  a reference to page A, which Tux3 clones into a new page B
  when the process writes it.
  
  However, while the process now points at page B, ptrace,
  AIO, infiniband, etc will still be pointing at page A.
  
  This causes the process and the other subsystem to each
  look at a different page, instead of at shared state,
  causing ptrace to do nothing, AIO and RDMA data to be
  invisible (or corrupted), etc...
 
 Is this a bit like page migration?
 

No, it's not.

-- 
Mel Gorman
SUSE Labs

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-15 Thread Daniel Phillips



On 05/15/2015 01:09 AM, Mel Gorman wrote:
 On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
 On 05/14/2015 08:06 PM, Daniel Phillips wrote:
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.

 This causes the process and the other subsystem to each
 look at a different page, instead of at shared state,
 causing ptrace to do nothing, AIO and RDMA data to be
 invisible (or corrupted), etc...

 Is this a bit like page migration?

 Yes. Page migration will fail if there is an extra
 reference to the page that is not accounted for by
 the migration code.
 
 When I said it's not like page migration, I was referring to the fact
 that a COW on a pinned page for RDMA is a different problem to page
 migration. The COW of a pinned page can lead to lost writes or
 corruption depending on the ordering of events.

I see the lost writes case, but not the corruption case, Do you
mean corruption by changing a page already in writeout? If so,
don't all filesystems have that problem?

If RDMA to a mmapped file races with write(2) to the same file,
maybe it is reasonable and expected to lose some data.

 Page migration fails
 when there are unexpected problems to avoid this class of issue which is
 fine for page migration but may be a critical failure in a filesystem
 depending on exactly why the copy is required.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-15 Thread Mel Gorman

On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
 
 
 On 05/15/2015 01:09 AM, Mel Gorman wrote:
  On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
  On 05/14/2015 08:06 PM, Daniel Phillips wrote:
  The issue is that things like ptrace, AIO, infiniband
  RDMA, and other direct memory access subsystems can take
  a reference to page A, which Tux3 clones into a new page B
  when the process writes it.
 
  However, while the process now points at page B, ptrace,
  AIO, infiniband, etc will still be pointing at page A.
 
  This causes the process and the other subsystem to each
  look at a different page, instead of at shared state,
  causing ptrace to do nothing, AIO and RDMA data to be
  invisible (or corrupted), etc...
 
  Is this a bit like page migration?
 
  Yes. Page migration will fail if there is an extra
  reference to the page that is not accounted for by
  the migration code.
  
  When I said it's not like page migration, I was referring to the fact
  that a COW on a pinned page for RDMA is a different problem to page
  migration. The COW of a pinned page can lead to lost writes or
  corruption depending on the ordering of events.
 
 I see the lost writes case, but not the corruption case,

Data corruption can occur depending on the ordering of events and the
applications expectations. If a process starts IO, RDMA pins the page
for read and forks are combined with writes from another thread then when
the IO completes the reads may not be visible. The application may take
improper action at that point.

Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
class of problem.

You can choose to not define this as data corruption because thge kernel
is not directly involved and that's your call.

 Do you
 mean corruption by changing a page already in writeout? If so,
 don't all filesystems have that problem?
 

No, the problem is different. Backing devices requiring stable pages will
block the write until the IO is complete. For those that do not require
stable pages it's ok to allow the write as long as the page is dirtied so
that it'll be written out again and no data is lost.

 If RDMA to a mmapped file races with write(2) to the same file,
 maybe it is reasonable and expected to lose some data.
 

In the RDMA case, there is at least application awareness to work around
the problems. Normally it's ok to have both mapped and write() access
to data although userspace might need a lock to co-ordinate updates and
event ordering.

-- 
Mel Gorman
SUSE Labs

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: [FYI] tux3: Core changes

2015-05-14 Thread Rik van Riel

On 05/14/2015 04:26 AM, Daniel Phillips wrote:
 Hi Rik,
 
 Our linux-tux3 tree currently currently carries this 652 line diff
 against core, to make Tux3 work. This is mainly by Hirofumi, except
 the fs-writeback.c hook, which is by me. The main part you may be
 interested in is rmap.c, which addresses the issues raised at the
 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
 
LSFMM: Page forking
http://lwn.net/Articles/548091/
 
 This is just a FYI. An upcoming Tux3 report will be a tour of the page
 forking design and implementation. For now, this is just to give a
 general sense of what we have done. We heard there are concerns about
 how ptrace will work. I really am not familiar with the issue, could
 you please explain what you were thinking of there?

The issue is that things like ptrace, AIO, infiniband
RDMA, and other direct memory access subsystems can take
a reference to page A, which Tux3 clones into a new page B
when the process writes it.

However, while the process now points at page B, ptrace,
AIO, infiniband, etc will still be pointing at page A.

This causes the process and the other subsystem to each
look at a different page, instead of at shared state,
causing ptrace to do nothing, AIO and RDMA data to be
invisible (or corrupted), etc...

-- 
All rights reversed

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

42 matches

Mail list logo