Re: [FYI] tux3: Core changes

2015-05-18 Thread Mel Gorman
On Sat, May 16, 2015 at 03:38:04PM -0700, David Lang wrote:
 On Fri, 15 May 2015, Mel Gorman wrote:
 
 On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
 
 
 On 05/15/2015 01:09 AM, Mel Gorman wrote:
 On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
 On 05/14/2015 08:06 PM, Daniel Phillips wrote:
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.
 
 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.
 
 This causes the process and the other subsystem to each
 look at a different page, instead of at shared state,
 causing ptrace to do nothing, AIO and RDMA data to be
 invisible (or corrupted), etc...
 
 Is this a bit like page migration?
 
 Yes. Page migration will fail if there is an extra
 reference to the page that is not accounted for by
 the migration code.
 
 When I said it's not like page migration, I was referring to the fact
 that a COW on a pinned page for RDMA is a different problem to page
 migration. The COW of a pinned page can lead to lost writes or
 corruption depending on the ordering of events.
 
 I see the lost writes case, but not the corruption case,
 
 Data corruption can occur depending on the ordering of events and the
 applications expectations. If a process starts IO, RDMA pins the page
 for read and forks are combined with writes from another thread then when
 the IO completes the reads may not be visible. The application may take
 improper action at that point.
 
 if tux3 forks the page and writes the copy while the original page
 is being modified by other things, this means that some of the
 changes won't be in the version written (and this could catch
 partial writes with 'interesting' results if the forking happens at
 the wrong time)
 

Potentially yes. There is likely to be some elevated memory usage but I
imagine that can be controlled.

 But if the original page gets re-marked as needing to be written out
 when it's changed by one of the other things that are accessing it,
 there shouldn't be any long-term corruption.
 
 As far as short-term corruption goes, any time you have a page
 mmapped it could get written out at any time, with only some of the
 application changes applied to it, so this sort of corruption could
 happen anyway couldn't it?
 

That becomes the responsibility of the application. It's up to it to sync
appropriately when it knows updates are complete.

 Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
 class of problem.
 
 You can choose to not define this as data corruption because thge kernel
 is not directly involved and that's your call.
 
 Do you
 mean corruption by changing a page already in writeout? If so,
 don't all filesystems have that problem?
 
 
 No, the problem is different. Backing devices requiring stable pages will
 block the write until the IO is complete. For those that do not require
 stable pages it's ok to allow the write as long as the page is dirtied so
 that it'll be written out again and no data is lost.
 
 so if tux3 is prevented from forking the page in cases where the
 write would be blocked, and will get forked again for follow-up
 writes if it's modified again otherwise, won't this be the same
 thing?
 

Functionally and from a correctness point of view, it *might* be
equivalent. It depends on the implementation and the page life cycle,
particularly the details of how the writeback and dirty state are coordinated
between the user-visible pages and the page being written back. I've read
none of the code or background so I cannot answer whether it's really
equivalent or not. Just be aware that it's not the same problem as page
migration and that it's not the same as how writeback and dirty state is
handled today.

-- 
Mel Gorman
SUSE Labs

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-05-18 Thread Boaz Harrosh
On 05/18/2015 05:20 AM, Rik van Riel wrote:
 On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
 On 05/14/2015 03:59 PM, Rik van Riel wrote:
 On 05/14/2015 04:26 AM, Daniel Phillips wrote:
 Hi Rik,
 

 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.


 All these problems can also happen with truncate+new-extending-write

 It is the responsibility of the application to take file/range locks
 to prevent these page-pinned problems.
 
 It is unreasonable to expect a process that is being ptraced
 (potentially without its knowledge) to take special measures
 to protect the ptraced memory from disappearing.

If the memory disappears that's a bug. No the memory is just there
it is just not reflecting the latest content of the fs-file.

 
 It is impossible for the debugger to take those special measures
 for anonymous memory, or unlinked inodes.
 

Why? one line of added code after the open and before the mmap do an flock

 I don't think your requirement is workable or reasonable.
 

Therefor it is unreasonable to write/modify a ptraced process
file.

Again what I'm saying is COWing a page on write, has the same effect
as truncate+write. They are both allowed and both might give you the same
stale effect. So the presidence is there. We are not introducing a new
anomaly, just introducing a new instance of it. I guess the question
is what applications/procedures are going to break. Need lots of testing
and real life installations to answer that, I guess.

Thanks
Boaz


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-05-18 Thread Boaz Harrosh
On 05/14/2015 03:59 PM, Rik van Riel wrote:
 On 05/14/2015 04:26 AM, Daniel Phillips wrote:
 Hi Rik,

 
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.
 
 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.
 

All these problems can also happen with truncate+new-extending-write

It is the responsibility of the application to take file/range locks
to prevent these page-pinned problems.

 This causes the process and the other subsystem to each
 look at a different page, instead of at shared state,
 causing ptrace to do nothing, AIO and RDMA data to be
 invisible (or corrupted), etc...
 

Again these problems already exist. Consider each in-place-write
being a truncate (punch hole) + new-write is that not the same?

Cheers
Boaz


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-05-18 Thread Daniel Phillips
On 05/17/2015 07:20 PM, Rik van Riel wrote:
 On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
 On 05/14/2015 03:59 PM, Rik van Riel wrote:
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.

 All these problems can also happen with truncate+new-extending-write

 It is the responsibility of the application to take file/range locks
 to prevent these page-pinned problems.
 
 It is unreasonable to expect a process that is being ptraced
 (potentially without its knowledge) to take special measures
 to protect the ptraced memory from disappearing.
 
 It is impossible for the debugger to take those special measures
 for anonymous memory, or unlinked inodes.
 
 I don't think your requirement is workable or reasonable.

Hi Rik,

You are quite right to poke at this aggressively. Whether or not
there is an issue needing fixing, we want to know the details. We
really need to do a deep dive in ptrace and know exactly what it
does, and whether Tux3 creates any new kind of hole. I really know
very little about ptrace at the moment, I only have heard that it
is a horrible hack we inherited from some place far away and a time
long ago.

A little guidance from you would help. Somewhere ptrace must modify
the executable page. Unlike uprobes, which makes sense to me, I did
not find where ptrace actually does that on a quick inspection.
Perhaps you could provide a pointer?

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3