Re: [FYI] tux3: Core changes
On 07/31/2015 01:27 PM, Daniel Phillips wrote: On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote: Returning ENOSPC when you have free space you can't yet prove is safer than not returning it and risking a data loss when you get hit by a write/commit storm. :) Remember when delayed allocation was scary and unproven, because proving that ENOSPC will always be returned when needed is extremely difficult? But the performance advantage was compelling, so we just worked at it until it worked. There were times when it didn't work properly, but the code was in the tree so it got fixed. It's like that now with page forking - a new technique with compelling advantages, and some challenges. In the past, we (the Linux community) would rise to the challenge and err on the side of pushing optimizations in early. That was our mojo, and that is how Linux became the dominant operating system it is today. Do we, the Linux community, still have that mojo? Do you have the mojo to come up with a proposal on how to make things work, in a way that ensures data consistency for Linux users? Yes, we know page forking is not compatible with the way Linux currently uses refcounts. The question is, does anyone have an idea on how we could fix that? Not necessarily an implementation yet, just an idea might be enough to move forward at this stage. However, if nobody wants to work on even an idea, page forking may simply not be a safe thing to do. ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
Jan Kara j...@suse.cz writes: On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote: Jan Kara j...@suse.cz writes: I'm not sure about which ENOSPC issue you are speaking BTW. Can you please ellaborate? 1. GUP simulate page fault, and prepare to modify 2. writeback clear dirty, and make PTE read-only 3. snapshot/reflink make block cow I assume by point 3. you mean that snapshot / reflink happens now and thus the page / block is marked as COW. Am I right? Right. 4. driver called GUP modifies page, and dirty page without simulate page fault OK, but this doesn't hit ENOSPC because as you correctly write in point 4., the page gets modified without triggering another page fault so COW for the modified page isn't triggered. Modified page contents will be in both the original and the reflinked file, won't it? And above result can be ENOSPC too, depending on implement and race condition. Also, if FS converted zerod blocks to hole like hammerfs, simply ENOSPC happens. I.e. other process uses all spaces, but then no -page_mkwrite() callback to check ENOSPC. And I agree that the fact that snapshotted file's original contents can still get modified is a bug. A one which is difficult to fix. Yes, it is why I'm thinking this logic is issue, before page forking. So it sounds like yet another stable page. I.e. unpredictable performance. (BTW, by recall of stable page, noticed stable page would not provide stabled page data for that logic too.) Well, assuming elevated refcount == threshold + waitq/wakeup, so IMO, it is not attractive. Rather the last option if there is no others as design choice. I agree the performance will be less predictable and that is not good. But changing what is visible in the file when writeback races with GUP is a worse problem to me. Maybe if GUP marked pages it got ref for so that we could trigger the slow behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so that pages pinned by get_user_pages() would be properly accounted and then we could use PG_mlocked and elevated refcount as a more reliable indication of pages that need special handling). I'm not reading Peter's patchset fully though, looks like good, and maybe similar strategy in my mind currently. Also I'm thinking to add callback for FS at start and end of GUP's pin window. (for just an example, callback can be used to stop writeback by FS if FS wants.) Thanks. -- OGAWA Hirofumi hirof...@mail.parknet.co.jp ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote: Jan Kara j...@suse.cz writes: I'm not sure about which ENOSPC issue you are speaking BTW. Can you please ellaborate? 1. GUP simulate page fault, and prepare to modify 2. writeback clear dirty, and make PTE read-only 3. snapshot/reflink make block cow I assume by point 3. you mean that snapshot / reflink happens now and thus the page / block is marked as COW. Am I right? 4. driver called GUP modifies page, and dirty page without simulate page fault OK, but this doesn't hit ENOSPC because as you correctly write in point 4., the page gets modified without triggering another page fault so COW for the modified page isn't triggered. Modified page contents will be in both the original and the reflinked file, won't it? And I agree that the fact that snapshotted file's original contents can still get modified is a bug. A one which is difficult to fix. If you claim, there is strange logic widely used already, and of course, we can't simply break it because of compatibility. I would be able to agree. But your claim sounds like that logic is sane and well designed behavior. So I disagree. To me the rule: Do not detach a page from a radix tree if it has an elevated refcount unless explicitely requested by a syscall looks like a sane one. Yes. And frankly I fail to see why you and Daniel care so much about this corner case because from performance POV it's IMHO a non-issue and you bother with page forking because of performance, don't you? Trying to penalize the corner case path, instead of normal path, should try at first. Penalizing normal path to allow corner case path is insane basically. Make normal path faster and more reliable is what we are trying. Elevated refcount of a page is in my opinion a corner case path. That's why I think that penalizing that case by waiting for IO instead of forking is acceptable cost for the improved compatibility maintainability of the code. What is elevated refcount? What is difference with normal refcount? Are you saying refcount = specified threshold + waitq/wakeup or such? If so, it is not the path. It is the state. IOW, some group may not hit much, but some group may hit much, on normal path. Yes, by elevated refcount I meant refcount 2 (one for pagecache, one for your code inspecting the page). So it sounds like yet another stable page. I.e. unpredictable performance. (BTW, by recall of stable page, noticed stable page would not provide stabled page data for that logic too.) Well, assuming elevated refcount == threshold + waitq/wakeup, so IMO, it is not attractive. Rather the last option if there is no others as design choice. I agree the performance will be less predictable and that is not good. But changing what is visible in the file when writeback races with GUP is a worse problem to me. Maybe if GUP marked pages it got ref for so that we could trigger the slow behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so that pages pinned by get_user_pages() would be properly accounted and then we could use PG_mlocked and elevated refcount as a more reliable indication of pages that need special handling). Honza [1] http://thread.gmane.org/gmane.linux.kernel.mm/117679 -- Jan Kara j...@suse.com SUSE Labs, CR ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
Jan Kara j...@suse.cz writes: I'm not sure about which ENOSPC issue you are speaking BTW. Can you please ellaborate? 1. GUP simulate page fault, and prepare to modify 2. writeback clear dirty, and make PTE read-only 3. snapshot/reflink make block cow 4. driver called GUP modifies page, and dirty page without simulate page fault If you claim, there is strange logic widely used already, and of course, we can't simply break it because of compatibility. I would be able to agree. But your claim sounds like that logic is sane and well designed behavior. So I disagree. To me the rule: Do not detach a page from a radix tree if it has an elevated refcount unless explicitely requested by a syscall looks like a sane one. Yes. And frankly I fail to see why you and Daniel care so much about this corner case because from performance POV it's IMHO a non-issue and you bother with page forking because of performance, don't you? Trying to penalize the corner case path, instead of normal path, should try at first. Penalizing normal path to allow corner case path is insane basically. Make normal path faster and more reliable is what we are trying. Elevated refcount of a page is in my opinion a corner case path. That's why I think that penalizing that case by waiting for IO instead of forking is acceptable cost for the improved compatibility maintainability of the code. What is elevated refcount? What is difference with normal refcount? Are you saying refcount = specified threshold + waitq/wakeup or such? If so, it is not the path. It is the state. IOW, some group may not hit much, but some group may hit much, on normal path. So it sounds like yet another stable page. I.e. unpredictable performance. (BTW, by recall of stable page, noticed stable page would not provide stabled page data for that logic too.) Well, assuming elevated refcount == threshold + waitq/wakeup, so IMO, it is not attractive. Rather the last option if there is no others as design choice. Thanks. -- OGAWA Hirofumi hirof...@mail.parknet.co.jp ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Fri 31-07-15 17:16:45, Daniel Phillips wrote: On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote: Note: Hirofumi's email is clear, logical and speaks to the question. This branch of the thread is largely pointless, though it essentially says the same thing in non-technical terms. Perhaps your next response should be to Hirofumi, and perhaps it should be technical. Now, let me try to lead the way, but being specific. RDMA was raised as a potential failure case for Tux3 page forking. But the RDMA api does not let you use memory mmaped by Tux3 as a source or destination of IO. Instead, it sets up its own pages and hands them out to the RDMA app from a pool. So no issue. One down, right? Can you please tell me how you arrived to that conclusion? As far as I'm looking at the code in drivers/infiniband/ I don't see anything there preventing userspace from passing in mmapped memory... Honza -- Jan Kara j...@suse.com SUSE Labs, CR ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
My gosh This is driving me crazy. Please let's make it crystal clear, in technical and precise terms, devoid of any ad hominem attacks or the like, what is preventing Tux3 from being merged. Maybe a list of issues in Github, so that each issue can be scrutinized more easily. Like this one: https://github.com/tux3fs/tux3-merging/issues/1 Daniel, thank you so much for spending so much energy to fight not only for tux3, but for logic, reason and rationality, and for a saner dev process in Linux. Elifarley Cruz - Do not believe anything because it is said by an authority, or if it is said to come from angels, or from Gods, or from an inspired source. Believe it only if you have explored it in your own heart and mind and body and found it to be true. Work out your own path, through diligence. - Gautama Buddha On Fri, Jul 31, 2015 at 7:27 PM, David Lang da...@lang.hm wrote: On Fri, 31 Jul 2015, Daniel Phillips wrote: On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: We, the Linux Community have less tolerance for losing people's data and preventing them from operating than we used to when it was all tinkerer's personal data and secondary systems. So rather than pushing optimizations out to everyone and seeing what breaks, we now do more testing and checking for failures before pushing things out. By the way, I am curious about whose data you think will get lost as a result of pushing out Tux3 with a possible theoretical bug in a wildly improbable scenario that has not actually been described with sufficient specificity to falsify, let alone demonstrated. you weren't asking about any particular feature of Tux, you were asking if we were still willing to push out stuff that breaks for users and fix it later. Especially for filesystems that can loose the data of whoever is using it, the answer seems to be a clear no. there may be bugs in what's pushed out that we don't know about. But we don't push out potential data corruption bugs that we do know about (or think we do) so if you think this should be pushed out with this known corner case that's not handled properly, you have to convince people that it's _so_ improbable that they shouldn't care about it. David Lang ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3 ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote: Note: Hirofumi's email is clear, logical and speaks to the question. This branch of the thread is largely pointless, though it essentially says the same thing in non-technical terms. Perhaps your next response should be to Hirofumi, and perhaps it should be technical. Now, let me try to lead the way, but being specific. RDMA was raised as a potential failure case for Tux3 page forking. But the RDMA api does not let you use memory mmaped by Tux3 as a source or destination of IO. Instead, it sets up its own pages and hands them out to the RDMA app from a pool. So no issue. One down, right? Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
Returning ENOSPC when you have free space you can't yet prove is safer than not returning it and risking a data loss when you get hit by a write/commit storm. :) On Thu, Jul 30, 2015 at 9:44 PM, OGAWA Hirofumi hirof...@mail.parknet.co.jp wrote: Jan Kara j...@suse.cz writes: Yes, if userspace truncates the file, the situation we end up with is basically the same. However for truncate to happen some malicious process has to come and truncate the file - a failure scenario that is acceptable for most use cases since it doesn't happen unless someone is actively trying to screw you. With page forking it is enough for flusher thread to start writeback for that page to trigger the problem - event that is basically bound to happen without any other userspace application interfering. Acceptable conclusion is where came from? That pseudocode logic doesn't say about usage at all. And even if assume it is acceptable, as far as I can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a page on non-exists block (sparse file. i.e. missing disk space check in your logic). And if really no any lock/check, there would be another races. So drop_caches won't cause any issues because it avoids mmaped pages. Also page reclaim or page migration don't cause any issues because they avoid pages with increased refcount (and increased refcount would stop drop_caches from reclaiming the page as well if it was not for the mmaped check before). Generally, elevated page refcount currently guarantees page isn't migrated, reclaimed, or otherwise detached from the mapping (except for truncate where the combination of mapping-index becomes invalid) and your page forking would change that assumption - which IMHO has a big potential for some breakage somewhere. Lifetime and visibility from user are different topic. The issue here is visibility. Of course, those has relation more or less though, refcount doesn't stop to drop page from radix-tree at all. Well, anyway, your claim seems to be assuming the userspace app workarounds the issues. And it sounds like still not workarounds the ENOSPC issue (validate at page fault/GUP) even if assuming userspace behave as perfect. Calling it as kernel assumption is strange. If you claim, there is strange logic widely used already, and of course, we can't simply break it because of compatibility. I would be able to agree. But your claim sounds like that logic is sane and well designed behavior. So I disagree. And frankly I fail to see why you and Daniel care so much about this corner case because from performance POV it's IMHO a non-issue and you bother with page forking because of performance, don't you? Trying to penalize the corner case path, instead of normal path, should try at first. Penalizing normal path to allow corner case path is insane basically. Make normal path faster and more reliable is what we are trying. So you can have a look for example at drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up of a video device buffer at virtual address specified by user. Now I don't know whether there really is any userspace video program that sets up the video buffer in mmaped file. I would agree with you that it would be a strange thing to do but I've seen enough strange userspace code that I would not be too surprised. Another example of similar kind is at drivers/infiniband/core/umem.c where we again set up buffer for infiniband cards at users specified virtual address. And there are more drivers in kernel like that. Unfortunately, I'm not looking those yet though. I guess those would be helpful to see the details. Thanks. -- OGAWA Hirofumi hirof...@mail.parknet.co.jp ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3 ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: If you define this as loosing our mojo, then yes we have. A pity. There remains so much to do that simply will not get done in the absence of mojo. Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Fri, 31 Jul 2015, Daniel Phillips wrote: Subject: Re: [FYI] tux3: Core changes On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote: Returning ENOSPC when you have free space you can't yet prove is safer than not returning it and risking a data loss when you get hit by a write/commit storm. :) Remember when delayed allocation was scary and unproven, because proving that ENOSPC will always be returned when needed is extremely difficult? But the performance advantage was compelling, so we just worked at it until it worked. There were times when it didn't work properly, but the code was in the tree so it got fixed. It's like that now with page forking - a new technique with compelling advantages, and some challenges. In the past, we (the Linux community) would rise to the challenge and err on the side of pushing optimizations in early. That was our mojo, and that is how Linux became the dominant operating system it is today. Do we, the Linux community, still have that mojo? We, the Linux Community have less tolerance for losing people's data and preventing them from operating than we used to when it was all tinkerer's personal data and secondary systems. So rather than pushing optimizations out to everyone and seeing what breaks, we now do more testing and checking for failures before pushing things out. This means that when something new is introduced, we default to the safe, slightly slower way initially (there will be enough other bugs to deal with in any case), and then as we gain experience from the tinkerers enabling the performance optimizations, we make those optimizations reliable and only then push them out to all users. If you define this as loosing our mojo, then yes we have. But most people see the pace of development as still being high, just with more testing and polishing before it gets out to users. David Lang ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: We, the Linux Community have less tolerance for losing people's data and preventing them from operating than we used to when it was all tinkerer's personal data and secondary systems. So rather than pushing optimizations out to everyone and seeing what breaks, we now do more testing and checking for failures before pushing things out. By the way, I am curious about whose data you think will get lost as a result of pushing out Tux3 with a possible theoretical bug in a wildly improbable scenario that has not actually been described with sufficient specificity to falsify, let alone demonstrated. Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Friday, July 31, 2015 3:27:12 PM PDT, David Lang wrote: On Fri, 31 Jul 2015, Daniel Phillips wrote: On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: ... you weren't asking about any particular feature of Tux, you were asking if we were still willing to push out stuff that breaks for users and fix it later. I think you left a key word out of my ask: theoretical. Especially for filesystems that can loose the data of whoever is using it, the answer seems to be a clear no. there may be bugs in what's pushed out that we don't know about. But we don't push out potential data corruption bugs that we do know about (or think we do) so if you think this should be pushed out with this known corner case that's not handled properly, you have to convince people that it's _so_ improbable that they shouldn't care about it. There should also be an onus on the person posing the worry to prove their case beyond a reasonable doubt, which has not been done in case we are discussing here. Note: that is a technical assessment to which a technical response is appropriate. I do think that we should put a cap on this fencing and make a real effort to get Tux3 into mainline. We should at least set a ground rule that a problem should be proved real before it becomes a reason to derail a project in the way that our project has been derailed. Otherwise, it's hard to see what interest is served. OK, lets get back to the program. I accept your assertion that we should convince people that the issue is improbable. To do that, I need a specific issue to address. So far, no such issue has been provided with specificity. Do you see why this is frustrating? Please, community. Give us specific issues to address, or give us some way out of this eternal limbo. Or better, lets go back to the old way of doing things in Linux, which is what got us where we are today. Not this. Note: Hirofumi's email is clear, logical and speaks to the question. This branch of the thread is largely pointless, though it essentially says the same thing in non-technical terms. Perhaps your next response should be to Hirofumi, and perhaps it should be technical. Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: We, the Linux Community have less tolerance for losing people's data and preventing them from operating than we used to when it was all tinkerer's personal data and secondary systems. So rather than pushing optimizations out to everyone and seeing what breaks, we now do more testing and checking for failures before pushing things out. By the way, I am curious about whose data you think will get lost as a result of pushing out Tux3 with a possible theoretical bug in a wildly improbable scenario that has not actually been described with sufficient specificity to falsify, let alone demonstrated. Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Fri, 31 Jul 2015, Daniel Phillips wrote: On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: We, the Linux Community have less tolerance for losing people's data and preventing them from operating than we used to when it was all tinkerer's personal data and secondary systems. So rather than pushing optimizations out to everyone and seeing what breaks, we now do more testing and checking for failures before pushing things out. By the way, I am curious about whose data you think will get lost as a result of pushing out Tux3 with a possible theoretical bug in a wildly improbable scenario that has not actually been described with sufficient specificity to falsify, let alone demonstrated. you weren't asking about any particular feature of Tux, you were asking if we were still willing to push out stuff that breaks for users and fix it later. Especially for filesystems that can loose the data of whoever is using it, the answer seems to be a clear no. there may be bugs in what's pushed out that we don't know about. But we don't push out potential data corruption bugs that we do know about (or think we do) so if you think this should be pushed out with this known corner case that's not handled properly, you have to convince people that it's _so_ improbable that they shouldn't care about it. David Lang ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
Jan Kara j...@suse.cz writes: Yes, if userspace truncates the file, the situation we end up with is basically the same. However for truncate to happen some malicious process has to come and truncate the file - a failure scenario that is acceptable for most use cases since it doesn't happen unless someone is actively trying to screw you. With page forking it is enough for flusher thread to start writeback for that page to trigger the problem - event that is basically bound to happen without any other userspace application interfering. Acceptable conclusion is where came from? That pseudocode logic doesn't say about usage at all. And even if assume it is acceptable, as far as I can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a page on non-exists block (sparse file. i.e. missing disk space check in your logic). And if really no any lock/check, there would be another races. So drop_caches won't cause any issues because it avoids mmaped pages. Also page reclaim or page migration don't cause any issues because they avoid pages with increased refcount (and increased refcount would stop drop_caches from reclaiming the page as well if it was not for the mmaped check before). Generally, elevated page refcount currently guarantees page isn't migrated, reclaimed, or otherwise detached from the mapping (except for truncate where the combination of mapping-index becomes invalid) and your page forking would change that assumption - which IMHO has a big potential for some breakage somewhere. Lifetime and visibility from user are different topic. The issue here is visibility. Of course, those has relation more or less though, refcount doesn't stop to drop page from radix-tree at all. Well, anyway, your claim seems to be assuming the userspace app workarounds the issues. And it sounds like still not workarounds the ENOSPC issue (validate at page fault/GUP) even if assuming userspace behave as perfect. Calling it as kernel assumption is strange. If you claim, there is strange logic widely used already, and of course, we can't simply break it because of compatibility. I would be able to agree. But your claim sounds like that logic is sane and well designed behavior. So I disagree. And frankly I fail to see why you and Daniel care so much about this corner case because from performance POV it's IMHO a non-issue and you bother with page forking because of performance, don't you? Trying to penalize the corner case path, instead of normal path, should try at first. Penalizing normal path to allow corner case path is insane basically. Make normal path faster and more reliable is what we are trying. So you can have a look for example at drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up of a video device buffer at virtual address specified by user. Now I don't know whether there really is any userspace video program that sets up the video buffer in mmaped file. I would agree with you that it would be a strange thing to do but I've seen enough strange userspace code that I would not be too surprised. Another example of similar kind is at drivers/infiniband/core/umem.c where we again set up buffer for infiniband cards at users specified virtual address. And there are more drivers in kernel like that. Unfortunately, I'm not looking those yet though. I guess those would be helpful to see the details. Thanks. -- OGAWA Hirofumi hirof...@mail.parknet.co.jp ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Sun 05-07-15 21:54:45, OGAWA Hirofumi wrote: Jan Kara j...@suse.cz writes: I'm not sure I'm understanding your pseudocode logic correctly though. This logic doesn't seems to be a page forking specific issue. And this pseudocode logic seems to be missing the locking and revalidate of page. If you can show more details, it would be helpful to see more, and discuss the issue of page forking, or we can think about how to handle the corner cases. Well, before that, why need more details? For example, replace the page fork at (4) with truncate, punch hole, or invalidate page. Those operations remove the old page from radix tree, so the userspace's write creates the new page, and HW still refererences the old page. (I.e. situation should be same with page forking, in my understand of this pseudocode logic.) Yes, if userspace truncates the file, the situation we end up with is basically the same. However for truncate to happen some malicious process has to come and truncate the file - a failure scenario that is acceptable for most use cases since it doesn't happen unless someone is actively trying to screw you. With page forking it is enough for flusher thread to start writeback for that page to trigger the problem - event that is basically bound to happen without any other userspace application interfering. Acceptable conclusion is where came from? That pseudocode logic doesn't say about usage at all. And even if assume it is acceptable, as far as I can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a page on non-exists block (sparse file. i.e. missing disk space check in your logic). And if really no any lock/check, there would be another races. So drop_caches won't cause any issues because it avoids mmaped pages. Also page reclaim or page migration don't cause any issues because they avoid pages with increased refcount (and increased refcount would stop drop_caches from reclaiming the page as well if it was not for the mmaped check before). Generally, elevated page refcount currently guarantees page isn't migrated, reclaimed, or otherwise detached from the mapping (except for truncate where the combination of mapping-index becomes invalid) and your page forking would change that assumption - which IMHO has a big potential for some breakage somewhere. And frankly I fail to see why you and Daniel care so much about this corner case because from performance POV it's IMHO a non-issue and you bother with page forking because of performance, don't you? IOW, this pseudocode logic seems to be broken without page forking if no lock and revalidate. Usually, we prevent unpleasant I/O by lock_page or PG_writeback, and an obsolated page is revalidated under lock_page. Well, good luck with converting all the get_user_pages() users in kernel to use lock_page() or PG_writeback checks to avoid issues with page forking. I don't think that's really feasible. What does all get_user_pages() conversion mean? Well, maybe right more or less, I also think there is the issue in/around get_user_pages() that we have to tackle. IMO, if there is a code that pseudocode logic actually, it is the breakage. And it is acceptable and limitation, and give up to fix, I don't think it is the right way to go. If there is really code broken like your logic, I think we should fix. Could you point which code is using your logic? Since that seems to be so racy, I can't believe yet there are that racy codes actually. So you can have a look for example at drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up of a video device buffer at virtual address specified by user. Now I don't know whether there really is any userspace video program that sets up the video buffer in mmaped file. I would agree with you that it would be a strange thing to do but I've seen enough strange userspace code that I would not be too surprised. Another example of similar kind is at drivers/infiniband/core/umem.c where we again set up buffer for infiniband cards at users specified virtual address. And there are more drivers in kernel like that. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
Jan Kara j...@suse.cz writes: I'm not sure I'm understanding your pseudocode logic correctly though. This logic doesn't seems to be a page forking specific issue. And this pseudocode logic seems to be missing the locking and revalidate of page. If you can show more details, it would be helpful to see more, and discuss the issue of page forking, or we can think about how to handle the corner cases. Well, before that, why need more details? For example, replace the page fork at (4) with truncate, punch hole, or invalidate page. Those operations remove the old page from radix tree, so the userspace's write creates the new page, and HW still refererences the old page. (I.e. situation should be same with page forking, in my understand of this pseudocode logic.) Yes, if userspace truncates the file, the situation we end up with is basically the same. However for truncate to happen some malicious process has to come and truncate the file - a failure scenario that is acceptable for most use cases since it doesn't happen unless someone is actively trying to screw you. With page forking it is enough for flusher thread to start writeback for that page to trigger the problem - event that is basically bound to happen without any other userspace application interfering. Acceptable conclusion is where came from? That pseudocode logic doesn't say about usage at all. And even if assume it is acceptable, as far as I can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a page on non-exists block (sparse file. i.e. missing disk space check in your logic). And if really no any lock/check, there would be another races. IOW, this pseudocode logic seems to be broken without page forking if no lock and revalidate. Usually, we prevent unpleasant I/O by lock_page or PG_writeback, and an obsolated page is revalidated under lock_page. Well, good luck with converting all the get_user_pages() users in kernel to use lock_page() or PG_writeback checks to avoid issues with page forking. I don't think that's really feasible. What does all get_user_pages() conversion mean? Well, maybe right more or less, I also think there is the issue in/around get_user_pages() that we have to tackle. IMO, if there is a code that pseudocode logic actually, it is the breakage. And it is acceptable and limitation, and give up to fix, I don't think it is the right way to go. If there is really code broken like your logic, I think we should fix. Could you point which code is using your logic? Since that seems to be so racy, I can't believe yet there are that racy codes actually. For page forking, we may also be able to prevent similar situation by locking, flags, and revalidate. But those details might be different with current code, because page states are different. Sorry, I don't understand what do you mean in this paragraph. Can you explain it a bit more? This just means a forked page (old page) and a truncated page have different set of flags and state, so we may have to adjust revalidation. Thanks. -- OGAWA Hirofumi hirof...@mail.parknet.co.jp ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Mon 22-06-15 00:36:00, OGAWA Hirofumi wrote: Jan Kara j...@suse.cz writes: So there are a few things to have in mind: 1) There is nothing like a writeable page. Page is always writeable (at least on x86 architecture). When a page is mapped into some virtual address space (or more of them), this *mapping* can be either writeable or read-only. mkwrite changes the mapping from read-only to writeable but kernel / hardware is free to write to the page regardless of the mapping. 2) When kernel / hardware writes to the page, it first modifies the page and then marks it dirty. So what can happen in this scenario is: 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens, page is dirtied, kernel notes a PFN of the page somewhere internally. 2) Writeback comes and starts writeback for the page. 3) Kernel ships the PFN to the hardware. 4) Userspace comes and wants to write to the page (different part than the HW is instructed to use). page_mkwrite is called, page is forked. Userspace writes to the forked page. 5) HW stores its data in the original page. Userspace never sees data from the HW! Data corrupted where without page forking everything would work just fine. I'm not sure I'm understanding your pseudocode logic correctly though. This logic doesn't seems to be a page forking specific issue. And this pseudocode logic seems to be missing the locking and revalidate of page. If you can show more details, it would be helpful to see more, and discuss the issue of page forking, or we can think about how to handle the corner cases. Well, before that, why need more details? For example, replace the page fork at (4) with truncate, punch hole, or invalidate page. Those operations remove the old page from radix tree, so the userspace's write creates the new page, and HW still refererences the old page. (I.e. situation should be same with page forking, in my understand of this pseudocode logic.) Yes, if userspace truncates the file, the situation we end up with is basically the same. However for truncate to happen some malicious process has to come and truncate the file - a failure scenario that is acceptable for most use cases since it doesn't happen unless someone is actively trying to screw you. With page forking it is enough for flusher thread to start writeback for that page to trigger the problem - event that is basically bound to happen without any other userspace application interfering. IOW, this pseudocode logic seems to be broken without page forking if no lock and revalidate. Usually, we prevent unpleasant I/O by lock_page or PG_writeback, and an obsolated page is revalidated under lock_page. Well, good luck with converting all the get_user_pages() users in kernel to use lock_page() or PG_writeback checks to avoid issues with page forking. I don't think that's really feasible. For page forking, we may also be able to prevent similar situation by locking, flags, and revalidate. But those details might be different with current code, because page states are different. Sorry, I don't understand what do you mean in this paragraph. Can you explain it a bit more? Another possible scenario: 1) Userspace app tells kernel to setup a HW buffer in a page. 2) Userspace app fills page with data - page_mkwrite is called, page is dirtied. 3) Userspace app tells kernel to ship buffer to video HW. 4) Writeback comes and starts writeback for the page 5) Video HW is done with the page. Userspace app fills new set of data into the page - page_mkwrite is called, page is forked. 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the old data from the original page. Again a data corruption issue where previously things were working fine. This logic seems to be same as above. Replace the page fork at (5). With no revalidate of page, (6) will use the old page. Yes, the same arguments as above apply... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
Jan Kara j...@suse.cz writes: Hi, So there are a few things to have in mind: 1) There is nothing like a writeable page. Page is always writeable (at least on x86 architecture). When a page is mapped into some virtual address space (or more of them), this *mapping* can be either writeable or read-only. mkwrite changes the mapping from read-only to writeable but kernel / hardware is free to write to the page regardless of the mapping. 2) When kernel / hardware writes to the page, it first modifies the page and then marks it dirty. So what can happen in this scenario is: 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens, page is dirtied, kernel notes a PFN of the page somewhere internally. 2) Writeback comes and starts writeback for the page. 3) Kernel ships the PFN to the hardware. 4) Userspace comes and wants to write to the page (different part than the HW is instructed to use). page_mkwrite is called, page is forked. Userspace writes to the forked page. 5) HW stores its data in the original page. Userspace never sees data from the HW! Data corrupted where without page forking everything would work just fine. I'm not sure I'm understanding your pseudocode logic correctly though. This logic doesn't seems to be a page forking specific issue. And this pseudocode logic seems to be missing the locking and revalidate of page. If you can show more details, it would be helpful to see more, and discuss the issue of page forking, or we can think about how to handle the corner cases. Well, before that, why need more details? For example, replace the page fork at (4) with truncate, punch hole, or invalidate page. Those operations remove the old page from radix tree, so the userspace's write creates the new page, and HW still refererences the old page. (I.e. situation should be same with page forking, in my understand of this pseudocode logic.) IOW, this pseudocode logic seems to be broken without page forking if no lock and revalidate. Usually, we prevent unpleasant I/O by lock_page or PG_writeback, and an obsolated page is revalidated under lock_page. For page forking, we may also be able to prevent similar situation by locking, flags, and revalidate. But those details might be different with current code, because page states are different. Another possible scenario: 1) Userspace app tells kernel to setup a HW buffer in a page. 2) Userspace app fills page with data - page_mkwrite is called, page is dirtied. 3) Userspace app tells kernel to ship buffer to video HW. 4) Writeback comes and starts writeback for the page 5) Video HW is done with the page. Userspace app fills new set of data into the page - page_mkwrite is called, page is forked. 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the old data from the original page. Again a data corruption issue where previously things were working fine. This logic seems to be same as above. Replace the page fork at (5). With no revalidate of page, (6) will use the old page. Thanks. -- OGAWA Hirofumi hirof...@mail.parknet.co.jp ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Wed 2015-05-27 11:09:25, Daniel Phillips wrote: On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote: On Fri 2015-05-15 02:38:33, Daniel Phillips wrote: On 05/14/2015 08:06 PM, Rik van Riel wrote: ... Umm. Why do you think it is only issue for executable files? I meant: files with code in them, that will be executed. Please excuse me for colliding with the chmod sense. I will say code files to avoid ambiguity. I'm free to mmap() any file, and then execute from it. /lib/ld-linux.so /path/to/binary is known way to exec programs that do not have x bit set. So... why would I write to a code file at the same time as stepping through it with ptrace? Should I expect ptrace to work perfectly if I do that? What would work perfectly mean, if the code is changing at the same time as being traced? Do you have any imagination at all? Reasons I should expect ptrace to work perfectly if I'm writing to file: 1) it used to work before 2) it used to work before 3) it used to work before and regressions are not allowed 4) some kind of just in time compiler 5) some kind of malware, playing tricks so that you have trouble analyzing it and of course, 6) it used to work before. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On 05/27/2015 02:37 PM, Pavel Machek wrote: On Wed 2015-05-27 11:09:25, Daniel Phillips wrote: On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote: On Fri 2015-05-15 02:38:33, Daniel Phillips wrote: On 05/14/2015 08:06 PM, Rik van Riel wrote: ... Umm. Why do you think it is only issue for executable files? I meant: files with code in them, that will be executed. Please excuse me for colliding with the chmod sense. I will say code files to avoid ambiguity. I'm free to mmap() any file, and then execute from it. /lib/ld-linux.so /path/to/binary is known way to exec programs that do not have x bit set. So... why would I write to a code file at the same time as stepping through it with ptrace? Should I expect ptrace to work perfectly if I do that? What would work perfectly mean, if the code is changing at the same time as being traced? Do you have any imagination at all? [Non-collegial rhetoric alert, it would be helpful to avoid that.] Reasons I should expect ptrace to work perfectly if I'm writing to file: 1) it used to work before 2) it used to work before 3) it used to work before and regressions are not allowed Are you sure that ptrace will work perfectly on a file that you are writing to at the same time as tracing? If so, it has magic that I do not understand. Could you please explain. 4) some kind of just in time compiler A JIT that can tolerate being written to by a task it knows nothing about, at the same time as it is generating code in the file? I do not know of any such JIT. 5) some kind of malware, playing tricks so that you have trouble analyzing it By writing to a code file? Then it already has write access to the code file, so it has already gotten inside your security perimeter without needing help from page fork. That said, we should be alert for any new holes that page fork might open. But if there are any, they should be actual holes, not theoretical ones. and of course, 6) it used to work before. I look forward to your explanation of how. Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Tue 26-05-15 01:08:56, Daniel Phillips wrote: On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote: E.g. video drivers (or infiniband or direct IO for that matter) which have buffers in user memory (may be mmapped file), grab references to pages and hand out PFNs of those pages to the hardware to store data in them... If you fork a page after the driver has handed PFNs to the hardware, you've just lost all the writes hardware will do. Hi Jan, The page forked because somebody wrote to it with write(2) or mmap write at the same time as a video driver (or infiniband or direct IO) was doing io to it. Isn't the application trying hard to lose data in that case? It would not need page fork to lose data that way. So I can think of two valid uses: 1) You setup IO to part of a page and modify from userspace a different part of a page. 2) At least for video drivers there is one ioctl() which creates object with buffers in memory and another ioctl() to actually ship it to hardware (may be called repeatedly). So in theory app could validly dirty the pages before it ships them to hardware. If this happens repeatedly and interacts badly with background writeback, you will end up with a forked page in a buffer and from that point on things are broken. So my opinion is: Don't fork the page if page_count is elevated. You can just wait for the IO if you need stable pages in that case. It's slow but it's safe and it should be pretty rare. Is there any problem with that? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Mon 25-05-15 23:11:11, Daniel Phillips wrote: On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote: if the page gets modified again, will that cause any issues? what if the page gets modified before the copy gets written out, so that there are two dirty copies of the page in the process of being written? David Lang How is the page going to get modified again? A forked page isn't mapped by a pte, so userspace can't modify it by mmap. The forked page is not in the page cache, so usespace can't modify it by posix file ops. So the writer would have to be in kernel. Tux3 knows what it is doing, so it won't modify the page. What kernel code besides Tux3 will modify the page? E.g. video drivers (or infiniband or direct IO for that matter) which have buffers in user memory (may be mmapped file), grab references to pages and hand out PFNs of those pages to the hardware to store data in them... If you fork a page after the driver has handed PFNs to the hardware, you've just lost all the writes hardware will do. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On (05/26/15 01:08), Daniel Phillips wrote: On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote: E.g. video drivers (or infiniband or direct IO for that matter) which have buffers in user memory (may be mmapped file), grab references to pages and hand out PFNs of those pages to the hardware to store data in them... If you fork a page after the driver has handed PFNs to the hardware, you've just lost all the writes hardware will do. Hi Jan, The page forked because somebody wrote to it with write(2) or mmap write at the same time as a video driver (or infiniband or direct IO) was doing io to it. Isn't the application trying hard to lose data in that case? It would not need page fork to lose data that way. Hello, is it possible to page-fork-bomb the system by some 'malicious' app? -ss ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Tuesday, May 26, 2015 3:13:02 AM PDT, Pavel Machek wrote: On Tue 2015-05-26 01:09:59, Daniel Phillips wrote: On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote: I'm assuming that Rik is talking about whatever has the reference to the page via one of the methods that he talked about. This would be a good moment to provide specifics. Hmm. This seems like a good moment for you to audit whole kernel, to make sure it does not do stuff you don't expect it to. You are changing core semantics, stuff that was allowed before is not allowed now, so it looks like you should do the auditing... No, we are not changing core semantics. Core semantics remain unchanged. All these questions are about Tux3 semantics. So I believe that I am correct in asking for specifics, rather than general theory. Auditing the whole kernel seems an unreasonable barrier to merging a filesystem. You may want to start with video4linux, as Jan pointed out. I pointed out that the video case Jan mentioned is a race. Is that right or wrong? And what responsibility does Tux3 have, if the use case is a race? I would think: do not oops, do not corrupt the filesystem. Anything else? Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On 05/26/2015 02:00 AM, Jan Kara wrote: On Tue 26-05-15 01:08:56, Daniel Phillips wrote: On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote: E.g. video drivers (or infiniband or direct IO for that matter) which have buffers in user memory (may be mmapped file), grab references to pages and hand out PFNs of those pages to the hardware to store data in them... If you fork a page after the driver has handed PFNs to the hardware, you've just lost all the writes hardware will do. Hi Jan, The page forked because somebody wrote to it with write(2) or mmap write at the same time as a video driver (or infiniband or direct IO) was doing io to it. Isn't the application trying hard to lose data in that case? It would not need page fork to lose data that way. So I can think of two valid uses: 1) You setup IO to part of a page and modify from userspace a different part of a page. Suppose the use case is reading textures from video memory into a mmapped file, and at the same time, the application is allowed to update the textures in the file via mmap or write(2). Fork happens at mkwrite time. If the page is already dirty, we do not fork it. The video API must have made the page writable and dirty, so I do not see an issue. 2) At least for video drivers there is one ioctl() which creates object with buffers in memory and another ioctl() to actually ship it to hardware (may be called repeatedly). So in theory app could validly dirty the pages before it ships them to hardware. If this happens repeatedly and interacts badly with background writeback, you will end up with a forked page in a buffer and from that point on things are broken. Writeback does not fork pages. An app may dirty a page that is in process of being shipped to hardware (must be a distinct part of the page, or it is a race) and the data being sent to hardware will not be disturbed. If there is an issue here, I do not see it. So my opinion is: Don't fork the page if page_count is elevated. You can just wait for the IO if you need stable pages in that case. It's slow but it's safe and it should be pretty rare. Is there any problem with that? That would be our fallback if anybody discovers a specific case where page fork breaks something, which so far has not been demonstrated. With a known fallback, it is hard to see why we should delay merging over that. Perfection has never been a requirement for merging filesystems. On the contrary, imperfection is a reason for merging, so that the many eyeballs effect may prove its value. Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
Hi Sergey, On 05/26/2015 03:22 AM, Sergey Senozhatsky wrote: Hello, is it possible to page-fork-bomb the system by some 'malicious' app? Not in any new way. A page fork can happen either in the front end, where it has to wait for memory like any other normal memory user, or in the backend, where Tux3 may have privileged access to low memory reserves and therefore must place bounds on its memory use like any other user of low memory reserves. This is not specific to page fork. We must place such bounds for any memory that the backend uses. Fortunately, the backend does not allocate memory extravagently, for fork or anything else, so when this does get to the top of our to-do list it should not be too hard to deal with. We plan to attack that after merge, as we have never observed a problem in practice. Rather, Tux3 already seems to survive low memory situations pretty well compared to some other filesystems. Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On 05/26/2015 04:22 PM, Daniel Phillips wrote: On 05/26/2015 02:00 AM, Jan Kara wrote: So my opinion is: Don't fork the page if page_count is elevated. You can just wait for the IO if you need stable pages in that case. It's slow but it's safe and it should be pretty rare. Is there any problem with that? That would be our fallback if anybody discovers a specific case where page fork breaks something, which so far has not been demonstrated. With a known fallback, it is hard to see why we should delay merging over that. Perfection has never been a requirement for merging filesystems. On However, avoiding data corruption by erring on the side of safety is a pretty basic requirement. the contrary, imperfection is a reason for merging, so that the many eyeballs effect may prove its value. If you skip the page fork when there is an elevated page count, tux3 should be safe (at least from that aspect). Only do the COW when there is no strange use of the page going on. -- All rights reversed ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On 05/20/2015 07:44 AM, Jan Kara wrote: On Tue 19-05-15 13:33:31, David Lang wrote: On Tue, 19 May 2015, Daniel Phillips wrote: I understand that Tux3 may avoid these issues due to some other mechanisms it internally has but if page forking should get into mm subsystem, the above must work. It does work, and by example, it does not need a lot of code to make it work, but the changes are not trivial. Tux3's delta writeback model will not suit everyone, so you can't just lift our code and add it to Ext4. Using it in Ext4 would require a per-inode writeback model, which looks practical to me but far from a weekend project. Maybe something to consider for Ext5. It is the job of new designs like Tux3 to chase after that final drop of performance, not our trusty Ext4 workhorse. Though stranger things have happened - as I recall, Ext4 had O(n) directory operations at one time. Fixing that was not easy, but we did it because we had to. Fixing Ext4's write performance is not urgent by comparison, and the barrier is high, you would want jbd3 for one thing. I think the meta-question you are asking is, where is the second user for this new CoW functionality? With a possible implication that if there is no second user then Tux3 cannot be merged. Is that is the question? I don't think they are asking for a second user. What they are saying is that for this functionality to be accepted in the mm subsystem, these problem cases need to work reliably, not just work for Tux3 because of your implementation. So for things that you don't use, you need to make it an error if they get used on a page that's been forked (or not be an error and 'do the right thing') For cases where it doesn't matter because Tux3 controls the writeback, and it's undefined in general what happens if writeback is triggered twice on the same page, you will need to figure out how to either prevent the second writeback from triggering if there's one in process, or define how the two writebacks are going to happen so that you can't end up with them re-ordered by some other filesystem. I think that that's what's meant by the top statement that I left in the quote. Even if your implementation details make it safe, these need to be safe even without your implementation details to be acceptable in the core kernel. Yeah, that's what I meant. If you create a function which manipulates page cache, you better make it work with other functions manipulating page cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting developer. Sure you can document all the conditions under which the function is safe to use but a function that has several paragraphs in front of it explaning when it is safe to use isn't very good API... Violent agreement, of course. To put it in concrete terms, each of the page fork support functions must be examined and determined sane. They are: * cow_replace_page_cache * cow_delete_from_page_cache * cow_clone_page * page_cow_one * page_cow_file Would it be useful to drill down into those, starting from the top of the list? Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Wed, 20 May 2015, Daniel Phillips wrote: On 05/20/2015 03:51 PM, Daniel Phillips wrote: On 05/20/2015 12:53 PM, Rik van Riel wrote: How does tux3 prevent a user of find_get_page() from reading from or writing into the pre-COW page, instead of the current page? Careful control of the dirty bits (we have two of them, one each for front and back). That is what pagefork_for_blockdirty is about. Ah, and of course it does not matter if a reader is on the pre-cow page. It would be reading the earlier copy, which might no longer be the current copy, but it raced with the write so nobody should be surprised. That is a race even without page fork. how do you prevent it from continuing to interact with the old version of the page and never see updates or have it's changes reflected on the current page? David Lang ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Tue, 19 May 2015, Daniel Phillips wrote: I understand that Tux3 may avoid these issues due to some other mechanisms it internally has but if page forking should get into mm subsystem, the above must work. It does work, and by example, it does not need a lot of code to make it work, but the changes are not trivial. Tux3's delta writeback model will not suit everyone, so you can't just lift our code and add it to Ext4. Using it in Ext4 would require a per-inode writeback model, which looks practical to me but far from a weekend project. Maybe something to consider for Ext5. It is the job of new designs like Tux3 to chase after that final drop of performance, not our trusty Ext4 workhorse. Though stranger things have happened - as I recall, Ext4 had O(n) directory operations at one time. Fixing that was not easy, but we did it because we had to. Fixing Ext4's write performance is not urgent by comparison, and the barrier is high, you would want jbd3 for one thing. I think the meta-question you are asking is, where is the second user for this new CoW functionality? With a possible implication that if there is no second user then Tux3 cannot be merged. Is that is the question? I don't think they are asking for a second user. What they are saying is that for this functionality to be accepted in the mm subsystem, these problem cases need to work reliably, not just work for Tux3 because of your implementation. So for things that you don't use, you need to make it an error if they get used on a page that's been forked (or not be an error and 'do the right thing') For cases where it doesn't matter because Tux3 controls the writeback, and it's undefined in general what happens if writeback is triggered twice on the same page, you will need to figure out how to either prevent the second writeback from triggering if there's one in process, or define how the two writebacks are going to happen so that you can't end up with them re-ordered by some other filesystem. I think that that's what's meant by the top statement that I left in the quote. Even if your implementation details make it safe, these need to be safe even without your implementation details to be acceptable in the core kernel. David Lang ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Thu 14-05-15 01:26:23, Daniel Phillips wrote: Hi Rik, Our linux-tux3 tree currently currently carries this 652 line diff against core, to make Tux3 work. This is mainly by Hirofumi, except the fs-writeback.c hook, which is by me. The main part you may be interested in is rmap.c, which addresses the issues raised at the 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] LSFMM: Page forking http://lwn.net/Articles/548091/ This is just a FYI. An upcoming Tux3 report will be a tour of the page forking design and implementation. For now, this is just to give a general sense of what we have done. We heard there are concerns about how ptrace will work. I really am not familiar with the issue, could you please explain what you were thinking of there? So here are a few things I find problematic about page forking (besides the cases with elevated page_count already discussed in this thread - there I believe that anything more complex than wait for the IO instead of forking when page has elevated use count isn't going to work. There are too many users depending on too subtle details of the behavior...). Some of them are actually mentioned in the above LWN article: When you create a copy of a page and replace it in the radix tree, nobody in mm subsystem is aware that oldpage may be under writeback. That causes interesting issues: * truncate_inode_pages() can finish before all IO for the file is finished. So far filesystems rely on the fact that once truncate_inode_pages() finishes all running IO against the file is completed and new cannot be submitted. * Writeback can come and try to write newpage while oldpage is still under IO. Then you'll have two IOs against one block which has undefined results. * filemap_fdatawait() called from fsync() has additional problem that it is not aware of oldpage and thus may return although IO hasn't finished yet. I understand that Tux3 may avoid these issues due to some other mechanisms it internally has but if page forking should get into mm subsystem, the above must work. Honza diffstat tux3.core.patch fs/Makefile |1 fs/fs-writeback.c | 100 + include/linux/fs.h|6 + include/linux/mm.h|5 + include/linux/pagemap.h |2 include/linux/rmap.h | 14 include/linux/writeback.h | 23 +++ mm/filemap.c | 82 +++ mm/rmap.c | 139 ++ mm/truncate.c | 98 10 files changed, 411 insertions(+), 59 deletions(-) diff --git a/fs/Makefile b/fs/Makefile index 91fcfa3..44d7192 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS) += ext4/ obj-$(CONFIG_JBD)+= jbd/ obj-$(CONFIG_JBD2) += jbd2/ obj-$(CONFIG_TUX3) += tux3/ -obj-$(CONFIG_TUX3_MMAP) += tux3/ obj-$(CONFIG_CRAMFS) += cramfs/ obj-$(CONFIG_SQUASHFS) += squashfs/ obj-y+= ramfs/ diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 2d609a5..fcd1c61 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -34,25 +34,6 @@ */ #define MIN_WRITEBACK_PAGES (4096UL (PAGE_CACHE_SHIFT - 10)) -/* - * Passed into wb_writeback(), essentially a subset of writeback_control - */ -struct wb_writeback_work { - long nr_pages; - struct super_block *sb; - unsigned long *older_than_this; - enum writeback_sync_modes sync_mode; - unsigned int tagged_writepages:1; - unsigned int for_kupdate:1; - unsigned int range_cyclic:1; - unsigned int for_background:1; - unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */ - enum wb_reason reason; /* why was writeback initiated? */ - - struct list_head list; /* pending work list */ - struct completion *done;/* set if the caller waits */ -}; - /** * writeback_in_progress - determine whether there is writeback in progress * @bdi: the device's backing_dev_info structure. @@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode) } /* + * Remove inode from writeback list if clean. + */ +void inode_writeback_done(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + + spin_lock(bdi-wb.list_lock); + spin_lock(inode-i_lock); + if (!(inode-i_state I_DIRTY)) + list_del_init(inode-i_wb_list); + spin_unlock(inode-i_lock); + spin_unlock(bdi-wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_done); + +/* + * Add inode to writeback dirty list with current time. + */ +void inode_writeback_touch(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode);
Re: [FYI] tux3: Core changes
On Sat, May 16, 2015 at 03:38:04PM -0700, David Lang wrote: On Fri, 15 May 2015, Mel Gorman wrote: On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote: On 05/15/2015 01:09 AM, Mel Gorman wrote: On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote: On 05/14/2015 08:06 PM, Daniel Phillips wrote: The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. This causes the process and the other subsystem to each look at a different page, instead of at shared state, causing ptrace to do nothing, AIO and RDMA data to be invisible (or corrupted), etc... Is this a bit like page migration? Yes. Page migration will fail if there is an extra reference to the page that is not accounted for by the migration code. When I said it's not like page migration, I was referring to the fact that a COW on a pinned page for RDMA is a different problem to page migration. The COW of a pinned page can lead to lost writes or corruption depending on the ordering of events. I see the lost writes case, but not the corruption case, Data corruption can occur depending on the ordering of events and the applications expectations. If a process starts IO, RDMA pins the page for read and forks are combined with writes from another thread then when the IO completes the reads may not be visible. The application may take improper action at that point. if tux3 forks the page and writes the copy while the original page is being modified by other things, this means that some of the changes won't be in the version written (and this could catch partial writes with 'interesting' results if the forking happens at the wrong time) Potentially yes. There is likely to be some elevated memory usage but I imagine that can be controlled. But if the original page gets re-marked as needing to be written out when it's changed by one of the other things that are accessing it, there shouldn't be any long-term corruption. As far as short-term corruption goes, any time you have a page mmapped it could get written out at any time, with only some of the application changes applied to it, so this sort of corruption could happen anyway couldn't it? That becomes the responsibility of the application. It's up to it to sync appropriately when it knows updates are complete. Users of RDMA are typically expected to use MADV_DONTFORK to avoid this class of problem. You can choose to not define this as data corruption because thge kernel is not directly involved and that's your call. Do you mean corruption by changing a page already in writeout? If so, don't all filesystems have that problem? No, the problem is different. Backing devices requiring stable pages will block the write until the IO is complete. For those that do not require stable pages it's ok to allow the write as long as the page is dirtied so that it'll be written out again and no data is lost. so if tux3 is prevented from forking the page in cases where the write would be blocked, and will get forked again for follow-up writes if it's modified again otherwise, won't this be the same thing? Functionally and from a correctness point of view, it *might* be equivalent. It depends on the implementation and the page life cycle, particularly the details of how the writeback and dirty state are coordinated between the user-visible pages and the page being written back. I've read none of the code or background so I cannot answer whether it's really equivalent or not. Just be aware that it's not the same problem as page migration and that it's not the same as how writeback and dirty state is handled today. -- Mel Gorman SUSE Labs ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On 05/18/2015 05:20 AM, Rik van Riel wrote: On 05/17/2015 09:26 AM, Boaz Harrosh wrote: On 05/14/2015 03:59 PM, Rik van Riel wrote: On 05/14/2015 04:26 AM, Daniel Phillips wrote: Hi Rik, The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. All these problems can also happen with truncate+new-extending-write It is the responsibility of the application to take file/range locks to prevent these page-pinned problems. It is unreasonable to expect a process that is being ptraced (potentially without its knowledge) to take special measures to protect the ptraced memory from disappearing. If the memory disappears that's a bug. No the memory is just there it is just not reflecting the latest content of the fs-file. It is impossible for the debugger to take those special measures for anonymous memory, or unlinked inodes. Why? one line of added code after the open and before the mmap do an flock I don't think your requirement is workable or reasonable. Therefor it is unreasonable to write/modify a ptraced process file. Again what I'm saying is COWing a page on write, has the same effect as truncate+write. They are both allowed and both might give you the same stale effect. So the presidence is there. We are not introducing a new anomaly, just introducing a new instance of it. I guess the question is what applications/procedures are going to break. Need lots of testing and real life installations to answer that, I guess. Thanks Boaz ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On 05/14/2015 03:59 PM, Rik van Riel wrote: On 05/14/2015 04:26 AM, Daniel Phillips wrote: Hi Rik, The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. All these problems can also happen with truncate+new-extending-write It is the responsibility of the application to take file/range locks to prevent these page-pinned problems. This causes the process and the other subsystem to each look at a different page, instead of at shared state, causing ptrace to do nothing, AIO and RDMA data to be invisible (or corrupted), etc... Again these problems already exist. Consider each in-place-write being a truncate (punch hole) + new-write is that not the same? Cheers Boaz ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On 05/17/2015 07:20 PM, Rik van Riel wrote: On 05/17/2015 09:26 AM, Boaz Harrosh wrote: On 05/14/2015 03:59 PM, Rik van Riel wrote: The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. All these problems can also happen with truncate+new-extending-write It is the responsibility of the application to take file/range locks to prevent these page-pinned problems. It is unreasonable to expect a process that is being ptraced (potentially without its knowledge) to take special measures to protect the ptraced memory from disappearing. It is impossible for the debugger to take those special measures for anonymous memory, or unlinked inodes. I don't think your requirement is workable or reasonable. Hi Rik, You are quite right to poke at this aggressively. Whether or not there is an issue needing fixing, we want to know the details. We really need to do a deep dive in ptrace and know exactly what it does, and whether Tux3 creates any new kind of hole. I really know very little about ptrace at the moment, I only have heard that it is a horrible hack we inherited from some place far away and a time long ago. A little guidance from you would help. Somewhere ptrace must modify the executable page. Unlike uprobes, which makes sense to me, I did not find where ptrace actually does that on a quick inspection. Perhaps you could provide a pointer? Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On 05/17/2015 09:26 AM, Boaz Harrosh wrote: On 05/14/2015 03:59 PM, Rik van Riel wrote: On 05/14/2015 04:26 AM, Daniel Phillips wrote: Hi Rik, The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. All these problems can also happen with truncate+new-extending-write It is the responsibility of the application to take file/range locks to prevent these page-pinned problems. It is unreasonable to expect a process that is being ptraced (potentially without its knowledge) to take special measures to protect the ptraced memory from disappearing. It is impossible for the debugger to take those special measures for anonymous memory, or unlinked inodes. I don't think your requirement is workable or reasonable. -- All rights reversed ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Fri, 15 May 2015, Mel Gorman wrote: On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote: On 05/15/2015 01:09 AM, Mel Gorman wrote: On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote: On 05/14/2015 08:06 PM, Daniel Phillips wrote: The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. This causes the process and the other subsystem to each look at a different page, instead of at shared state, causing ptrace to do nothing, AIO and RDMA data to be invisible (or corrupted), etc... Is this a bit like page migration? Yes. Page migration will fail if there is an extra reference to the page that is not accounted for by the migration code. When I said it's not like page migration, I was referring to the fact that a COW on a pinned page for RDMA is a different problem to page migration. The COW of a pinned page can lead to lost writes or corruption depending on the ordering of events. I see the lost writes case, but not the corruption case, Data corruption can occur depending on the ordering of events and the applications expectations. If a process starts IO, RDMA pins the page for read and forks are combined with writes from another thread then when the IO completes the reads may not be visible. The application may take improper action at that point. if tux3 forks the page and writes the copy while the original page is being modified by other things, this means that some of the changes won't be in the version written (and this could catch partial writes with 'interesting' results if the forking happens at the wrong time) But if the original page gets re-marked as needing to be written out when it's changed by one of the other things that are accessing it, there shouldn't be any long-term corruption. As far as short-term corruption goes, any time you have a page mmapped it could get written out at any time, with only some of the application changes applied to it, so this sort of corruption could happen anyway couldn't it? Users of RDMA are typically expected to use MADV_DONTFORK to avoid this class of problem. You can choose to not define this as data corruption because thge kernel is not directly involved and that's your call. Do you mean corruption by changing a page already in writeout? If so, don't all filesystems have that problem? No, the problem is different. Backing devices requiring stable pages will block the write until the IO is complete. For those that do not require stable pages it's ok to allow the write as long as the page is dirtied so that it'll be written out again and no data is lost. so if tux3 is prevented from forking the page in cases where the write would be blocked, and will get forked again for follow-up writes if it's modified again otherwise, won't this be the same thing? David Lang If RDMA to a mmapped file races with write(2) to the same file, maybe it is reasonable and expected to lose some data. In the RDMA case, there is at least application awareness to work around the problems. Normally it's ok to have both mapped and write() access to data although userspace might need a lock to co-ordinate updates and event ordering. ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Thu, May 14, 2015 at 05:06:39PM -0700, Daniel Phillips wrote: Hi Rik, Added Mel, Andrea and Peterz to CC as interested parties. There are probably others, please just jump in. On 05/14/2015 05:59 AM, Rik van Riel wrote: On 05/14/2015 04:26 AM, Daniel Phillips wrote: Hi Rik, Our linux-tux3 tree currently currently carries this 652 line diff against core, to make Tux3 work. This is mainly by Hirofumi, except the fs-writeback.c hook, which is by me. The main part you may be interested in is rmap.c, which addresses the issues raised at the 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] LSFMM: Page forking http://lwn.net/Articles/548091/ This is just a FYI. An upcoming Tux3 report will be a tour of the page forking design and implementation. For now, this is just to give a general sense of what we have done. We heard there are concerns about how ptrace will work. I really am not familiar with the issue, could you please explain what you were thinking of there? The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. This causes the process and the other subsystem to each look at a different page, instead of at shared state, causing ptrace to do nothing, AIO and RDMA data to be invisible (or corrupted), etc... Is this a bit like page migration? No, it's not. -- Mel Gorman SUSE Labs ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On 05/15/2015 01:09 AM, Mel Gorman wrote: On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote: On 05/14/2015 08:06 PM, Daniel Phillips wrote: The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. This causes the process and the other subsystem to each look at a different page, instead of at shared state, causing ptrace to do nothing, AIO and RDMA data to be invisible (or corrupted), etc... Is this a bit like page migration? Yes. Page migration will fail if there is an extra reference to the page that is not accounted for by the migration code. When I said it's not like page migration, I was referring to the fact that a COW on a pinned page for RDMA is a different problem to page migration. The COW of a pinned page can lead to lost writes or corruption depending on the ordering of events. I see the lost writes case, but not the corruption case, Do you mean corruption by changing a page already in writeout? If so, don't all filesystems have that problem? If RDMA to a mmapped file races with write(2) to the same file, maybe it is reasonable and expected to lose some data. Page migration fails when there are unexpected problems to avoid this class of issue which is fine for page migration but may be a critical failure in a filesystem depending on exactly why the copy is required. Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote: On 05/15/2015 01:09 AM, Mel Gorman wrote: On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote: On 05/14/2015 08:06 PM, Daniel Phillips wrote: The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. This causes the process and the other subsystem to each look at a different page, instead of at shared state, causing ptrace to do nothing, AIO and RDMA data to be invisible (or corrupted), etc... Is this a bit like page migration? Yes. Page migration will fail if there is an extra reference to the page that is not accounted for by the migration code. When I said it's not like page migration, I was referring to the fact that a COW on a pinned page for RDMA is a different problem to page migration. The COW of a pinned page can lead to lost writes or corruption depending on the ordering of events. I see the lost writes case, but not the corruption case, Data corruption can occur depending on the ordering of events and the applications expectations. If a process starts IO, RDMA pins the page for read and forks are combined with writes from another thread then when the IO completes the reads may not be visible. The application may take improper action at that point. Users of RDMA are typically expected to use MADV_DONTFORK to avoid this class of problem. You can choose to not define this as data corruption because thge kernel is not directly involved and that's your call. Do you mean corruption by changing a page already in writeout? If so, don't all filesystems have that problem? No, the problem is different. Backing devices requiring stable pages will block the write until the IO is complete. For those that do not require stable pages it's ok to allow the write as long as the page is dirtied so that it'll be written out again and no data is lost. If RDMA to a mmapped file races with write(2) to the same file, maybe it is reasonable and expected to lose some data. In the RDMA case, there is at least application awareness to work around the problems. Normally it's ok to have both mapped and write() access to data although userspace might need a lock to co-ordinate updates and event ordering. -- Mel Gorman SUSE Labs ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: [FYI] tux3: Core changes
On 05/14/2015 04:26 AM, Daniel Phillips wrote: Hi Rik, Our linux-tux3 tree currently currently carries this 652 line diff against core, to make Tux3 work. This is mainly by Hirofumi, except the fs-writeback.c hook, which is by me. The main part you may be interested in is rmap.c, which addresses the issues raised at the 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] LSFMM: Page forking http://lwn.net/Articles/548091/ This is just a FYI. An upcoming Tux3 report will be a tour of the page forking design and implementation. For now, this is just to give a general sense of what we have done. We heard there are concerns about how ptrace will work. I really am not familiar with the issue, could you please explain what you were thinking of there? The issue is that things like ptrace, AIO, infiniband RDMA, and other direct memory access subsystems can take a reference to page A, which Tux3 clones into a new page B when the process writes it. However, while the process now points at page B, ptrace, AIO, infiniband, etc will still be pointing at page A. This causes the process and the other subsystem to each look at a different page, instead of at shared state, causing ptrace to do nothing, AIO and RDMA data to be invisible (or corrupted), etc... -- All rights reversed ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3