Re: kdbus: credential faking
Casey Schaufler wrote: > On 7/10/2015 7:57 AM, Alex Elsayed wrote: >> Stephen Smalley wrote: >> >>> On 07/10/2015 09:43 AM, David Herrmann wrote: >>>> Hi >>>> >>>> On Fri, Jul 10, 2015 at 3:25 PM, Stephen Smalley >>>> wrote: >>>>> On 07/09/2015 06:22 PM, David Herrmann wrote: >>>>>> To be clear, faking metadata has one use-case, and one use-case only: >>>>>> dbus1 compatibility >>>>>> >>>>>> In dbus1, clients connect to a unix-socket placed in the file-system >>>>>> hierarchy. To avoid breaking ABI for old clients, we support a >>>>>> unix-kdbus proxy. This proxy is called systemd-bus-proxyd. It is >>>>>> spawned once for each bus we proxy and simply remarshals messages >>>>>> from the client to kdbus and vice versa. >>>>> Is this truly necessary? Can't the distributions just update the >>>>> client >>>>> side libraries to use kdbus if enabled and be done with it? Doesn't >>>>> this proxy undo many of the benefits of using kdbus in the first >>>>> place? >>>> We need binary compatibility to dbus1. There're millions of >>>> applications and language bindings with dbus1 compiled in, which we >>>> cannot suddenly break. >>> So, are you saying that there are many applications that statically link >>> the dbus1 library implementation (thus the distributions can't just push >>> an updated shared library that switches from using the socket to using >>> kdbus), and that many of these applications are third party applications >>> not packaged by the distributions (thus the distributions cannot just do >>> a mass rebuild to update these applications too)? Otherwise, I would >>> think that the use of a socket would just be an implementation detail >>> and you would be free to change it without affecting dbus1 library ABI >>> compatibility. >> Honestly? Yes. To bring up two examples off the bat, IIRC both Haskell >> and Java have independent *implementations* of the dbus1 protocol, not >> reusing the reference library at all - Haskell isn't technically >> statically linked, but its ABI hashing stuff means it's the next best >> thing, and both it and Java are often managed outside the PM because for >> various reasons (in the case of Haskell, lots of tiny packages with lots >> of frequent releases make packagers cry until they find a way of >> automating it). > > There is absolutely no reason to expect that these two examples don't have > native kdbus implementations in the works already. The Haskell one, at least, does not. I checked. > That's the risk you take when you eschew the "standard" libraries. > Further, the primary reason that developers deviate from the norm is (you guessed it!) performance. Or, you know, avoiding the hassle of building and/or linking to code in another language via FFI. That's my recall of the primary reason for the Haskell one - and I don't think it's any coincidence that the two pure reimplementations are in managed-but-compiled languages. > The proxy is going to kill (or at least be assumed to kill) that > advantage, putting even more pressure on these deviant applications to > provide native kdbus versions. ...sure, if performance was the object. But it went through the old D-Bus daemon either way, so I'm rather dubious of your assertion - whether due to being in userspace or just poor implementation, it's no speed daemon so to speak. > Backward compatibility shims/libraries/proxies only work when it's the > rare and unimportant case requiring it. If it's the common case, it won't > work. If it's the important case, it won't work. If kdbus is worth the > effort, make the effort. They also work if they require no configuration or effort from the legacy side, allowing those who need the (possibly rare *but also* important) benefits of the new system to benefit without causing harm to others. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kdbus: credential faking
Stephen Smalley wrote: > On 07/10/2015 09:43 AM, David Herrmann wrote: >> Hi >> >> On Fri, Jul 10, 2015 at 3:25 PM, Stephen Smalley >> wrote: >>> On 07/09/2015 06:22 PM, David Herrmann wrote: To be clear, faking metadata has one use-case, and one use-case only: dbus1 compatibility In dbus1, clients connect to a unix-socket placed in the file-system hierarchy. To avoid breaking ABI for old clients, we support a unix-kdbus proxy. This proxy is called systemd-bus-proxyd. It is spawned once for each bus we proxy and simply remarshals messages from the client to kdbus and vice versa. >>> >>> Is this truly necessary? Can't the distributions just update the client >>> side libraries to use kdbus if enabled and be done with it? Doesn't >>> this proxy undo many of the benefits of using kdbus in the first place? >> >> We need binary compatibility to dbus1. There're millions of >> applications and language bindings with dbus1 compiled in, which we >> cannot suddenly break. > > So, are you saying that there are many applications that statically link > the dbus1 library implementation (thus the distributions can't just push > an updated shared library that switches from using the socket to using > kdbus), and that many of these applications are third party applications > not packaged by the distributions (thus the distributions cannot just do > a mass rebuild to update these applications too)? Otherwise, I would > think that the use of a socket would just be an implementation detail > and you would be free to change it without affecting dbus1 library ABI > compatibility. Honestly? Yes. To bring up two examples off the bat, IIRC both Haskell and Java have independent *implementations* of the dbus1 protocol, not reusing the reference library at all - Haskell isn't technically statically linked, but its ABI hashing stuff means it's the next best thing, and both it and Java are often managed outside the PM because for various reasons (in the case of Haskell, lots of tiny packages with lots of frequent releases make packagers cry until they find a way of automating it). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kdbus: credential faking
Stephen Smalley wrote: On 07/10/2015 09:43 AM, David Herrmann wrote: Hi On Fri, Jul 10, 2015 at 3:25 PM, Stephen Smalley s...@tycho.nsa.gov wrote: On 07/09/2015 06:22 PM, David Herrmann wrote: To be clear, faking metadata has one use-case, and one use-case only: dbus1 compatibility In dbus1, clients connect to a unix-socket placed in the file-system hierarchy. To avoid breaking ABI for old clients, we support a unix-kdbus proxy. This proxy is called systemd-bus-proxyd. It is spawned once for each bus we proxy and simply remarshals messages from the client to kdbus and vice versa. Is this truly necessary? Can't the distributions just update the client side libraries to use kdbus if enabled and be done with it? Doesn't this proxy undo many of the benefits of using kdbus in the first place? We need binary compatibility to dbus1. There're millions of applications and language bindings with dbus1 compiled in, which we cannot suddenly break. So, are you saying that there are many applications that statically link the dbus1 library implementation (thus the distributions can't just push an updated shared library that switches from using the socket to using kdbus), and that many of these applications are third party applications not packaged by the distributions (thus the distributions cannot just do a mass rebuild to update these applications too)? Otherwise, I would think that the use of a socket would just be an implementation detail and you would be free to change it without affecting dbus1 library ABI compatibility. Honestly? Yes. To bring up two examples off the bat, IIRC both Haskell and Java have independent *implementations* of the dbus1 protocol, not reusing the reference library at all - Haskell isn't technically statically linked, but its ABI hashing stuff means it's the next best thing, and both it and Java are often managed outside the PM because for various reasons (in the case of Haskell, lots of tiny packages with lots of frequent releases make packagers cry until they find a way of automating it). snip -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kdbus: credential faking
Casey Schaufler wrote: On 7/10/2015 7:57 AM, Alex Elsayed wrote: Stephen Smalley wrote: On 07/10/2015 09:43 AM, David Herrmann wrote: Hi On Fri, Jul 10, 2015 at 3:25 PM, Stephen Smalley s...@tycho.nsa.gov wrote: On 07/09/2015 06:22 PM, David Herrmann wrote: To be clear, faking metadata has one use-case, and one use-case only: dbus1 compatibility In dbus1, clients connect to a unix-socket placed in the file-system hierarchy. To avoid breaking ABI for old clients, we support a unix-kdbus proxy. This proxy is called systemd-bus-proxyd. It is spawned once for each bus we proxy and simply remarshals messages from the client to kdbus and vice versa. Is this truly necessary? Can't the distributions just update the client side libraries to use kdbus if enabled and be done with it? Doesn't this proxy undo many of the benefits of using kdbus in the first place? We need binary compatibility to dbus1. There're millions of applications and language bindings with dbus1 compiled in, which we cannot suddenly break. So, are you saying that there are many applications that statically link the dbus1 library implementation (thus the distributions can't just push an updated shared library that switches from using the socket to using kdbus), and that many of these applications are third party applications not packaged by the distributions (thus the distributions cannot just do a mass rebuild to update these applications too)? Otherwise, I would think that the use of a socket would just be an implementation detail and you would be free to change it without affecting dbus1 library ABI compatibility. Honestly? Yes. To bring up two examples off the bat, IIRC both Haskell and Java have independent *implementations* of the dbus1 protocol, not reusing the reference library at all - Haskell isn't technically statically linked, but its ABI hashing stuff means it's the next best thing, and both it and Java are often managed outside the PM because for various reasons (in the case of Haskell, lots of tiny packages with lots of frequent releases make packagers cry until they find a way of automating it). There is absolutely no reason to expect that these two examples don't have native kdbus implementations in the works already. The Haskell one, at least, does not. I checked. That's the risk you take when you eschew the standard libraries. Further, the primary reason that developers deviate from the norm is (you guessed it!) performance. Or, you know, avoiding the hassle of building and/or linking to code in another language via FFI. That's my recall of the primary reason for the Haskell one - and I don't think it's any coincidence that the two pure reimplementations are in managed-but-compiled languages. The proxy is going to kill (or at least be assumed to kill) that advantage, putting even more pressure on these deviant applications to provide native kdbus versions. ...sure, if performance was the object. But it went through the old D-Bus daemon either way, so I'm rather dubious of your assertion - whether due to being in userspace or just poor implementation, it's no speed daemon so to speak. Backward compatibility shims/libraries/proxies only work when it's the rare and unimportant case requiring it. If it's the common case, it won't work. If it's the important case, it won't work. If kdbus is worth the effort, make the effort. They also work if they require no configuration or effort from the legacy side, allowing those who need the (possibly rare *but also* important) benefits of the new system to benefit without causing harm to others. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] kdbus for 4.1-rc1
Havoc Pennington wrote: > Hi, > > On Fri, Apr 17, 2015 at 3:27 PM, James Bottomley > wrote: >> >> This is why I think kdbus is a bad idea: it solidifies as a linux kernel >> API something which runs counter to granular OS virtualization (and >> something which caused Windows to fall behind Linux in the container >> space). Splitting out the acceleration problem and leaving the rest to >> user space currently looks fine because the ideas Al and Andy are >> kicking around don't cause problems with OS virtualization. >> > > I'm interested in understanding this problem (if only for my own > curiosity) but I'm not confident I understand what you're saying > correctly. > > Can I try to explain back / ask questions and see what I have right? > > I think you are saying that if an application relies on a system > service (= any other process that runs on the system bus) then to > virtualize that app by itself in a dedicated container, the system bus > and the system service need to also be in the container. So the > container ends up with a bunch of stuff in it beyond only the > application. Right / wrong / confused? > > I also think you're saying that userspace dbus has the same issue > (this isn't a userspace vs. kernel thing per se), the objection to > kdbus is that it makes this issue more solidified / harder to fix? > > Do you have ideas on how to go about fixing it, whether in userspace > or kernel dbus? > > Havoc So far as I understand (and this may be wrong), this is the use case of kdbus "endpoints" - you'd create a (constrained) kdbus endpoint on the host, and then expose it to the application, such that the application uses it as if it were the system bus. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] dm-crypt: Adds support for wiping key when doing suspend/hibernation
Mike Snitzer wrote: > On Thu, Apr 16 2015 at 5:23am -0400, > Alex Elsayed wrote: > >> Mike Snitzer wrote: >> >> > On Thu, Apr 09 2015 at 9:28am -0400, >> > Pali Rohár wrote: >> > >> >> On Thursday 09 April 2015 09:12:08 Mike Snitzer wrote: >> >> > On Mon, Apr 06 2015 at 9:29am -0400, >> >> > Pali Rohár wrote: >> >> > >> >> > > On Monday 06 April 2015 15:00:46 Mike Snitzer wrote: >> >> > > > On Sun, Apr 05 2015 at 1:20pm -0400, >> >> > > > >> >> > > > Pali Rohár wrote: >> >> > > > > This patch series increase security of suspend and hibernate >> >> > > > > actions. It allows user to safely wipe crypto keys before >> >> > > > > suspend and hibernate actions starts without race >> >> > > > > conditions on userspace process with heavy I/O. >> >> > > > > >> >> > > > > To automatically wipe cryto key for before >> >> > > > > hibernate action call: $ dmsetup message 0 key >> >> > > > > wipe_on_hibernation 1 >> >> > > > > >> >> > > > > To automatically wipe cryto key for before suspend >> >> > > > > action call: $ dmsetup message 0 key >> >> > > > > wipe_on_suspend 1 >> >> > > > > >> >> > > > > (Value 0 after wipe_* string reverts original behaviour - to >> >> > > > > not wipe key) >> >> > > > >> >> > > > Can you elaborate on the attack vector your changes are meant >> >> > > > to protect against? The user already authorized access, why >> >> > > > is it inherently dangerous to _not_ wipe the associated key >> >> > > > across these events? >> >> > > >> >> > > Hi, >> >> > > >> >> > > yes, I will try to explain current problems with cryptsetup >> >> > > luksSuspend command and hibernation. >> >> > > >> >> > > First, sometimes it is needed to put machine into other hands. >> >> > > You can still watch other person what is doing with machine, but >> >> > > once if you let machine unlocked (e.g opened luks disk), she/he >> >> > > can access encrypted data. >> >> > > >> >> > > If you turn off machine, it could be safe, because luks disk >> >> > > devices are locked. But if you enter machine into suspend or >> >> > > hibernate state luks devices are still open. And my patches try >> >> > > to achieve similar security as when machine is off (= no crypto >> >> > > keys in RAM or on swap). >> >> > > >> >> > > When doing hibernate on unencrypted swap it is to prevent leaking >> >> > > crypto keys to hibernate image (which is stored in swap). >> >> > > >> >> > > When doing suspend action it is again to prevent leaking crypto >> >> > > keys. E.g when you suspend laptop and put it off (somebody can >> >> > > remove RAMs and do some cold boot attack). >> >> > > >> >> > > The most common situation is: >> >> > > You have mounted partition from dm-crypt device (e.g. /home/), >> >> > > some userspace processes access it (e.g opened firefox which >> >> > > still reads/writes to cache ~/.firefox/) and you want to drop >> >> > > crypto keys from kernel for some time. >> >> > > >> >> > > For that operation there is command cryptsetup luksSuspend, which >> >> > > suspend dm device and then tell kernel to wipe crypto keys. All >> >> > > I/O operations are then stopped and userspace processes which >> >> > > want to do some those I/O operations are stopped too (until you >> >> > > call cryptsetup luksResume and enter correct key). >> >> > > >> >> > > Now if you want to suspend/hiberate your machine (when some of dm >> >> > > devices are suspeneded and some processes are stopped due to >> >> > > pending I/O) it is not possible. Kernel freeze_processes function >> >> > > will fail because userspace processes are still stopped inside >> >> > &g
Re: [PATCH 0/3] dm-crypt: Adds support for wiping key when doing suspend/hibernation
Mike Snitzer wrote: On Thu, Apr 16 2015 at 5:23am -0400, Alex Elsayed eternal...@gmail.com wrote: Mike Snitzer wrote: On Thu, Apr 09 2015 at 9:28am -0400, Pali Rohár pali.ro...@gmail.com wrote: On Thursday 09 April 2015 09:12:08 Mike Snitzer wrote: On Mon, Apr 06 2015 at 9:29am -0400, Pali Rohár pali.ro...@gmail.com wrote: On Monday 06 April 2015 15:00:46 Mike Snitzer wrote: On Sun, Apr 05 2015 at 1:20pm -0400, Pali Rohár pali.ro...@gmail.com wrote: This patch series increase security of suspend and hibernate actions. It allows user to safely wipe crypto keys before suspend and hibernate actions starts without race conditions on userspace process with heavy I/O. To automatically wipe cryto key for device before hibernate action call: $ dmsetup message device 0 key wipe_on_hibernation 1 To automatically wipe cryto key for device before suspend action call: $ dmsetup message device 0 key wipe_on_suspend 1 (Value 0 after wipe_* string reverts original behaviour - to not wipe key) Can you elaborate on the attack vector your changes are meant to protect against? The user already authorized access, why is it inherently dangerous to _not_ wipe the associated key across these events? Hi, yes, I will try to explain current problems with cryptsetup luksSuspend command and hibernation. First, sometimes it is needed to put machine into other hands. You can still watch other person what is doing with machine, but once if you let machine unlocked (e.g opened luks disk), she/he can access encrypted data. If you turn off machine, it could be safe, because luks disk devices are locked. But if you enter machine into suspend or hibernate state luks devices are still open. And my patches try to achieve similar security as when machine is off (= no crypto keys in RAM or on swap). When doing hibernate on unencrypted swap it is to prevent leaking crypto keys to hibernate image (which is stored in swap). When doing suspend action it is again to prevent leaking crypto keys. E.g when you suspend laptop and put it off (somebody can remove RAMs and do some cold boot attack). The most common situation is: You have mounted partition from dm-crypt device (e.g. /home/), some userspace processes access it (e.g opened firefox which still reads/writes to cache ~/.firefox/) and you want to drop crypto keys from kernel for some time. For that operation there is command cryptsetup luksSuspend, which suspend dm device and then tell kernel to wipe crypto keys. All I/O operations are then stopped and userspace processes which want to do some those I/O operations are stopped too (until you call cryptsetup luksResume and enter correct key). Now if you want to suspend/hiberate your machine (when some of dm devices are suspeneded and some processes are stopped due to pending I/O) it is not possible. Kernel freeze_processes function will fail because userspace processes are still stopped inside some I/O syscall (read/write, etc,...). My patches fixes this problem and do those operations (suspend dm device, wipe crypto keys, enter suspend/hiberate) in correct order and without race condition. dm device is suspended *after* userspace processes are freezed and after that are crypto keys wiped. And then computer/laptop enters into suspend/hibernate state. Wouldn't it be better to fix freeze_processes() to be tolerant of processes that are hung as a side-effect of their backing storage being suspended? A hibernate shouldn't fail simply because a user chose to suspend a DM device. Then this entire problem goes away and the key can be wiped from userspace (like you said above). Still there will be race condition. Before hibernation (and device poweroff) we should have synced disks and filesystems to prevent data lose (or other damage) as more as we can. And if there will be some application which using lot of I/O (e.g normal firefox) then there always will be race condtion. The DM suspend will take care of flushing any pending I/O. So I don't see where the supposed race is... Anything else that is trapped in userspace memory will be there when the machine resumes. So proper way is to wipe luks crypto keys *after* userspace processes are freezed. I know you believe that I'm just not accepting that at face value. Um, pardon me if I'm being naive, but what about the case of hibernation where the swapdev and the root device are both LVs on the same dm_crypt device? The kernel is writing to swap _after_ userspace processes are all frozen; that seems to me like an ordering dependency entirely incompatible
Re: [GIT PULL] kdbus for 4.1-rc1
Havoc Pennington wrote: Hi, On Fri, Apr 17, 2015 at 3:27 PM, James Bottomley james.bottom...@hansenpartnership.com wrote: This is why I think kdbus is a bad idea: it solidifies as a linux kernel API something which runs counter to granular OS virtualization (and something which caused Windows to fall behind Linux in the container space). Splitting out the acceleration problem and leaving the rest to user space currently looks fine because the ideas Al and Andy are kicking around don't cause problems with OS virtualization. I'm interested in understanding this problem (if only for my own curiosity) but I'm not confident I understand what you're saying correctly. Can I try to explain back / ask questions and see what I have right? I think you are saying that if an application relies on a system service (= any other process that runs on the system bus) then to virtualize that app by itself in a dedicated container, the system bus and the system service need to also be in the container. So the container ends up with a bunch of stuff in it beyond only the application. Right / wrong / confused? I also think you're saying that userspace dbus has the same issue (this isn't a userspace vs. kernel thing per se), the objection to kdbus is that it makes this issue more solidified / harder to fix? Do you have ideas on how to go about fixing it, whether in userspace or kernel dbus? Havoc So far as I understand (and this may be wrong), this is the use case of kdbus endpoints - you'd create a (constrained) kdbus endpoint on the host, and then expose it to the application, such that the application uses it as if it were the system bus. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux XIA - merge proposal
Michel Machado wrote: > Hi there, > > We have been developing Linux XIA, a new network stack that > emphasizes evolvability and interoperability, for a couple of years, and > it has now reached a degree of maturity that allows others to experiment > with it. >From looking at your wiki, "network stack" may have been a poor choice of term - it looks like rather than being a new network stack (which in Linux, is commonly used to refer to the software stack that lives between the APIs and the hardware), this is a new protocol (and framework _for_ protocols) operating at the same level of the network as IP, with ideas extending upwards through TCP. Now, that's a rather different proposal - witness that RDS, TIPC, etc all made it into the kernel relatively easily, especially when compared to netmap, or any other system that tried to replace the Linux networking infrastructure. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux XIA - merge proposal
Michel Machado wrote: Hi there, We have been developing Linux XIA, a new network stack that emphasizes evolvability and interoperability, for a couple of years, and it has now reached a degree of maturity that allows others to experiment with it. From looking at your wiki, network stack may have been a poor choice of term - it looks like rather than being a new network stack (which in Linux, is commonly used to refer to the software stack that lives between the APIs and the hardware), this is a new protocol (and framework _for_ protocols) operating at the same level of the network as IP, with ideas extending upwards through TCP. Now, that's a rather different proposal - witness that RDS, TIPC, etc all made it into the kernel relatively easily, especially when compared to netmap, or any other system that tried to replace the Linux networking infrastructure. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
Al Viro wrote: > On Tue, Feb 03, 2015 at 07:01:50PM +0100, Alexander Holler wrote: > >> Yeah, as I've already admitted in the bug, I never should have use >> the word secure, because everyone nowadays seems to end up in panic >> when reading that word. >> >> So, if I would be able to use sed on my mails, I would replace >> unlinkat_s() with unlinkat_w() (for wipe) or would say that _s does >> stand for 'shred' in the means of shred(1). > > TBH, I suspect that the saner API would be something like > EXT2_IOC_[SG[ETFLAGS, allowing to set and query that along with other > flags (append-only, etc.). > > Forget about unlink; first of all, whatever API you use should only _mark_ > the inode as "zero freed blocks" (or trim, for that matter). You can't > force freeing of an inode, so either you make sure that subsequent freeing > of inode, whenever it happens, will do that work, or your API is > hopelessly > racy. Moreover, when link has been removed it's too late to report that > fs has no way to e.g. trim those blocks, so you really want to have it > done > _before_ the actual link removal. And if the file contents is that > sensitive, you'd better extend the same protection to all operations that > free its > blocks, including truncate(), fallocate() hole-punching, whatever. What's > more, if you divorce that from link removal, you probably don't want it as > in-core-only flag - have it stored in inode, if fs supports that. > > Alternatively, you might want to represent it as xattr - as much as I hate > those, it might turn out to be the best fit in this case, if we end up > with several variants for freed blocks disposal. Not sure... > > But whichever way we represent that state, IMO > a) operation should be similar to chmod/chattr/setfattr - modifying > inode metadata. > b) it should affect _all_ operations freeing blocks of that file > from that point on > c) it should be able to fail, telling you that you can't do that for > this backing store. Well, chattr already has +s which means exactly this. It's just not respected by... anything. The 0/5 mentioned it, albeit briefly. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
Al Viro wrote: On Tue, Feb 03, 2015 at 07:01:50PM +0100, Alexander Holler wrote: Yeah, as I've already admitted in the bug, I never should have use the word secure, because everyone nowadays seems to end up in panic when reading that word. So, if I would be able to use sed on my mails, I would replace unlinkat_s() with unlinkat_w() (for wipe) or would say that _s does stand for 'shred' in the means of shred(1). TBH, I suspect that the saner API would be something like EXT2_IOC_[SG[ETFLAGS, allowing to set and query that along with other flags (append-only, etc.). Forget about unlink; first of all, whatever API you use should only _mark_ the inode as zero freed blocks (or trim, for that matter). You can't force freeing of an inode, so either you make sure that subsequent freeing of inode, whenever it happens, will do that work, or your API is hopelessly racy. Moreover, when link has been removed it's too late to report that fs has no way to e.g. trim those blocks, so you really want to have it done _before_ the actual link removal. And if the file contents is that sensitive, you'd better extend the same protection to all operations that free its blocks, including truncate(), fallocate() hole-punching, whatever. What's more, if you divorce that from link removal, you probably don't want it as in-core-only flag - have it stored in inode, if fs supports that. Alternatively, you might want to represent it as xattr - as much as I hate those, it might turn out to be the best fit in this case, if we end up with several variants for freed blocks disposal. Not sure... But whichever way we represent that state, IMO a) operation should be similar to chmod/chattr/setfattr - modifying inode metadata. b) it should affect _all_ operations freeing blocks of that file from that point on c) it should be able to fail, telling you that you can't do that for this backing store. Well, chattr already has +s which means exactly this. It's just not respected by... anything. The 0/5 mentioned it, albeit briefly. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/12] Add kdbus implementation
Andy Lutomirski wrote: > There should be a number measured in, say, nanoseconds in here > somewhere. The actual extent of the speedup is unmeasurable here. > Also, it's worth reading at least one of Linus' many rants about > zero-copy. It's not an automatic win. It's well-understood that it's not an automatic win; significant testing on multiple architectures indicated that 512K is a surprisingly universal crossover point. The userspace code, therefore, switches from copying (normal kdbus parameters) to zero-copy (memfds) right around there. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/12] Add kdbus implementation
Andy Lutomirski wrote: snip There should be a number measured in, say, nanoseconds in here somewhere. The actual extent of the speedup is unmeasurable here. Also, it's worth reading at least one of Linus' many rants about zero-copy. It's not an automatic win. It's well-understood that it's not an automatic win; significant testing on multiple architectures indicated that 512K is a surprisingly universal crossover point. The userspace code, therefore, switches from copying (normal kdbus parameters) to zero-copy (memfds) right around there. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] target: Add a user-passthrough backstore
Nicholas A. Bellinger wrote: > On Fri, 2014-09-19 at 14:43 -0700, Alex Elsayed wrote: >> Nicholas A. Bellinger wrote: >> >> >> > So the idea of allowing the in-kernel CDB emulation to run after >> > user-space has returned unsupported opcode is problematic for a couple >> > of different reasons. >> > >> > First, if the correct feature bits in standard INQUIRY + EVPD INQUIRY, >> > etc are not populated by user-space to match what in-kernel CDB >> > emulation actually supports, this can result in potentially undefined >> > results initiator side. >> > >> > Also for example, if user-space decided to emulate only a subset of PR >> > operations, and leaves the rest of it up to the in-kernel emulation, >> > there's not really a way to sync current state between the two, which >> > also can end up with undefined results. >> > >> > So that said, I think a saner approach would be two define two modes of >> > operation for TCMU: >> > >> >*) Passthrough Mode: All CDBs are passed to user-space, and no >> > in-kernel emulation is done in the event of an unsupported >> > opcode response. >> > >> >*) I/O Mode: I/O related CDBs are passed into user-space, but >> > all control CDBs continue to be processed by in-kernel emulation. >> > This effectively limits operation to TYPE_DISK, but with this >> > mode it's probably OK to assume this. >> > >> > This seems like the best trade-off between flexibility when everything >> > should be handled by user-space, vs. functionality when only block >> > remapping of I/O is occurring in user-space code. >> >> The problem there is that the first case has all the issues of pscsi and >> simply becomes a performance optimization over tgt+iscsi client+pscsi and >> the latter case excludes the main use cases I'm interested in - OSDs, >> media changers, optical discs (the biggest thing for me), and so forth. >> >> One of the main things I want to do with this is hook up a plugin that >> uses libmirage to handle various optical disc image formats. >> > > Not sure I follow.. How does the proposed passthrough mode prevent > someone from emulating OSDs, media changers, optical disks or anything > else in userspace with TCMU..? > > The main thing that the above comments highlight is why attempting to > combine the existing in-kernel emulation with a userspace backend > providing it's own emulation can open up a number of problems with > mismatched state between the two. It doesn't prevent it, but it _does_ put it in the exact same place as PSCSI regarding the warnings on the wiki. It means that if someone wants to implement (say) the optical disc or OSD CDBs, they then lose out on ALUA unless they implement it themselves - which seems unnecessary and painful, since those should really be disjoint. In particular, an OSD backed by RADOS objects could be a very nice thing indeed, _and_ could really benefit from ALUA. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] target: Add a user-passthrough backstore
Nicholas A. Bellinger wrote: On Fri, 2014-09-19 at 14:43 -0700, Alex Elsayed wrote: Nicholas A. Bellinger wrote: snip So the idea of allowing the in-kernel CDB emulation to run after user-space has returned unsupported opcode is problematic for a couple of different reasons. First, if the correct feature bits in standard INQUIRY + EVPD INQUIRY, etc are not populated by user-space to match what in-kernel CDB emulation actually supports, this can result in potentially undefined results initiator side. Also for example, if user-space decided to emulate only a subset of PR operations, and leaves the rest of it up to the in-kernel emulation, there's not really a way to sync current state between the two, which also can end up with undefined results. So that said, I think a saner approach would be two define two modes of operation for TCMU: *) Passthrough Mode: All CDBs are passed to user-space, and no in-kernel emulation is done in the event of an unsupported opcode response. *) I/O Mode: I/O related CDBs are passed into user-space, but all control CDBs continue to be processed by in-kernel emulation. This effectively limits operation to TYPE_DISK, but with this mode it's probably OK to assume this. This seems like the best trade-off between flexibility when everything should be handled by user-space, vs. functionality when only block remapping of I/O is occurring in user-space code. The problem there is that the first case has all the issues of pscsi and simply becomes a performance optimization over tgt+iscsi client+pscsi and the latter case excludes the main use cases I'm interested in - OSDs, media changers, optical discs (the biggest thing for me), and so forth. One of the main things I want to do with this is hook up a plugin that uses libmirage to handle various optical disc image formats. Not sure I follow.. How does the proposed passthrough mode prevent someone from emulating OSDs, media changers, optical disks or anything else in userspace with TCMU..? The main thing that the above comments highlight is why attempting to combine the existing in-kernel emulation with a userspace backend providing it's own emulation can open up a number of problems with mismatched state between the two. It doesn't prevent it, but it _does_ put it in the exact same place as PSCSI regarding the warnings on the wiki. It means that if someone wants to implement (say) the optical disc or OSD CDBs, they then lose out on ALUA co unless they implement it themselves - which seems unnecessary and painful, since those should really be disjoint. In particular, an OSD backed by RADOS objects could be a very nice thing indeed, _and_ could really benefit from ALUA. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 RESEND] zram: auto add new devices on demand
Minchan Kim wrote: > On Wed, Jul 30, 2014 at 10:58:31PM +0900, Sergey Senozhatsky wrote: >> Hello, >> >> On (07/29/14 12:00), Minchan Kim wrote: >> > Hello Timofey, >> > >> > Why do you add new device unconditionally? >> > Maybe we need new konb on sysfs or ioctl for adding new device? >> > Any thought, guys? >> >> >> speaking of the patch, frankly, I (almost) see no gain comparing to the >> existing functionality. >> >> speaking of the idea. well, I'm not 100% convinced yet. the use cases I >> see around do not imply dynamic creation/resizing/etc. that said, I need >> to think about it. > > It didn't persuade me, either. > > Normally, distro have some config file for adding param at module loading > like /etc/modules. So, I think it should be done in there if someone want > to increase the number of zram devices. The problem here is that this requires (at least) unloading the module, and if it was built in requires a reboot (and futzing with the kernel command line, rather than /etc/modules.d) If someone's distro already loaded the module with nr_devices=1 (the default, I remind you), and is using it as swap, then it may well not be a feasible option for them to swapoff the (potentially large) swap device and do the modprobe dance. If they're running off a livecd that's using it in combination with, say, LVM thin provisioning in order to have a writeable system, then they are _completely_ screwed because you can't swapoff your rootfs. If they're using it as a backing store for ephemeral containers or VMs, then they may hit _any_ static limit, when they just want to start one more without having to stop the existing bunch. The swap case might be argued as "deal with it" (despite that swapoff is not something fun to do on a system under any real-world load, _especially_ if what you're trying to force off of the swap device won't fit in ram and has to get pushed down to a different swap device). But any case where people put filesystems on the device should make the issues of only supporting the module parameter pretty apparent. Finally, there's the issue of clutter - I may need 4 zram devices when I'm experimenting with something, but only the one swap device for daily use. Having the other three just sitting around permanently is at most an annoyance, a 'papercut' - but papercuts add up. >> >> if we end up adding this functionality I tend to vote for sysfs knob, >> just because it seems to be more user friendly than writing some magic >> INTs to ioctl-d fd. This I agree with wholeheartedly. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 RESEND] zram: auto add new devices on demand
Minchan Kim wrote: On Wed, Jul 30, 2014 at 10:58:31PM +0900, Sergey Senozhatsky wrote: Hello, On (07/29/14 12:00), Minchan Kim wrote: Hello Timofey, snip Why do you add new device unconditionally? Maybe we need new konb on sysfs or ioctl for adding new device? Any thought, guys? speaking of the patch, frankly, I (almost) see no gain comparing to the existing functionality. speaking of the idea. well, I'm not 100% convinced yet. the use cases I see around do not imply dynamic creation/resizing/etc. that said, I need to think about it. It didn't persuade me, either. Normally, distro have some config file for adding param at module loading like /etc/modules. So, I think it should be done in there if someone want to increase the number of zram devices. The problem here is that this requires (at least) unloading the module, and if it was built in requires a reboot (and futzing with the kernel command line, rather than /etc/modules.d) If someone's distro already loaded the module with nr_devices=1 (the default, I remind you), and is using it as swap, then it may well not be a feasible option for them to swapoff the (potentially large) swap device and do the modprobe dance. If they're running off a livecd that's using it in combination with, say, LVM thin provisioning in order to have a writeable system, then they are _completely_ screwed because you can't swapoff your rootfs. If they're using it as a backing store for ephemeral containers or VMs, then they may hit _any_ static limit, when they just want to start one more without having to stop the existing bunch. The swap case might be argued as deal with it (despite that swapoff is not something fun to do on a system under any real-world load, _especially_ if what you're trying to force off of the swap device won't fit in ram and has to get pushed down to a different swap device). But any case where people put filesystems on the device should make the issues of only supporting the module parameter pretty apparent. Finally, there's the issue of clutter - I may need 4 zram devices when I'm experimenting with something, but only the one swap device for daily use. Having the other three just sitting around permanently is at most an annoyance, a 'papercut' - but papercuts add up. if we end up adding this functionality I tend to vote for sysfs knob, just because it seems to be more user friendly than writing some magic INTs to ioctl-d fd. This I agree with wholeheartedly. snip -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reading large amounts from /dev/urandom broken
Hannes Frederic Sowa wrote: > On Mi, 2014-07-23 at 11:14 -0400, Theodore Ts'o wrote: >> On Wed, Jul 23, 2014 at 04:52:21PM +0300, Andrey Utkin wrote: >> > Dear developers, please check bugzilla ticket >> > https://bugzilla.kernel.org/show_bug.cgi?id=80981 (not the initial >> > issue, but starting with comment#3. >> > >> > Reading from /dev/urandom gives EOF after 33554431 bytes. I believe >> > it is introduced by commit 79a8468747c5f95ed3d5ce8376a3e82e0c5857fc, >> > with the chunk >> > >> > nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3)); >> > >> > which is described in commit message as "additional paranoia check to >> > prevent overly large count values to be passed into urandom_read()". >> > >> > I don't know why people pull such large amounts of data from urandom, >> > but given today there are two bugreports regarding problems doing >> > that, i consider that this is practiced. >> >> I've inquired on the bugzilla why the reporter is abusing urandom in >> this way. The other commenter on the bug replicated the problem, but >> that's not a "second bug report" in my book. >> >> At the very least, this will probably cause me to insert a warning >> printk: "insane user of /dev/urandom: [current->comm] requested %d >> bytes" whenever someone tries to request more than 4k. > > Ok, I would be fine with that. > > The dd if=/dev/urandom of=random_file.dat seems reasonable to me to try > to not break it. But, of course, there are other possibilities. Personally, I'd say that _is_ insane - reading from urandom still consumes entropy (causing readers of /dev/random to block more often); when alternatives (such as dd'ing to dm-crypt) both avoid the issue _and_ are faster then it should very well be considered pathological. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reading large amounts from /dev/urandom broken
Hannes Frederic Sowa wrote: On Mi, 2014-07-23 at 11:14 -0400, Theodore Ts'o wrote: On Wed, Jul 23, 2014 at 04:52:21PM +0300, Andrey Utkin wrote: Dear developers, please check bugzilla ticket https://bugzilla.kernel.org/show_bug.cgi?id=80981 (not the initial issue, but starting with comment#3. Reading from /dev/urandom gives EOF after 33554431 bytes. I believe it is introduced by commit 79a8468747c5f95ed3d5ce8376a3e82e0c5857fc, with the chunk nbytes = min_t(size_t, nbytes, INT_MAX (ENTROPY_SHIFT + 3)); which is described in commit message as additional paranoia check to prevent overly large count values to be passed into urandom_read(). I don't know why people pull such large amounts of data from urandom, but given today there are two bugreports regarding problems doing that, i consider that this is practiced. I've inquired on the bugzilla why the reporter is abusing urandom in this way. The other commenter on the bug replicated the problem, but that's not a second bug report in my book. At the very least, this will probably cause me to insert a warning printk: insane user of /dev/urandom: [current-comm] requested %d bytes whenever someone tries to request more than 4k. Ok, I would be fine with that. The dd if=/dev/urandom of=random_file.dat seems reasonable to me to try to not break it. But, of course, there are other possibilities. Personally, I'd say that _is_ insane - reading from urandom still consumes entropy (causing readers of /dev/random to block more often); when alternatives (such as dd'ing to dm-crypt) both avoid the issue _and_ are faster then it should very well be considered pathological. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/2] target: Add documentation on the target userspace pass-through driver
Reply inline, with a good bit of snipping done (posting via gmane, so quote/content ratio is an issue). Andy Grover wrote: > +These backstores cover the most common use cases, but not all. One new > +use case that other non-kernel target solutions, such as tgt, are able > +to support is using Gluster's GLFS or Ceph's RBD as a backstore. The > +target then serves as a translator, allowing initiators to store data > +in these non-traditional networked storage systems, while still only > +using standard protocols themselves. Another use case is in supporting various image formats, like (say) qcow2, and then handing those off to vhost_scsi. > +Benefits: > + > +In addition to allowing relatively easy support for RBD and GLFS, TCMU > +will also allow easier development of new backstores. TCMU combines > +with the LIO loopback fabric to become something similar to FUSE > +(Filesystem in Userspace), but at the SCSI layer instead of the > +filesystem layer. A SUSE, if you will. As long as people don't start calling it L[UNs in ]USER[space] :P Between that and ABUSE (A Block device in USErspace), this domain has some real naming potential... > +Device Discovery: > + > +Other devices may be using UIO besides TCMU. Unrelated user processes > +may also be handling different sets of TCMU devices. TCMU userspace > +processes must find their devices by scanning sysfs > +class/uio/uio*/name. For TCMU devices, these names will be of the > +format: > + > +tcm-user// > + > +where "tcm-user" is common for all TCMU-backed UIO devices. > +will be a userspace-process-unique string to identify the TCMU device > +as expecting to be backed by a certain handler, and will be an > +additional handler-specific string for the user process to configure > +the device, if needed. Neither or can contain ':', > +due to LIO limitations. It might be good to change this somewhat; in the vast majority of cases it'd be saner for userspace programs to figure this information out via udev etc. rather than parsing sysfs themselves. This information is still worth documenting, but saying things like "must find their devices by scanning sysfs" is likely to lead to users of this interface making suboptimal choices. > +Device Events: > + > +If a new device is added or removed, user processes will recieve a HUP > +signal, and should re-scan sysfs. File descriptors for devices no > +longer in sysfs should be closed, and new devices should be opened and > +handled. Is there a cleaner way to do this? In particular, re-scanning sysfs may cause race conditions (device removed, one of the same name re-added but a different UIO device node; probably more to be found). Perhaps recommend netlink uevents, so that remove+add is noticeable? Also, is the SIGHUP itself the best option? Could we simply require the user process to listen for add/remove uevents to get such change notifications, and thus enforce good behavior? > +Writing a user backstore handler: > + > +Variable emulation with pass_level: > + > +TCMU supports a "pass_level" option with valid values of 1, 2, or > +3. This controls how many different SCSI commands are passed up, > +versus being emulated by LIO. The purpose of this is to give the user > +handler author a choice of how much of the full SCSI command set they > +care to support. > + > +At level 1, only READ and WRITE commands will be seen. At level 2, > +additional commands defined in the SBC SCSI specification such as > +WRITE SAME, SYNCRONIZE CACHE, and UNMAP will be passed up. Finally, at > +level 3, almost all commands defined in the SPC SCSI specification > +will also be passed up for processing by the user handler. One use case I'm actually interested in is having userspace provide something other than just SPC - for instance, tgt can provide a virtual tape library or an OSD, and CDemu can provide emulated optical discs from various image formats. Currently, CDemu uses its own out-of-tree driver called VHBA (Virtual Host Bus Adapter) to do pretty much exactly what TCMU+Loopback would accomplish... and in the process misses out on all of the other fabrics, unless you're willing to _re-import_ those devices using PSCSI, which has its own quirks. Perhaps there could be a level 0 (or 4, or whatever) which means "explicitly enabled list of commands" - maybe as a bitmap that could be passed to the kernel somehow? Hopefully, that could also avoid some of the quirks of PSCSI regarding ALUA and such - if it's not implemented, leave the relevant bits at zero, and LIO handles it. This does look really nice, thanks for writing it! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/2] target: Add documentation on the target userspace pass-through driver
Reply inline, with a good bit of snipping done (posting via gmane, so quote/content ratio is an issue). Andy Grover wrote: +These backstores cover the most common use cases, but not all. One new +use case that other non-kernel target solutions, such as tgt, are able +to support is using Gluster's GLFS or Ceph's RBD as a backstore. The +target then serves as a translator, allowing initiators to store data +in these non-traditional networked storage systems, while still only +using standard protocols themselves. Another use case is in supporting various image formats, like (say) qcow2, and then handing those off to vhost_scsi. +Benefits: + +In addition to allowing relatively easy support for RBD and GLFS, TCMU +will also allow easier development of new backstores. TCMU combines +with the LIO loopback fabric to become something similar to FUSE +(Filesystem in Userspace), but at the SCSI layer instead of the +filesystem layer. A SUSE, if you will. As long as people don't start calling it L[UNs in ]USER[space] :P Between that and ABUSE (A Block device in USErspace), this domain has some real naming potential... +Device Discovery: + +Other devices may be using UIO besides TCMU. Unrelated user processes +may also be handling different sets of TCMU devices. TCMU userspace +processes must find their devices by scanning sysfs +class/uio/uio*/name. For TCMU devices, these names will be of the +format: + +tcm-user/subtype/path + +where tcm-user is common for all TCMU-backed UIO devices. subtype +will be a userspace-process-unique string to identify the TCMU device +as expecting to be backed by a certain handler, and path will be an +additional handler-specific string for the user process to configure +the device, if needed. Neither subtype or path can contain ':', +due to LIO limitations. It might be good to change this somewhat; in the vast majority of cases it'd be saner for userspace programs to figure this information out via udev etc. rather than parsing sysfs themselves. This information is still worth documenting, but saying things like must find their devices by scanning sysfs is likely to lead to users of this interface making suboptimal choices. +Device Events: + +If a new device is added or removed, user processes will recieve a HUP +signal, and should re-scan sysfs. File descriptors for devices no +longer in sysfs should be closed, and new devices should be opened and +handled. Is there a cleaner way to do this? In particular, re-scanning sysfs may cause race conditions (device removed, one of the same name re-added but a different UIO device node; probably more to be found). Perhaps recommend netlink uevents, so that remove+add is noticeable? Also, is the SIGHUP itself the best option? Could we simply require the user process to listen for add/remove uevents to get such change notifications, and thus enforce good behavior? +Writing a user backstore handler: + +Variable emulation with pass_level: + +TCMU supports a pass_level option with valid values of 1, 2, or +3. This controls how many different SCSI commands are passed up, +versus being emulated by LIO. The purpose of this is to give the user +handler author a choice of how much of the full SCSI command set they +care to support. + +At level 1, only READ and WRITE commands will be seen. At level 2, +additional commands defined in the SBC SCSI specification such as +WRITE SAME, SYNCRONIZE CACHE, and UNMAP will be passed up. Finally, at +level 3, almost all commands defined in the SPC SCSI specification +will also be passed up for processing by the user handler. One use case I'm actually interested in is having userspace provide something other than just SPC - for instance, tgt can provide a virtual tape library or an OSD, and CDemu can provide emulated optical discs from various image formats. Currently, CDemu uses its own out-of-tree driver called VHBA (Virtual Host Bus Adapter) to do pretty much exactly what TCMU+Loopback would accomplish... and in the process misses out on all of the other fabrics, unless you're willing to _re-import_ those devices using PSCSI, which has its own quirks. Perhaps there could be a level 0 (or 4, or whatever) which means explicitly enabled list of commands - maybe as a bitmap that could be passed to the kernel somehow? Hopefully, that could also avoid some of the quirks of PSCSI regarding ALUA and such - if it's not implemented, leave the relevant bits at zero, and LIO handles it. This does look really nice, thanks for writing it! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Thoughts on credential switching
Jeff Layton wrote: > On Wed, 26 Mar 2014 20:25:35 -0700 > Jeff Layton wrote: > >> On Wed, 26 Mar 2014 20:05:16 -0700 >> Andy Lutomirski wrote: >> >> > On Wed, Mar 26, 2014 at 7:48 PM, Jeff Layton >> > wrote: >> > > On Wed, 26 Mar 2014 17:23:24 -0700 >> > > Andy Lutomirski wrote: >> > > >> > >> Hi various people who care about user-space NFS servers and/or >> > >> security-relevant APIs. >> > >> >> > >> I propose the following set of new syscalls: >> > >> >> > >> int credfd_create(unsigned int flags): returns a new credfd that >> > >> corresponds to current's creds. >> > >> >> > >> int credfd_activate(int fd, unsigned int flags): Change current's >> > >> creds to match the creds stored in fd. To be clear, this changes >> > >> both the "subjective" and "objective" (aka real_cred and cred) >> > >> because there aren't any real semantics for what happens when >> > >> userspace code runs with real_cred != cred. >> > >> >> > >> Rules: >> > >> >> > >> - credfd_activate fails (-EINVAL) if fd is not a credfd. >> > >> - credfd_activate fails (-EPERM) if the fd's userns doesn't >> > >> match current's userns. credfd_activate is not intended to be a >> > >> substitute for setns. >> > >> - credfd_activate will fail (-EPERM) if LSM does not allow the >> > >> switch. This probably needs to be a new selinux action -- >> > >> dyntransition is too restrictive. >> > >> >> > >> >> > >> Optional: >> > >> - credfd_create always sets cloexec, because the alternative is >> > >> silly. >> > >> - credfd_activate fails (-EINVAL) if dumpable. This is because >> > >> we don't want a privileged daemon to be ptraced while >> > >> impersonating someone else. >> > >> - optional: both credfd_create and credfd_activate fail if >> > >> !ns_capable(CAP_SYS_ADMIN) or perhaps !capable(CAP_SETUID). >> > >> >> > >> The first question: does this solve Ganesha's problem? >> > >> >> > >> The second question: is this safe? I can see two major concerns. >> > >> The bigger concern is that having these syscalls available will >> > >> allow users to exploit things that were previously secure. For >> > >> example, maybe some configuration assumes that a task running as >> > >> uid==1 can't switch to uid==2, even with uid 2's consent. >> > >> Similar issues happen with capabilities. If CAP_SYS_ADMIN is not >> > >> required, then this is no longer really true. >> > >> >> > >> Alternatively, something running as uid == 0 with heavy >> > >> capability restrictions in a mount namespace (but not a uid >> > >> namespace) could pass a credfd out of the namespace. This could >> > >> break things like Docker pretty badly. CAP_SYS_ADMIN guards >> > >> against this to some extent. But I think that Docker is already >> > >> totally screwed if a Docker root task can receive an O_DIRECTORY >> > >> or O_PATH fd out of the container, so it's not entirely clear >> > >> that the situation is any worse, even without requiring >> > >> CAP_SYS_ADMIN. >> > >> >> > >> The second concern is that it may be difficult to use this >> > >> correctly. There's a reason that real_cred and cred exist, but >> > >> it's not really well set up for being used. >> > >> >> > >> As a simple way to stay safe, Ganesha could only use credfds that >> > >> have real_uid == 0. >> > >> >> > >> --Andy >> > > >> > > >> > > I still don't quite grok why having this special credfd_create >> > > call buys you anything over simply doing what Al had originally >> > > suggested -- switch creds using all of the different syscalls and >> > > then simply caching that in a "normal" fd: >> > > >> > > fd = open("/dev/null", O_PATH...); >> > > >> > > ...it seems to me that the credfd_activate call will still need to >> > > do the same permission checking that all of the individual >> > > set*id() calls require (and all of the other stuff like changing >> > > selinux contexts, etc). >> > > >> > > IOW, this fd is just a "handle" for passing around a struct cred, >> > > but I don't see why having access to that handle would allow you >> > > to do something you couldn't already do anyway. >> > > >> > > Am I missing something obvious here? >> > >> > Not really. I think I didn't adequately explain a piece of this. >> > >> > I think that what you're suggesting is for an fd to encode a set of >> > credentials but not to grant permission to use those credentials. >> > So switch_creds(fd) is more or less the same thing as >> > switch_creds(ruid, euid, suid, rgid, egid, sgid, groups, mac >> > label, ...). switch_creds needs to verify that the caller can >> > dyntransition to the label, set all the ids, etc., but it avoids >> > allocating anything and running RCU callbacks. >> > >> > The trouble with this is that the verification needed is complicated >> > and expensive. And I think that my proposal is potentially more >> > useful. >> > >> >> Is it really though? My understanding of the problem was that it was >> the syscall (context switching) overhead + having to do a bunch of RCU >> critical
Re: Thoughts on credential switching
Jeff Layton wrote: On Wed, 26 Mar 2014 20:25:35 -0700 Jeff Layton jlay...@redhat.com wrote: On Wed, 26 Mar 2014 20:05:16 -0700 Andy Lutomirski l...@amacapital.net wrote: On Wed, Mar 26, 2014 at 7:48 PM, Jeff Layton jlay...@redhat.com wrote: On Wed, 26 Mar 2014 17:23:24 -0700 Andy Lutomirski l...@amacapital.net wrote: Hi various people who care about user-space NFS servers and/or security-relevant APIs. I propose the following set of new syscalls: int credfd_create(unsigned int flags): returns a new credfd that corresponds to current's creds. int credfd_activate(int fd, unsigned int flags): Change current's creds to match the creds stored in fd. To be clear, this changes both the subjective and objective (aka real_cred and cred) because there aren't any real semantics for what happens when userspace code runs with real_cred != cred. Rules: - credfd_activate fails (-EINVAL) if fd is not a credfd. - credfd_activate fails (-EPERM) if the fd's userns doesn't match current's userns. credfd_activate is not intended to be a substitute for setns. - credfd_activate will fail (-EPERM) if LSM does not allow the switch. This probably needs to be a new selinux action -- dyntransition is too restrictive. Optional: - credfd_create always sets cloexec, because the alternative is silly. - credfd_activate fails (-EINVAL) if dumpable. This is because we don't want a privileged daemon to be ptraced while impersonating someone else. - optional: both credfd_create and credfd_activate fail if !ns_capable(CAP_SYS_ADMIN) or perhaps !capable(CAP_SETUID). The first question: does this solve Ganesha's problem? The second question: is this safe? I can see two major concerns. The bigger concern is that having these syscalls available will allow users to exploit things that were previously secure. For example, maybe some configuration assumes that a task running as uid==1 can't switch to uid==2, even with uid 2's consent. Similar issues happen with capabilities. If CAP_SYS_ADMIN is not required, then this is no longer really true. Alternatively, something running as uid == 0 with heavy capability restrictions in a mount namespace (but not a uid namespace) could pass a credfd out of the namespace. This could break things like Docker pretty badly. CAP_SYS_ADMIN guards against this to some extent. But I think that Docker is already totally screwed if a Docker root task can receive an O_DIRECTORY or O_PATH fd out of the container, so it's not entirely clear that the situation is any worse, even without requiring CAP_SYS_ADMIN. The second concern is that it may be difficult to use this correctly. There's a reason that real_cred and cred exist, but it's not really well set up for being used. As a simple way to stay safe, Ganesha could only use credfds that have real_uid == 0. --Andy I still don't quite grok why having this special credfd_create call buys you anything over simply doing what Al had originally suggested -- switch creds using all of the different syscalls and then simply caching that in a normal fd: fd = open(/dev/null, O_PATH...); ...it seems to me that the credfd_activate call will still need to do the same permission checking that all of the individual set*id() calls require (and all of the other stuff like changing selinux contexts, etc). IOW, this fd is just a handle for passing around a struct cred, but I don't see why having access to that handle would allow you to do something you couldn't already do anyway. Am I missing something obvious here? Not really. I think I didn't adequately explain a piece of this. I think that what you're suggesting is for an fd to encode a set of credentials but not to grant permission to use those credentials. So switch_creds(fd) is more or less the same thing as switch_creds(ruid, euid, suid, rgid, egid, sgid, groups, mac label, ...). switch_creds needs to verify that the caller can dyntransition to the label, set all the ids, etc., but it avoids allocating anything and running RCU callbacks. The trouble with this is that the verification needed is complicated and expensive. And I think that my proposal is potentially more useful. Is it really though? My understanding of the problem was that it was the syscall (context switching) overhead + having to do a bunch of RCU critical stuff that was the problem. If we can do all of this in the context of a single RCU critical section, isn't that still a win? As to the complicated part...maybe but it doesn't seem like it would have to be. We could simply return -EINVAL or something if the old struct cred doesn't have fields that match the ones we're replacing and that we don't expect to see changed. A credfd is like a
Re: [QUERY] lguest64
Ramkumar Ramachandra wrote: > H. Peter Anvin wrote: >> UML, lguest and Xen were done before the x86 architecture supported >> hardware virtualization. > > [...] > >> but on KVM-enabled hardware KVM seems >> like the better option (and is indeed what libguestfs uses.) > > While we're still on the topic, I'd like a few clarifications. From > your reply, I got the impression that KVM the only mechanism for > non-pvops virtualization. This seems quite contrary to what I read on > lwn about ARM virtualization [1]. In short, ARM provides a "hypervisor > mode", and the article says > > "the virtualization model provided by ARM fits the Xen > hypervisor-based virtualization better than KVM's kernel-based model" > > The Xen people call this "ARM PVH" (as opposed to ARM PV, which does > not utilize hardware extensions) [2]. Although I wasn't able to find > much information about the hardware aspect, what ARM provides seems to > be quite different from VT-x and AMD-V. I'm also confused about what > virt/kvm/arm is. > > Thanks. > > [1]: http://lwn.net/Articles/513940/ > [2]: http://www.xenproject.org/developers/teams/arm-hypervisor.html ARM's virtualization extensions may be a more *natural* match to Xen's semantics and architecture, but that doesn't mean that KVM can't use it. LWN explains the details far better than I can: https://lwn.net/Articles/557132/ virt/kvm/arm is an implementation of KVM (the API) that takes advantage of ARM's virtualization extensions. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [QUERY] lguest64
Ramkumar Ramachandra wrote: H. Peter Anvin wrote: UML, lguest and Xen were done before the x86 architecture supported hardware virtualization. [...] but on KVM-enabled hardware KVM seems like the better option (and is indeed what libguestfs uses.) While we're still on the topic, I'd like a few clarifications. From your reply, I got the impression that KVM the only mechanism for non-pvops virtualization. This seems quite contrary to what I read on lwn about ARM virtualization [1]. In short, ARM provides a hypervisor mode, and the article says the virtualization model provided by ARM fits the Xen hypervisor-based virtualization better than KVM's kernel-based model The Xen people call this ARM PVH (as opposed to ARM PV, which does not utilize hardware extensions) [2]. Although I wasn't able to find much information about the hardware aspect, what ARM provides seems to be quite different from VT-x and AMD-V. I'm also confused about what virt/kvm/arm is. Thanks. [1]: http://lwn.net/Articles/513940/ [2]: http://www.xenproject.org/developers/teams/arm-hypervisor.html ARM's virtualization extensions may be a more *natural* match to Xen's semantics and architecture, but that doesn't mean that KVM can't use it. LWN explains the details far better than I can: https://lwn.net/Articles/557132/ virt/kvm/arm is an implementation of KVM (the API) that takes advantage of ARM's virtualization extensions. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] target-pending tree updated to v3.6-rc2
On Thursday, August 16, 2012 11:49:44 PM Nicholas A. Bellinger wrote: > On Thu, 2012-08-16 at 19:43 -0700, Nicholas A. Bellinger wrote: > > Hi all, > > > > With the release of v3.6-rc2 this afternoon, the target-pending.git tree > > now has been updated using the freshly cut -rc2 as it's new HEAD. > > Patches destined into for-3.7 code are now being added into for-next for > > linux-next build testing. > > > > Also, thanks go out to MST, Stefan, and Paolo who managed to get > > tcm_vhost fabric code merged in mainline for -rc2 under the special > > post merge window exception rule for new kernel code, which in my > > experience tends to happens about as often as Haley's comet comes > > around.. > > > > Here is the current breakdown of pending patches by branch: > > > > master: (patches headed for 3.6-rc-fixes) > > > > 5b7517f8 tcm_vhost: Change vhost_scsi_target->vhost_wwpn to char * > > d0e27c88 target: fix NULL pointer dereference bug alloc_page() fails to > > get memory 1fa8f450 tcm_fc: Avoid debug overhead when not debugging > > 101998f6 tcm_vhost: Post-merge review changes requested by MST > > f0e0e9bb tcm_vhost: Fix incorrect IS_ERR() usage in > > vhost_scsi_map_iov_to_sgl > Whoops.. Missed one extra target/pscsi regression bug-fix reported > recently by Alex Elsayed (CC'ed) that has just been pushed into master > here: > > target/pscsi: Fix bug with REPORT_LUNs handling for SCSI passthrough > http://git.kernel.org/?p=linux/kernel/git/nab/target-pending.git;a=commitdif > f;h=1d2a2cd95ee0137a2353d1b5635739c281f27cd4 > > Alex-E, you where able to get TYPE_ROM passthrough w/ pSCSI export > working on your setup with this patch, yes..? > > Thanks! > > --nab The patch got it to where XP and Windows 7 would recognize it and mount it, but there's still some weirdness - for instance, opening it in the file browser hangs for an extended period of time before it starts working. No noise in dmesg when that happens, and I haven't had time to look deeper. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] target-pending tree updated to v3.6-rc2
On Thursday, August 16, 2012 11:49:44 PM Nicholas A. Bellinger wrote: On Thu, 2012-08-16 at 19:43 -0700, Nicholas A. Bellinger wrote: Hi all, With the release of v3.6-rc2 this afternoon, the target-pending.git tree now has been updated using the freshly cut -rc2 as it's new HEAD. Patches destined into for-3.7 code are now being added into for-next for linux-next build testing. Also, thanks go out to MST, Stefan, and Paolo who managed to get tcm_vhost fabric code merged in mainline for -rc2 under the special post merge window exception rule for new kernel code, which in my experience tends to happens about as often as Haley's comet comes around.. Here is the current breakdown of pending patches by branch: master: (patches headed for 3.6-rc-fixes) 5b7517f8 tcm_vhost: Change vhost_scsi_target-vhost_wwpn to char * d0e27c88 target: fix NULL pointer dereference bug alloc_page() fails to get memory 1fa8f450 tcm_fc: Avoid debug overhead when not debugging 101998f6 tcm_vhost: Post-merge review changes requested by MST f0e0e9bb tcm_vhost: Fix incorrect IS_ERR() usage in vhost_scsi_map_iov_to_sgl Whoops.. Missed one extra target/pscsi regression bug-fix reported recently by Alex Elsayed (CC'ed) that has just been pushed into master here: target/pscsi: Fix bug with REPORT_LUNs handling for SCSI passthrough http://git.kernel.org/?p=linux/kernel/git/nab/target-pending.git;a=commitdif f;h=1d2a2cd95ee0137a2353d1b5635739c281f27cd4 Alex-E, you where able to get TYPE_ROM passthrough w/ pSCSI export working on your setup with this patch, yes..? Thanks! --nab The patch got it to where XP and Windows 7 would recognize it and mount it, but there's still some weirdness - for instance, opening it in the file browser hangs for an extended period of time before it starts working. No noise in dmesg when that happens, and I haven't had time to look deeper. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/