Re: kdbus: credential faking

2015-07-10 Thread Alex Elsayed
Casey Schaufler wrote:

> On 7/10/2015 7:57 AM, Alex Elsayed wrote:
>> Stephen Smalley wrote:
>>
>>> On 07/10/2015 09:43 AM, David Herrmann wrote:
>>>> Hi
>>>>
>>>> On Fri, Jul 10, 2015 at 3:25 PM, Stephen Smalley 
>>>> wrote:
>>>>> On 07/09/2015 06:22 PM, David Herrmann wrote:
>>>>>> To be clear, faking metadata has one use-case, and one use-case only:
>>>>>> dbus1 compatibility
>>>>>>
>>>>>> In dbus1, clients connect to a unix-socket placed in the file-system
>>>>>> hierarchy. To avoid breaking ABI for old clients, we support a
>>>>>> unix-kdbus proxy. This proxy is called systemd-bus-proxyd. It is
>>>>>> spawned once for each bus we proxy and simply remarshals messages
>>>>>> from the client to kdbus and vice versa.
>>>>> Is this truly necessary?  Can't the distributions just update the
>>>>> client
>>>>> side libraries to use kdbus if enabled and be done with it?  Doesn't
>>>>> this proxy undo many of the benefits of using kdbus in the first
>>>>> place?
>>>> We need binary compatibility to dbus1. There're millions of
>>>> applications and language bindings with dbus1 compiled in, which we
>>>> cannot suddenly break.
>>> So, are you saying that there are many applications that statically link
>>> the dbus1 library implementation (thus the distributions can't just push
>>> an updated shared library that switches from using the socket to using
>>> kdbus), and that many of these applications are third party applications
>>> not packaged by the distributions (thus the distributions cannot just do
>>> a mass rebuild to update these applications too)?  Otherwise, I would
>>> think that the use of a socket would just be an implementation detail
>>> and you would be free to change it without affecting dbus1 library ABI
>>> compatibility.
>> Honestly? Yes. To bring up two examples off the bat, IIRC both Haskell
>> and Java have independent *implementations* of the dbus1 protocol, not
>> reusing the reference library at all - Haskell isn't technically
>> statically linked, but its ABI hashing stuff means it's the next best
>> thing, and both it and Java are often managed outside the PM because for
>> various reasons (in the case of Haskell, lots of tiny packages with lots
>> of frequent releases make packagers cry until they find a way of
>> automating it).
> 
> There is absolutely no reason to expect that these two examples don't have
> native kdbus implementations in the works already.

The Haskell one, at least, does not. I checked.

> That's the risk you take when you eschew the "standard" libraries.
> Further, the primary reason that developers deviate from the norm is (you 
guessed it!) performance.

Or, you know, avoiding the hassle of building and/or linking to code in 
another language via FFI. That's my recall of the primary reason for the 
Haskell one - and I don't think it's any coincidence that the two pure 
reimplementations are in managed-but-compiled languages.

> The proxy is going to kill (or at least be assumed to kill) that
> advantage, putting even more pressure on these deviant applications to
> provide native kdbus versions.

...sure, if performance was the object. But it went through the old D-Bus 
daemon either way, so I'm rather dubious of your assertion - whether due to 
being in userspace or just poor implementation, it's no speed daemon so to 
speak.

> Backward compatibility shims/libraries/proxies only work when it's the
> rare and unimportant case requiring it. If it's the common case, it won't
> work. If it's the important case, it won't work. If kdbus is worth the
> effort, make the effort.

They also work if they require no configuration or effort from the legacy 
side, allowing those who need the (possibly rare *but also* important) 
benefits of the new system to benefit without causing harm to others.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kdbus: credential faking

2015-07-10 Thread Alex Elsayed
Stephen Smalley wrote:

> On 07/10/2015 09:43 AM, David Herrmann wrote:
>> Hi
>> 
>> On Fri, Jul 10, 2015 at 3:25 PM, Stephen Smalley 
>> wrote:
>>> On 07/09/2015 06:22 PM, David Herrmann wrote:
 To be clear, faking metadata has one use-case, and one use-case only:
 dbus1 compatibility

 In dbus1, clients connect to a unix-socket placed in the file-system
 hierarchy. To avoid breaking ABI for old clients, we support a
 unix-kdbus proxy. This proxy is called systemd-bus-proxyd. It is
 spawned once for each bus we proxy and simply remarshals messages from
 the client to kdbus and vice versa.
>>>
>>> Is this truly necessary?  Can't the distributions just update the client
>>> side libraries to use kdbus if enabled and be done with it?  Doesn't
>>> this proxy undo many of the benefits of using kdbus in the first place?
>> 
>> We need binary compatibility to dbus1. There're millions of
>> applications and language bindings with dbus1 compiled in, which we
>> cannot suddenly break.
> 
> So, are you saying that there are many applications that statically link
> the dbus1 library implementation (thus the distributions can't just push
> an updated shared library that switches from using the socket to using
> kdbus), and that many of these applications are third party applications
> not packaged by the distributions (thus the distributions cannot just do
> a mass rebuild to update these applications too)?  Otherwise, I would
> think that the use of a socket would just be an implementation detail
> and you would be free to change it without affecting dbus1 library ABI
> compatibility.

Honestly? Yes. To bring up two examples off the bat, IIRC both Haskell and 
Java have independent *implementations* of the dbus1 protocol, not reusing 
the reference library at all - Haskell isn't technically statically linked, 
but its ABI hashing stuff means it's the next best thing, and both it and 
Java are often managed outside the PM because for various reasons (in the 
case of Haskell, lots of tiny packages with lots of frequent releases make 
packagers cry until they find a way of automating it).



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kdbus: credential faking

2015-07-10 Thread Alex Elsayed
Stephen Smalley wrote:

 On 07/10/2015 09:43 AM, David Herrmann wrote:
 Hi
 
 On Fri, Jul 10, 2015 at 3:25 PM, Stephen Smalley s...@tycho.nsa.gov
 wrote:
 On 07/09/2015 06:22 PM, David Herrmann wrote:
 To be clear, faking metadata has one use-case, and one use-case only:
 dbus1 compatibility

 In dbus1, clients connect to a unix-socket placed in the file-system
 hierarchy. To avoid breaking ABI for old clients, we support a
 unix-kdbus proxy. This proxy is called systemd-bus-proxyd. It is
 spawned once for each bus we proxy and simply remarshals messages from
 the client to kdbus and vice versa.

 Is this truly necessary?  Can't the distributions just update the client
 side libraries to use kdbus if enabled and be done with it?  Doesn't
 this proxy undo many of the benefits of using kdbus in the first place?
 
 We need binary compatibility to dbus1. There're millions of
 applications and language bindings with dbus1 compiled in, which we
 cannot suddenly break.
 
 So, are you saying that there are many applications that statically link
 the dbus1 library implementation (thus the distributions can't just push
 an updated shared library that switches from using the socket to using
 kdbus), and that many of these applications are third party applications
 not packaged by the distributions (thus the distributions cannot just do
 a mass rebuild to update these applications too)?  Otherwise, I would
 think that the use of a socket would just be an implementation detail
 and you would be free to change it without affecting dbus1 library ABI
 compatibility.

Honestly? Yes. To bring up two examples off the bat, IIRC both Haskell and 
Java have independent *implementations* of the dbus1 protocol, not reusing 
the reference library at all - Haskell isn't technically statically linked, 
but its ABI hashing stuff means it's the next best thing, and both it and 
Java are often managed outside the PM because for various reasons (in the 
case of Haskell, lots of tiny packages with lots of frequent releases make 
packagers cry until they find a way of automating it).

snip

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kdbus: credential faking

2015-07-10 Thread Alex Elsayed
Casey Schaufler wrote:

 On 7/10/2015 7:57 AM, Alex Elsayed wrote:
 Stephen Smalley wrote:

 On 07/10/2015 09:43 AM, David Herrmann wrote:
 Hi

 On Fri, Jul 10, 2015 at 3:25 PM, Stephen Smalley s...@tycho.nsa.gov
 wrote:
 On 07/09/2015 06:22 PM, David Herrmann wrote:
 To be clear, faking metadata has one use-case, and one use-case only:
 dbus1 compatibility

 In dbus1, clients connect to a unix-socket placed in the file-system
 hierarchy. To avoid breaking ABI for old clients, we support a
 unix-kdbus proxy. This proxy is called systemd-bus-proxyd. It is
 spawned once for each bus we proxy and simply remarshals messages
 from the client to kdbus and vice versa.
 Is this truly necessary?  Can't the distributions just update the
 client
 side libraries to use kdbus if enabled and be done with it?  Doesn't
 this proxy undo many of the benefits of using kdbus in the first
 place?
 We need binary compatibility to dbus1. There're millions of
 applications and language bindings with dbus1 compiled in, which we
 cannot suddenly break.
 So, are you saying that there are many applications that statically link
 the dbus1 library implementation (thus the distributions can't just push
 an updated shared library that switches from using the socket to using
 kdbus), and that many of these applications are third party applications
 not packaged by the distributions (thus the distributions cannot just do
 a mass rebuild to update these applications too)?  Otherwise, I would
 think that the use of a socket would just be an implementation detail
 and you would be free to change it without affecting dbus1 library ABI
 compatibility.
 Honestly? Yes. To bring up two examples off the bat, IIRC both Haskell
 and Java have independent *implementations* of the dbus1 protocol, not
 reusing the reference library at all - Haskell isn't technically
 statically linked, but its ABI hashing stuff means it's the next best
 thing, and both it and Java are often managed outside the PM because for
 various reasons (in the case of Haskell, lots of tiny packages with lots
 of frequent releases make packagers cry until they find a way of
 automating it).
 
 There is absolutely no reason to expect that these two examples don't have
 native kdbus implementations in the works already.

The Haskell one, at least, does not. I checked.

 That's the risk you take when you eschew the standard libraries.
 Further, the primary reason that developers deviate from the norm is (you 
guessed it!) performance.

Or, you know, avoiding the hassle of building and/or linking to code in 
another language via FFI. That's my recall of the primary reason for the 
Haskell one - and I don't think it's any coincidence that the two pure 
reimplementations are in managed-but-compiled languages.

 The proxy is going to kill (or at least be assumed to kill) that
 advantage, putting even more pressure on these deviant applications to
 provide native kdbus versions.

...sure, if performance was the object. But it went through the old D-Bus 
daemon either way, so I'm rather dubious of your assertion - whether due to 
being in userspace or just poor implementation, it's no speed daemon so to 
speak.

 Backward compatibility shims/libraries/proxies only work when it's the
 rare and unimportant case requiring it. If it's the common case, it won't
 work. If it's the important case, it won't work. If kdbus is worth the
 effort, make the effort.

They also work if they require no configuration or effort from the legacy 
side, allowing those who need the (possibly rare *but also* important) 
benefits of the new system to benefit without causing harm to others.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-17 Thread Alex Elsayed
Havoc Pennington wrote:

> Hi,
> 
> On Fri, Apr 17, 2015 at 3:27 PM, James Bottomley
>  wrote:
>>
>> This is why I think kdbus is a bad idea: it solidifies as a linux kernel
>> API something which runs counter to granular OS virtualization (and
>> something which caused Windows to fall behind Linux in the container
>> space).  Splitting out the acceleration problem and leaving the rest to
>> user space currently looks fine because the ideas Al and Andy are
>> kicking around don't cause problems with OS virtualization.
>>
> 
> I'm interested in understanding this problem (if only for my own
> curiosity) but I'm not confident I understand what you're saying
> correctly.
> 
> Can I try to explain back / ask questions and see what I have right?
> 
> I think you are saying that if an application relies on a system
> service (= any other process that runs on the system bus) then to
> virtualize that app by itself in a dedicated container, the system bus
> and the system service need to also be in the container. So the
> container ends up with a bunch of stuff in it beyond only the
> application.  Right / wrong / confused?
> 
> I also think you're saying that userspace dbus has the same issue
> (this isn't a userspace vs. kernel thing per se), the objection to
> kdbus is that it makes this issue more solidified / harder to fix?
> 
> Do you have ideas on how to go about fixing it, whether in userspace
> or kernel dbus?
> 
> Havoc

So far as I understand (and this may be wrong), this is the use case of 
kdbus "endpoints" - you'd create a (constrained) kdbus endpoint on the host, 
and then expose it to the application, such that the application uses it as 
if it were the system bus.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] dm-crypt: Adds support for wiping key when doing suspend/hibernation

2015-04-17 Thread Alex Elsayed
Mike Snitzer wrote:

> On Thu, Apr 16 2015 at  5:23am -0400,
> Alex Elsayed  wrote:
> 
>> Mike Snitzer wrote:
>> 
>> > On Thu, Apr 09 2015 at  9:28am -0400,
>> > Pali Rohár  wrote:
>> > 
>> >> On Thursday 09 April 2015 09:12:08 Mike Snitzer wrote:
>> >> > On Mon, Apr 06 2015 at  9:29am -0400,
>> >> > Pali Rohár  wrote:
>> >> > 
>> >> > > On Monday 06 April 2015 15:00:46 Mike Snitzer wrote:
>> >> > > > On Sun, Apr 05 2015 at  1:20pm -0400,
>> >> > > > 
>> >> > > > Pali Rohár  wrote:
>> >> > > > > This patch series increase security of suspend and hibernate
>> >> > > > > actions. It allows user to safely wipe crypto keys before
>> >> > > > > suspend and hibernate actions starts without race
>> >> > > > > conditions on userspace process with heavy I/O.
>> >> > > > > 
>> >> > > > > To automatically wipe cryto key for  before
>> >> > > > > hibernate action call: $ dmsetup message  0 key
>> >> > > > > wipe_on_hibernation 1
>> >> > > > > 
>> >> > > > > To automatically wipe cryto key for  before suspend
>> >> > > > > action call: $ dmsetup message  0 key
>> >> > > > > wipe_on_suspend 1
>> >> > > > > 
>> >> > > > > (Value 0 after wipe_* string reverts original behaviour - to
>> >> > > > > not wipe key)
>> >> > > > 
>> >> > > > Can you elaborate on the attack vector your changes are meant
>> >> > > > to protect against?  The user already authorized access, why
>> >> > > > is it inherently dangerous to _not_ wipe the associated key
>> >> > > > across these events?
>> >> > > 
>> >> > > Hi,
>> >> > > 
>> >> > > yes, I will try to explain current problems with cryptsetup
>> >> > > luksSuspend command and hibernation.
>> >> > > 
>> >> > > First, sometimes it is needed to put machine into other hands.
>> >> > > You can still watch other person what is doing with machine, but
>> >> > > once if you let machine unlocked (e.g opened luks disk), she/he
>> >> > > can access encrypted data.
>> >> > > 
>> >> > > If you turn off machine, it could be safe, because luks disk
>> >> > > devices are locked. But if you enter machine into suspend or
>> >> > > hibernate state luks devices are still open. And my patches try
>> >> > > to achieve similar security as when machine is off (= no crypto
>> >> > > keys in RAM or on swap).
>> >> > > 
>> >> > > When doing hibernate on unencrypted swap it is to prevent leaking
>> >> > > crypto keys to hibernate image (which is stored in swap).
>> >> > > 
>> >> > > When doing suspend action it is again to prevent leaking crypto
>> >> > > keys. E.g when you suspend laptop and put it off (somebody can
>> >> > > remove RAMs and do some cold boot attack).
>> >> > > 
>> >> > > The most common situation is:
>> >> > > You have mounted partition from dm-crypt device (e.g. /home/),
>> >> > > some userspace processes access it (e.g opened firefox which
>> >> > > still reads/writes to cache ~/.firefox/) and you want to drop
>> >> > > crypto keys from kernel for some time.
>> >> > > 
>> >> > > For that operation there is command cryptsetup luksSuspend, which
>> >> > > suspend dm device and then tell kernel to wipe crypto keys. All
>> >> > > I/O operations are then stopped and userspace processes which
>> >> > > want to do some those I/O operations are stopped too (until you
>> >> > > call cryptsetup luksResume and enter correct key).
>> >> > > 
>> >> > > Now if you want to suspend/hiberate your machine (when some of dm
>> >> > > devices are suspeneded and some processes are stopped due to
>> >> > > pending I/O) it is not possible. Kernel freeze_processes function
>> >> > > will fail because userspace processes are still stopped inside
>> >> > &g

Re: [PATCH 0/3] dm-crypt: Adds support for wiping key when doing suspend/hibernation

2015-04-17 Thread Alex Elsayed
Mike Snitzer wrote:

 On Thu, Apr 16 2015 at  5:23am -0400,
 Alex Elsayed eternal...@gmail.com wrote:
 
 Mike Snitzer wrote:
 
  On Thu, Apr 09 2015 at  9:28am -0400,
  Pali Rohár pali.ro...@gmail.com wrote:
  
  On Thursday 09 April 2015 09:12:08 Mike Snitzer wrote:
   On Mon, Apr 06 2015 at  9:29am -0400,
   Pali Rohár pali.ro...@gmail.com wrote:
   
On Monday 06 April 2015 15:00:46 Mike Snitzer wrote:
 On Sun, Apr 05 2015 at  1:20pm -0400,
 
 Pali Rohár pali.ro...@gmail.com wrote:
  This patch series increase security of suspend and hibernate
  actions. It allows user to safely wipe crypto keys before
  suspend and hibernate actions starts without race
  conditions on userspace process with heavy I/O.
  
  To automatically wipe cryto key for device before
  hibernate action call: $ dmsetup message device 0 key
  wipe_on_hibernation 1
  
  To automatically wipe cryto key for device before suspend
  action call: $ dmsetup message device 0 key
  wipe_on_suspend 1
  
  (Value 0 after wipe_* string reverts original behaviour - to
  not wipe key)
 
 Can you elaborate on the attack vector your changes are meant
 to protect against?  The user already authorized access, why
 is it inherently dangerous to _not_ wipe the associated key
 across these events?

Hi,

yes, I will try to explain current problems with cryptsetup
luksSuspend command and hibernation.

First, sometimes it is needed to put machine into other hands.
You can still watch other person what is doing with machine, but
once if you let machine unlocked (e.g opened luks disk), she/he
can access encrypted data.

If you turn off machine, it could be safe, because luks disk
devices are locked. But if you enter machine into suspend or
hibernate state luks devices are still open. And my patches try
to achieve similar security as when machine is off (= no crypto
keys in RAM or on swap).

When doing hibernate on unencrypted swap it is to prevent leaking
crypto keys to hibernate image (which is stored in swap).

When doing suspend action it is again to prevent leaking crypto
keys. E.g when you suspend laptop and put it off (somebody can
remove RAMs and do some cold boot attack).

The most common situation is:
You have mounted partition from dm-crypt device (e.g. /home/),
some userspace processes access it (e.g opened firefox which
still reads/writes to cache ~/.firefox/) and you want to drop
crypto keys from kernel for some time.

For that operation there is command cryptsetup luksSuspend, which
suspend dm device and then tell kernel to wipe crypto keys. All
I/O operations are then stopped and userspace processes which
want to do some those I/O operations are stopped too (until you
call cryptsetup luksResume and enter correct key).

Now if you want to suspend/hiberate your machine (when some of dm
devices are suspeneded and some processes are stopped due to
pending I/O) it is not possible. Kernel freeze_processes function
will fail because userspace processes are still stopped inside
some I/O syscall (read/write, etc,...).

My patches fixes this problem and do those operations (suspend dm
device, wipe crypto keys, enter suspend/hiberate) in correct
order and without race condition.

dm device is suspended *after* userspace processes are freezed
and after that are crypto keys wiped. And then computer/laptop
enters into suspend/hibernate state.
   
   Wouldn't it be better to fix freeze_processes() to be tolerant of
   processes that are hung as a side-effect of their backing storage
   being
   suspended?  A hibernate shouldn't fail simply because a user chose
   to suspend a DM device.
   
   Then this entire problem goes away and the key can be wiped from
   userspace (like you said above).
  
  Still there will be race condition. Before hibernation (and device
  poweroff) we should have synced disks and filesystems to prevent data
  lose (or other damage) as more as we can. And if there will be some
  application which using lot of I/O (e.g normal firefox) then there
  always will be race condtion.
  
  The DM suspend will take care of flushing any pending I/O.  So I don't
  see where the supposed race is...
  
  Anything else that is trapped in userspace memory will be there when
  the machine resumes.
  
  So proper way is to wipe luks crypto keys *after* userspace processes
  are freezed.
  
  I know you believe that I'm just not accepting that at face value.
 
 Um, pardon me if I'm being naive, but what about the case of hibernation
 where the swapdev and the root device are both LVs on the same dm_crypt
 device?
 
 The kernel is writing to swap _after_ userspace processes are all frozen;
 that seems to me like an ordering dependency entirely incompatible

Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-17 Thread Alex Elsayed
Havoc Pennington wrote:

 Hi,
 
 On Fri, Apr 17, 2015 at 3:27 PM, James Bottomley
 james.bottom...@hansenpartnership.com wrote:

 This is why I think kdbus is a bad idea: it solidifies as a linux kernel
 API something which runs counter to granular OS virtualization (and
 something which caused Windows to fall behind Linux in the container
 space).  Splitting out the acceleration problem and leaving the rest to
 user space currently looks fine because the ideas Al and Andy are
 kicking around don't cause problems with OS virtualization.

 
 I'm interested in understanding this problem (if only for my own
 curiosity) but I'm not confident I understand what you're saying
 correctly.
 
 Can I try to explain back / ask questions and see what I have right?
 
 I think you are saying that if an application relies on a system
 service (= any other process that runs on the system bus) then to
 virtualize that app by itself in a dedicated container, the system bus
 and the system service need to also be in the container. So the
 container ends up with a bunch of stuff in it beyond only the
 application.  Right / wrong / confused?
 
 I also think you're saying that userspace dbus has the same issue
 (this isn't a userspace vs. kernel thing per se), the objection to
 kdbus is that it makes this issue more solidified / harder to fix?
 
 Do you have ideas on how to go about fixing it, whether in userspace
 or kernel dbus?
 
 Havoc

So far as I understand (and this may be wrong), this is the use case of 
kdbus endpoints - you'd create a (constrained) kdbus endpoint on the host, 
and then expose it to the application, such that the application uses it as 
if it were the system bus.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux XIA - merge proposal

2015-03-05 Thread Alex Elsayed
Michel Machado wrote:

> Hi there,
> 
> We have been developing Linux XIA, a new network stack that
> emphasizes evolvability and interoperability, for a couple of years, and
> it has now reached a degree of maturity that allows others to experiment
> with it.

>From looking at your wiki, "network stack" may have been a poor choice of 
term - it looks like rather than being a new network stack (which in Linux, 
is commonly used to refer to the software stack that lives between the APIs 
and the hardware), this is a new protocol (and framework _for_ protocols) 
operating at the same level of the network as IP, with ideas extending 
upwards through TCP.

Now, that's a rather different proposal - witness that RDS, TIPC, etc all 
made it into the kernel relatively easily, especially when compared to 
netmap, or any other system that tried to replace the Linux networking 
infrastructure.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux XIA - merge proposal

2015-03-05 Thread Alex Elsayed
Michel Machado wrote:

 Hi there,
 
 We have been developing Linux XIA, a new network stack that
 emphasizes evolvability and interoperability, for a couple of years, and
 it has now reached a degree of maturity that allows others to experiment
 with it.

From looking at your wiki, network stack may have been a poor choice of 
term - it looks like rather than being a new network stack (which in Linux, 
is commonly used to refer to the software stack that lives between the APIs 
and the hardware), this is a new protocol (and framework _for_ protocols) 
operating at the same level of the network as IP, with ideas extending 
upwards through TCP.

Now, that's a rather different proposal - witness that RDS, TIPC, etc all 
made it into the kernel relatively easily, especially when compared to 
netmap, or any other system that tried to replace the Linux networking 
infrastructure.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)

2015-02-03 Thread Alex Elsayed
Al Viro wrote:

> On Tue, Feb 03, 2015 at 07:01:50PM +0100, Alexander Holler wrote:
> 
>> Yeah, as I've already admitted in the bug, I never should have use
>> the word secure, because everyone nowadays seems to end up in panic
>> when reading that word.
>> 
>> So, if I would be able to use sed on my mails, I would replace
>> unlinkat_s() with unlinkat_w() (for wipe) or would say that _s does
>> stand for 'shred' in the means of shred(1).
> 
> TBH, I suspect that the saner API would be something like
> EXT2_IOC_[SG[ETFLAGS, allowing to set and query that along with other
> flags (append-only, etc.).
> 
> Forget about unlink; first of all, whatever API you use should only _mark_
> the inode as "zero freed blocks" (or trim, for that matter).  You can't
> force freeing of an inode, so either you make sure that subsequent freeing
> of inode, whenever it happens, will do that work, or your API is
> hopelessly
> racy.  Moreover, when link has been removed it's too late to report that
> fs has no way to e.g. trim those blocks, so you really want to have it
> done
> _before_ the actual link removal.  And if the file contents is that
> sensitive, you'd better extend the same protection to all operations that
> free its
> blocks, including truncate(), fallocate() hole-punching, whatever.  What's
> more, if you divorce that from link removal, you probably don't want it as
> in-core-only flag - have it stored in inode, if fs supports that.
> 
> Alternatively, you might want to represent it as xattr - as much as I hate
> those, it might turn out to be the best fit in this case, if we end up
> with several variants for freed blocks disposal.  Not sure...
> 
> But whichever way we represent that state, IMO
> a) operation should be similar to chmod/chattr/setfattr - modifying
> inode metadata.
> b) it should affect _all_ operations freeing blocks of that file
> from that point on
> c) it should be able to fail, telling you that you can't do that for
> this backing store.

Well, chattr already has +s which means exactly this. It's just not 
respected by... anything. The 0/5 mentioned it, albeit briefly.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)

2015-02-03 Thread Alex Elsayed
Al Viro wrote:

 On Tue, Feb 03, 2015 at 07:01:50PM +0100, Alexander Holler wrote:
 
 Yeah, as I've already admitted in the bug, I never should have use
 the word secure, because everyone nowadays seems to end up in panic
 when reading that word.
 
 So, if I would be able to use sed on my mails, I would replace
 unlinkat_s() with unlinkat_w() (for wipe) or would say that _s does
 stand for 'shred' in the means of shred(1).
 
 TBH, I suspect that the saner API would be something like
 EXT2_IOC_[SG[ETFLAGS, allowing to set and query that along with other
 flags (append-only, etc.).
 
 Forget about unlink; first of all, whatever API you use should only _mark_
 the inode as zero freed blocks (or trim, for that matter).  You can't
 force freeing of an inode, so either you make sure that subsequent freeing
 of inode, whenever it happens, will do that work, or your API is
 hopelessly
 racy.  Moreover, when link has been removed it's too late to report that
 fs has no way to e.g. trim those blocks, so you really want to have it
 done
 _before_ the actual link removal.  And if the file contents is that
 sensitive, you'd better extend the same protection to all operations that
 free its
 blocks, including truncate(), fallocate() hole-punching, whatever.  What's
 more, if you divorce that from link removal, you probably don't want it as
 in-core-only flag - have it stored in inode, if fs supports that.
 
 Alternatively, you might want to represent it as xattr - as much as I hate
 those, it might turn out to be the best fit in this case, if we end up
 with several variants for freed blocks disposal.  Not sure...
 
 But whichever way we represent that state, IMO
 a) operation should be similar to chmod/chattr/setfattr - modifying
 inode metadata.
 b) it should affect _all_ operations freeing blocks of that file
 from that point on
 c) it should be able to fail, telling you that you can't do that for
 this backing store.

Well, chattr already has +s which means exactly this. It's just not 
respected by... anything. The 0/5 mentioned it, albeit briefly.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/12] Add kdbus implementation

2014-10-30 Thread Alex Elsayed
Andy Lutomirski wrote:


> There should be a number measured in, say, nanoseconds in here
> somewhere.  The actual extent of the speedup is unmeasurable here.
> Also, it's worth reading at least one of Linus' many rants about
> zero-copy.  It's not an automatic win.

It's well-understood that it's not an automatic win; significant testing on 
multiple architectures indicated that 512K is a surprisingly universal 
crossover point. The userspace code, therefore, switches from copying 
(normal kdbus parameters) to zero-copy (memfds) right around there.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/12] Add kdbus implementation

2014-10-30 Thread Alex Elsayed
Andy Lutomirski wrote:

snip
 There should be a number measured in, say, nanoseconds in here
 somewhere.  The actual extent of the speedup is unmeasurable here.
 Also, it's worth reading at least one of Linus' many rants about
 zero-copy.  It's not an automatic win.

It's well-understood that it's not an automatic win; significant testing on 
multiple architectures indicated that 512K is a surprisingly universal 
crossover point. The userspace code, therefore, switches from copying 
(normal kdbus parameters) to zero-copy (memfds) right around there.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] target: Add a user-passthrough backstore

2014-09-19 Thread Alex Elsayed
Nicholas A. Bellinger wrote:

> On Fri, 2014-09-19 at 14:43 -0700, Alex Elsayed wrote:
>> Nicholas A. Bellinger wrote:
>> 
>> 
>> > So the idea of allowing the in-kernel CDB emulation to run after
>> > user-space has returned unsupported opcode is problematic for a couple
>> > of different reasons.
>> > 
>> > First, if the correct feature bits in standard INQUIRY + EVPD INQUIRY,
>> > etc are not populated by user-space to match what in-kernel CDB
>> > emulation actually supports, this can result in potentially undefined
>> > results initiator side.
>> > 
>> > Also for example, if user-space decided to emulate only a subset of PR
>> > operations, and leaves the rest of it up to the in-kernel emulation,
>> > there's not really a way to sync current state between the two, which
>> > also can end up with undefined results.
>> > 
>> > So that said, I think a saner approach would be two define two modes of
>> > operation for TCMU:
>> > 
>> >*) Passthrough Mode: All CDBs are passed to user-space, and no
>> >   in-kernel emulation is done in the event of an unsupported
>> >   opcode response.
>> > 
>> >*) I/O Mode: I/O related CDBs are passed into user-space, but
>> >   all control CDBs continue to be processed by in-kernel emulation.
>> >   This effectively limits operation to TYPE_DISK, but with this
>> >   mode it's probably OK to assume this.
>> > 
>> > This seems like the best trade-off between flexibility when everything
>> > should be handled by user-space, vs. functionality when only block
>> > remapping of I/O is occurring in user-space code.
>> 
>> The problem there is that the first case has all the issues of pscsi and
>> simply becomes a performance optimization over tgt+iscsi client+pscsi and
>> the latter case excludes the main use cases I'm interested in - OSDs,
>> media changers, optical discs (the biggest thing for me), and so forth.
>> 
>> One of the main things I want to do with this is hook up a plugin that
>> uses libmirage to handle various optical disc image formats.
>> 
> 
> Not sure I follow..  How does the proposed passthrough mode prevent
> someone from emulating OSDs, media changers, optical disks or anything
> else in userspace with TCMU..?
> 
> The main thing that the above comments highlight is why attempting to
> combine the existing in-kernel emulation with a userspace backend
> providing it's own emulation can open up a number of problems with
> mismatched state between the two.

It doesn't prevent it, but it _does_ put it in the exact same place as PSCSI 
regarding the warnings on the wiki. It means that if someone wants to 
implement (say) the optical disc or OSD CDBs, they then lose out on ALUA  
unless they implement it themselves - which seems unnecessary and painful, 
since those should really be disjoint. In particular, an OSD backed by RADOS 
objects could be a very nice thing indeed, _and_ could really benefit from 
ALUA.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] target: Add a user-passthrough backstore

2014-09-19 Thread Alex Elsayed
Nicholas A. Bellinger wrote:

 On Fri, 2014-09-19 at 14:43 -0700, Alex Elsayed wrote:
 Nicholas A. Bellinger wrote:
 
 snip
  So the idea of allowing the in-kernel CDB emulation to run after
  user-space has returned unsupported opcode is problematic for a couple
  of different reasons.
  
  First, if the correct feature bits in standard INQUIRY + EVPD INQUIRY,
  etc are not populated by user-space to match what in-kernel CDB
  emulation actually supports, this can result in potentially undefined
  results initiator side.
  
  Also for example, if user-space decided to emulate only a subset of PR
  operations, and leaves the rest of it up to the in-kernel emulation,
  there's not really a way to sync current state between the two, which
  also can end up with undefined results.
  
  So that said, I think a saner approach would be two define two modes of
  operation for TCMU:
  
 *) Passthrough Mode: All CDBs are passed to user-space, and no
in-kernel emulation is done in the event of an unsupported
opcode response.
  
 *) I/O Mode: I/O related CDBs are passed into user-space, but
all control CDBs continue to be processed by in-kernel emulation.
This effectively limits operation to TYPE_DISK, but with this
mode it's probably OK to assume this.
  
  This seems like the best trade-off between flexibility when everything
  should be handled by user-space, vs. functionality when only block
  remapping of I/O is occurring in user-space code.
 
 The problem there is that the first case has all the issues of pscsi and
 simply becomes a performance optimization over tgt+iscsi client+pscsi and
 the latter case excludes the main use cases I'm interested in - OSDs,
 media changers, optical discs (the biggest thing for me), and so forth.
 
 One of the main things I want to do with this is hook up a plugin that
 uses libmirage to handle various optical disc image formats.
 
 
 Not sure I follow..  How does the proposed passthrough mode prevent
 someone from emulating OSDs, media changers, optical disks or anything
 else in userspace with TCMU..?
 
 The main thing that the above comments highlight is why attempting to
 combine the existing in-kernel emulation with a userspace backend
 providing it's own emulation can open up a number of problems with
 mismatched state between the two.

It doesn't prevent it, but it _does_ put it in the exact same place as PSCSI 
regarding the warnings on the wiki. It means that if someone wants to 
implement (say) the optical disc or OSD CDBs, they then lose out on ALUA co 
unless they implement it themselves - which seems unnecessary and painful, 
since those should really be disjoint. In particular, an OSD backed by RADOS 
objects could be a very nice thing indeed, _and_ could really benefit from 
ALUA.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 RESEND] zram: auto add new devices on demand

2014-08-08 Thread Alex Elsayed
Minchan Kim wrote:

> On Wed, Jul 30, 2014 at 10:58:31PM +0900, Sergey Senozhatsky wrote:
>> Hello,
>> 
>> On (07/29/14 12:00), Minchan Kim wrote:
>> > Hello Timofey,
>> > 



>> > Why do you add new device unconditionally?
>> > Maybe we need new konb on sysfs or ioctl for adding new device?
>> > Any thought, guys?
>> 
>> 
>> speaking of the patch, frankly, I (almost) see no gain comparing to the
>> existing functionality.
>> 
>> speaking of the idea. well, I'm not 100% convinced yet. the use cases I
>> see around do not imply dynamic creation/resizing/etc. that said, I need
>> to think about it.
> 
> It didn't persuade me, either.
> 
> Normally, distro have some config file for adding param at module loading
> like /etc/modules. So, I think it should be done in there if someone want
> to increase the number of zram devices.

The problem here is that this requires (at least) unloading the module, and 
if it was built in requires a reboot (and futzing with the kernel command 
line, rather than /etc/modules.d)

If someone's distro already loaded the module with nr_devices=1 (the 
default, I remind you), and is using it as swap, then it may well not be a 
feasible option for them to swapoff the (potentially large) swap device and 
do the modprobe dance.

If they're running off a livecd that's using it in combination with, say, 
LVM thin provisioning in order to have a writeable system, then they are 
_completely_ screwed because you can't swapoff your rootfs.

If they're using it as a backing store for ephemeral containers or VMs, then 
they may hit _any_ static limit, when they just want to start one more 
without having to stop the existing bunch.

The swap case might be argued as "deal with it" (despite that swapoff is not 
something fun to do on a system under any real-world load, _especially_ if 
what you're trying to force off of the swap device won't fit in ram and has 
to get pushed down to a different swap device).

But any case where people put filesystems on the device should make the 
issues of only supporting the module parameter pretty apparent.

Finally, there's the issue of clutter - I may need 4 zram devices when I'm 
experimenting with something, but only the one swap device for daily use. 
Having the other three just sitting around permanently is at most an 
annoyance, a 'papercut' - but papercuts add up.

>> 
>> if we end up adding this functionality I tend to vote for sysfs knob,
>> just because it seems to be more user friendly than writing some magic
>> INTs to ioctl-d fd.

This I agree with wholeheartedly.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 RESEND] zram: auto add new devices on demand

2014-08-08 Thread Alex Elsayed
Minchan Kim wrote:

 On Wed, Jul 30, 2014 at 10:58:31PM +0900, Sergey Senozhatsky wrote:
 Hello,
 
 On (07/29/14 12:00), Minchan Kim wrote:
  Hello Timofey,
  

snip

  Why do you add new device unconditionally?
  Maybe we need new konb on sysfs or ioctl for adding new device?
  Any thought, guys?
 
 
 speaking of the patch, frankly, I (almost) see no gain comparing to the
 existing functionality.
 
 speaking of the idea. well, I'm not 100% convinced yet. the use cases I
 see around do not imply dynamic creation/resizing/etc. that said, I need
 to think about it.
 
 It didn't persuade me, either.
 
 Normally, distro have some config file for adding param at module loading
 like /etc/modules. So, I think it should be done in there if someone want
 to increase the number of zram devices.

The problem here is that this requires (at least) unloading the module, and 
if it was built in requires a reboot (and futzing with the kernel command 
line, rather than /etc/modules.d)

If someone's distro already loaded the module with nr_devices=1 (the 
default, I remind you), and is using it as swap, then it may well not be a 
feasible option for them to swapoff the (potentially large) swap device and 
do the modprobe dance.

If they're running off a livecd that's using it in combination with, say, 
LVM thin provisioning in order to have a writeable system, then they are 
_completely_ screwed because you can't swapoff your rootfs.

If they're using it as a backing store for ephemeral containers or VMs, then 
they may hit _any_ static limit, when they just want to start one more 
without having to stop the existing bunch.

The swap case might be argued as deal with it (despite that swapoff is not 
something fun to do on a system under any real-world load, _especially_ if 
what you're trying to force off of the swap device won't fit in ram and has 
to get pushed down to a different swap device).

But any case where people put filesystems on the device should make the 
issues of only supporting the module parameter pretty apparent.

Finally, there's the issue of clutter - I may need 4 zram devices when I'm 
experimenting with something, but only the one swap device for daily use. 
Having the other three just sitting around permanently is at most an 
annoyance, a 'papercut' - but papercuts add up.

 
 if we end up adding this functionality I tend to vote for sysfs knob,
 just because it seems to be more user friendly than writing some magic
 INTs to ioctl-d fd.

This I agree with wholeheartedly.

snip


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading large amounts from /dev/urandom broken

2014-07-24 Thread Alex Elsayed
Hannes Frederic Sowa wrote:

> On Mi, 2014-07-23 at 11:14 -0400, Theodore Ts'o wrote:
>> On Wed, Jul 23, 2014 at 04:52:21PM +0300, Andrey Utkin wrote:
>> > Dear developers, please check bugzilla ticket
>> > https://bugzilla.kernel.org/show_bug.cgi?id=80981 (not the initial
>> > issue, but starting with comment#3.
>> > 
>> > Reading from /dev/urandom gives EOF after 33554431 bytes.  I believe
>> > it is introduced by commit 79a8468747c5f95ed3d5ce8376a3e82e0c5857fc,
>> > with the chunk
>> > 
>> > nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
>> > 
>> > which is described in commit message as "additional paranoia check to
>> > prevent overly large count values to be passed into urandom_read()".
>> > 
>> > I don't know why people pull such large amounts of data from urandom,
>> > but given today there are two bugreports regarding problems doing
>> > that, i consider that this is practiced.
>> 
>> I've inquired on the bugzilla why the reporter is abusing urandom in
>> this way.  The other commenter on the bug replicated the problem, but
>> that's not a "second bug report" in my book.
>> 
>> At the very least, this will probably cause me to insert a warning
>> printk: "insane user of /dev/urandom: [current->comm] requested %d
>> bytes" whenever someone tries to request more than 4k.
> 
> Ok, I would be fine with that.
> 
> The dd if=/dev/urandom of=random_file.dat seems reasonable to me to try
> to not break it. But, of course, there are other possibilities.

Personally, I'd say that _is_ insane - reading from urandom still consumes 
entropy (causing readers of /dev/random to block more often); when 
alternatives (such as dd'ing to dm-crypt) both avoid the issue _and_ are 
faster then it should very well be considered pathological.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading large amounts from /dev/urandom broken

2014-07-24 Thread Alex Elsayed
Hannes Frederic Sowa wrote:

 On Mi, 2014-07-23 at 11:14 -0400, Theodore Ts'o wrote:
 On Wed, Jul 23, 2014 at 04:52:21PM +0300, Andrey Utkin wrote:
  Dear developers, please check bugzilla ticket
  https://bugzilla.kernel.org/show_bug.cgi?id=80981 (not the initial
  issue, but starting with comment#3.
  
  Reading from /dev/urandom gives EOF after 33554431 bytes.  I believe
  it is introduced by commit 79a8468747c5f95ed3d5ce8376a3e82e0c5857fc,
  with the chunk
  
  nbytes = min_t(size_t, nbytes, INT_MAX  (ENTROPY_SHIFT + 3));
  
  which is described in commit message as additional paranoia check to
  prevent overly large count values to be passed into urandom_read().
  
  I don't know why people pull such large amounts of data from urandom,
  but given today there are two bugreports regarding problems doing
  that, i consider that this is practiced.
 
 I've inquired on the bugzilla why the reporter is abusing urandom in
 this way.  The other commenter on the bug replicated the problem, but
 that's not a second bug report in my book.
 
 At the very least, this will probably cause me to insert a warning
 printk: insane user of /dev/urandom: [current-comm] requested %d
 bytes whenever someone tries to request more than 4k.
 
 Ok, I would be fine with that.
 
 The dd if=/dev/urandom of=random_file.dat seems reasonable to me to try
 to not break it. But, of course, there are other possibilities.

Personally, I'd say that _is_ insane - reading from urandom still consumes 
entropy (causing readers of /dev/random to block more often); when 
alternatives (such as dd'ing to dm-crypt) both avoid the issue _and_ are 
faster then it should very well be considered pathological.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/2] target: Add documentation on the target userspace pass-through driver

2014-07-05 Thread Alex Elsayed
Reply inline, with a good bit of snipping done (posting via gmane, so 
quote/content ratio is an issue).

Andy Grover wrote:

> +These backstores cover the most common use cases, but not all. One new
> +use case that other non-kernel target solutions, such as tgt, are able
> +to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
> +target then serves as a translator, allowing initiators to store data
> +in these non-traditional networked storage systems, while still only
> +using standard protocols themselves.

Another use case is in supporting various image formats, like (say) qcow2, 
and then handing those off to vhost_scsi.

> +Benefits:
> +
> +In addition to allowing relatively easy support for RBD and GLFS, TCMU
> +will also allow easier development of new backstores. TCMU combines
> +with the LIO loopback fabric to become something similar to FUSE
> +(Filesystem in Userspace), but at the SCSI layer instead of the
> +filesystem layer. A SUSE, if you will.

As long as people don't start calling it L[UNs in ]USER[space] :P

Between that and ABUSE (A Block device in USErspace), this domain has some 
real naming potential...

> +Device Discovery:
> +
> +Other devices may be using UIO besides TCMU. Unrelated user processes
> +may also be handling different sets of TCMU devices. TCMU userspace
> +processes must find their devices by scanning sysfs
> +class/uio/uio*/name. For TCMU devices, these names will be of the
> +format:
> +
> +tcm-user//
> +
> +where "tcm-user" is common for all TCMU-backed UIO devices. 
> +will be a userspace-process-unique string to identify the TCMU device
> +as expecting to be backed by a certain handler, and  will be an
> +additional handler-specific string for the user process to configure
> +the device, if needed. Neither  or  can contain ':',
> +due to LIO limitations.

It might be good to change this somewhat; in the vast majority of cases it'd 
be saner for userspace programs to figure this information out via udev etc. 
rather than parsing sysfs themselves. This information is still worth 
documenting, but saying things like "must find their devices by scanning 
sysfs" is likely to lead to users of this interface making suboptimal 
choices.

> +Device Events:
> +
> +If a new device is added or removed, user processes will recieve a HUP
> +signal, and should re-scan sysfs. File descriptors for devices no
> +longer in sysfs should be closed, and new devices should be opened and
> +handled.

Is there a cleaner way to do this? In particular, re-scanning sysfs may 
cause race conditions (device removed, one of the same name re-added but a 
different UIO device node; probably more to be found). Perhaps recommend 
netlink uevents, so that remove+add is noticeable? Also, is the SIGHUP 
itself the best option? Could we simply require the user process to listen 
for add/remove uevents to get such change notifications, and thus enforce 
good behavior?

> +Writing a user backstore handler:
> +
> +Variable emulation with pass_level:
> +
> +TCMU supports a "pass_level" option with valid values of 1, 2, or
> +3. This controls how many different SCSI commands are passed up,
> +versus being emulated by LIO. The purpose of this is to give the user
> +handler author a choice of how much of the full SCSI command set they
> +care to support.
> +
> +At level 1, only READ and WRITE commands will be seen. At level 2,
> +additional commands defined in the SBC SCSI specification such as
> +WRITE SAME, SYNCRONIZE CACHE, and UNMAP will be passed up. Finally, at
> +level 3, almost all commands defined in the SPC SCSI specification
> +will also be passed up for processing by the user handler.

One use case I'm actually interested in is having userspace provide 
something other than just SPC - for instance, tgt can provide a virtual tape 
library or an OSD, and CDemu can provide emulated optical discs from various 
image formats.

Currently, CDemu uses its own out-of-tree driver called VHBA (Virtual Host 
Bus Adapter) to do pretty much exactly what TCMU+Loopback would 
accomplish... and in the process misses out on all of the other fabrics, 
unless you're willing to _re-import_ those devices using PSCSI, which has 
its own quirks.

Perhaps there could be a level 0 (or 4, or whatever) which means "explicitly 
enabled list of commands" - maybe as a bitmap that could be passed to the 
kernel somehow? Hopefully, that could also avoid some of the quirks of PSCSI 
regarding ALUA and such - if it's not implemented, leave the relevant bits 
at zero, and LIO handles it.

This does look really nice, thanks for writing it!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/2] target: Add documentation on the target userspace pass-through driver

2014-07-05 Thread Alex Elsayed
Reply inline, with a good bit of snipping done (posting via gmane, so 
quote/content ratio is an issue).

Andy Grover wrote:

 +These backstores cover the most common use cases, but not all. One new
 +use case that other non-kernel target solutions, such as tgt, are able
 +to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
 +target then serves as a translator, allowing initiators to store data
 +in these non-traditional networked storage systems, while still only
 +using standard protocols themselves.

Another use case is in supporting various image formats, like (say) qcow2, 
and then handing those off to vhost_scsi.

 +Benefits:
 +
 +In addition to allowing relatively easy support for RBD and GLFS, TCMU
 +will also allow easier development of new backstores. TCMU combines
 +with the LIO loopback fabric to become something similar to FUSE
 +(Filesystem in Userspace), but at the SCSI layer instead of the
 +filesystem layer. A SUSE, if you will.

As long as people don't start calling it L[UNs in ]USER[space] :P

Between that and ABUSE (A Block device in USErspace), this domain has some 
real naming potential...

 +Device Discovery:
 +
 +Other devices may be using UIO besides TCMU. Unrelated user processes
 +may also be handling different sets of TCMU devices. TCMU userspace
 +processes must find their devices by scanning sysfs
 +class/uio/uio*/name. For TCMU devices, these names will be of the
 +format:
 +
 +tcm-user/subtype/path
 +
 +where tcm-user is common for all TCMU-backed UIO devices. subtype
 +will be a userspace-process-unique string to identify the TCMU device
 +as expecting to be backed by a certain handler, and path will be an
 +additional handler-specific string for the user process to configure
 +the device, if needed. Neither subtype or path can contain ':',
 +due to LIO limitations.

It might be good to change this somewhat; in the vast majority of cases it'd 
be saner for userspace programs to figure this information out via udev etc. 
rather than parsing sysfs themselves. This information is still worth 
documenting, but saying things like must find their devices by scanning 
sysfs is likely to lead to users of this interface making suboptimal 
choices.

 +Device Events:
 +
 +If a new device is added or removed, user processes will recieve a HUP
 +signal, and should re-scan sysfs. File descriptors for devices no
 +longer in sysfs should be closed, and new devices should be opened and
 +handled.

Is there a cleaner way to do this? In particular, re-scanning sysfs may 
cause race conditions (device removed, one of the same name re-added but a 
different UIO device node; probably more to be found). Perhaps recommend 
netlink uevents, so that remove+add is noticeable? Also, is the SIGHUP 
itself the best option? Could we simply require the user process to listen 
for add/remove uevents to get such change notifications, and thus enforce 
good behavior?

 +Writing a user backstore handler:
 +
 +Variable emulation with pass_level:
 +
 +TCMU supports a pass_level option with valid values of 1, 2, or
 +3. This controls how many different SCSI commands are passed up,
 +versus being emulated by LIO. The purpose of this is to give the user
 +handler author a choice of how much of the full SCSI command set they
 +care to support.
 +
 +At level 1, only READ and WRITE commands will be seen. At level 2,
 +additional commands defined in the SBC SCSI specification such as
 +WRITE SAME, SYNCRONIZE CACHE, and UNMAP will be passed up. Finally, at
 +level 3, almost all commands defined in the SPC SCSI specification
 +will also be passed up for processing by the user handler.

One use case I'm actually interested in is having userspace provide 
something other than just SPC - for instance, tgt can provide a virtual tape 
library or an OSD, and CDemu can provide emulated optical discs from various 
image formats.

Currently, CDemu uses its own out-of-tree driver called VHBA (Virtual Host 
Bus Adapter) to do pretty much exactly what TCMU+Loopback would 
accomplish... and in the process misses out on all of the other fabrics, 
unless you're willing to _re-import_ those devices using PSCSI, which has 
its own quirks.

Perhaps there could be a level 0 (or 4, or whatever) which means explicitly 
enabled list of commands - maybe as a bitmap that could be passed to the 
kernel somehow? Hopefully, that could also avoid some of the quirks of PSCSI 
regarding ALUA and such - if it's not implemented, leave the relevant bits 
at zero, and LIO handles it.

This does look really nice, thanks for writing it!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Thoughts on credential switching

2014-03-29 Thread Alex Elsayed
Jeff Layton wrote:

> On Wed, 26 Mar 2014 20:25:35 -0700
> Jeff Layton  wrote:
> 
>> On Wed, 26 Mar 2014 20:05:16 -0700
>> Andy Lutomirski  wrote:
>> 
>> > On Wed, Mar 26, 2014 at 7:48 PM, Jeff Layton 
>> > wrote:
>> > > On Wed, 26 Mar 2014 17:23:24 -0700
>> > > Andy Lutomirski  wrote:
>> > >
>> > >> Hi various people who care about user-space NFS servers and/or
>> > >> security-relevant APIs.
>> > >>
>> > >> I propose the following set of new syscalls:
>> > >>
>> > >> int credfd_create(unsigned int flags): returns a new credfd that
>> > >> corresponds to current's creds.
>> > >>
>> > >> int credfd_activate(int fd, unsigned int flags): Change current's
>> > >> creds to match the creds stored in fd.  To be clear, this changes
>> > >> both the "subjective" and "objective" (aka real_cred and cred)
>> > >> because there aren't any real semantics for what happens when
>> > >> userspace code runs with real_cred != cred.
>> > >>
>> > >> Rules:
>> > >>
>> > >>  - credfd_activate fails (-EINVAL) if fd is not a credfd.
>> > >>  - credfd_activate fails (-EPERM) if the fd's userns doesn't
>> > >> match current's userns.  credfd_activate is not intended to be a
>> > >> substitute for setns.
>> > >>  - credfd_activate will fail (-EPERM) if LSM does not allow the
>> > >> switch.  This probably needs to be a new selinux action --
>> > >> dyntransition is too restrictive.
>> > >>
>> > >>
>> > >> Optional:
>> > >>  - credfd_create always sets cloexec, because the alternative is
>> > >> silly.
>> > >>  - credfd_activate fails (-EINVAL) if dumpable.  This is because
>> > >> we don't want a privileged daemon to be ptraced while
>> > >> impersonating someone else.
>> > >>  - optional: both credfd_create and credfd_activate fail if
>> > >> !ns_capable(CAP_SYS_ADMIN) or perhaps !capable(CAP_SETUID).
>> > >>
>> > >> The first question: does this solve Ganesha's problem?
>> > >>
>> > >> The second question: is this safe?  I can see two major concerns.
>> > >> The bigger concern is that having these syscalls available will
>> > >> allow users to exploit things that were previously secure.  For
>> > >> example, maybe some configuration assumes that a task running as
>> > >> uid==1 can't switch to uid==2, even with uid 2's consent.
>> > >> Similar issues happen with capabilities.  If CAP_SYS_ADMIN is not
>> > >> required, then this is no longer really true.
>> > >>
>> > >> Alternatively, something running as uid == 0 with heavy
>> > >> capability restrictions in a mount namespace (but not a uid
>> > >> namespace) could pass a credfd out of the namespace.  This could
>> > >> break things like Docker pretty badly.  CAP_SYS_ADMIN guards
>> > >> against this to some extent.  But I think that Docker is already
>> > >> totally screwed if a Docker root task can receive an O_DIRECTORY
>> > >> or O_PATH fd out of the container, so it's not entirely clear
>> > >> that the situation is any worse, even without requiring
>> > >> CAP_SYS_ADMIN.
>> > >>
>> > >> The second concern is that it may be difficult to use this
>> > >> correctly. There's a reason that real_cred and cred exist, but
>> > >> it's not really well set up for being used.
>> > >>
>> > >> As a simple way to stay safe, Ganesha could only use credfds that
>> > >> have real_uid == 0.
>> > >>
>> > >> --Andy
>> > >
>> > >
>> > > I still don't quite grok why having this special credfd_create
>> > > call buys you anything over simply doing what Al had originally
>> > > suggested -- switch creds using all of the different syscalls and
>> > > then simply caching that in a "normal" fd:
>> > >
>> > > fd = open("/dev/null", O_PATH...);
>> > >
>> > > ...it seems to me that the credfd_activate call will still need to
>> > > do the same permission checking that all of the individual
>> > > set*id() calls require (and all of the other stuff like changing
>> > > selinux contexts, etc).
>> > >
>> > > IOW, this fd is just a "handle" for passing around a struct cred,
>> > > but I don't see why having access to that handle would allow you
>> > > to do something you couldn't already do anyway.
>> > >
>> > > Am I missing something obvious here?
>> > 
>> > Not really.  I think I didn't adequately explain a piece of this.
>> > 
>> > I think that what you're suggesting is for an fd to encode a set of
>> > credentials but not to grant permission to use those credentials.
>> > So switch_creds(fd) is more or less the same thing as
>> > switch_creds(ruid, euid, suid, rgid, egid, sgid, groups, mac
>> > label, ...).  switch_creds needs to verify that the caller can
>> > dyntransition to the label, set all the ids, etc., but it avoids
>> > allocating anything and running RCU callbacks.
>> > 
>> > The trouble with this is that the verification needed is complicated
>> > and expensive.  And I think that my proposal is potentially more
>> > useful.
>> > 
>> 
>> Is it really though? My understanding of the problem was that it was
>> the syscall (context switching) overhead + having to do a bunch of RCU
>> critical 

Re: Thoughts on credential switching

2014-03-29 Thread Alex Elsayed
Jeff Layton wrote:

 On Wed, 26 Mar 2014 20:25:35 -0700
 Jeff Layton jlay...@redhat.com wrote:
 
 On Wed, 26 Mar 2014 20:05:16 -0700
 Andy Lutomirski l...@amacapital.net wrote:
 
  On Wed, Mar 26, 2014 at 7:48 PM, Jeff Layton jlay...@redhat.com
  wrote:
   On Wed, 26 Mar 2014 17:23:24 -0700
   Andy Lutomirski l...@amacapital.net wrote:
  
   Hi various people who care about user-space NFS servers and/or
   security-relevant APIs.
  
   I propose the following set of new syscalls:
  
   int credfd_create(unsigned int flags): returns a new credfd that
   corresponds to current's creds.
  
   int credfd_activate(int fd, unsigned int flags): Change current's
   creds to match the creds stored in fd.  To be clear, this changes
   both the subjective and objective (aka real_cred and cred)
   because there aren't any real semantics for what happens when
   userspace code runs with real_cred != cred.
  
   Rules:
  
- credfd_activate fails (-EINVAL) if fd is not a credfd.
- credfd_activate fails (-EPERM) if the fd's userns doesn't
   match current's userns.  credfd_activate is not intended to be a
   substitute for setns.
- credfd_activate will fail (-EPERM) if LSM does not allow the
   switch.  This probably needs to be a new selinux action --
   dyntransition is too restrictive.
  
  
   Optional:
- credfd_create always sets cloexec, because the alternative is
   silly.
- credfd_activate fails (-EINVAL) if dumpable.  This is because
   we don't want a privileged daemon to be ptraced while
   impersonating someone else.
- optional: both credfd_create and credfd_activate fail if
   !ns_capable(CAP_SYS_ADMIN) or perhaps !capable(CAP_SETUID).
  
   The first question: does this solve Ganesha's problem?
  
   The second question: is this safe?  I can see two major concerns.
   The bigger concern is that having these syscalls available will
   allow users to exploit things that were previously secure.  For
   example, maybe some configuration assumes that a task running as
   uid==1 can't switch to uid==2, even with uid 2's consent.
   Similar issues happen with capabilities.  If CAP_SYS_ADMIN is not
   required, then this is no longer really true.
  
   Alternatively, something running as uid == 0 with heavy
   capability restrictions in a mount namespace (but not a uid
   namespace) could pass a credfd out of the namespace.  This could
   break things like Docker pretty badly.  CAP_SYS_ADMIN guards
   against this to some extent.  But I think that Docker is already
   totally screwed if a Docker root task can receive an O_DIRECTORY
   or O_PATH fd out of the container, so it's not entirely clear
   that the situation is any worse, even without requiring
   CAP_SYS_ADMIN.
  
   The second concern is that it may be difficult to use this
   correctly. There's a reason that real_cred and cred exist, but
   it's not really well set up for being used.
  
   As a simple way to stay safe, Ganesha could only use credfds that
   have real_uid == 0.
  
   --Andy
  
  
   I still don't quite grok why having this special credfd_create
   call buys you anything over simply doing what Al had originally
   suggested -- switch creds using all of the different syscalls and
   then simply caching that in a normal fd:
  
   fd = open(/dev/null, O_PATH...);
  
   ...it seems to me that the credfd_activate call will still need to
   do the same permission checking that all of the individual
   set*id() calls require (and all of the other stuff like changing
   selinux contexts, etc).
  
   IOW, this fd is just a handle for passing around a struct cred,
   but I don't see why having access to that handle would allow you
   to do something you couldn't already do anyway.
  
   Am I missing something obvious here?
  
  Not really.  I think I didn't adequately explain a piece of this.
  
  I think that what you're suggesting is for an fd to encode a set of
  credentials but not to grant permission to use those credentials.
  So switch_creds(fd) is more or less the same thing as
  switch_creds(ruid, euid, suid, rgid, egid, sgid, groups, mac
  label, ...).  switch_creds needs to verify that the caller can
  dyntransition to the label, set all the ids, etc., but it avoids
  allocating anything and running RCU callbacks.
  
  The trouble with this is that the verification needed is complicated
  and expensive.  And I think that my proposal is potentially more
  useful.
  
 
 Is it really though? My understanding of the problem was that it was
 the syscall (context switching) overhead + having to do a bunch of RCU
 critical stuff that was the problem. If we can do all of this in the
 context of a single RCU critical section, isn't that still a win?
 
 As to the complicated part...maybe but it doesn't seem like it would
 have to be. We could simply return -EINVAL or something if the old
 struct cred doesn't have fields that match the ones we're replacing
 and that we don't expect to see changed.
 
  A credfd is like a 

Re: [QUERY] lguest64

2013-08-01 Thread Alex Elsayed
Ramkumar Ramachandra wrote:

> H. Peter Anvin wrote:
>> UML, lguest and Xen were done before the x86 architecture supported
>> hardware virtualization.
> 
> [...]
> 
>> but on KVM-enabled hardware KVM seems
>> like the better option (and is indeed what libguestfs uses.)
> 
> While we're still on the topic, I'd like a few clarifications. From
> your reply, I got the impression that KVM the only mechanism for
> non-pvops virtualization.  This seems quite contrary to what I read on
> lwn about ARM virtualization [1]. In short, ARM provides a "hypervisor
> mode", and the article says
> 
>   "the virtualization model provided by ARM fits the Xen
> hypervisor-based virtualization better than KVM's kernel-based model"
> 
> The Xen people call this "ARM PVH" (as opposed to ARM PV, which does
> not utilize hardware extensions) [2]. Although I wasn't able to find
> much information about the hardware aspect, what ARM provides seems to
> be quite different from VT-x and AMD-V. I'm also confused about what
> virt/kvm/arm is.
> 
> Thanks.
> 
> [1]: http://lwn.net/Articles/513940/
> [2]: http://www.xenproject.org/developers/teams/arm-hypervisor.html

ARM's virtualization extensions may be a more *natural* match to Xen's 
semantics and architecture, but that doesn't mean that KVM can't use it. LWN 
explains the details far better than I can: https://lwn.net/Articles/557132/

virt/kvm/arm is an implementation of KVM (the API) that takes advantage of 
ARM's virtualization extensions.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUERY] lguest64

2013-08-01 Thread Alex Elsayed
Ramkumar Ramachandra wrote:

 H. Peter Anvin wrote:
 UML, lguest and Xen were done before the x86 architecture supported
 hardware virtualization.
 
 [...]
 
 but on KVM-enabled hardware KVM seems
 like the better option (and is indeed what libguestfs uses.)
 
 While we're still on the topic, I'd like a few clarifications. From
 your reply, I got the impression that KVM the only mechanism for
 non-pvops virtualization.  This seems quite contrary to what I read on
 lwn about ARM virtualization [1]. In short, ARM provides a hypervisor
 mode, and the article says
 
   the virtualization model provided by ARM fits the Xen
 hypervisor-based virtualization better than KVM's kernel-based model
 
 The Xen people call this ARM PVH (as opposed to ARM PV, which does
 not utilize hardware extensions) [2]. Although I wasn't able to find
 much information about the hardware aspect, what ARM provides seems to
 be quite different from VT-x and AMD-V. I'm also confused about what
 virt/kvm/arm is.
 
 Thanks.
 
 [1]: http://lwn.net/Articles/513940/
 [2]: http://www.xenproject.org/developers/teams/arm-hypervisor.html

ARM's virtualization extensions may be a more *natural* match to Xen's 
semantics and architecture, but that doesn't mean that KVM can't use it. LWN 
explains the details far better than I can: https://lwn.net/Articles/557132/

virt/kvm/arm is an implementation of KVM (the API) that takes advantage of 
ARM's virtualization extensions.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] target-pending tree updated to v3.6-rc2

2012-08-17 Thread Alex Elsayed
On Thursday, August 16, 2012 11:49:44 PM Nicholas A. Bellinger wrote:
> On Thu, 2012-08-16 at 19:43 -0700, Nicholas A. Bellinger wrote:
> > Hi all,
> > 
> > With the release of v3.6-rc2 this afternoon, the target-pending.git tree
> > now has been updated using the freshly cut -rc2 as it's new HEAD.
> > Patches destined into for-3.7 code are now being added into for-next for
> > linux-next build testing.
> > 
> > Also, thanks go out to MST, Stefan, and Paolo who managed to get
> > tcm_vhost fabric code merged in mainline for -rc2  under the special
> > post merge window exception rule for new kernel code, which in my
> > experience tends to happens about as often as Haley's comet comes
> > around..
> > 
> > Here is the current breakdown of pending patches by branch:
> > 
> > master: (patches headed for 3.6-rc-fixes)
> > 
> > 5b7517f8 tcm_vhost: Change vhost_scsi_target->vhost_wwpn to char *
> > d0e27c88 target: fix NULL pointer dereference bug alloc_page() fails to
> > get memory 1fa8f450 tcm_fc: Avoid debug overhead when not debugging
> > 101998f6 tcm_vhost: Post-merge review changes requested by MST
> > f0e0e9bb tcm_vhost: Fix incorrect IS_ERR() usage in
> > vhost_scsi_map_iov_to_sgl
> Whoops..  Missed one extra target/pscsi regression bug-fix reported
> recently by Alex Elsayed (CC'ed) that has just been pushed into master
> here:
> 
> target/pscsi: Fix bug with REPORT_LUNs handling for SCSI passthrough
> http://git.kernel.org/?p=linux/kernel/git/nab/target-pending.git;a=commitdif
> f;h=1d2a2cd95ee0137a2353d1b5635739c281f27cd4
> 
> Alex-E, you where able to get TYPE_ROM passthrough w/ pSCSI export
> working on your setup with this patch, yes..?
> 
> Thanks!
> 
> --nab

The patch got it to where XP and Windows 7 would recognize it and mount it, 
but there's still some weirdness - for instance, opening it in the file browser 
hangs for an extended period of time before it starts working. No noise in 
dmesg when that happens, and I haven't had time to look deeper.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] target-pending tree updated to v3.6-rc2

2012-08-17 Thread Alex Elsayed
On Thursday, August 16, 2012 11:49:44 PM Nicholas A. Bellinger wrote:
 On Thu, 2012-08-16 at 19:43 -0700, Nicholas A. Bellinger wrote:
  Hi all,
  
  With the release of v3.6-rc2 this afternoon, the target-pending.git tree
  now has been updated using the freshly cut -rc2 as it's new HEAD.
  Patches destined into for-3.7 code are now being added into for-next for
  linux-next build testing.
  
  Also, thanks go out to MST, Stefan, and Paolo who managed to get
  tcm_vhost fabric code merged in mainline for -rc2  under the special
  post merge window exception rule for new kernel code, which in my
  experience tends to happens about as often as Haley's comet comes
  around..
  
  Here is the current breakdown of pending patches by branch:
  
  master: (patches headed for 3.6-rc-fixes)
  
  5b7517f8 tcm_vhost: Change vhost_scsi_target-vhost_wwpn to char *
  d0e27c88 target: fix NULL pointer dereference bug alloc_page() fails to
  get memory 1fa8f450 tcm_fc: Avoid debug overhead when not debugging
  101998f6 tcm_vhost: Post-merge review changes requested by MST
  f0e0e9bb tcm_vhost: Fix incorrect IS_ERR() usage in
  vhost_scsi_map_iov_to_sgl
 Whoops..  Missed one extra target/pscsi regression bug-fix reported
 recently by Alex Elsayed (CC'ed) that has just been pushed into master
 here:
 
 target/pscsi: Fix bug with REPORT_LUNs handling for SCSI passthrough
 http://git.kernel.org/?p=linux/kernel/git/nab/target-pending.git;a=commitdif
 f;h=1d2a2cd95ee0137a2353d1b5635739c281f27cd4
 
 Alex-E, you where able to get TYPE_ROM passthrough w/ pSCSI export
 working on your setup with this patch, yes..?
 
 Thanks!
 
 --nab

The patch got it to where XP and Windows 7 would recognize it and mount it, 
but there's still some weirdness - for instance, opening it in the file browser 
hangs for an extended period of time before it starts working. No noise in 
dmesg when that happens, and I haven't had time to look deeper.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/