from:"Richard W.M. Jones"

Re: INFO: task hung in nbd_ioctl

2019-10-17 Thread Richard W.M. Jones

On Thu, Oct 17, 2019 at 10:47:59AM -0500, Mike Christie wrote:
> On 10/17/2019 09:03 AM, Richard W.M. Jones wrote:
> > On Tue, Oct 01, 2019 at 04:19:25PM -0500, Mike Christie wrote:
> >> Hey Josef and nbd list,
> >>
> >> I had a question about if there are any socket family restrictions for nbd?
> > 
> > In normal circumstances, in userspace, the NBD protocol would only be
> > used over AF_UNIX or AF_INET/AF_INET6.
> > 
> > There's a bit of confusion because netlink is used by nbd-client to
> > configure the NBD device, setting things like block size and timeouts
> > (instead of ioctl which is deprecated).  I think you don't mean this
> > use of netlink?
> 
> I didn't. It looks like it is just a bad test.
> 
> For the automated test in this thread the test created a AF_NETLINK
> socket and passed it into the NBD_SET_SOCK ioctl. That is what got used
> for the NBD_DO_IT ioctl.
> 
> I was not sure if the test creator picked any old socket and it just
> happened to pick one nbd never supported, or it was trying to simulate
> sockets that did not support the shutdown method.
> 
> I attached the automated test that got run (test.c).

I'd say it sounds like a bad test, but I'm not familiar with syzkaller
nor how / from where it generates these tests.  Did someone report a
bug and then syzkaller wrote this test?

Rich.

> > 
> >> The bug here is that some socket familys do not support the
> >> sock->ops->shutdown callout, and when nbd calls kernel_sock_shutdown
> >> their callout returns -EOPNOTSUPP. That then leaves recv_work stuck in
> >> nbd_read_stat -> sock_xmit -> sock_recvmsg. My patch added a
> >> flush_workqueue call, so for socket familys like AF_NETLINK in this bug
> >> we hang like we see below.
> >>
> >> I can just remove the flush_workqueue call in that code path since it's
> >> not needed there, but it leaves the original bug my patch was hitting
> >> where we leave the recv_work running which can then result in leaked
> >> resources, or possible use after free crashes and you still get the hang
> >> if you remove the module.
> >>
> >> It looks like we have used kernel_sock_shutdown for a while so I thought
> >> we might never have supported sockets that did not support the callout.
> >> Is that correct? If so then I can just add a check for this in
> >> nbd_add_socket and fix that bug too.
> > 
> > Rich.
> > 
> >> On 09/30/2019 05:39 PM, syzbot wrote:
> >>> Hello,
> >>>
> >>> syzbot found the following crash on:
> >>>
> >>> HEAD commit:bb2aee77 Add linux-next specific files for 20190926
> >>> git tree:   linux-next
> >>> console output: https://syzkaller.appspot.com/x/log.txt?x=13385ca360
> >>> kernel config:  https://syzkaller.appspot.com/x/.config?x=e60af4ac5a01e964
> >>> dashboard link:
> >>> https://syzkaller.appspot.com/bug?extid=24c12fa8d218ed26011a
> >>> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> >>> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12abc2a360
> >>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11712c0560
> >>>
> >>> The bug was bisected to:
> >>>
> >>> commit e9e006f5fcf2bab59149cb38a48a4817c1b538b4
> >>> Author: Mike Christie 
> >>> Date:   Sun Aug 4 19:10:06 2019 +
> >>>
> >>> nbd: fix max number of supported devs
> >>>
> >>> bisection log:  
> >>> https://syzkaller.appspot.com/x/bisect.txt?x=1226f3c560
> >>> final crash:
> >>> https://syzkaller.appspot.com/x/report.txt?x=1126f3c560
> >>> console output: https://syzkaller.appspot.com/x/log.txt?x=1626f3c560
> >>>
> >>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> >>> Reported-by: syzbot+24c12fa8d218ed260...@syzkaller.appspotmail.com
> >>> Fixes: e9e006f5fcf2 ("nbd: fix max number of supported devs")
> >>>
> >>> INFO: task syz-executor390:8778 can't die for more than 143 seconds.
> >>> syz-executor390 D27432  8778   8777 0x4004
> >>> Call Trace:
> >>>  context_switch kernel/sched/core.c:3384 [inline]
> >>>  __schedule+0x828/0x1c20 kernel/sched/core.c:4065
> >>>  schedule+0xd9/0x260 kernel/sched/core.c:4132
> >>>  schedule_timeout+0x717/0xc50 kernel/time/timer.c:1871
> >>>  do_wait_for_common kernel/sched/completio

Re: nbd, nbdkit, loopback mounts and memory management

2019-02-17 Thread Richard W.M. Jones

On Mon, Feb 18, 2019 at 12:15:14AM +0100, Pavel Machek wrote:
> But Shaun reported it happened somehow often for them, so he might
> have a practical test case... better than my theories :-).

Yes, certainly not saying this isn't a problem.

I think the good news is the fix seems quite easy, ie. to add mlockall
and adjust the OOM killer score, as is done currently in the client:

https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L867-L885
https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L1219

For now I have added a note in the TODO file to follow up in case we
get a test case or reports of a problem:

https://github.com/libguestfs/nbdkit/commit/72e0afe2e280d895f68941677fafa559ddc3bb0d

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: nbd, nbdkit, loopback mounts and memory management

2019-02-17 Thread Richard W.M. Jones

So not to dispute that this could be a bug, but I couldn't cause a
deadlock.  I wonder if you can see something wrong with my method?

*** Set up ***

- kernel 5.0.0-0.rc3.git0.1.fc30.x86_64
- nbd-client 3.19-1.fc30
- nbdkit 1.11.5 (git commit ef9d1978ce28)

Baremetal machine was booted with mem=2G to artificially limit the
RAM.  The machine has 64G of swap.

# free -m
  totalusedfree  shared  buff/cache   available
Mem:   1806 3291383   0  931357
Swap: 65535 179   65356

*** Method ***

I started nbdkit as a 4G RAM disk:

  ./nbdkit memory size=4G

This is implemented as a sparse array with a 2 level page table, and
should allocate (using malloc) every time a new area of the disk is
written to.  Exact implementation is here:
https://github.com/libguestfs/nbdkit/tree/master/common/sparse

I started nbd-client using the -swap option which uses
mlockall(MCL_FUTURE) to lock the client into RAM.

  nbd-client -b 512 -swap localhost /dev/nbd0

I then created a filesystem on the RAM disk, mounted it, and copied a
3G file into it.  I tried this various ways, but the variation I was
eventually happy with was:

  mke2fs /dev/nbd0
  mount /dev/nbd0 /tmp/mnt

  dd if=/dev/zero of=/tmp/big bs=1M count=3000
  cp /tmp/big /tmp/mnt/big

I couldn't get any kind of deadlock or failure in this test.

(Note that if you repeat the test several times, in theory you could
delete the file and fstrim the filesystem, but when I was testing it
to be sure I unmounted everything and killed and restarted nbdkit
between each test.)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: nbd, nbdkit, loopback mounts and memory management

2019-02-15 Thread Richard W.M. Jones

On Fri, Feb 15, 2019 at 10:41:26PM +, Richard W.M. Jones wrote:
> On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> > Hi!
> > 
> > I watched fosdem talk about
> > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> > of warning: I'm not sure using it read-write on localhost is safe.
> > 
> > In particular, user application could create a lot of dirty data
> > quickly. If there's not enough memory for nbdkit (or nbd-client or
> > nbd-server), you might get a deadlock.
> 
> Thanks for the kind words about the talk.  I've added Wouter Verhelst
> & the NBD mailing list to CC.  Although I did the talk because the
> subject is interesting, how I actually use nbdkit / NBD is to talk to
> qemu and that's where I have most experience and where we (Red Hat)
> use it in production systems.
> 
> However in January I spent a lot of time exercising the NBD loop-mount
> + nbdkit case using fio in order to find contention / bottlenecks in
> our use of threads and locks.  I didn't notice any particular problems
> then, but it's possible my testing wasn't thorough enough.  Or that
> fio only creates small numbers of dirty pages (because of locality in
> its access patterns I guess?)
> 
> When you say it's not safe, what could happen?  What would we observe
> if it was going wrong?

Reading more carefully I see you said we'd observe a deadlock.  I
didn't see that, but again my testing of this wouldn't have been very
thorough.  When I have some time I'll try creating / spooling huge
files into an NBD loop mount to see if I can cause a deadlock.

Thanks, Rich.

> > Also note that nbd.txt in Documentation/blockdev/ points to
> > sourceforge; it should probably point to
> > https://github.com/NetworkBlockDevice/nbd ?
> 
> Wouter should be able to say what the correct link should be.
> 
> Rich.
> 
> -- 
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> libguestfs lets you edit virtual machines.  Supports shell scripting,
> bindings from many languages.  http://libguestfs.org

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: nbd, nbdkit, loopback mounts and memory management

2019-02-15 Thread Richard W.M. Jones

On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> Hi!
> 
> I watched fosdem talk about
> nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> of warning: I'm not sure using it read-write on localhost is safe.
> 
> In particular, user application could create a lot of dirty data
> quickly. If there's not enough memory for nbdkit (or nbd-client or
> nbd-server), you might get a deadlock.

Thanks for the kind words about the talk.  I've added Wouter Verhelst
& the NBD mailing list to CC.  Although I did the talk because the
subject is interesting, how I actually use nbdkit / NBD is to talk to
qemu and that's where I have most experience and where we (Red Hat)
use it in production systems.

However in January I spent a lot of time exercising the NBD loop-mount
+ nbdkit case using fio in order to find contention / bottlenecks in
our use of threads and locks.  I didn't notice any particular problems
then, but it's possible my testing wasn't thorough enough.  Or that
fio only creates small numbers of dirty pages (because of locality in
its access patterns I guess?)

When you say it's not safe, what could happen?  What would we observe
if it was going wrong?

> Also note that nbd.txt in Documentation/blockdev/ points to
> sourceforge; it should probably point to
> https://github.com/NetworkBlockDevice/nbd ?

Wouter should be able to say what the correct link should be.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH v2] libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk

2018-05-31 Thread Richard W.M. Jones

On Thu, May 31, 2018 at 01:21:07PM +0200, Hans de Goede wrote:
> Commit 184add2ca23c ("libata: Apply NOLPM quirk for SanDisk
> SD7UB3Q*G1001 SSDs") disabled LPM for SanDisk SD7UB3Q*G1001 SSDs.
> 
> This has lead to several reports of users of that SSD where LPM
> was working fine and who know have a significantly increased idle
> power consumption on their laptops.
> 
> Likely there is another problem on the T450s from the original
> reporter which gets exposed by the uncore reaching deeper sleep
> states (higher PC-states) due to LPM being enabled. The problem as
> reported, a hardfreeze about once a day, already did not sound like
> it would be caused by LPM and the reports of the SSD working fine
> confirm this. The original reporter is ok with dropping the quirk.
> 
> A X250 user has reported the same hard freeze problem and for him
> the problem went away after unrelated updates, I suspect some GPU
> driver stack changes fixed things.
> 
> TL;DR: The original reporters problem were triggered by LPM but not
> an LPM issue, so drop the quirk for the SSD in question.

As the reporter of the original issue, I agree with Hans's
analysis above, so ACK from me.

Rich.

> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1583207
> Cc: sta...@vger.kernel.org
> Cc: Richard W.M. Jones 
> Cc: Lorenzo Dalrio 
> Reported-by: Lorenzo Dalrio 
> Signed-off-by: Hans de Goede 
> ---
> Changes in v2:
> -Rebase on 4.17-rc7
> ---
>  drivers/ata/libata-core.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 346b163f6e89..9bfd2f7e4542 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -4557,9 +4557,6 @@ static const struct ata_blacklist_entry 
> ata_device_blacklist [] = {
>   { "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM, },
>   { "SAMSUNG SSD PM830 mSATA *",  "CXM13D1Q", ATA_HORKAGE_NOLPM, },
>  
> - /* Sandisk devices which are known to not handle LPM well */
> - { "SanDisk SD7UB3Q*G1001",  NULL,   ATA_HORKAGE_NOLPM, },
> -
>   /* devices that don't properly handle queued TRIM commands */
>   { "Micron_M500IT_*","MU01", ATA_HORKAGE_NO_NCQ_TRIM |
>   ATA_HORKAGE_ZERO_AFTER_TRIM, },
> -- 
> 2.17.0

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: [PATCH v2] libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk

2018-05-31 Thread Richard W.M. Jones

On Thu, May 31, 2018 at 01:21:07PM +0200, Hans de Goede wrote:
> Commit 184add2ca23c ("libata: Apply NOLPM quirk for SanDisk
> SD7UB3Q*G1001 SSDs") disabled LPM for SanDisk SD7UB3Q*G1001 SSDs.
> 
> This has lead to several reports of users of that SSD where LPM
> was working fine and who know have a significantly increased idle
> power consumption on their laptops.
> 
> Likely there is another problem on the T450s from the original
> reporter which gets exposed by the uncore reaching deeper sleep
> states (higher PC-states) due to LPM being enabled. The problem as
> reported, a hardfreeze about once a day, already did not sound like
> it would be caused by LPM and the reports of the SSD working fine
> confirm this. The original reporter is ok with dropping the quirk.
> 
> A X250 user has reported the same hard freeze problem and for him
> the problem went away after unrelated updates, I suspect some GPU
> driver stack changes fixed things.
> 
> TL;DR: The original reporters problem were triggered by LPM but not
> an LPM issue, so drop the quirk for the SSD in question.

As the reporter of the original issue, I agree with Hans's
analysis above, so ACK from me.

Rich.

> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1583207
> Cc: sta...@vger.kernel.org
> Cc: Richard W.M. Jones 
> Cc: Lorenzo Dalrio 
> Reported-by: Lorenzo Dalrio 
> Signed-off-by: Hans de Goede 
> ---
> Changes in v2:
> -Rebase on 4.17-rc7
> ---
>  drivers/ata/libata-core.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 346b163f6e89..9bfd2f7e4542 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -4557,9 +4557,6 @@ static const struct ata_blacklist_entry 
> ata_device_blacklist [] = {
>   { "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM, },
>   { "SAMSUNG SSD PM830 mSATA *",  "CXM13D1Q", ATA_HORKAGE_NOLPM, },
>  
> - /* Sandisk devices which are known to not handle LPM well */
> - { "SanDisk SD7UB3Q*G1001",  NULL,   ATA_HORKAGE_NOLPM, },
> -
>   /* devices that don't properly handle queued TRIM commands */
>   { "Micron_M500IT_*","MU01", ATA_HORKAGE_NO_NCQ_TRIM |
>   ATA_HORKAGE_ZERO_AFTER_TRIM, },
> -- 
> 2.17.0

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

On Fri, Aug 11, 2017 at 12:41:47AM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 10, 2017 at 10:35:11PM +0100, Richard W.M. Jones wrote:
> > On Fri, Aug 11, 2017 at 12:31:44AM +0300, Michael S. Tsirkin wrote:
> > > Then we probably should fail probe if vq size is too small.
> > 
> > What does this mean?
> > 
> > Rich.
> 
> We must prevent driver from submitting s/g lists > vq size to device.

Fair point.  I would have thought (not knowing anything about how this
works in the kernel) that the driver should be able to split up
scatter-gather lists if they are larger than what the driver supports?

Anyway the commit message is derived from what Paolo told me on IRC,
so let's see what he says.

Rich.

> Either tell linux to avoid s/g lists that are too long, or
> simply fail request if this happens, or refuse to attach driver to device.
> 
> Later option would look something like this within probe:
> 
> for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
>   if (vqs[i]->num < MAX_SG_USED_BY_LINUX)
>   goto err;
> 
> 
> I don't know what's MAX_SG_USED_BY_LINUX though.
> 
> > -- 
> > Richard Jones, Virtualization Group, Red Hat 
> > http://people.redhat.com/~rjones
> > Read my programming and virtualization blog: http://rwmj.wordpress.com
> > virt-builder quickly builds VMs from scratch
> > http://libguestfs.org/virt-builder.1.html

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

On Fri, Aug 11, 2017 at 12:41:47AM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 10, 2017 at 10:35:11PM +0100, Richard W.M. Jones wrote:
> > On Fri, Aug 11, 2017 at 12:31:44AM +0300, Michael S. Tsirkin wrote:
> > > Then we probably should fail probe if vq size is too small.
> > 
> > What does this mean?
> > 
> > Rich.
> 
> We must prevent driver from submitting s/g lists > vq size to device.

Fair point.  I would have thought (not knowing anything about how this
works in the kernel) that the driver should be able to split up
scatter-gather lists if they are larger than what the driver supports?

Anyway the commit message is derived from what Paolo told me on IRC,
so let's see what he says.

Rich.

> Either tell linux to avoid s/g lists that are too long, or
> simply fail request if this happens, or refuse to attach driver to device.
> 
> Later option would look something like this within probe:
> 
> for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
>   if (vqs[i]->num < MAX_SG_USED_BY_LINUX)
>   goto err;
> 
> 
> I don't know what's MAX_SG_USED_BY_LINUX though.
> 
> > -- 
> > Richard Jones, Virtualization Group, Red Hat 
> > http://people.redhat.com/~rjones
> > Read my programming and virtualization blog: http://rwmj.wordpress.com
> > virt-builder quickly builds VMs from scratch
> > http://libguestfs.org/virt-builder.1.html

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

On Fri, Aug 11, 2017 at 12:31:44AM +0300, Michael S. Tsirkin wrote:
> Then we probably should fail probe if vq size is too small.

What does this mean?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

On Fri, Aug 11, 2017 at 12:31:44AM +0300, Michael S. Tsirkin wrote:
> Then we probably should fail probe if vq size is too small.

What does this mean?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

On Fri, Aug 11, 2017 at 12:21:16AM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 10, 2017 at 05:40:34PM +0100, Richard W.M. Jones wrote:
> > If using indirect descriptors, you can make the total_sg as large as
> > you want.
> 
> That would be a spec violation though, even if it happens to
> work on current QEMU.
> 
> The spec says:
>   A driver MUST NOT create a descriptor chain longer than the Queue Size 
> of the device.
> 
> What prompted this patch?
> Do we ever encounter this situation?

This patch is needed because the following (2/2) patch will trigger
that BUG_ON if I set virtqueue_size=64 or any smaller value.

The precise backtrace is below.

Rich.

[4.029510] [ cut here ]
[4.030127] kernel BUG at drivers/virtio/virtio_ring.c:299!
[4.030834] invalid opcode:  [#1] SMP
[4.031340] Modules linked in: libcrc32c crc8 crc7 crc4 crc_itu_t virtio_pci 
virtio_mmio virtio_input virtio_balloon virtio_scsi nd_pmem nd_btt virtio_net 
virtio_crypto crypto_engine virtio_console virtio_rng virtio_blk virtio_ring 
virtio nfit crc32_generic crct10dif_pclmul crc32c_intel crc32_pclmul
[4.034606] CPU: 0 PID: 1 Comm: init Not tainted 4.13.0-rc4+ #100
[4.035354] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[4.036770] task: 9a859e243300 task.stack: a731c00cc000
[4.037506] RIP: 0010:virtqueue_add_sgs+0x23d/0x460 [virtio_ring]
[4.038250] RSP: :a731c00cf6e0 EFLAGS: 00010097
[4.038898] RAX:  RBX: 0011 RCX: dd0680646c40
[4.039762] RDX:  RSI: a731c00cf7b8 RDI: 9a85945c4d68
[4.040634] RBP: a731c00cf748 R08: 9a85945c4a78 R09: 01080020
[4.041508] R10: a731c00cf788 R11: 9a859b3d3120 R12: a731c00cf7d0
[4.042382] R13: a731c00cf7d0 R14: 0001 R15: 9a859b3f8200
[4.043248] FS:  () GS:9a859e60() 
knlGS:
[4.044232] CS:  0010 DS:  ES:  CR0: 80050033
[4.044942] CR2: 7fcff02e931c CR3: 1d23b000 CR4: 003406f0
[4.045815] DR0:  DR1:  DR2: 
[4.046684] DR3:  DR6: fffe0ff0 DR7: 0400
[4.047559] Call Trace:
[4.047876]  virtscsi_add_cmd+0x1c9/0x280 [virtio_scsi]
[4.048528]  virtscsi_kick_cmd+0x38/0x90 [virtio_scsi]
[4.049161]  virtscsi_queuecommand+0x104/0x280 [virtio_scsi]
[4.049875]  virtscsi_queuecommand_single+0x38/0x40 [virtio_scsi]
[4.050628]  scsi_dispatch_cmd+0xf9/0x390
[4.051128]  scsi_queue_rq+0x5e5/0x6f0
[4.051602]  blk_mq_dispatch_rq_list+0x1ff/0x3c0
[4.052175]  blk_mq_sched_dispatch_requests+0x199/0x1f0
[4.052824]  __blk_mq_run_hw_queue+0x11d/0x1b0
[4.053382]  __blk_mq_delay_run_hw_queue+0x8d/0xa0
[4.053972]  blk_mq_run_hw_queue+0x14/0x20
[4.054485]  blk_mq_sched_insert_requests+0x96/0x120
[4.055095]  blk_mq_flush_plug_list+0x19d/0x410
[4.055661]  blk_flush_plug_list+0xf9/0x2b0
[4.056182]  blk_finish_plug+0x2c/0x40
[4.056655]  __do_page_cache_readahead+0x32e/0x400
[4.057261]  filemap_fault+0x2fb/0x890
[4.057731]  ? filemap_fault+0x2fb/0x890
[4.058220]  ? find_held_lock+0x3c/0xb0
[4.058714]  ext4_filemap_fault+0x34/0x50
[4.059212]  __do_fault+0x1e/0x110
[4.059644]  __handle_mm_fault+0x6b2/0x1080
[4.060167]  handle_mm_fault+0x178/0x350
[4.060662]  __do_page_fault+0x26e/0x510
[4.061152]  trace_do_page_fault+0x9d/0x290
[4.061677]  do_async_page_fault+0x51/0xa0
[4.062189]  async_page_fault+0x28/0x30
[4.062667] RIP: 0033:0x7fcff030a24f
[4.063113] RSP: 002b:7ffefc2ad078 EFLAGS: 00010206
[4.063760] RAX: 7fcff02e931c RBX: 7fcff050f660 RCX: 7fcff02e935c
[4.064648] RDX: 0664 RSI:  RDI: 7fcff02e931c
[4.065519] RBP: 7ffefc2ad320 R08: 7fcff02e931c R09: 00027000
[4.066392] R10: 7fcff02e9980 R11: 0206 R12: 7ffefc2ad0b0
[4.067263] R13: 7ffefc2ad408 R14: 0002 R15: 87f0
[4.068135] Code: af 01 c7 45 c8 01 00 00 00 45 31 ed e9 9b fe ff ff 31 db 
48 83 7d d0 00 0f 85 3f fe ff ff 0f 0b 48 8b 7d b8 e8 e5 fd 22 de eb 8b <0f> 0b 
0f 0b 44 89 6d a8 45 89 f5 48 8b 45 a0 44 8d 70 01 48 83 
[4.070506] RIP: virtqueue_add_sgs+0x23d/0x460 [virtio_ring] RSP: 
a731c00cf6e0
[4.071434] ---[ end trace 02532659840e2a64 ]---


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

On Fri, Aug 11, 2017 at 12:21:16AM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 10, 2017 at 05:40:34PM +0100, Richard W.M. Jones wrote:
> > If using indirect descriptors, you can make the total_sg as large as
> > you want.
> 
> That would be a spec violation though, even if it happens to
> work on current QEMU.
> 
> The spec says:
>   A driver MUST NOT create a descriptor chain longer than the Queue Size 
> of the device.
> 
> What prompted this patch?
> Do we ever encounter this situation?

This patch is needed because the following (2/2) patch will trigger
that BUG_ON if I set virtqueue_size=64 or any smaller value.

The precise backtrace is below.

Rich.

[4.029510] [ cut here ]
[4.030127] kernel BUG at drivers/virtio/virtio_ring.c:299!
[4.030834] invalid opcode:  [#1] SMP
[4.031340] Modules linked in: libcrc32c crc8 crc7 crc4 crc_itu_t virtio_pci 
virtio_mmio virtio_input virtio_balloon virtio_scsi nd_pmem nd_btt virtio_net 
virtio_crypto crypto_engine virtio_console virtio_rng virtio_blk virtio_ring 
virtio nfit crc32_generic crct10dif_pclmul crc32c_intel crc32_pclmul
[4.034606] CPU: 0 PID: 1 Comm: init Not tainted 4.13.0-rc4+ #100
[4.035354] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[4.036770] task: 9a859e243300 task.stack: a731c00cc000
[4.037506] RIP: 0010:virtqueue_add_sgs+0x23d/0x460 [virtio_ring]
[4.038250] RSP: :a731c00cf6e0 EFLAGS: 00010097
[4.038898] RAX:  RBX: 0011 RCX: dd0680646c40
[4.039762] RDX:  RSI: a731c00cf7b8 RDI: 9a85945c4d68
[4.040634] RBP: a731c00cf748 R08: 9a85945c4a78 R09: 01080020
[4.041508] R10: a731c00cf788 R11: 9a859b3d3120 R12: a731c00cf7d0
[4.042382] R13: a731c00cf7d0 R14: 0001 R15: 9a859b3f8200
[4.043248] FS:  () GS:9a859e60() 
knlGS:
[4.044232] CS:  0010 DS:  ES:  CR0: 80050033
[4.044942] CR2: 7fcff02e931c CR3: 1d23b000 CR4: 003406f0
[4.045815] DR0:  DR1:  DR2: 
[4.046684] DR3:  DR6: fffe0ff0 DR7: 0400
[4.047559] Call Trace:
[4.047876]  virtscsi_add_cmd+0x1c9/0x280 [virtio_scsi]
[4.048528]  virtscsi_kick_cmd+0x38/0x90 [virtio_scsi]
[4.049161]  virtscsi_queuecommand+0x104/0x280 [virtio_scsi]
[4.049875]  virtscsi_queuecommand_single+0x38/0x40 [virtio_scsi]
[4.050628]  scsi_dispatch_cmd+0xf9/0x390
[4.051128]  scsi_queue_rq+0x5e5/0x6f0
[4.051602]  blk_mq_dispatch_rq_list+0x1ff/0x3c0
[4.052175]  blk_mq_sched_dispatch_requests+0x199/0x1f0
[4.052824]  __blk_mq_run_hw_queue+0x11d/0x1b0
[4.053382]  __blk_mq_delay_run_hw_queue+0x8d/0xa0
[4.053972]  blk_mq_run_hw_queue+0x14/0x20
[4.054485]  blk_mq_sched_insert_requests+0x96/0x120
[4.055095]  blk_mq_flush_plug_list+0x19d/0x410
[4.055661]  blk_flush_plug_list+0xf9/0x2b0
[4.056182]  blk_finish_plug+0x2c/0x40
[4.056655]  __do_page_cache_readahead+0x32e/0x400
[4.057261]  filemap_fault+0x2fb/0x890
[4.057731]  ? filemap_fault+0x2fb/0x890
[4.058220]  ? find_held_lock+0x3c/0xb0
[4.058714]  ext4_filemap_fault+0x34/0x50
[4.059212]  __do_fault+0x1e/0x110
[4.059644]  __handle_mm_fault+0x6b2/0x1080
[4.060167]  handle_mm_fault+0x178/0x350
[4.060662]  __do_page_fault+0x26e/0x510
[4.061152]  trace_do_page_fault+0x9d/0x290
[4.061677]  do_async_page_fault+0x51/0xa0
[4.062189]  async_page_fault+0x28/0x30
[4.062667] RIP: 0033:0x7fcff030a24f
[4.063113] RSP: 002b:7ffefc2ad078 EFLAGS: 00010206
[4.063760] RAX: 7fcff02e931c RBX: 7fcff050f660 RCX: 7fcff02e935c
[4.064648] RDX: 0664 RSI:  RDI: 7fcff02e931c
[4.065519] RBP: 7ffefc2ad320 R08: 7fcff02e931c R09: 00027000
[4.066392] R10: 7fcff02e9980 R11: 0206 R12: 7ffefc2ad0b0
[4.067263] R13: 7ffefc2ad408 R14: 0002 R15: 87f0
[4.068135] Code: af 01 c7 45 c8 01 00 00 00 45 31 ed e9 9b fe ff ff 31 db 
48 83 7d d0 00 0f 85 3f fe ff ff 0f 0b 48 8b 7d b8 e8 e5 fd 22 de eb 8b <0f> 0b 
0f 0b 44 89 6d a8 45 89 f5 48 8b 45 a0 44 8d 70 01 48 83 
[4.070506] RIP: virtqueue_add_sgs+0x23d/0x460 [virtio_ring] RSP: 
a731c00cf6e0
[4.071434] ---[ end trace 02532659840e2a64 ]---


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

[PATCH v2 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

If using indirect descriptors, you can make the total_sg as large as
you want.  If not, BUG is too serious because the function later
returns -ENOSPC.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
Reviewed-by: Paolo Bonzini <pbonz...@redhat.com>
---
 drivers/virtio/virtio_ring.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5e1b548828e6..27cbc1eab868 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
}
 #endif
 
-   BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);
 
head = vq->free_head;
@@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
desc = alloc_indirect(_vq, total_sg, gfp);
-   else
+   else {
desc = NULL;
+   WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
+   }
 
if (desc) {
/* Use a single buffer which doesn't continue */
-- 
2.13.1

[PATCH v2 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

If using indirect descriptors, you can make the total_sg as large as
you want.  If not, BUG is too serious because the function later
returns -ENOSPC.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones 
Reviewed-by: Paolo Bonzini 
---
 drivers/virtio/virtio_ring.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5e1b548828e6..27cbc1eab868 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
}
 #endif
 
-   BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);
 
head = vq->free_head;
@@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
desc = alloc_indirect(_vq, total_sg, gfp);
-   else
+   else {
desc = NULL;
+   WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
+   }
 
if (desc) {
/* Use a single buffer which doesn't continue */
-- 
2.13.1

[PATCH v2 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.

2017-08-10 Thread Richard W.M. Jones

Since switching to blk-mq as the default in commit 5c279bd9e406
("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as
much kernel memory.

qemu currently allocates a fixed 128 entry virtqueue.  can_queue
currently is set to 1024.  But with indirect descriptors, each command
in the queue takes 1 virtqueue entry, so the number of commands which
can be queued is equal to the length of the virtqueue.

Note I intend to send a patch to qemu to allow the virtqueue size to
be configured from the qemu command line.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 drivers/scsi/virtio_scsi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..7c28e8d4955a 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -818,7 +818,6 @@ static struct scsi_host_template 
virtscsi_host_template_single = {
.eh_timed_out = virtscsi_eh_timed_out,
.slave_alloc = virtscsi_device_alloc,
 
-   .can_queue = 1024,
.dma_boundary = UINT_MAX,
.use_clustering = ENABLE_CLUSTERING,
.target_alloc = virtscsi_target_alloc,
@@ -839,7 +838,6 @@ static struct scsi_host_template 
virtscsi_host_template_multi = {
.eh_timed_out = virtscsi_eh_timed_out,
.slave_alloc = virtscsi_device_alloc,
 
-   .can_queue = 1024,
.dma_boundary = UINT_MAX,
.use_clustering = ENABLE_CLUSTERING,
.target_alloc = virtscsi_target_alloc,
@@ -972,6 +970,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
-- 
2.13.1

[PATCH v2 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.

2017-08-10 Thread Richard W.M. Jones

Since switching to blk-mq as the default in commit 5c279bd9e406
("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as
much kernel memory.

qemu currently allocates a fixed 128 entry virtqueue.  can_queue
currently is set to 1024.  But with indirect descriptors, each command
in the queue takes 1 virtqueue entry, so the number of commands which
can be queued is equal to the length of the virtqueue.

Note I intend to send a patch to qemu to allow the virtqueue size to
be configured from the qemu command line.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones 
---
 drivers/scsi/virtio_scsi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..7c28e8d4955a 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -818,7 +818,6 @@ static struct scsi_host_template 
virtscsi_host_template_single = {
.eh_timed_out = virtscsi_eh_timed_out,
.slave_alloc = virtscsi_device_alloc,
 
-   .can_queue = 1024,
.dma_boundary = UINT_MAX,
.use_clustering = ENABLE_CLUSTERING,
.target_alloc = virtscsi_target_alloc,
@@ -839,7 +838,6 @@ static struct scsi_host_template 
virtscsi_host_template_multi = {
.eh_timed_out = virtscsi_eh_timed_out,
.slave_alloc = virtscsi_device_alloc,
 
-   .can_queue = 1024,
.dma_boundary = UINT_MAX,
.use_clustering = ENABLE_CLUSTERING,
.target_alloc = virtscsi_target_alloc,
@@ -972,6 +970,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
-- 
2.13.1

[PATCH v2 0/2] virtio_scsi: Set can_queue based on size of virtqueue.

2017-08-10 Thread Richard W.M. Jones

v1 was here:

  https://lkml.org/lkml/2017/8/10/689

v1 -> v2:

  Remove .can_queue field from the templates.

Rich.

[PATCH v2 0/2] virtio_scsi: Set can_queue based on size of virtqueue.

2017-08-10 Thread Richard W.M. Jones

v1 was here:

  https://lkml.org/lkml/2017/8/10/689

v1 -> v2:

  Remove .can_queue field from the templates.

Rich.

[PATCH 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.

2017-08-10 Thread Richard W.M. Jones

Since switching to blk-mq as the default in commit 5c279bd9e406
("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as
much kernel memory.

qemu currently allocates a fixed 128 entry virtqueue.  can_queue
currently is set to 1024.  But with indirect descriptors, each command
in the queue takes 1 virtqueue entry, so the number of commands which
can be queued is equal to the length of the virtqueue.

Note I intend to send a patch to qemu to allow the virtqueue size to
be configured from the qemu command line.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 drivers/scsi/virtio_scsi.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..d6b4ff634c0d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
-- 
2.13.1

[PATCH 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.

2017-08-10 Thread Richard W.M. Jones

Since switching to blk-mq as the default in commit 5c279bd9e406
("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as
much kernel memory.

qemu currently allocates a fixed 128 entry virtqueue.  can_queue
currently is set to 1024.  But with indirect descriptors, each command
in the queue takes 1 virtqueue entry, so the number of commands which
can be queued is equal to the length of the virtqueue.

Note I intend to send a patch to qemu to allow the virtqueue size to
be configured from the qemu command line.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones 
---
 drivers/scsi/virtio_scsi.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..d6b4ff634c0d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
-- 
2.13.1

[PATCH 0/2] virtio_scsi: Set can_queue based on size of virtqueue.

2017-08-10 Thread Richard W.M. Jones

Earlier discussion:
https://lkml.org/lkml/2017/8/4/601
"Increased memory usage with scsi-mq"

Downstream bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1478201

[PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

If using indirect descriptors, you can make the total_sg as large as
you want.  If not, BUG is too serious because the function later
returns -ENOSPC.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 drivers/virtio/virtio_ring.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5e1b548828e6..27cbc1eab868 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
}
 #endif
 
-   BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);
 
head = vq->free_head;
@@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
desc = alloc_indirect(_vq, total_sg, gfp);
-   else
+   else {
desc = NULL;
+   WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
+   }
 
if (desc) {
/* Use a single buffer which doesn't continue */
-- 
2.13.1

[PATCH 0/2] virtio_scsi: Set can_queue based on size of virtqueue.

2017-08-10 Thread Richard W.M. Jones

Earlier discussion:
https://lkml.org/lkml/2017/8/4/601
"Increased memory usage with scsi-mq"

Downstream bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1478201

[PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

If using indirect descriptors, you can make the total_sg as large as
you want.  If not, BUG is too serious because the function later
returns -ENOSPC.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones 
---
 drivers/virtio/virtio_ring.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5e1b548828e6..27cbc1eab868 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
}
 #endif
 
-   BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);
 
head = vq->free_head;
@@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
desc = alloc_indirect(_vq, total_sg, gfp);
-   else
+   else {
desc = NULL;
+   WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
+   }
 
if (desc) {
/* Use a single buffer which doesn't continue */
-- 
2.13.1

Re: Increased memory usage with scsi-mq

2017-08-10 Thread Richard W.M. Jones

OK this is looking a bit better now.

  With scsi-mq enabled:   175 disks
  virtqueue_size=64:  318 disks *
  virtqueue_size=16:  775 disks *
  With scsi-mq disabled: 1755 disks
* = new results

I also ran the whole libguestfs test suite with virtqueue_size=16
(with no failures shown).  As this tests many different disk I/O
operations, it gives me some confidence that things generally work.

Do you have any other comments about the patches?  I'm not sure I know
enough to write an intelligent commit message for the kernel patch.

Rich.


--- kernel patch ---

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..d6b4ff634c0d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5e1b548828e6..2d7509da9f39 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
}
 #endif
 
-   BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);
 
head = vq->free_head;
@@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
desc = alloc_indirect(_vq, total_sg, gfp);
-   else
+   else {
desc = NULL;
+   WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
+}
 
if (desc) {
/* Use a single buffer which doesn't continue */

--- qemu patch ---

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index eb639442d1..aadd99aad1 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -867,10 +867,10 @@ void virtio_scsi_common_realize(DeviceState *dev,
 s->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE;
 s->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE;
 
-s->ctrl_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, ctrl);
-s->event_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, evt);
+s->ctrl_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, ctrl);
+s->event_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, evt);
 for (i = 0; i < s->conf.num_queues; i++) {
-s->cmd_vqs[i] = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, cmd);
+s->cmd_vqs[i] = virtio_add_queue(vdev, s->conf.virtqueue_size, cmd);
 }
 }
 
@@ -917,6 +917,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, 
Error **errp)
 
 static Property virtio_scsi_properties[] = {
 DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 
1),
+DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSI, 
parent_obj.conf.virtqueue_size, 128),
 DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors,
   0x),
 DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun,
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index de6ae5a9f6..e30a92d3e7 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -48,6 +48,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig;
 
 struct VirtIOSCSIConf {
 uint32_t num_queues;
+uint32_t virtqueue_size;
 uint32_t max_sectors;
 uint32_t cmd_per_lun;
 #ifdef CONFIG_VHOST_SCSI



-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: Increased memory usage with scsi-mq

2017-08-10 Thread Richard W.M. Jones

OK this is looking a bit better now.

  With scsi-mq enabled:   175 disks
  virtqueue_size=64:  318 disks *
  virtqueue_size=16:  775 disks *
  With scsi-mq disabled: 1755 disks
* = new results

I also ran the whole libguestfs test suite with virtqueue_size=16
(with no failures shown).  As this tests many different disk I/O
operations, it gives me some confidence that things generally work.

Do you have any other comments about the patches?  I'm not sure I know
enough to write an intelligent commit message for the kernel patch.

Rich.


--- kernel patch ---

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..d6b4ff634c0d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5e1b548828e6..2d7509da9f39 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
}
 #endif
 
-   BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);
 
head = vq->free_head;
@@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
desc = alloc_indirect(_vq, total_sg, gfp);
-   else
+   else {
desc = NULL;
+   WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
+}
 
if (desc) {
/* Use a single buffer which doesn't continue */

--- qemu patch ---

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index eb639442d1..aadd99aad1 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -867,10 +867,10 @@ void virtio_scsi_common_realize(DeviceState *dev,
 s->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE;
 s->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE;
 
-s->ctrl_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, ctrl);
-s->event_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, evt);
+s->ctrl_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, ctrl);
+s->event_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, evt);
 for (i = 0; i < s->conf.num_queues; i++) {
-s->cmd_vqs[i] = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, cmd);
+s->cmd_vqs[i] = virtio_add_queue(vdev, s->conf.virtqueue_size, cmd);
 }
 }
 
@@ -917,6 +917,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, 
Error **errp)
 
 static Property virtio_scsi_properties[] = {
 DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 
1),
+DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSI, 
parent_obj.conf.virtqueue_size, 128),
 DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors,
   0x),
 DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun,
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index de6ae5a9f6..e30a92d3e7 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -48,6 +48,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig;
 
 struct VirtIOSCSIConf {
 uint32_t num_queues;
+uint32_t virtqueue_size;
 uint32_t max_sectors;
 uint32_t cmd_per_lun;
 #ifdef CONFIG_VHOST_SCSI



-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: Increased memory usage with scsi-mq

2017-08-10 Thread Richard W.M. Jones

On Thu, Aug 10, 2017 at 02:53:58PM +0200, Paolo Bonzini wrote:
> can_queue and cmd_per_lun are different.  can_queue should be set to the
> value of vq->vring.num where vq is the command virtqueue (the first one
> is okay if there's >1).
> 
> If you want to change it, you'll have to do so in QEMU.

Here are a couple more patches I came up with, the first to Linux,
the second to qemu.

There are a few problems ...

(1) Setting virtqueue_size to < 128 causes a BUG_ON failure in
virtio_ring.c:virtqueue_add at:

BUG_ON(total_sg > vq->vring.num);

I initially thought that I should also set cmd_per_lun to the same
value, but that doesn't help.  Is there some relationship between
virtqueue_size and other settings?

(2) Can/should the ctrl, event and cmd queue sizes be set to the same
values?  Or should ctrl & event be left at 128?

(3) It seems like it might be a problem that virtqueue_size is not
passed through the virtio_scsi_conf struct (which is ABI between the
kernel and qemu AFAICT and so not easily modifiable).  However I think
this is not a problem because virtqueue size is stored in the
Virtqueue Descriptor table, which is how the kernel gets the value.
Is that right?

Rich.


--- kernel patch ---

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..d6b4ff634c0d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;


--- qemu patch ---

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index eb639442d1..aadd99aad1 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -867,10 +867,10 @@ void virtio_scsi_common_realize(DeviceState *dev,
 s->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE;
 s->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE;
 
-s->ctrl_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, ctrl);
-s->event_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, evt);
+s->ctrl_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, ctrl);
+s->event_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, evt);
 for (i = 0; i < s->conf.num_queues; i++) {
-s->cmd_vqs[i] = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, cmd);
+s->cmd_vqs[i] = virtio_add_queue(vdev, s->conf.virtqueue_size, cmd);
 }
 }
 
@@ -917,6 +917,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, 
Error **errp)
 
 static Property virtio_scsi_properties[] = {
 DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 
1),
+DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSI, 
parent_obj.conf.virtqueue_size, 128),
 DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors,
   0x),
 DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun,
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index de6ae5a9f6..e30a92d3e7 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -48,6 +48,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig;
 
 struct VirtIOSCSIConf {
 uint32_t num_queues;
+uint32_t virtqueue_size;
 uint32_t max_sectors;
 uint32_t cmd_per_lun;
 #ifdef CONFIG_VHOST_SCSI


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: Increased memory usage with scsi-mq

2017-08-10 Thread Richard W.M. Jones

On Thu, Aug 10, 2017 at 02:53:58PM +0200, Paolo Bonzini wrote:
> can_queue and cmd_per_lun are different.  can_queue should be set to the
> value of vq->vring.num where vq is the command virtqueue (the first one
> is okay if there's >1).
> 
> If you want to change it, you'll have to do so in QEMU.

Here are a couple more patches I came up with, the first to Linux,
the second to qemu.

There are a few problems ...

(1) Setting virtqueue_size to < 128 causes a BUG_ON failure in
virtio_ring.c:virtqueue_add at:

BUG_ON(total_sg > vq->vring.num);

I initially thought that I should also set cmd_per_lun to the same
value, but that doesn't help.  Is there some relationship between
virtqueue_size and other settings?

(2) Can/should the ctrl, event and cmd queue sizes be set to the same
values?  Or should ctrl & event be left at 128?

(3) It seems like it might be a problem that virtqueue_size is not
passed through the virtio_scsi_conf struct (which is ABI between the
kernel and qemu AFAICT and so not easily modifiable).  However I think
this is not a problem because virtqueue size is stored in the
Virtqueue Descriptor table, which is how the kernel gets the value.
Is that right?

Rich.


--- kernel patch ---

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..d6b4ff634c0d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;


--- qemu patch ---

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index eb639442d1..aadd99aad1 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -867,10 +867,10 @@ void virtio_scsi_common_realize(DeviceState *dev,
 s->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE;
 s->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE;
 
-s->ctrl_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, ctrl);
-s->event_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, evt);
+s->ctrl_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, ctrl);
+s->event_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, evt);
 for (i = 0; i < s->conf.num_queues; i++) {
-s->cmd_vqs[i] = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, cmd);
+s->cmd_vqs[i] = virtio_add_queue(vdev, s->conf.virtqueue_size, cmd);
 }
 }
 
@@ -917,6 +917,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, 
Error **errp)
 
 static Property virtio_scsi_properties[] = {
 DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 
1),
+DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSI, 
parent_obj.conf.virtqueue_size, 128),
 DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors,
   0x),
 DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun,
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index de6ae5a9f6..e30a92d3e7 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -48,6 +48,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig;
 
 struct VirtIOSCSIConf {
 uint32_t num_queues;
+uint32_t virtqueue_size;
 uint32_t max_sectors;
 uint32_t cmd_per_lun;
 #ifdef CONFIG_VHOST_SCSI


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: Increased memory usage with scsi-mq

2017-08-10 Thread Richard W.M. Jones

On Wed, Aug 09, 2017 at 06:50:10PM +0200, Paolo Bonzini wrote:
> On 09/08/2017 18:01, Christoph Hellwig wrote:
> > On Mon, Aug 07, 2017 at 03:07:48PM +0200, Paolo Bonzini wrote:
> >> can_queue should depend on the virtqueue size, which unfortunately can
> >> vary for each virtio-scsi device in theory.  The virtqueue size is
> >> retrieved by drivers/virtio prior to calling vring_create_virtqueue, and
> >> in QEMU it is the second argument to virtio_add_queue.
> > 
> > Why is that unfortunate?  We don't even have to set can_queue in
> > the host template, we can dynamically set it on per-host basis.
> 
> Ah, cool, I thought allocations based on can_queue happened already in
> scsi_host_alloc, but they happen at scsi_add_host time.

I think I've decoded all that information into the patch below.

I tested it, and it appears to work: when I set cmd_per_lun on the
qemu command line, I see that the guest can add more disks:

  With scsi-mq enabled:   175 disks
  cmd_per_lun not set:177 disks  *
  cmd_per_lun=16: 776 disks  *
  cmd_per_lun=4: 1160 disks  *
  With scsi-mq disabled: 1755 disks
 * = new result

>From my point of view, this is a good result, but you should be warned
that I don't fully understand what's going on here and I may have made
obvious or not-so-obvious mistakes.

I tested the performance impact and it's not noticable in the
libguestfs case even with very small cmd_per_lun settings, but
libguestfs is largely serial and so this result won't be applicable to
guests in general.

Also, should the guest kernel validate cmd_per_lun to make sure it's
not too small or large?  And if so, what would the limits be?

Rich.

>From e923e49846189b2f55f3f02b70a290d4be237ed5 Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" <rjo...@redhat.com>
Date: Thu, 10 Aug 2017 12:21:47 +0100
Subject: [PATCH] scsi: virtio_scsi: Set can_queue based on cmd_per_lun passed
 by hypervisor.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 drivers/scsi/virtio_scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..b22591e9b16b 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -973,7 +973,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
goto virtscsi_init_failed;
 
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
-   shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
+   shost->cmd_per_lun = shost->can_queue = cmd_per_lun;
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
 
/* LUNs > 256 are reported with format 1, so they go in the range
-- 
2.13.1


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: Increased memory usage with scsi-mq

2017-08-10 Thread Richard W.M. Jones

On Wed, Aug 09, 2017 at 06:50:10PM +0200, Paolo Bonzini wrote:
> On 09/08/2017 18:01, Christoph Hellwig wrote:
> > On Mon, Aug 07, 2017 at 03:07:48PM +0200, Paolo Bonzini wrote:
> >> can_queue should depend on the virtqueue size, which unfortunately can
> >> vary for each virtio-scsi device in theory.  The virtqueue size is
> >> retrieved by drivers/virtio prior to calling vring_create_virtqueue, and
> >> in QEMU it is the second argument to virtio_add_queue.
> > 
> > Why is that unfortunate?  We don't even have to set can_queue in
> > the host template, we can dynamically set it on per-host basis.
> 
> Ah, cool, I thought allocations based on can_queue happened already in
> scsi_host_alloc, but they happen at scsi_add_host time.

I think I've decoded all that information into the patch below.

I tested it, and it appears to work: when I set cmd_per_lun on the
qemu command line, I see that the guest can add more disks:

  With scsi-mq enabled:   175 disks
  cmd_per_lun not set:177 disks  *
  cmd_per_lun=16: 776 disks  *
  cmd_per_lun=4: 1160 disks  *
  With scsi-mq disabled: 1755 disks
 * = new result

>From my point of view, this is a good result, but you should be warned
that I don't fully understand what's going on here and I may have made
obvious or not-so-obvious mistakes.

I tested the performance impact and it's not noticable in the
libguestfs case even with very small cmd_per_lun settings, but
libguestfs is largely serial and so this result won't be applicable to
guests in general.

Also, should the guest kernel validate cmd_per_lun to make sure it's
not too small or large?  And if so, what would the limits be?

Rich.

>From e923e49846189b2f55f3f02b70a290d4be237ed5 Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" 
Date: Thu, 10 Aug 2017 12:21:47 +0100
Subject: [PATCH] scsi: virtio_scsi: Set can_queue based on cmd_per_lun passed
 by hypervisor.

Signed-off-by: Richard W.M. Jones 
---
 drivers/scsi/virtio_scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..b22591e9b16b 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -973,7 +973,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
goto virtscsi_init_failed;
 
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
-   shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
+   shost->cmd_per_lun = shost->can_queue = cmd_per_lun;
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
 
/* LUNs > 256 are reported with format 1, so they go in the range
-- 
2.13.1


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: Increased memory usage with scsi-mq

2017-08-07 Thread Richard W.M. Jones

On Mon, Aug 07, 2017 at 02:11:39PM +0200, Paolo Bonzini wrote:
> You could also add a module parameter to the driver, and set it to 64 on
> the kernel command line (there is an example in
> drivers/scsi/vmw_pvscsi.c of how to do it).

[Proviso: I've not tested the performance of difference values, nor do
I have any particular knowledge in this area]

can_queue is documented as:

 * This determines if we will use a non-interrupt driven
 * or an interrupt driven scheme.  It is set to the maximum number
 * of simultaneous commands a given host adapter will accept.

Wouldn't it be better to make it default to k * number of CPUs for
some small integer k?

I looked at the other scsi drivers and they set it to all kinds of
values.  1, small integers, large integers, configurable values.

Also I noticed this code in virtio_scsi.c:

cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

but setting cmd_per_lun (as a qemu virtio-scsi-pci parameter) didn't
seem to make any difference to memory usage.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: Increased memory usage with scsi-mq

2017-08-07 Thread Richard W.M. Jones

On Mon, Aug 07, 2017 at 02:11:39PM +0200, Paolo Bonzini wrote:
> You could also add a module parameter to the driver, and set it to 64 on
> the kernel command line (there is an example in
> drivers/scsi/vmw_pvscsi.c of how to do it).

[Proviso: I've not tested the performance of difference values, nor do
I have any particular knowledge in this area]

can_queue is documented as:

 * This determines if we will use a non-interrupt driven
 * or an interrupt driven scheme.  It is set to the maximum number
 * of simultaneous commands a given host adapter will accept.

Wouldn't it be better to make it default to k * number of CPUs for
some small integer k?

I looked at the other scsi drivers and they set it to all kinds of
values.  1, small integers, large integers, configurable values.

Also I noticed this code in virtio_scsi.c:

cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

but setting cmd_per_lun (as a qemu virtio-scsi-pci parameter) didn't
seem to make any difference to memory usage.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: Increased memory usage with scsi-mq

2017-08-05 Thread Richard W.M. Jones

On Sat, Aug 05, 2017 at 03:39:54PM +0200, Christoph Hellwig wrote:
> For now can you apply this testing patch to the guest kernel?
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 9be211d68b15..0cbe2c882e1c 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -818,7 +818,7 @@ static struct scsi_host_template 
> virtscsi_host_template_single = {
>   .eh_timed_out = virtscsi_eh_timed_out,
>   .slave_alloc = virtscsi_device_alloc,
>  
> - .can_queue = 1024,
> + .can_queue = 64,
>   .dma_boundary = UINT_MAX,
>   .use_clustering = ENABLE_CLUSTERING,
>   .target_alloc = virtscsi_target_alloc,
> @@ -839,7 +839,7 @@ static struct scsi_host_template 
> virtscsi_host_template_multi = {
>   .eh_timed_out = virtscsi_eh_timed_out,
>   .slave_alloc = virtscsi_device_alloc,
>  
> - .can_queue = 1024,
> + .can_queue = 64,
>   .dma_boundary = UINT_MAX,
>   .use_clustering = ENABLE_CLUSTERING,
>   .target_alloc = virtscsi_target_alloc,
> @@ -983,7 +983,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
>   shost->max_id = num_targets;
>   shost->max_channel = 0;
>   shost->max_cmd_len = VIRTIO_SCSI_CDB_SIZE;
> - shost->nr_hw_queues = num_queues;
> + shost->nr_hw_queues = 1;
>  
>  #ifdef CONFIG_BLK_DEV_INTEGRITY
>   if (virtio_has_feature(vdev, VIRTIO_SCSI_F_T10_PI)) {

Yes, that's an improvement, although it's still a little way off the
density possible the old way:

  With scsi-mq enabled:   175 disks
* With this patch:319 disks *
  With scsi-mq disabled: 1755 disks

Also only the first two hunks are necessary.  The kernel behaves
exactly the same way with or without the third hunk (ie. num_queues
must already be 1).

Can I infer from this that qemu needs a way to specify the can_queue
setting to the virtio-scsi driver in the guest kernel?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: Increased memory usage with scsi-mq

2017-08-05 Thread Richard W.M. Jones

On Sat, Aug 05, 2017 at 03:39:54PM +0200, Christoph Hellwig wrote:
> For now can you apply this testing patch to the guest kernel?
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 9be211d68b15..0cbe2c882e1c 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -818,7 +818,7 @@ static struct scsi_host_template 
> virtscsi_host_template_single = {
>   .eh_timed_out = virtscsi_eh_timed_out,
>   .slave_alloc = virtscsi_device_alloc,
>  
> - .can_queue = 1024,
> + .can_queue = 64,
>   .dma_boundary = UINT_MAX,
>   .use_clustering = ENABLE_CLUSTERING,
>   .target_alloc = virtscsi_target_alloc,
> @@ -839,7 +839,7 @@ static struct scsi_host_template 
> virtscsi_host_template_multi = {
>   .eh_timed_out = virtscsi_eh_timed_out,
>   .slave_alloc = virtscsi_device_alloc,
>  
> - .can_queue = 1024,
> + .can_queue = 64,
>   .dma_boundary = UINT_MAX,
>   .use_clustering = ENABLE_CLUSTERING,
>   .target_alloc = virtscsi_target_alloc,
> @@ -983,7 +983,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
>   shost->max_id = num_targets;
>   shost->max_channel = 0;
>   shost->max_cmd_len = VIRTIO_SCSI_CDB_SIZE;
> - shost->nr_hw_queues = num_queues;
> + shost->nr_hw_queues = 1;
>  
>  #ifdef CONFIG_BLK_DEV_INTEGRITY
>   if (virtio_has_feature(vdev, VIRTIO_SCSI_F_T10_PI)) {

Yes, that's an improvement, although it's still a little way off the
density possible the old way:

  With scsi-mq enabled:   175 disks
* With this patch:319 disks *
  With scsi-mq disabled: 1755 disks

Also only the first two hunks are necessary.  The kernel behaves
exactly the same way with or without the third hunk (ie. num_queues
must already be 1).

Can I infer from this that qemu needs a way to specify the can_queue
setting to the virtio-scsi driver in the guest kernel?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: Increased memory usage with scsi-mq

2017-08-05 Thread Richard W.M. Jones

On Sat, Aug 05, 2017 at 10:44:36AM +0200, Christoph Hellwig wrote:
> On Fri, Aug 04, 2017 at 10:00:47PM +0100, Richard W.M. Jones wrote:
> > I read your slides about scsi-mq and it seems like a significant
> > benefit to large machines, but could the out of the box defaults be
> > made more friendly for small memory machines?
> 
> The default inumber of queues and queue depth and thus memory usage is
> set by the LLDD.
> 
> Try to reduce the can_queue value in virtio_scsi and/or make sure
> you use the single queue variant in your VM (which should be tunable
> in qemu).

Thanks, this is interesting.

Virtio-scsi seems to have a few settable parameters that might be
related to this:

  DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 1),
  DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors,
0x),
  DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun,
128),

Unfortunately (assuming I'm setting them right - see below), none of
them have any effect on the number of disks that I can add to the VM.

I am testing them by placing them in the ‘-device virtio-scsi-pci’
parameter, ie. as a property of the controller, not a property of the
LUN, eg:

-device virtio-scsi-pci,cmd_per_lun=32,id=scsi \
-drive 
file=/home/rjones/d/libguestfs/tmp/libguestfshXImTv/scratch.1,cache=unsafe,format=raw,id=hd0,if=none
 \
-device scsi-hd,drive=hd0 \

The debugging output is a bit too large to attach to this email, but I
have placed it at the link below.  It contains (if you scroll down a
bit) the full qemu command line and full kernel output.

  http://oirase.annexia.org/tmp/bz1478201-log.txt

I can add some extra debugging into the kernel if you like.  Just
point me to the right place.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: Increased memory usage with scsi-mq

2017-08-05 Thread Richard W.M. Jones

On Sat, Aug 05, 2017 at 10:44:36AM +0200, Christoph Hellwig wrote:
> On Fri, Aug 04, 2017 at 10:00:47PM +0100, Richard W.M. Jones wrote:
> > I read your slides about scsi-mq and it seems like a significant
> > benefit to large machines, but could the out of the box defaults be
> > made more friendly for small memory machines?
> 
> The default inumber of queues and queue depth and thus memory usage is
> set by the LLDD.
> 
> Try to reduce the can_queue value in virtio_scsi and/or make sure
> you use the single queue variant in your VM (which should be tunable
> in qemu).

Thanks, this is interesting.

Virtio-scsi seems to have a few settable parameters that might be
related to this:

  DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 1),
  DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors,
0x),
  DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun,
128),

Unfortunately (assuming I'm setting them right - see below), none of
them have any effect on the number of disks that I can add to the VM.

I am testing them by placing them in the ‘-device virtio-scsi-pci’
parameter, ie. as a property of the controller, not a property of the
LUN, eg:

-device virtio-scsi-pci,cmd_per_lun=32,id=scsi \
-drive 
file=/home/rjones/d/libguestfs/tmp/libguestfshXImTv/scratch.1,cache=unsafe,format=raw,id=hd0,if=none
 \
-device scsi-hd,drive=hd0 \

The debugging output is a bit too large to attach to this email, but I
have placed it at the link below.  It contains (if you scroll down a
bit) the full qemu command line and full kernel output.

  http://oirase.annexia.org/tmp/bz1478201-log.txt

I can add some extra debugging into the kernel if you like.  Just
point me to the right place.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Increased memory usage with scsi-mq

2017-08-04 Thread Richard W.M. Jones

https://bugzilla.redhat.com/show_bug.cgi?id=1478201

We have a libguestfs test which adds 256 virtio-scsi disks to a qemu
virtual machine.  The VM has 500 MB of RAM, 1 vCPU and no swap.

This test has been failing for a little while.  It runs out of memory
during SCSI enumeration in early boot.

Tonight I bisected the cause to:

  5c279bd9e40624f4ab6e688671026d6005b066fa is the first bad commit
  commit 5c279bd9e40624f4ab6e688671026d6005b066fa
  Author: Christoph Hellwig 
  Date:   Fri Jun 16 10:27:55 2017 +0200

scsi: default to scsi-mq

Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
path now that we had plenty of testing, and have I/O schedulers for
blk-mq.  The module option to disable the blk-mq path is kept around for
now.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Martin K. Petersen 

  :04 04 57ec7d5d2ba76592a695f533a69f747700c31966
  c79f6ecb070acc4fadf6fc05ca9ba32bc9c0c665 Mdrivers

I also wrote a small test to see the maximum number of virtio-scsi
disks I could add to the above VM.  The results were very surprising
(to me anyhow):

  With scsi-mq enabled:   175 disks
  With scsi-mq disabled: 1755 disks

I don't know why the ratio is almost exactly 10 times.

I read your slides about scsi-mq and it seems like a significant
benefit to large machines, but could the out of the box defaults be
made more friendly for small memory machines?

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Increased memory usage with scsi-mq

2017-08-04 Thread Richard W.M. Jones

https://bugzilla.redhat.com/show_bug.cgi?id=1478201

We have a libguestfs test which adds 256 virtio-scsi disks to a qemu
virtual machine.  The VM has 500 MB of RAM, 1 vCPU and no swap.

This test has been failing for a little while.  It runs out of memory
during SCSI enumeration in early boot.

Tonight I bisected the cause to:

  5c279bd9e40624f4ab6e688671026d6005b066fa is the first bad commit
  commit 5c279bd9e40624f4ab6e688671026d6005b066fa
  Author: Christoph Hellwig 
  Date:   Fri Jun 16 10:27:55 2017 +0200

scsi: default to scsi-mq

Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
path now that we had plenty of testing, and have I/O schedulers for
blk-mq.  The module option to disable the blk-mq path is kept around for
now.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Martin K. Petersen 

  :04 04 57ec7d5d2ba76592a695f533a69f747700c31966
  c79f6ecb070acc4fadf6fc05ca9ba32bc9c0c665 Mdrivers

I also wrote a small test to see the maximum number of virtio-scsi
disks I could add to the above VM.  The results were very surprising
(to me anyhow):

  With scsi-mq enabled:   175 disks
  With scsi-mq disabled: 1755 disks

I don't know why the ratio is almost exactly 10 times.

I read your slides about scsi-mq and it seems like a significant
benefit to large machines, but could the out of the box defaults be
made more friendly for small memory machines?

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 01:13:50PM +0800, Jason Wang wrote:
> >From 312859b596e83a2164a8430343d31fce2a5ad808 Mon Sep 17 00:00:00 2001
> From: Jason Wang <jasow...@redhat.com>
> Date: Thu, 23 Mar 2017 13:07:16 +0800
> Subject: [PATCH] virtio_pci: fix out of bound access for msix_names
> 
> Signed-off-by: Jason Wang <jasow...@redhat.com>

I tested this, and it does appear to fix the crashes in
vp_modern_find_vqs.  Therefore:

Tested-by: Richard W.M. Jones <rjo...@redhat.com>

Thanks,

Rich.

>  drivers/virtio/virtio_pci_common.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_pci_common.c 
> b/drivers/virtio/virtio_pci_common.c
> index df548a6..5905349 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -147,7 +147,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
> unsigned nvqs,
>  {
>   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>   const char *name = dev_name(_dev->vdev.dev);
> - int i, err = -ENOMEM, allocated_vectors, nvectors;
> + int i, j, err = -ENOMEM, allocated_vectors, nvectors;
>   unsigned flags = PCI_IRQ_MSIX;
>   bool shared = false;
>   u16 msix_vec;
> @@ -212,7 +212,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
> unsigned nvqs,
>   if (!vp_dev->msix_vector_map)
>   goto out_disable_config_irq;
>  
> - allocated_vectors = 1; /* vector 0 is the config interrupt */
> + allocated_vectors = j = 1; /* vector 0 is the config interrupt */
>   for (i = 0; i < nvqs; ++i) {
>   if (!names[i]) {
>   vqs[i] = NULL;
> @@ -236,18 +236,19 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
> unsigned nvqs,
>   continue;
>   }
>  
> - snprintf(vp_dev->msix_names[i + 1],
> + snprintf(vp_dev->msix_names[j],
>sizeof(*vp_dev->msix_names), "%s-%s",
>dev_name(_dev->vdev.dev), names[i]);
>   err = request_irq(pci_irq_vector(vp_dev->pci_dev, msix_vec),
> vring_interrupt, IRQF_SHARED,
> -   vp_dev->msix_names[i + 1], vqs[i]);
> +   vp_dev->msix_names[j], vqs[i]);
>   if (err) {
>   /* don't free this irq on error */
>   vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR;
>   goto out_remove_vqs;
>   }
>   vp_dev->msix_vector_map[i] = msix_vec;
> + j++;
>  
>   /*
>* Use a different vector for each queue if they are available,
> -- 
> 2.7.4
> 


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 01:13:50PM +0800, Jason Wang wrote:
> >From 312859b596e83a2164a8430343d31fce2a5ad808 Mon Sep 17 00:00:00 2001
> From: Jason Wang 
> Date: Thu, 23 Mar 2017 13:07:16 +0800
> Subject: [PATCH] virtio_pci: fix out of bound access for msix_names
> 
> Signed-off-by: Jason Wang 

I tested this, and it does appear to fix the crashes in
vp_modern_find_vqs.  Therefore:

Tested-by: Richard W.M. Jones 

Thanks,

Rich.

>  drivers/virtio/virtio_pci_common.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_pci_common.c 
> b/drivers/virtio/virtio_pci_common.c
> index df548a6..5905349 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -147,7 +147,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
> unsigned nvqs,
>  {
>   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>   const char *name = dev_name(_dev->vdev.dev);
> - int i, err = -ENOMEM, allocated_vectors, nvectors;
> + int i, j, err = -ENOMEM, allocated_vectors, nvectors;
>   unsigned flags = PCI_IRQ_MSIX;
>   bool shared = false;
>   u16 msix_vec;
> @@ -212,7 +212,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
> unsigned nvqs,
>   if (!vp_dev->msix_vector_map)
>   goto out_disable_config_irq;
>  
> - allocated_vectors = 1; /* vector 0 is the config interrupt */
> + allocated_vectors = j = 1; /* vector 0 is the config interrupt */
>   for (i = 0; i < nvqs; ++i) {
>   if (!names[i]) {
>   vqs[i] = NULL;
> @@ -236,18 +236,19 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
> unsigned nvqs,
>   continue;
>   }
>  
> - snprintf(vp_dev->msix_names[i + 1],
> + snprintf(vp_dev->msix_names[j],
>sizeof(*vp_dev->msix_names), "%s-%s",
>dev_name(_dev->vdev.dev), names[i]);
>   err = request_irq(pci_irq_vector(vp_dev->pci_dev, msix_vec),
> vring_interrupt, IRQF_SHARED,
> -   vp_dev->msix_names[i + 1], vqs[i]);
> +   vp_dev->msix_names[j], vqs[i]);
>   if (err) {
>   /* don't free this irq on error */
>   vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR;
>   goto out_remove_vqs;
>   }
>   vp_dev->msix_vector_map[i] = msix_vec;
> + j++;
>  
>   /*
>* Use a different vector for each queue if they are available,
> -- 
> 2.7.4
> 


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote:
> Does the patch from Jason in the
> 
> "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for 
> virtqueues") causes crashes in guest"
> 
> thread fix the issue for you?

In brief, yes it does.  I followed up on that thread.

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote:
> Does the patch from Jason in the
> 
> "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for 
> virtqueues") causes crashes in guest"
> 
> thread fix the issue for you?

In brief, yes it does.  I followed up on that thread.

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote:
> Does the patch from Jason in the
> 
> "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for 
> virtqueues") causes crashes in guest"
> 
> thread fix the issue for you?

I didn't see this thread before.  I'll check that out for you now.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote:
> Does the patch from Jason in the
> 
> "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for 
> virtqueues") causes crashes in guest"
> 
> thread fix the issue for you?

I didn't see this thread before.  I'll check that out for you now.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 03:51:25PM +0100, Thorsten Leemhuis wrote:
> Hi Christoph! Hi Michael!
> 
> (Mail roughly based on text from
> https://bugzilla.kernel.org/show_bug.cgi?id=194911 )
> 
> I'm seeing random crashes during boot every few boot attempts when
> running Linux 4.11-rc/mainline in a Fedora 26 guest under a CentOS7 host
> (CPU: Intel(R) Pentium(R) CPU G3220) using KVM. Sometimes when the guest
> actually booted the network did not work. To get some impressions of the
> crashes I got see this gallery:
> https://plus.google.com/+ThorstenLeemhuis/posts/FjyyGjNtrrG
> 
> Richard W.M. Jones and Adam Williamson see the same problems. See above
> bug for details. It seems they ran into the problem in the past few
> days, so I assume it's still present in mainline (I'm travelling
> currently and haven't had time for proper tests since last last Friday
> (pre-rc3); but I thought it's time to get the problem to the lists).
> 
> Long story short: Richard and I did bisections and we both found that
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07ec51480b5e
> ("virtio_pci: use shared interrupts for virtqueues") is the first bad
> commit. Any idea what might be wrong? Do you need more details from us
> to fix this?

Laura Abbott posted a kernel RPM which works for me.  She has had to
revert quite a number of commits, which are detailed in this comment:

  https://bugzilla.redhat.com/show_bug.cgi?id=1430297#c7

Her reverting patch is also attached.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html
>From 4d3cba0be27b20516eb765c2913bce93e73fe30e Mon Sep 17 00:00:00 2001
From: Laura Abbott <labb...@redhat.com>
Date: Wed, 22 Mar 2017 15:41:27 -0700
Subject: [PATCH] Revert a bunch of virtio commits

07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues")
is linked to a bunch of issues. Unfortunately we can't just revert it by
itself. Revert it and dependency patches as well.

Revert "virtio: provide a method to get the IRQ affinity mask for a virtqueue"

This reverts commit bbaba479563910aaa51e59bb9027a09e396d3a3c.

Revert "virtio-console: avoid DMA from stack"

This reverts commit c4baad50297d84bde1a7ad45e50c73adae4a2192.

Revert "vhost: introduce O(1) vq metadata cache"

This reverts commit f889491380582b4ba2981cf0b0d7d6a40fb30ab7.

 Conflicts:
drivers/vhost/vhost.c

Revert "virtio_scsi: use virtio IRQ affinity"

This reverts commit 0d9f0a52c8b9f7a003fe1650b7d5fb8518efabe0.

Revert "virtio_blk: use virtio IRQ affinity"

This reverts commit ad71473d9c43725c917fc5a86d54ceb7001ee28c.

Revert "blk-mq: provide a default queue mapping for virtio device"

This reverts commit 73473427bb551686e4b68ecd99bfd27e6635286a.

Revert "virtio: allow drivers to request IRQ affinity when creating VQs"

This reverts commit fb5e31d970ce8b4941f03ed765d7dbefc39f22d9.

Revert "virtio_pci: simplify MSI-X setup"

This reverts commit 52a61516125fa9a21b3bdf4f90928308e2e5573f.

Revert "virtio_pci: don't duplicate the msix_enable flag in struct pci_dev"

This reverts commit 53a020c661741f3b87ad3ac6fa545088aaebac9b.

Revert "virtio_pci: use shared interrupts for virtqueues"

This reverts commit 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507.
---
 block/Kconfig  |   5 -
 block/Makefile |   1 -
 block/blk-mq-virtio.c  |  54 --
 drivers/block/virtio_blk.c |  14 +-
 drivers/char/virtio_console.c  |  14 +-
 drivers/crypto/virtio/virtio_crypto_core.c |   2 +-
 drivers/gpu/drm/virtio/virtgpu_kms.c   |   2 +-
 drivers/misc/mic/vop/vop_main.c|   2 +-
 drivers/net/caif/caif_virtio.c |   3 +-
 drivers/net/virtio_net.c   |   2 +-
 drivers/remoteproc/remoteproc_virtio.c |   3 +-
 drivers/rpmsg/virtio_rpmsg_bus.c   |   2 +-
 drivers/s390/virtio/kvm_virtio.c   |   3 +-
 drivers/s390/virtio/virtio_ccw.c   |   3 +-
 drivers/scsi/virtio_scsi.c | 127 +++--
 drivers/vhost/vhost.c  | 136 +++---
 drivers/vhost/vhost.h  |   8 -
 drivers/virtio/virtio_balloon.c|   3 +-
 drivers/virtio/virtio_input.c  |   3 +-
 drivers/virtio/virtio_mmio.c   |   3 +-
 drivers/virtio/virtio_pci_common.c | 287 +++--
 drivers/virtio/virtio_pci_common.h |  25 ++-
 drivers/virtio/virtio_pci_legacy.c |   3 +-
 drivers/virtio/virtio_pci_modern.c |  11 +-
 include/linux/blk-mq-virtio.h  |  10 -
 include/

Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 03:51:25PM +0100, Thorsten Leemhuis wrote:
> Hi Christoph! Hi Michael!
> 
> (Mail roughly based on text from
> https://bugzilla.kernel.org/show_bug.cgi?id=194911 )
> 
> I'm seeing random crashes during boot every few boot attempts when
> running Linux 4.11-rc/mainline in a Fedora 26 guest under a CentOS7 host
> (CPU: Intel(R) Pentium(R) CPU G3220) using KVM. Sometimes when the guest
> actually booted the network did not work. To get some impressions of the
> crashes I got see this gallery:
> https://plus.google.com/+ThorstenLeemhuis/posts/FjyyGjNtrrG
> 
> Richard W.M. Jones and Adam Williamson see the same problems. See above
> bug for details. It seems they ran into the problem in the past few
> days, so I assume it's still present in mainline (I'm travelling
> currently and haven't had time for proper tests since last last Friday
> (pre-rc3); but I thought it's time to get the problem to the lists).
> 
> Long story short: Richard and I did bisections and we both found that
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07ec51480b5e
> ("virtio_pci: use shared interrupts for virtqueues") is the first bad
> commit. Any idea what might be wrong? Do you need more details from us
> to fix this?

Laura Abbott posted a kernel RPM which works for me.  She has had to
revert quite a number of commits, which are detailed in this comment:

  https://bugzilla.redhat.com/show_bug.cgi?id=1430297#c7

Her reverting patch is also attached.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html
>From 4d3cba0be27b20516eb765c2913bce93e73fe30e Mon Sep 17 00:00:00 2001
From: Laura Abbott 
Date: Wed, 22 Mar 2017 15:41:27 -0700
Subject: [PATCH] Revert a bunch of virtio commits

07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues")
is linked to a bunch of issues. Unfortunately we can't just revert it by
itself. Revert it and dependency patches as well.

Revert "virtio: provide a method to get the IRQ affinity mask for a virtqueue"

This reverts commit bbaba479563910aaa51e59bb9027a09e396d3a3c.

Revert "virtio-console: avoid DMA from stack"

This reverts commit c4baad50297d84bde1a7ad45e50c73adae4a2192.

Revert "vhost: introduce O(1) vq metadata cache"

This reverts commit f889491380582b4ba2981cf0b0d7d6a40fb30ab7.

 Conflicts:
drivers/vhost/vhost.c

Revert "virtio_scsi: use virtio IRQ affinity"

This reverts commit 0d9f0a52c8b9f7a003fe1650b7d5fb8518efabe0.

Revert "virtio_blk: use virtio IRQ affinity"

This reverts commit ad71473d9c43725c917fc5a86d54ceb7001ee28c.

Revert "blk-mq: provide a default queue mapping for virtio device"

This reverts commit 73473427bb551686e4b68ecd99bfd27e6635286a.

Revert "virtio: allow drivers to request IRQ affinity when creating VQs"

This reverts commit fb5e31d970ce8b4941f03ed765d7dbefc39f22d9.

Revert "virtio_pci: simplify MSI-X setup"

This reverts commit 52a61516125fa9a21b3bdf4f90928308e2e5573f.

Revert "virtio_pci: don't duplicate the msix_enable flag in struct pci_dev"

This reverts commit 53a020c661741f3b87ad3ac6fa545088aaebac9b.

Revert "virtio_pci: use shared interrupts for virtqueues"

This reverts commit 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507.
---
 block/Kconfig  |   5 -
 block/Makefile |   1 -
 block/blk-mq-virtio.c  |  54 --
 drivers/block/virtio_blk.c |  14 +-
 drivers/char/virtio_console.c  |  14 +-
 drivers/crypto/virtio/virtio_crypto_core.c |   2 +-
 drivers/gpu/drm/virtio/virtgpu_kms.c   |   2 +-
 drivers/misc/mic/vop/vop_main.c|   2 +-
 drivers/net/caif/caif_virtio.c |   3 +-
 drivers/net/virtio_net.c   |   2 +-
 drivers/remoteproc/remoteproc_virtio.c |   3 +-
 drivers/rpmsg/virtio_rpmsg_bus.c   |   2 +-
 drivers/s390/virtio/kvm_virtio.c   |   3 +-
 drivers/s390/virtio/virtio_ccw.c   |   3 +-
 drivers/scsi/virtio_scsi.c | 127 +++--
 drivers/vhost/vhost.c  | 136 +++---
 drivers/vhost/vhost.h  |   8 -
 drivers/virtio/virtio_balloon.c|   3 +-
 drivers/virtio/virtio_input.c  |   3 +-
 drivers/virtio/virtio_mmio.c   |   3 +-
 drivers/virtio/virtio_pci_common.c | 287 +++--
 drivers/virtio/virtio_pci_common.h |  25 ++-
 drivers/virtio/virtio_pci_legacy.c |   3 +-
 drivers/virtio/virtio_pci_modern.c |  11 +-
 include/linux/blk-mq-virtio.h  |  10 -
 include/linux/cpuhotplug.h

Re: [PATCH RESEND] hwrng: core - don't pass stack allocated buffer to rng->read()

2016-10-21 Thread Richard W.M. Jones

On Fri, Oct 21, 2016 at 02:04:27PM -0700, Andy Lutomirski wrote:
> https://git.kernel.org/cgit/linux/kernel/git/herbert/cryptodev-2.6.git/commit/?id=6d4952d9d9d4dc2bb9c0255d95a09405a1e958f7

I have tested this one, and it also fixes the bug I was seeing.

Thanks Laszlo as well for his fix, and sorry for not finding the
patch above first.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [PATCH RESEND] hwrng: core - don't pass stack allocated buffer to rng->read()

2016-10-21 Thread Richard W.M. Jones

On Fri, Oct 21, 2016 at 02:04:27PM -0700, Andy Lutomirski wrote:
> https://git.kernel.org/cgit/linux/kernel/git/herbert/cryptodev-2.6.git/commit/?id=6d4952d9d9d4dc2bb9c0255d95a09405a1e958f7

I have tested this one, and it also fixes the bug I was seeing.

Thanks Laszlo as well for his fix, and sorry for not finding the
patch above first.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: Warning from free_init_pages with large initrd

2016-09-27 Thread Richard W.M. Jones

On Tue, Sep 27, 2016 at 11:20:06AM +0100, Sitsofe Wheeler wrote:
> (See http://www.gossamer-threads.com/lists/linux/kernel/2534175 for
> the history of this thread )
> 
> On 26 September 2016 at 20:00, Randy Dunlap  wrote:
> >
> > but the warning in free_init_pages() is about alignment, not size...
> > Maybe the concatenation is bad?
> 
> What would l have to pull apart to be able to tell?
> 
> Having said that I've just noticed that newer versions of the script
> concatenate an ISO to the initrd (rather than another cpio) -
> https://github.com/rhinstaller/livecd-tools/commit/8067be50907da9461e442c11a664c89e066ccac6#diff-88c69e43bb69726c532af3a136cc50e8
> . Unfortunately after rebuilding the initrd with the
> livecd-iso-to-pxeboot version from Fedora 24 the warning persists.
> CC'ing a few of the tool's authors on this email.

I would just comment that I wrote livecd-iso-to-pxeboot for
a specific purpose -- virt-p2v -- where the size of the ISO
rarely exceeded around 100-200 MB, which is large for an initramfs,
but about an order of magnitude smaller than what you are trying.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: Warning from free_init_pages with large initrd

2016-09-27 Thread Richard W.M. Jones

On Tue, Sep 27, 2016 at 11:20:06AM +0100, Sitsofe Wheeler wrote:
> (See http://www.gossamer-threads.com/lists/linux/kernel/2534175 for
> the history of this thread )
> 
> On 26 September 2016 at 20:00, Randy Dunlap  wrote:
> >
> > but the warning in free_init_pages() is about alignment, not size...
> > Maybe the concatenation is bad?
> 
> What would l have to pull apart to be able to tell?
> 
> Having said that I've just noticed that newer versions of the script
> concatenate an ISO to the initrd (rather than another cpio) -
> https://github.com/rhinstaller/livecd-tools/commit/8067be50907da9461e442c11a664c89e066ccac6#diff-88c69e43bb69726c532af3a136cc50e8
> . Unfortunately after rebuilding the initrd with the
> livecd-iso-to-pxeboot version from Fedora 24 the warning persists.
> CC'ing a few of the tool's authors on this email.

I would just comment that I wrote livecd-iso-to-pxeboot for
a specific purpose -- virt-p2v -- where the size of the ISO
rarely exceeded around 100-200 MB, which is large for an initramfs,
but about an order of magnitude smaller than what you are trying.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

[PATCH v3] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-05-03 Thread Richard W.M. Jones

Running self-tests for a short-lived KVM VM takes 28ms on my laptop.
This commit adds a flag 'cryptomgr.notests' which allows them to be
disabled.

However if fips=1 as well, we ignore this flag as FIPS mode mandates
that the self-tests are run.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 Documentation/kernel-parameters.txt | 3 +++
 crypto/testmgr.c| 9 +
 2 files changed, 12 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 0b3de80..d4d5fb7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   cryptomgr.notests
+[KNL] Disable crypto self-tests
+
cs89x0_dma= [HW,NET]
Format: 
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index b86883a..fcd89fe 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -35,6 +35,10 @@
 
 #include "internal.h"
 
+static bool notests;
+module_param(notests, bool, 0644);
+MODULE_PARM_DESC(notests, "disable crypto self-tests");
+
 #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS
 
 /* a perfect nop */
@@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 
type, u32 mask)
int j;
int rc;
 
+   if (!fips_enabled && notests) {
+   printk_once(KERN_INFO "alg: self-tests disabled\n");
+   return 0;
+   }
+
alg_test_descs_check_order();
 
if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
-- 
2.7.4

[PATCH v3] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-05-03 Thread Richard W.M. Jones

Running self-tests for a short-lived KVM VM takes 28ms on my laptop.
This commit adds a flag 'cryptomgr.notests' which allows them to be
disabled.

However if fips=1 as well, we ignore this flag as FIPS mode mandates
that the self-tests are run.

Signed-off-by: Richard W.M. Jones 
---
 Documentation/kernel-parameters.txt | 3 +++
 crypto/testmgr.c| 9 +
 2 files changed, 12 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 0b3de80..d4d5fb7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   cryptomgr.notests
+[KNL] Disable crypto self-tests
+
cs89x0_dma= [HW,NET]
Format: 
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index b86883a..fcd89fe 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -35,6 +35,10 @@
 
 #include "internal.h"
 
+static bool notests;
+module_param(notests, bool, 0644);
+MODULE_PARM_DESC(notests, "disable crypto self-tests");
+
 #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS
 
 /* a perfect nop */
@@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 
type, u32 mask)
int j;
int rc;
 
+   if (!fips_enabled && notests) {
+   printk_once(KERN_INFO "alg: self-tests disabled\n");
+   return 0;
+   }
+
alg_test_descs_check_order();
 
if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
-- 
2.7.4

[PATCH v3] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-05-03 Thread Richard W.M. Jones

v2 -> v3:

 - Ignore the flag if FIPS mode is enabled.

v1 -> v2:

 - Use printk_once.  Because the serial console is so slow, printing
   the message multiple times actually consumed about 6ms extra later
   on during the boot.

- - -

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 28ms (on my laptop) running crypto
algorithm self-tests.  Although it's possibe to disable these at
compile time, Fedora kernel maintainers want to maintain a single
kernel image for all uses.

So this commit adds a runtime flag which callers can set to skip the
self-tests in the fast container/virtualization case.

Rich.

[PATCH v3] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-05-03 Thread Richard W.M. Jones

v2 -> v3:

 - Ignore the flag if FIPS mode is enabled.

v1 -> v2:

 - Use printk_once.  Because the serial console is so slow, printing
   the message multiple times actually consumed about 6ms extra later
   on during the boot.

- - -

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 28ms (on my laptop) running crypto
algorithm self-tests.  Although it's possibe to disable these at
compile time, Fedora kernel maintainers want to maintain a single
kernel image for all uses.

So this commit adds a runtime flag which callers can set to skip the
self-tests in the fast container/virtualization case.

Rich.

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 08:54:13AM -0700, Greg KH wrote:
> You are trying to take a generalized kernel and somehow "know" about the
> hardware ahead of time it is going to run on.  That seems like two
> conflicting requirements, don't you agree?

We would have the 8250 serial port in any kernel.  Even if Fedora
kernel maintainers allowed us to have specialized kernels for each
purpose, I would use the simple ISA serial port here because it allows
us to capture debug messages very early in the boot.  Alternatives
like virtio-console don't allow that.

The kernel does know what hardware it's running on - via the CPUID
hypervisor leaf.  It's also possible for us to tell the kernel about
the hardware using the command line, ACPI[*], DT, etc.  I'd really
like to tell the kernel this is a 16550A, not broken, you don't need
to spend time testing that.

There is prior art here: no_timer_check & lpj=..

Rich.

[*] Although ACPI is really slow, adding another 190ms, and for this
reason I have disabled it for now, but not investigated why it's so
slow.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 08:54:13AM -0700, Greg KH wrote:
> You are trying to take a generalized kernel and somehow "know" about the
> hardware ahead of time it is going to run on.  That seems like two
> conflicting requirements, don't you agree?

We would have the 8250 serial port in any kernel.  Even if Fedora
kernel maintainers allowed us to have specialized kernels for each
purpose, I would use the simple ISA serial port here because it allows
us to capture debug messages very early in the boot.  Alternatives
like virtio-console don't allow that.

The kernel does know what hardware it's running on - via the CPUID
hypervisor leaf.  It's also possible for us to tell the kernel about
the hardware using the command line, ACPI[*], DT, etc.  I'd really
like to tell the kernel this is a 16550A, not broken, you don't need
to spend time testing that.

There is prior art here: no_timer_check & lpj=..

Rich.

[*] Although ACPI is really slow, adding another 190ms, and for this
reason I have disabled it for now, but not investigated why it's so
slow.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 08:16:35AM -0700, Greg KH wrote:
> On Fri, Apr 29, 2016 at 09:10:06AM +0100, Richard W.M. Jones wrote:
> > On Thu, Apr 28, 2016 at 03:56:33PM -0700, Greg KH wrote:
> > > On Thu, Apr 28, 2016 at 11:18:33PM +0100, Richard W.M. Jones wrote:
> > > > Currently autoconf spends 25ms (on my laptop) testing if the UART
> > > > exported to it by KVM is an 8250 without FIFO and/or with strange
> > > > quirks, which it obviously isn't.  Assume it is exported to us by a
> > > > hypervisor, it's a normal, working 16550A.
> > > > 
> > > > Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
> > > > ---
> > > >  drivers/tty/serial/8250/8250_port.c | 7 +++
> > > >  1 file changed, 7 insertions(+)
> > > > 
> > > > diff --git a/drivers/tty/serial/8250/8250_port.c 
> > > > b/drivers/tty/serial/8250/8250_port.c
> > > > index 00ad2637..de19924 100644
> > > > --- a/drivers/tty/serial/8250/8250_port.c
> > > > +++ b/drivers/tty/serial/8250/8250_port.c
> > > > @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up)
> > > > if (!port->iobase && !port->mapbase && !port->membase)
> > > > return;
> > > >  
> > > > +   /* Hypervisors always export working 16550A devices. */
> > > > +   if (cpu_has_hypervisor) {
> > > > +   up->port.type = PORT_16550A;
> > > > +   up->capabilities |= UART_CAP_FIFO;
> > > > +   return;
> > > > +   }
> > > 
> > > Have you audited vmware, virtualbox, and everyone else that provides a
> > > virtual uart device that it will work properly here?
> > > 
> > > qemu isn't all the world :)
> > 
> > Attached below is a slightly different approach.  If the user passes a
> > special flag on the kernel command line then we force 16550A and avoid
> > the 25ms delay.  Since the user chooses the flag, any concerns about
> > the behaviour of the hypervisor or use of VFIO should be moot.
> 
> No, no more module parameters, that's crazy, what happens when you have
> 64 serial ports in a system, which one is this option for?

In this (very special) case, the domain is running under qemu and
I know it only has one serial port that doesn't need probing.

What's the right way to avoid this 25ms delay?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 08:16:35AM -0700, Greg KH wrote:
> On Fri, Apr 29, 2016 at 09:10:06AM +0100, Richard W.M. Jones wrote:
> > On Thu, Apr 28, 2016 at 03:56:33PM -0700, Greg KH wrote:
> > > On Thu, Apr 28, 2016 at 11:18:33PM +0100, Richard W.M. Jones wrote:
> > > > Currently autoconf spends 25ms (on my laptop) testing if the UART
> > > > exported to it by KVM is an 8250 without FIFO and/or with strange
> > > > quirks, which it obviously isn't.  Assume it is exported to us by a
> > > > hypervisor, it's a normal, working 16550A.
> > > > 
> > > > Signed-off-by: Richard W.M. Jones 
> > > > ---
> > > >  drivers/tty/serial/8250/8250_port.c | 7 +++
> > > >  1 file changed, 7 insertions(+)
> > > > 
> > > > diff --git a/drivers/tty/serial/8250/8250_port.c 
> > > > b/drivers/tty/serial/8250/8250_port.c
> > > > index 00ad2637..de19924 100644
> > > > --- a/drivers/tty/serial/8250/8250_port.c
> > > > +++ b/drivers/tty/serial/8250/8250_port.c
> > > > @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up)
> > > > if (!port->iobase && !port->mapbase && !port->membase)
> > > > return;
> > > >  
> > > > +   /* Hypervisors always export working 16550A devices. */
> > > > +   if (cpu_has_hypervisor) {
> > > > +   up->port.type = PORT_16550A;
> > > > +   up->capabilities |= UART_CAP_FIFO;
> > > > +   return;
> > > > +   }
> > > 
> > > Have you audited vmware, virtualbox, and everyone else that provides a
> > > virtual uart device that it will work properly here?
> > > 
> > > qemu isn't all the world :)
> > 
> > Attached below is a slightly different approach.  If the user passes a
> > special flag on the kernel command line then we force 16550A and avoid
> > the 25ms delay.  Since the user chooses the flag, any concerns about
> > the behaviour of the hypervisor or use of VFIO should be moot.
> 
> No, no more module parameters, that's crazy, what happens when you have
> 64 serial ports in a system, which one is this option for?

In this (very special) case, the domain is running under qemu and
I know it only has one serial port that doesn't need probing.

What's the right way to avoid this 25ms delay?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 12:59:57PM +0200, Stephan Mueller wrote:
> Am Freitag, 29. April 2016, 11:07:43 schrieb Richard W.M. Jones:
> 
> Hi Richard,
[...]
> > +   if (notests) {
> 
> What about if (!fips_enabled && notests) ?
>
> I am not sure whether the kernel should prevent mistakes in user space. A 
> mistake would be when setting fips=1 and notests=1 as the FIPS mode mandates 
> the self tests.

(Sorry, I just posted v2 before I saw this message.)  I saw the FIPS
stuff and thought about that.  Should we prevent mistakes like that?
I really don't know.

Rich.

> > +   pr_info("alg: self-tests disabled\n");
> > +   return 0;
> > +   }
> > +
> > alg_test_descs_check_order();
> > 
> > if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
> 
> 
> Ciao
> Stephan
> -- 
> | Nimm das Recht weg - |
> |  was ist dann der Staat noch anderes als eine große Räuberbande? |

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: [PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 12:59:57PM +0200, Stephan Mueller wrote:
> Am Freitag, 29. April 2016, 11:07:43 schrieb Richard W.M. Jones:
> 
> Hi Richard,
[...]
> > +   if (notests) {
> 
> What about if (!fips_enabled && notests) ?
>
> I am not sure whether the kernel should prevent mistakes in user space. A 
> mistake would be when setting fips=1 and notests=1 as the FIPS mode mandates 
> the self tests.

(Sorry, I just posted v2 before I saw this message.)  I saw the FIPS
stuff and thought about that.  Should we prevent mistakes like that?
I really don't know.

Rich.

> > +   pr_info("alg: self-tests disabled\n");
> > +   return 0;
> > +   }
> > +
> > alg_test_descs_check_order();
> > 
> > if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
> 
> 
> Ciao
> Stephan
> -- 
> | Nimm das Recht weg - |
> |  was ist dann der Staat noch anderes als eine große Räuberbande? |

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

[PATCH v2] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

Running self-tests for a short-lived KVM VM takes 28ms on my laptop.
This commit adds a flag 'cryptomgr.notests' which allows them to be
disabled.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 Documentation/kernel-parameters.txt | 3 +++
 crypto/testmgr.c| 9 +
 2 files changed, 12 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 0b3de80..d4d5fb7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   cryptomgr.notests
+[KNL] Disable crypto self-tests
+
cs89x0_dma= [HW,NET]
Format: 
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index b86883a..5c0664d 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -35,6 +35,10 @@
 
 #include "internal.h"
 
+static bool notests;
+module_param(notests, bool, 0644);
+MODULE_PARM_DESC(notests, "disable crypto self-tests");
+
 #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS
 
 /* a perfect nop */
@@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 
type, u32 mask)
int j;
int rc;
 
+   if (notests) {
+   printk_once(KERN_INFO "alg: self-tests disabled\n");
+   return 0;
+   }
+
alg_test_descs_check_order();
 
if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
-- 
2.7.4

[PATCH v2] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

Running self-tests for a short-lived KVM VM takes 28ms on my laptop.
This commit adds a flag 'cryptomgr.notests' which allows them to be
disabled.

Signed-off-by: Richard W.M. Jones 
---
 Documentation/kernel-parameters.txt | 3 +++
 crypto/testmgr.c| 9 +
 2 files changed, 12 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 0b3de80..d4d5fb7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   cryptomgr.notests
+[KNL] Disable crypto self-tests
+
cs89x0_dma= [HW,NET]
Format: 
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index b86883a..5c0664d 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -35,6 +35,10 @@
 
 #include "internal.h"
 
+static bool notests;
+module_param(notests, bool, 0644);
+MODULE_PARM_DESC(notests, "disable crypto self-tests");
+
 #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS
 
 /* a perfect nop */
@@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 
type, u32 mask)
int j;
int rc;
 
+   if (notests) {
+   printk_once(KERN_INFO "alg: self-tests disabled\n");
+   return 0;
+   }
+
alg_test_descs_check_order();
 
if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
-- 
2.7.4

[PATCH v2] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

v1 -> v2:

 - Use printk_once.

Because the serial console is so slow, printing the message multiple
times consumed about 6ms extra later on in the boot.  Printing it only
once is both neater and avoids this extra overhead.

Rich.

- - -

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 28ms (on my laptop) running crypto
algorithm self-tests.  Although it's possibe to disable these at
compile time, Fedora kernel maintainers want to maintain a single
kernel image for all uses.

So this commit adds a runtime flag which callers can set to skip the
self-tests in the fast container/virtualization case.

[PATCH v2] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

v1 -> v2:

 - Use printk_once.

Because the serial console is so slow, printing the message multiple
times consumed about 6ms extra later on in the boot.  Printing it only
once is both neater and avoids this extra overhead.

Rich.

- - -

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 28ms (on my laptop) running crypto
algorithm self-tests.  Although it's possibe to disable these at
compile time, Fedora kernel maintainers want to maintain a single
kernel image for all uses.

So this commit adds a runtime flag which callers can set to skip the
self-tests in the fast container/virtualization case.

[PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

Running self-tests for a short-lived KVM VM takes 28ms on my laptop.
This commit adds a flag 'cryptomgr.notests' which allows them to be
disabled.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 Documentation/kernel-parameters.txt | 3 +++
 crypto/testmgr.c| 9 +
 2 files changed, 12 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 0b3de80..d4d5fb7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   cryptomgr.notests
+[KNL] Disable crypto self-tests
+
cs89x0_dma= [HW,NET]
Format: 
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index b86883a..dc613f2 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -35,6 +35,10 @@
 
 #include "internal.h"
 
+static bool notests;
+module_param(notests, bool, 0644);
+MODULE_PARM_DESC(notests, "disable crypto self-tests");
+
 #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS
 
 /* a perfect nop */
@@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 
type, u32 mask)
int j;
int rc;
 
+   if (notests) {
+   pr_info("alg: self-tests disabled\n");
+   return 0;
+   }
+
alg_test_descs_check_order();
 
if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
-- 
2.7.4

[PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

Running self-tests for a short-lived KVM VM takes 28ms on my laptop.
This commit adds a flag 'cryptomgr.notests' which allows them to be
disabled.

Signed-off-by: Richard W.M. Jones 
---
 Documentation/kernel-parameters.txt | 3 +++
 crypto/testmgr.c| 9 +
 2 files changed, 12 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 0b3de80..d4d5fb7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   cryptomgr.notests
+[KNL] Disable crypto self-tests
+
cs89x0_dma= [HW,NET]
Format: 
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index b86883a..dc613f2 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -35,6 +35,10 @@
 
 #include "internal.h"
 
+static bool notests;
+module_param(notests, bool, 0644);
+MODULE_PARM_DESC(notests, "disable crypto self-tests");
+
 #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS
 
 /* a perfect nop */
@@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 
type, u32 mask)
int j;
int rc;
 
+   if (notests) {
+   pr_info("alg: self-tests disabled\n");
+   return 0;
+   }
+
alg_test_descs_check_order();
 
if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
-- 
2.7.4

[PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 28ms (on my laptop) running crypto
algorithm self-tests.  Although it's possibe to disable these at
compile time, Fedora kernel maintainers want to maintain a single
kernel image for all uses.

So this commit adds a runtime flag which callers can set to skip the
self-tests in the fast container/virtualization case.

Rich.

[PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 28ms (on my laptop) running crypto
algorithm self-tests.  Although it's possibe to disable these at
compile time, Fedora kernel maintainers want to maintain a single
kernel image for all uses.

So this commit adds a runtime flag which callers can set to skip the
self-tests in the fast container/virtualization case.

Rich.

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Thu, Apr 28, 2016 at 03:56:33PM -0700, Greg KH wrote:
> On Thu, Apr 28, 2016 at 11:18:33PM +0100, Richard W.M. Jones wrote:
> > Currently autoconf spends 25ms (on my laptop) testing if the UART
> > exported to it by KVM is an 8250 without FIFO and/or with strange
> > quirks, which it obviously isn't.  Assume it is exported to us by a
> > hypervisor, it's a normal, working 16550A.
> > 
> > Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
> > ---
> >  drivers/tty/serial/8250/8250_port.c | 7 +++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/tty/serial/8250/8250_port.c 
> > b/drivers/tty/serial/8250/8250_port.c
> > index 00ad2637..de19924 100644
> > --- a/drivers/tty/serial/8250/8250_port.c
> > +++ b/drivers/tty/serial/8250/8250_port.c
> > @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up)
> > if (!port->iobase && !port->mapbase && !port->membase)
> > return;
> >  
> > +   /* Hypervisors always export working 16550A devices. */
> > +   if (cpu_has_hypervisor) {
> > +   up->port.type = PORT_16550A;
> > +   up->capabilities |= UART_CAP_FIFO;
> > +   return;
> > +   }
> 
> Have you audited vmware, virtualbox, and everyone else that provides a
> virtual uart device that it will work properly here?
> 
> qemu isn't all the world :)

Attached below is a slightly different approach.  If the user passes a
special flag on the kernel command line then we force 16550A and avoid
the 25ms delay.  Since the user chooses the flag, any concerns about
the behaviour of the hypervisor or use of VFIO should be moot.

Rich.

>From 5694addf0e05de9c842274be8392ebce5dc24280 Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" <rjo...@redhat.com>
Date: Thu, 28 Apr 2016 23:08:54 +0100
Subject: [PATCH] 8250: Allow user to force 16550A UART and avoid probing.

Currently autoconf spends 25ms (on my laptop) testing the UART.

Allow the user to avoid this delay if they know that all serial ports
(eg on a virtual machine) are ordinary 16550A.  The user does this by
passing '8250_base.really_16550a' on the command line.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 drivers/tty/serial/8250/8250_port.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/tty/serial/8250/8250_port.c 
b/drivers/tty/serial/8250/8250_port.c
index 00ad2637..ac92f55 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -53,6 +53,10 @@
 #define DEBUG_AUTOCONF(fmt...) do { } while (0)
 #endif
 
+static bool really_16550a;
+module_param(really_16550a, bool, 0644);
+MODULE_PARM_DESC(really_16550a, "Don't probe, assume 16550A");
+
 #define BOTH_EMPTY (UART_LSR_TEMT | UART_LSR_THRE)
 
 /*
@@ -1171,6 +1175,12 @@ static void autoconfig(struct uart_8250_port *up)
if (!port->iobase && !port->mapbase && !port->membase)
return;
 
+   if (really_16550a) {
+   up->port.type = PORT_16550A;
+   up->capabilities |= UART_CAP_FIFO;
+   return;
+   }
+
DEBUG_AUTOCONF("ttyS%d: autoconf (0x%04lx, 0x%p): ",
   serial_index(port), port->iobase, port->membase);
 
-- 
2.7.4



-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Thu, Apr 28, 2016 at 03:56:33PM -0700, Greg KH wrote:
> On Thu, Apr 28, 2016 at 11:18:33PM +0100, Richard W.M. Jones wrote:
> > Currently autoconf spends 25ms (on my laptop) testing if the UART
> > exported to it by KVM is an 8250 without FIFO and/or with strange
> > quirks, which it obviously isn't.  Assume it is exported to us by a
> > hypervisor, it's a normal, working 16550A.
> > 
> > Signed-off-by: Richard W.M. Jones 
> > ---
> >  drivers/tty/serial/8250/8250_port.c | 7 +++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/tty/serial/8250/8250_port.c 
> > b/drivers/tty/serial/8250/8250_port.c
> > index 00ad2637..de19924 100644
> > --- a/drivers/tty/serial/8250/8250_port.c
> > +++ b/drivers/tty/serial/8250/8250_port.c
> > @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up)
> > if (!port->iobase && !port->mapbase && !port->membase)
> > return;
> >  
> > +   /* Hypervisors always export working 16550A devices. */
> > +   if (cpu_has_hypervisor) {
> > +   up->port.type = PORT_16550A;
> > +   up->capabilities |= UART_CAP_FIFO;
> > +   return;
> > +   }
> 
> Have you audited vmware, virtualbox, and everyone else that provides a
> virtual uart device that it will work properly here?
> 
> qemu isn't all the world :)

Attached below is a slightly different approach.  If the user passes a
special flag on the kernel command line then we force 16550A and avoid
the 25ms delay.  Since the user chooses the flag, any concerns about
the behaviour of the hypervisor or use of VFIO should be moot.

Rich.

>From 5694addf0e05de9c842274be8392ebce5dc24280 Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" 
Date: Thu, 28 Apr 2016 23:08:54 +0100
Subject: [PATCH] 8250: Allow user to force 16550A UART and avoid probing.

Currently autoconf spends 25ms (on my laptop) testing the UART.

Allow the user to avoid this delay if they know that all serial ports
(eg on a virtual machine) are ordinary 16550A.  The user does this by
passing '8250_base.really_16550a' on the command line.

Signed-off-by: Richard W.M. Jones 
---
 drivers/tty/serial/8250/8250_port.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/tty/serial/8250/8250_port.c 
b/drivers/tty/serial/8250/8250_port.c
index 00ad2637..ac92f55 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -53,6 +53,10 @@
 #define DEBUG_AUTOCONF(fmt...) do { } while (0)
 #endif
 
+static bool really_16550a;
+module_param(really_16550a, bool, 0644);
+MODULE_PARM_DESC(really_16550a, "Don't probe, assume 16550A");
+
 #define BOTH_EMPTY (UART_LSR_TEMT | UART_LSR_THRE)
 
 /*
@@ -1171,6 +1175,12 @@ static void autoconfig(struct uart_8250_port *up)
if (!port->iobase && !port->mapbase && !port->membase)
return;
 
+   if (really_16550a) {
+   up->port.type = PORT_16550A;
+   up->capabilities |= UART_CAP_FIFO;
+   return;
+   }
+
DEBUG_AUTOCONF("ttyS%d: autoconf (0x%04lx, 0x%p): ",
   serial_index(port), port->iobase, port->membase);
 
-- 
2.7.4



-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 10:01:08AM +0300, Matwey V. Kornilov wrote:
> 2016-04-29 1:18 GMT+03:00 Richard W.M. Jones <rjo...@redhat.com>:
> > [This is an opinionated patch, mainly for discussion.]
> >
> > I'm trying to reduce the time taken in the kernel in initcalls, with
> > my aim being to reduce the current ~700ms spent in initcalls before
> > userspace, down to something like 100ms.  All times on my Broadwell-U
> > laptop, under virtualization.  The purpose of this is to be able to
> > launch VMs around containers with minimal overhead, like Intel Clear
> > Containers, but using standard distro kernels and qemu.
> >
> > Currently the kernel spends 25ms inspecting the UART that we passed to
> > it from qemu to find out whether it's an 8250/16550/16550A perhaps
> > with a non-working FIFO or other quirks.  Well, it isn't -- it's a
> > working emulated 16550A, with a FIFO and no quirks, and if it isn't,
> > we should fix qemu.
> >
> > So the patch detects if we're running virtualized (perhaps it should
> > only check for qemu/KVM?) and if so, shortcuts the tests.
> 
> Does anybody know, whether it is possible to pass through real
> hardware serial port to a guest? It seems to be as simple as to pass
> through an interrupt and memory IO ports.

In theory it seems like something VFIO could do.  Passing something as
low performance as a serial port through would seem to make little
sense though.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 10:01:08AM +0300, Matwey V. Kornilov wrote:
> 2016-04-29 1:18 GMT+03:00 Richard W.M. Jones :
> > [This is an opinionated patch, mainly for discussion.]
> >
> > I'm trying to reduce the time taken in the kernel in initcalls, with
> > my aim being to reduce the current ~700ms spent in initcalls before
> > userspace, down to something like 100ms.  All times on my Broadwell-U
> > laptop, under virtualization.  The purpose of this is to be able to
> > launch VMs around containers with minimal overhead, like Intel Clear
> > Containers, but using standard distro kernels and qemu.
> >
> > Currently the kernel spends 25ms inspecting the UART that we passed to
> > it from qemu to find out whether it's an 8250/16550/16550A perhaps
> > with a non-working FIFO or other quirks.  Well, it isn't -- it's a
> > working emulated 16550A, with a FIFO and no quirks, and if it isn't,
> > we should fix qemu.
> >
> > So the patch detects if we're running virtualized (perhaps it should
> > only check for qemu/KVM?) and if so, shortcuts the tests.
> 
> Does anybody know, whether it is possible to pass through real
> hardware serial port to a guest? It seems to be as simple as to pass
> through an interrupt and memory IO ports.

In theory it seems like something VFIO could do.  Passing something as
low performance as a serial port through would seem to make little
sense though.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

[PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-28 Thread Richard W.M. Jones

[This is an opinionated patch, mainly for discussion.]

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 25ms inspecting the UART that we passed to
it from qemu to find out whether it's an 8250/16550/16550A perhaps
with a non-working FIFO or other quirks.  Well, it isn't -- it's a
working emulated 16550A, with a FIFO and no quirks, and if it isn't,
we should fix qemu.

So the patch detects if we're running virtualized (perhaps it should
only check for qemu/KVM?) and if so, shortcuts the tests.

Rich.

[PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-28 Thread Richard W.M. Jones

[This is an opinionated patch, mainly for discussion.]

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 25ms inspecting the UART that we passed to
it from qemu to find out whether it's an 8250/16550/16550A perhaps
with a non-working FIFO or other quirks.  Well, it isn't -- it's a
working emulated 16550A, with a FIFO and no quirks, and if it isn't,
we should fix qemu.

So the patch detects if we're running virtualized (perhaps it should
only check for qemu/KVM?) and if so, shortcuts the tests.

Rich.

[PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-28 Thread Richard W.M. Jones

Currently autoconf spends 25ms (on my laptop) testing if the UART
exported to it by KVM is an 8250 without FIFO and/or with strange
quirks, which it obviously isn't.  Assume it is exported to us by a
hypervisor, it's a normal, working 16550A.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 drivers/tty/serial/8250/8250_port.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/tty/serial/8250/8250_port.c 
b/drivers/tty/serial/8250/8250_port.c
index 00ad2637..de19924 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up)
if (!port->iobase && !port->mapbase && !port->membase)
return;
 
+   /* Hypervisors always export working 16550A devices. */
+   if (cpu_has_hypervisor) {
+   up->port.type = PORT_16550A;
+   up->capabilities |= UART_CAP_FIFO;
+   return;
+   }
+
DEBUG_AUTOCONF("ttyS%d: autoconf (0x%04lx, 0x%p): ",
   serial_index(port), port->iobase, port->membase);
 
-- 
2.7.4

[PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-28 Thread Richard W.M. Jones

Currently autoconf spends 25ms (on my laptop) testing if the UART
exported to it by KVM is an 8250 without FIFO and/or with strange
quirks, which it obviously isn't.  Assume it is exported to us by a
hypervisor, it's a normal, working 16550A.

Signed-off-by: Richard W.M. Jones 
---
 drivers/tty/serial/8250/8250_port.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/tty/serial/8250/8250_port.c 
b/drivers/tty/serial/8250/8250_port.c
index 00ad2637..de19924 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up)
if (!port->iobase && !port->mapbase && !port->membase)
return;
 
+   /* Hypervisors always export working 16550A devices. */
+   if (cpu_has_hypervisor) {
+   up->port.type = PORT_16550A;
+   up->capabilities |= UART_CAP_FIFO;
+   return;
+   }
+
DEBUG_AUTOCONF("ttyS%d: autoconf (0x%04lx, 0x%p): ",
   serial_index(port), port->iobase, port->membase);
 
-- 
2.7.4

Re: [PATCH v2 0/2] vfs: Define new syscall getumask.

2016-04-18 Thread Richard W.M. Jones

On Sun, Apr 17, 2016 at 06:57:36PM -0700, Josh Triplett wrote:
> O_NOUMASK seems potentially useful to support implementation of umask
> entirely in userspace, which also addresses thread-safety.  A program
> could read its process umask out at startup, handle umask entirely in
> userspace (including for threads), and only interact with the system
> umask after fork and before exec.

I had a look at O_NOUMASK and there are a few problems:

It's relatively easy to implement for open(2).  A few filesystems
implement their own open so I had to go into those filesystems and
modify how they handle current_umask too.  And FUSE support is tricky
so I passed on that.

The real problem is that mkdir/mkdirat/mknod/mknodat are affected by
umask, but there is no convenient flags parameter to pass the
O_NOUMASK flag.  So I think the patch only half-solves the problem.

I have a patch which needs a bit more testing, which I can post if you
think that's helpful, but I don't think it would be acceptable in its
current state.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH v2 0/2] vfs: Define new syscall getumask.

2016-04-18 Thread Richard W.M. Jones

On Sun, Apr 17, 2016 at 06:57:36PM -0700, Josh Triplett wrote:
> O_NOUMASK seems potentially useful to support implementation of umask
> entirely in userspace, which also addresses thread-safety.  A program
> could read its process umask out at startup, handle umask entirely in
> userspace (including for threads), and only interact with the system
> umask after fork and before exec.

I had a look at O_NOUMASK and there are a few problems:

It's relatively easy to implement for open(2).  A few filesystems
implement their own open so I had to go into those filesystems and
modify how they handle current_umask too.  And FUSE support is tricky
so I passed on that.

The real problem is that mkdir/mkdirat/mknod/mknodat are affected by
umask, but there is no convenient flags parameter to pass the
O_NOUMASK flag.  So I think the patch only half-solves the problem.

I have a patch which needs a bit more testing, which I can post if you
think that's helpful, but I don't think it would be acceptable in its
current state.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH v2] procfs: expose umask in /proc//status

2016-04-15 Thread Richard W.M. Jones

On Fri, Apr 15, 2016 at 03:13:10PM +0200, Michal Hocko wrote:
> On Thu 14-04-16 12:08:15, Richard W.M. Jones wrote:
> > It's not possible to read the process umask without also modifying it,
> > which is what umask(2) does.  A library cannot read umask safely,
> > especially if the main program might be multithreaded.
> 
> It would be helpful to describe the usecase a bit better. Who needs to
> read the umask and what for (e.g. what if the umask changes right after
> it has been read by the 3rd party?).
> 
> I am not opposing the patch as is but this exports a new information to
> the userspace and we will have to maintain it for ever, so we should
> better describe the usecase.

The use case is that we have endless trouble with people setting weird
umask() values (usually on the grounds of "security"), and then
everything breaking.  I'm on the hook to fix these.  We'd like to add
debugging to our program so we can dump out the umask in debug
reports.

Previous versions of the patch used a syscall so you could only read
your own umask.  That's all I need.  However there was quite a lot of
push-back from those, so this new version exports it in /proc.

See:

https://lkml.org/lkml/2016/4/13/704 [umask2]
https://lkml.org/lkml/2016/4/13/487 [getumask]

Rich.

> > Add a new status line ("Umask") in /proc//status.  It contains
> > the file mode creation mask (umask) in octal.  It is only shown for
> > tasks which have task->fs.
> > 
> > This patch is adapted from one originally written by Pierre Carrier.
> > 
> > Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
> > ---
> >  Documentation/filesystems/proc.txt |  1 +
> >  fs/proc/array.c| 20 +++-
> >  2 files changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/filesystems/proc.txt 
> > b/Documentation/filesystems/proc.txt
> > index 7f5607a..e8d0075 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
> >   TracerPid   PID of process tracing this process (0 if not)
> >   Uid Real, effective, saved set, and  file system 
> > UIDs
> >   Gid Real, effective, saved set, and  file system 
> > GIDs
> > + Umask   file mode creation mask
> >   FDSize  number of file descriptor slots currently 
> > allocated
> >   Groups  supplementary group list
> >   NStgid  descendant namespace thread group ID hierarchy
> > diff --git a/fs/proc/array.c b/fs/proc/array.c
> > index b6c00ce..88c7de1 100644
> > --- a/fs/proc/array.c
> > +++ b/fs/proc/array.c
> > @@ -83,6 +83,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
> > task_struct *tsk)
> > return task_state_array[fls(state)];
> >  }
> >  
> > +static inline int get_task_umask(struct task_struct *tsk)
> > +{
> > +   struct fs_struct *fs;
> > +   int umask = -ENOENT;
> > +
> > +   task_lock(tsk);
> > +   fs = tsk->fs;
> > +   if (fs)
> > +   umask = fs->umask;
> > +   task_unlock(tsk);
> > +   return umask;
> > +}
> > +
> >  static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> > struct pid *pid, struct task_struct *p)
> >  {
> > struct user_namespace *user_ns = seq_user_ns(m);
> > struct group_info *group_info;
> > -   int g;
> > +   int g, umask;
> > struct task_struct *tracer;
> > const struct cred *cred;
> > pid_t ppid, tpid = 0, tgid, ngid;
> > @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, 
> > struct pid_namespace *ns,
> > ngid = task_numa_group_id(p);
> > cred = get_task_cred(p);
> >  
> > +   umask = get_task_umask(p);
> > +   if (umask >= 0)
> > +   seq_printf(m, "Umask:\t%#04o\n", umask);
> > +
> > task_lock(p);
> > if (p->files)
> > max_fds = files_fdtable(p->files)->max_fds;
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [PATCH v2] procfs: expose umask in /proc//status

2016-04-15 Thread Richard W.M. Jones

On Fri, Apr 15, 2016 at 03:13:10PM +0200, Michal Hocko wrote:
> On Thu 14-04-16 12:08:15, Richard W.M. Jones wrote:
> > It's not possible to read the process umask without also modifying it,
> > which is what umask(2) does.  A library cannot read umask safely,
> > especially if the main program might be multithreaded.
> 
> It would be helpful to describe the usecase a bit better. Who needs to
> read the umask and what for (e.g. what if the umask changes right after
> it has been read by the 3rd party?).
> 
> I am not opposing the patch as is but this exports a new information to
> the userspace and we will have to maintain it for ever, so we should
> better describe the usecase.

The use case is that we have endless trouble with people setting weird
umask() values (usually on the grounds of "security"), and then
everything breaking.  I'm on the hook to fix these.  We'd like to add
debugging to our program so we can dump out the umask in debug
reports.

Previous versions of the patch used a syscall so you could only read
your own umask.  That's all I need.  However there was quite a lot of
push-back from those, so this new version exports it in /proc.

See:

https://lkml.org/lkml/2016/4/13/704 [umask2]
https://lkml.org/lkml/2016/4/13/487 [getumask]

Rich.

> > Add a new status line ("Umask") in /proc//status.  It contains
> > the file mode creation mask (umask) in octal.  It is only shown for
> > tasks which have task->fs.
> > 
> > This patch is adapted from one originally written by Pierre Carrier.
> > 
> > Signed-off-by: Richard W.M. Jones 
> > ---
> >  Documentation/filesystems/proc.txt |  1 +
> >  fs/proc/array.c| 20 +++-
> >  2 files changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/filesystems/proc.txt 
> > b/Documentation/filesystems/proc.txt
> > index 7f5607a..e8d0075 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
> >   TracerPid   PID of process tracing this process (0 if not)
> >   Uid Real, effective, saved set, and  file system 
> > UIDs
> >   Gid Real, effective, saved set, and  file system 
> > GIDs
> > + Umask   file mode creation mask
> >   FDSize  number of file descriptor slots currently 
> > allocated
> >   Groups  supplementary group list
> >   NStgid  descendant namespace thread group ID hierarchy
> > diff --git a/fs/proc/array.c b/fs/proc/array.c
> > index b6c00ce..88c7de1 100644
> > --- a/fs/proc/array.c
> > +++ b/fs/proc/array.c
> > @@ -83,6 +83,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
> > task_struct *tsk)
> > return task_state_array[fls(state)];
> >  }
> >  
> > +static inline int get_task_umask(struct task_struct *tsk)
> > +{
> > +   struct fs_struct *fs;
> > +   int umask = -ENOENT;
> > +
> > +   task_lock(tsk);
> > +   fs = tsk->fs;
> > +   if (fs)
> > +   umask = fs->umask;
> > +   task_unlock(tsk);
> > +   return umask;
> > +}
> > +
> >  static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> > struct pid *pid, struct task_struct *p)
> >  {
> > struct user_namespace *user_ns = seq_user_ns(m);
> > struct group_info *group_info;
> > -   int g;
> > +   int g, umask;
> > struct task_struct *tracer;
> > const struct cred *cred;
> > pid_t ppid, tpid = 0, tgid, ngid;
> > @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, 
> > struct pid_namespace *ns,
> > ngid = task_numa_group_id(p);
> > cred = get_task_cred(p);
> >  
> > +   umask = get_task_umask(p);
> > +   if (umask >= 0)
> > +   seq_printf(m, "Umask:\t%#04o\n", umask);
> > +
> > task_lock(p);
> > if (p->files)
> > max_fds = files_fdtable(p->files)->max_fds;
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

[PATCH v2] procfs: expose umask in /proc//status

2016-04-14 Thread Richard W.M. Jones

v1 -> v2:

 - Change printf format to %#04o.

 - Retest and update examples accordingly.

--

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

For the library this allows me to read the umask from
/proc/self/status.

This patch is adapted from one originally written by Pierre Carrier:
https://lkml.org/lkml/2012/5/4/451

Example usage:

$ grep Umask /proc/1/status 
Umask:  0022
$ grep Umask /proc/2/status 
Umask:  0022
$ grep Umask /proc/self/status 
Umask:  0022
$ umask 002
$ grep Umask /proc/self/status 
Umask:  0002

Rich.

[PATCH v2] procfs: expose umask in /proc//status

2016-04-14 Thread Richard W.M. Jones

v1 -> v2:

 - Change printf format to %#04o.

 - Retest and update examples accordingly.

--

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

For the library this allows me to read the umask from
/proc/self/status.

This patch is adapted from one originally written by Pierre Carrier:
https://lkml.org/lkml/2012/5/4/451

Example usage:

$ grep Umask /proc/1/status 
Umask:  0022
$ grep Umask /proc/2/status 
Umask:  0022
$ grep Umask /proc/self/status 
Umask:  0022
$ umask 002
$ grep Umask /proc/self/status 
Umask:  0002

Rich.

[PATCH v2] procfs: expose umask in /proc//status

2016-04-14 Thread Richard W.M. Jones

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

This patch is adapted from one originally written by Pierre Carrier.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 Documentation/filesystems/proc.txt |  1 +
 fs/proc/array.c| 20 +++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 7f5607a..e8d0075 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
  TracerPid   PID of process tracing this process (0 if not)
  Uid Real, effective, saved set, and  file system UIDs
  Gid Real, effective, saved set, and  file system GIDs
+ Umask   file mode creation mask
  FDSize  number of file descriptor slots currently 
allocated
  Groups  supplementary group list
  NStgid  descendant namespace thread group ID hierarchy
diff --git a/fs/proc/array.c b/fs/proc/array.c
index b6c00ce..88c7de1 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -83,6 +83,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
task_struct *tsk)
return task_state_array[fls(state)];
 }
 
+static inline int get_task_umask(struct task_struct *tsk)
+{
+   struct fs_struct *fs;
+   int umask = -ENOENT;
+
+   task_lock(tsk);
+   fs = tsk->fs;
+   if (fs)
+   umask = fs->umask;
+   task_unlock(tsk);
+   return umask;
+}
+
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *p)
 {
struct user_namespace *user_ns = seq_user_ns(m);
struct group_info *group_info;
-   int g;
+   int g, umask;
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
@@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct 
pid_namespace *ns,
ngid = task_numa_group_id(p);
cred = get_task_cred(p);
 
+   umask = get_task_umask(p);
+   if (umask >= 0)
+   seq_printf(m, "Umask:\t%#04o\n", umask);
+
task_lock(p);
if (p->files)
max_fds = files_fdtable(p->files)->max_fds;
-- 
2.7.4

[PATCH v2] procfs: expose umask in /proc//status

2016-04-14 Thread Richard W.M. Jones

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

This patch is adapted from one originally written by Pierre Carrier.

Signed-off-by: Richard W.M. Jones 
---
 Documentation/filesystems/proc.txt |  1 +
 fs/proc/array.c| 20 +++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 7f5607a..e8d0075 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
  TracerPid   PID of process tracing this process (0 if not)
  Uid Real, effective, saved set, and  file system UIDs
  Gid Real, effective, saved set, and  file system GIDs
+ Umask   file mode creation mask
  FDSize  number of file descriptor slots currently 
allocated
  Groups  supplementary group list
  NStgid  descendant namespace thread group ID hierarchy
diff --git a/fs/proc/array.c b/fs/proc/array.c
index b6c00ce..88c7de1 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -83,6 +83,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
task_struct *tsk)
return task_state_array[fls(state)];
 }
 
+static inline int get_task_umask(struct task_struct *tsk)
+{
+   struct fs_struct *fs;
+   int umask = -ENOENT;
+
+   task_lock(tsk);
+   fs = tsk->fs;
+   if (fs)
+   umask = fs->umask;
+   task_unlock(tsk);
+   return umask;
+}
+
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *p)
 {
struct user_namespace *user_ns = seq_user_ns(m);
struct group_info *group_info;
-   int g;
+   int g, umask;
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
@@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct 
pid_namespace *ns,
ngid = task_numa_group_id(p);
cred = get_task_cred(p);
 
+   umask = get_task_umask(p);
+   if (umask >= 0)
+   seq_printf(m, "Umask:\t%#04o\n", umask);
+
task_lock(p);
if (p->files)
max_fds = files_fdtable(p->files)->max_fds;
-- 
2.7.4

Re: [PATCH] procfs: expose umask in /proc//status (formerly umask2, formerly getumask)

2016-04-14 Thread Richard W.M. Jones

On Thu, Apr 14, 2016 at 10:34:48AM +0100, Richard W.M. Jones wrote:
> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.
> 
> Add a new status line ("Umask") in /proc//status.  It contains
> the file mode creation mask (umask) in octal.  It is only shown for
> tasks which have task->fs.
> 
> For the library this allows me to read the umask from
> /proc/self/status.
> 
> This patch is adapted from one originally written by Pierre Carrier:
> https://lkml.org/lkml/2012/5/4/451

Sorry, I meant to add an example of what this looks like:

$ grep Umask /proc/1/status 
Umask:022
$ grep Umask /proc/2/status 
Umask:022
$ grep Umask /proc/self/status 
Umask:022
$ umask 002
$ grep Umask /proc/self/status 
Umask:02

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [PATCH] procfs: expose umask in /proc//status (formerly umask2, formerly getumask)

2016-04-14 Thread Richard W.M. Jones

On Thu, Apr 14, 2016 at 10:34:48AM +0100, Richard W.M. Jones wrote:
> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.
> 
> Add a new status line ("Umask") in /proc//status.  It contains
> the file mode creation mask (umask) in octal.  It is only shown for
> tasks which have task->fs.
> 
> For the library this allows me to read the umask from
> /proc/self/status.
> 
> This patch is adapted from one originally written by Pierre Carrier:
> https://lkml.org/lkml/2012/5/4/451

Sorry, I meant to add an example of what this looks like:

$ grep Umask /proc/1/status 
Umask:022
$ grep Umask /proc/2/status 
Umask:022
$ grep Umask /proc/self/status 
Umask:022
$ umask 002
$ grep Umask /proc/self/status 
Umask:02

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

[PATCH] procfs: expose umask in /proc//status

2016-04-14 Thread Richard W.M. Jones

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

This patch is adapted from one originally written by Pierre Carrier.

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 Documentation/filesystems/proc.txt |  1 +
 fs/proc/array.c| 20 +++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 7f5607a..e8d0075 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
  TracerPid   PID of process tracing this process (0 if not)
  Uid Real, effective, saved set, and  file system UIDs
  Gid Real, effective, saved set, and  file system GIDs
+ Umask   file mode creation mask
  FDSize  number of file descriptor slots currently 
allocated
  Groups  supplementary group list
  NStgid  descendant namespace thread group ID hierarchy
diff --git a/fs/proc/array.c b/fs/proc/array.c
index b6c00ce..03e8d3f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -83,6 +83,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
task_struct *tsk)
return task_state_array[fls(state)];
 }
 
+static inline int get_task_umask(struct task_struct *tsk)
+{
+   struct fs_struct *fs;
+   int umask = -ENOENT;
+
+   task_lock(tsk);
+   fs = tsk->fs;
+   if (fs)
+   umask = fs->umask;
+   task_unlock(tsk);
+   return umask;
+}
+
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *p)
 {
struct user_namespace *user_ns = seq_user_ns(m);
struct group_info *group_info;
-   int g;
+   int g, umask;
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
@@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct 
pid_namespace *ns,
ngid = task_numa_group_id(p);
cred = get_task_cred(p);
 
+   umask = get_task_umask(p);
+   if (umask >= 0)
+   seq_printf(m, "Umask:\t0%o\n", umask);
+
task_lock(p);
if (p->files)
max_fds = files_fdtable(p->files)->max_fds;
-- 
2.7.4

[PATCH] procfs: expose umask in /proc//status

2016-04-14 Thread Richard W.M. Jones

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

This patch is adapted from one originally written by Pierre Carrier.

Signed-off-by: Richard W.M. Jones 
---
 Documentation/filesystems/proc.txt |  1 +
 fs/proc/array.c| 20 +++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 7f5607a..e8d0075 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
  TracerPid   PID of process tracing this process (0 if not)
  Uid Real, effective, saved set, and  file system UIDs
  Gid Real, effective, saved set, and  file system GIDs
+ Umask   file mode creation mask
  FDSize  number of file descriptor slots currently 
allocated
  Groups  supplementary group list
  NStgid  descendant namespace thread group ID hierarchy
diff --git a/fs/proc/array.c b/fs/proc/array.c
index b6c00ce..03e8d3f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -83,6 +83,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
task_struct *tsk)
return task_state_array[fls(state)];
 }
 
+static inline int get_task_umask(struct task_struct *tsk)
+{
+   struct fs_struct *fs;
+   int umask = -ENOENT;
+
+   task_lock(tsk);
+   fs = tsk->fs;
+   if (fs)
+   umask = fs->umask;
+   task_unlock(tsk);
+   return umask;
+}
+
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *p)
 {
struct user_namespace *user_ns = seq_user_ns(m);
struct group_info *group_info;
-   int g;
+   int g, umask;
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
@@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct 
pid_namespace *ns,
ngid = task_numa_group_id(p);
cred = get_task_cred(p);
 
+   umask = get_task_umask(p);
+   if (umask >= 0)
+   seq_printf(m, "Umask:\t0%o\n", umask);
+
task_lock(p);
if (p->files)
max_fds = files_fdtable(p->files)->max_fds;
-- 
2.7.4

[PATCH] procfs: expose umask in /proc//status (formerly umask2, formerly getumask)

2016-04-14 Thread Richard W.M. Jones

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

For the library this allows me to read the umask from
/proc/self/status.

This patch is adapted from one originally written by Pierre Carrier:
https://lkml.org/lkml/2012/5/4/451

Rich.

[PATCH] procfs: expose umask in /proc//status (formerly umask2, formerly getumask)

2016-04-14 Thread Richard W.M. Jones

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

For the library this allows me to read the umask from
/proc/self/status.

This patch is adapted from one originally written by Pierre Carrier:
https://lkml.org/lkml/2012/5/4/451

Rich.

Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]

2016-04-14 Thread Richard W.M. Jones

On Thu, Apr 14, 2016 at 09:09:38AM +1000, Stephen Rothwell wrote:
> Hi Richard,
> 
> On Wed, 13 Apr 2016 20:05:33 +0100 "Richard W.M. Jones" <rjo...@redhat.com> 
> wrote:
> >
> > It's not possible to read the process umask without also modifying it,
> > which is what umask(2) does.  A library cannot read umask safely,
> > especially if the main program might be multithreaded.
> 
> I was wondering if you really need to read the umask, or would just a
> "ignore umask" flag to open{,at} do what you want?

This would be very useful, although I think being able to read umask
is also useful.

---

FWIW I am currently developing a patch to add umask to
/proc/PID/status.  Will post it shortly once I've tested it properly.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]

2016-04-14 Thread Richard W.M. Jones

On Thu, Apr 14, 2016 at 09:09:38AM +1000, Stephen Rothwell wrote:
> Hi Richard,
> 
> On Wed, 13 Apr 2016 20:05:33 +0100 "Richard W.M. Jones"  
> wrote:
> >
> > It's not possible to read the process umask without also modifying it,
> > which is what umask(2) does.  A library cannot read umask safely,
> > especially if the main program might be multithreaded.
> 
> I was wondering if you really need to read the umask, or would just a
> "ignore umask" flag to open{,at} do what you want?

This would be very useful, although I think being able to read umask
is also useful.

---

FWIW I am currently developing a patch to add umask to
/proc/PID/status.  Will post it shortly once I've tested it properly.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]

2016-04-13 Thread Richard W.M. Jones


On Wed, Apr 13, 2016 at 10:45:05PM +0200, Florian Weimer wrote:
> * H. Peter Anvin:
> 
> > I have to say I'm skeptic to the need for umask2() as opposed to
> > getumask().
> 
> I find the extension with a set-the-thread umask somewhat unlikely.
> How would a potential per-thread umask interact with CLONE_FS?
> Have a per-thread umask that, when active, overrides the global
> one, similar to what uselocale provides?  That seems rather messy,
> and I'm not aware of any precedent.

The flags parameter is for extensions we can't envisage now ...

... although since umask has been around since Unix V6 in 1978, we
might be waiting a long time.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]

2016-04-13 Thread Richard W.M. Jones


On Wed, Apr 13, 2016 at 10:45:05PM +0200, Florian Weimer wrote:
> * H. Peter Anvin:
> 
> > I have to say I'm skeptic to the need for umask2() as opposed to
> > getumask().
> 
> I find the extension with a set-the-thread umask somewhat unlikely.
> How would a potential per-thread umask interact with CLONE_FS?
> Have a per-thread umask that, when active, overrides the global
> one, similar to what uselocale provides?  That seems rather messy,
> and I'm not aware of any precedent.

The flags parameter is for extensions we can't envisage now ...

... although since umask has been around since Unix V6 in 1978, we
might be waiting a long time.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

umask2 man page (was: Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask])

2016-04-13 Thread Richard W.M. Jones

UMASK(2)   Linux Programmer's Manual  UMASK(2)



NAME
   umask, umask2 - get and set file mode creation mask

SYNOPSIS
   #include 
   #include 

   mode_t umask(mode_t mask);

   #define _GNU_SOURCE
   #include 
   #include 
   #include 

   mode_t umask2(mode_t mask, int flags);

DESCRIPTION
   umask()  sets  the calling process's file mode creation mask (umask) to
   mask & 0777 (i.e., only the file permission bits of mask are used), and
   returns the previous value of the mask.

   If flags is 0, then umask2() is the same as umask().

   If flags is UMASK_GET_MASK then umask2() ignores the mask parameter and
   returns the process's current umask.  The process's current mask is not
   modified in this case.

   The  umask  is  used  by open(2), mkdir(2), and other system calls that
   create files to modify the permissions placed on newly created files or
   directories.   Specifically,  permissions  in  the umask are turned off
   from the mode argument to open(2) and mkdir(2).

   Alternatively, if the parent directory has a default ACL (see  acl(5)),
   the umask is ignored, the default ACL is inherited, the permission bits
   are set based on the inherited ACL, and permission bits absent  in  the
   mode  argument  are turned off.  For example, the following default ACL
   is equivalent to a umask of 022:

   u::rwx,g::r-x,o::r-x

   Combining the effect of this default ACL with a mode argument  of  0666
   (rw-rw-rw-), the resulting file permissions would be 0644 (rw-r--r--).

   The  constants  that should be used to specify mask are described under
   stat(2).

   The typical default value for the process  umask  is  S_IWGRP | S_IWOTH
   (octal  022).   In the usual case where the mode argument to open(2) is
   specified as:

   S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH

   (octal 0666) when creating a new file, the permissions on the resulting
   file will be:

   S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH

   (because 0666 & ~022 = 0644; i.e., rw-r--r--).

RETURN VALUE
   The  umask()  system call always succeeds and the previous value of the
   mask is returned.

   The umask2() system call returns the process's current  umask  on  suc‐
   cess.  On error it returns -1 and sets errno appropriately.

CONFORMING TO
   SVr4, 4.3BSD, POSIX.1-2001.

NOTES
   A  child  process created via fork(2) inherits its parent's umask.  The
   umask is left unchanged by execve(2).

   The umask setting also affects the permissions assigned  to  POSIX  IPC
   objects  (mq_open(3), sem_open(3), shm_open(3)), FIFOs (mkfifo(3)), and
   UNIX domain sockets (unix(7)) created by the process.  The  umask  does
   not  affect the permissions assigned to System V IPC objects created by
   the process (using msgget(2), semget(2), shmget(2)).

SEE ALSO
   chmod(2), mkdir(2), open(2), stat(2), acl(5)

COLOPHON
   This page is part of release 4.00 of the Linux  man-pages  project.   A
   description  of  the project, information about reporting bugs, and the
   latestversionofthispage,can be found at
   http://www.kernel.org/doc/man-pages/.



Linux 2016-04-13  UMASK(2)

umask2 man page (was: Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask])

2016-04-13 Thread Richard W.M. Jones

UMASK(2)   Linux Programmer's Manual  UMASK(2)



NAME
   umask, umask2 - get and set file mode creation mask

SYNOPSIS
   #include 
   #include 

   mode_t umask(mode_t mask);

   #define _GNU_SOURCE
   #include 
   #include 
   #include 

   mode_t umask2(mode_t mask, int flags);

DESCRIPTION
   umask()  sets  the calling process's file mode creation mask (umask) to
   mask & 0777 (i.e., only the file permission bits of mask are used), and
   returns the previous value of the mask.

   If flags is 0, then umask2() is the same as umask().

   If flags is UMASK_GET_MASK then umask2() ignores the mask parameter and
   returns the process's current umask.  The process's current mask is not
   modified in this case.

   The  umask  is  used  by open(2), mkdir(2), and other system calls that
   create files to modify the permissions placed on newly created files or
   directories.   Specifically,  permissions  in  the umask are turned off
   from the mode argument to open(2) and mkdir(2).

   Alternatively, if the parent directory has a default ACL (see  acl(5)),
   the umask is ignored, the default ACL is inherited, the permission bits
   are set based on the inherited ACL, and permission bits absent  in  the
   mode  argument  are turned off.  For example, the following default ACL
   is equivalent to a umask of 022:

   u::rwx,g::r-x,o::r-x

   Combining the effect of this default ACL with a mode argument  of  0666
   (rw-rw-rw-), the resulting file permissions would be 0644 (rw-r--r--).

   The  constants  that should be used to specify mask are described under
   stat(2).

   The typical default value for the process  umask  is  S_IWGRP | S_IWOTH
   (octal  022).   In the usual case where the mode argument to open(2) is
   specified as:

   S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH

   (octal 0666) when creating a new file, the permissions on the resulting
   file will be:

   S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH

   (because 0666 & ~022 = 0644; i.e., rw-r--r--).

RETURN VALUE
   The  umask()  system call always succeeds and the previous value of the
   mask is returned.

   The umask2() system call returns the process's current  umask  on  suc‐
   cess.  On error it returns -1 and sets errno appropriately.

CONFORMING TO
   SVr4, 4.3BSD, POSIX.1-2001.

NOTES
   A  child  process created via fork(2) inherits its parent's umask.  The
   umask is left unchanged by execve(2).

   The umask setting also affects the permissions assigned  to  POSIX  IPC
   objects  (mq_open(3), sem_open(3), shm_open(3)), FIFOs (mkfifo(3)), and
   UNIX domain sockets (unix(7)) created by the process.  The  umask  does
   not  affect the permissions assigned to System V IPC objects created by
   the process (using msgget(2), semget(2), shmget(2)).

SEE ALSO
   chmod(2), mkdir(2), open(2), stat(2), acl(5)

COLOPHON
   This page is part of release 4.00 of the Linux  man-pages  project.   A
   description  of  the project, information about reporting bugs, and the
   latestversionofthispage,can be found at
   http://www.kernel.org/doc/man-pages/.



Linux 2016-04-13  UMASK(2)

Re: [PATCH v3 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

On Wed, Apr 13, 2016 at 11:52:53AM -0700, Davidlohr Bueso wrote:
> ENOMANPAGE

Where do man pages go?  In the man-pages project?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

[PATCH v4 3/3] vfs: selftests: Add test for umask2 system call.

2016-04-13 Thread Richard W.M. Jones

Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>
---
 tools/testing/selftests/Makefile  |  1 +
 tools/testing/selftests/fs/.gitignore |  1 +
 tools/testing/selftests/fs/Makefile   |  9 
 tools/testing/selftests/fs/umask2-tests.c | 77 +++
 4 files changed, 88 insertions(+)
 create mode 100644 tools/testing/selftests/fs/.gitignore
 create mode 100644 tools/testing/selftests/fs/Makefile
 create mode 100644 tools/testing/selftests/fs/umask2-tests.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index b04afc3..9e2eb24 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -4,6 +4,7 @@ TARGETS += cpu-hotplug
 TARGETS += efivarfs
 TARGETS += exec
 TARGETS += firmware
+TARGETS += fs
 TARGETS += ftrace
 TARGETS += futex
 TARGETS += ipc
diff --git a/tools/testing/selftests/fs/.gitignore 
b/tools/testing/selftests/fs/.gitignore
new file mode 100644
index 000..057dced
--- /dev/null
+++ b/tools/testing/selftests/fs/.gitignore
@@ -0,0 +1 @@
+umask2-tests
diff --git a/tools/testing/selftests/fs/Makefile 
b/tools/testing/selftests/fs/Makefile
new file mode 100644
index 000..6f231d7
--- /dev/null
+++ b/tools/testing/selftests/fs/Makefile
@@ -0,0 +1,9 @@
+BINARIES := umask2-tests
+
+all: $(BINARIES)
+
+clean:
+   $(RM) $(BINARIES)
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/fs/umask2-tests.c 
b/tools/testing/selftests/fs/umask2-tests.c
new file mode 100644
index 000..3e01575
--- /dev/null
+++ b/tools/testing/selftests/fs/umask2-tests.c
@@ -0,0 +1,77 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef UMASK_GET_MASK
+#define UMASK_GET_MASK 1
+#endif
+
+static int umask2_(int mask, int flags)
+{
+#ifdef __NR_umask2
+   return syscall(__NR_umask2, mask, flags);
+#else
+   errno = ENOSYS;
+   return -1;
+#endif
+}
+
+int main(int argc, char **argv)
+{
+   int r;
+
+   /* umask2 available in current kernel? */
+   r = umask2_(0, UMASK_GET_MASK);
+   if (r == -1 && errno == ENOSYS) {
+   fprintf(stderr,
+   "umask2 not available in current kernel or headers, 
skipping test\n");
+   exit(0);
+   }
+
+   /* Check that old umask still works. */
+   r = umask(022);
+   if (r == -1) {
+   perror("umask");
+   exit(1);
+   }
+
+   /* Call umask2 to emulate old umask. */
+   r = umask2_(023, 0);
+   if (r == -1) {
+   perror("umask2");
+   exit(1);
+   }
+   if (r != 022) {
+   fprintf(stderr, "umask2: expected %o, got %o\n", 022, r);
+   exit(1);
+   }
+
+   /* Call umask2 to read current umask without modifying it. */
+   r = umask2_(0777, UMASK_GET_MASK);
+   if (r == -1) {
+   perror("umask2");
+   exit(1);
+   }
+   if (r != 023) {
+   fprintf(stderr, "umask2: expected %o, got %o\n", 023, r);
+   exit(1);
+   }
+
+   /* Call it again to make sure we didn't modify umask. */
+   r = umask2_(0777, UMASK_GET_MASK);
+   if (r == -1) {
+   perror("umask2");
+   exit(1);
+   }
+   if (r != 023) {
+   fprintf(stderr, "umask2: expected %o, got %o\n", 023, r);
+   exit(1);
+   }
+
+   exit(0);
+}
-- 
2.7.4

Re: [PATCH v3 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

On Wed, Apr 13, 2016 at 11:52:53AM -0700, Davidlohr Bueso wrote:
> ENOMANPAGE

Where do man pages go?  In the man-pages project?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

1 2 3 >

1 - 100 of 213 matches

Mail list logo