Re: INFO: task hung in nbd_ioctl
On Thu, Oct 17, 2019 at 10:47:59AM -0500, Mike Christie wrote: > On 10/17/2019 09:03 AM, Richard W.M. Jones wrote: > > On Tue, Oct 01, 2019 at 04:19:25PM -0500, Mike Christie wrote: > >> Hey Josef and nbd list, > >> > >> I had a question about if there are any socket family restrictions for nbd? > > > > In normal circumstances, in userspace, the NBD protocol would only be > > used over AF_UNIX or AF_INET/AF_INET6. > > > > There's a bit of confusion because netlink is used by nbd-client to > > configure the NBD device, setting things like block size and timeouts > > (instead of ioctl which is deprecated). I think you don't mean this > > use of netlink? > > I didn't. It looks like it is just a bad test. > > For the automated test in this thread the test created a AF_NETLINK > socket and passed it into the NBD_SET_SOCK ioctl. That is what got used > for the NBD_DO_IT ioctl. > > I was not sure if the test creator picked any old socket and it just > happened to pick one nbd never supported, or it was trying to simulate > sockets that did not support the shutdown method. > > I attached the automated test that got run (test.c). I'd say it sounds like a bad test, but I'm not familiar with syzkaller nor how / from where it generates these tests. Did someone report a bug and then syzkaller wrote this test? Rich. > > > >> The bug here is that some socket familys do not support the > >> sock->ops->shutdown callout, and when nbd calls kernel_sock_shutdown > >> their callout returns -EOPNOTSUPP. That then leaves recv_work stuck in > >> nbd_read_stat -> sock_xmit -> sock_recvmsg. My patch added a > >> flush_workqueue call, so for socket familys like AF_NETLINK in this bug > >> we hang like we see below. > >> > >> I can just remove the flush_workqueue call in that code path since it's > >> not needed there, but it leaves the original bug my patch was hitting > >> where we leave the recv_work running which can then result in leaked > >> resources, or possible use after free crashes and you still get the hang > >> if you remove the module. > >> > >> It looks like we have used kernel_sock_shutdown for a while so I thought > >> we might never have supported sockets that did not support the callout. > >> Is that correct? If so then I can just add a check for this in > >> nbd_add_socket and fix that bug too. > > > > Rich. > > > >> On 09/30/2019 05:39 PM, syzbot wrote: > >>> Hello, > >>> > >>> syzbot found the following crash on: > >>> > >>> HEAD commit:bb2aee77 Add linux-next specific files for 20190926 > >>> git tree: linux-next > >>> console output: https://syzkaller.appspot.com/x/log.txt?x=13385ca360 > >>> kernel config: https://syzkaller.appspot.com/x/.config?x=e60af4ac5a01e964 > >>> dashboard link: > >>> https://syzkaller.appspot.com/bug?extid=24c12fa8d218ed26011a > >>> compiler: gcc (GCC) 9.0.0 20181231 (experimental) > >>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12abc2a360 > >>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11712c0560 > >>> > >>> The bug was bisected to: > >>> > >>> commit e9e006f5fcf2bab59149cb38a48a4817c1b538b4 > >>> Author: Mike Christie > >>> Date: Sun Aug 4 19:10:06 2019 + > >>> > >>> nbd: fix max number of supported devs > >>> > >>> bisection log: > >>> https://syzkaller.appspot.com/x/bisect.txt?x=1226f3c560 > >>> final crash: > >>> https://syzkaller.appspot.com/x/report.txt?x=1126f3c560 > >>> console output: https://syzkaller.appspot.com/x/log.txt?x=1626f3c560 > >>> > >>> IMPORTANT: if you fix the bug, please add the following tag to the commit: > >>> Reported-by: syzbot+24c12fa8d218ed260...@syzkaller.appspotmail.com > >>> Fixes: e9e006f5fcf2 ("nbd: fix max number of supported devs") > >>> > >>> INFO: task syz-executor390:8778 can't die for more than 143 seconds. > >>> syz-executor390 D27432 8778 8777 0x4004 > >>> Call Trace: > >>> context_switch kernel/sched/core.c:3384 [inline] > >>> __schedule+0x828/0x1c20 kernel/sched/core.c:4065 > >>> schedule+0xd9/0x260 kernel/sched/core.c:4132 > >>> schedule_timeout+0x717/0xc50 kernel/time/timer.c:1871 > >>> do_wait_for_common kernel/sched/completio
Re: nbd, nbdkit, loopback mounts and memory management
On Mon, Feb 18, 2019 at 12:15:14AM +0100, Pavel Machek wrote: > But Shaun reported it happened somehow often for them, so he might > have a practical test case... better than my theories :-). Yes, certainly not saying this isn't a problem. I think the good news is the fix seems quite easy, ie. to add mlockall and adjust the OOM killer score, as is done currently in the client: https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L867-L885 https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L1219 For now I have added a note in the TODO file to follow up in case we get a test case or reports of a problem: https://github.com/libguestfs/nbdkit/commit/72e0afe2e280d895f68941677fafa559ddc3bb0d Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: nbd, nbdkit, loopback mounts and memory management
So not to dispute that this could be a bug, but I couldn't cause a deadlock. I wonder if you can see something wrong with my method? *** Set up *** - kernel 5.0.0-0.rc3.git0.1.fc30.x86_64 - nbd-client 3.19-1.fc30 - nbdkit 1.11.5 (git commit ef9d1978ce28) Baremetal machine was booted with mem=2G to artificially limit the RAM. The machine has 64G of swap. # free -m totalusedfree shared buff/cache available Mem: 1806 3291383 0 931357 Swap: 65535 179 65356 *** Method *** I started nbdkit as a 4G RAM disk: ./nbdkit memory size=4G This is implemented as a sparse array with a 2 level page table, and should allocate (using malloc) every time a new area of the disk is written to. Exact implementation is here: https://github.com/libguestfs/nbdkit/tree/master/common/sparse I started nbd-client using the -swap option which uses mlockall(MCL_FUTURE) to lock the client into RAM. nbd-client -b 512 -swap localhost /dev/nbd0 I then created a filesystem on the RAM disk, mounted it, and copied a 3G file into it. I tried this various ways, but the variation I was eventually happy with was: mke2fs /dev/nbd0 mount /dev/nbd0 /tmp/mnt dd if=/dev/zero of=/tmp/big bs=1M count=3000 cp /tmp/big /tmp/mnt/big I couldn't get any kind of deadlock or failure in this test. (Note that if you repeat the test several times, in theory you could delete the file and fstrim the filesystem, but when I was testing it to be sure I unmounted everything and killed and restarted nbdkit between each test.) Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Re: nbd, nbdkit, loopback mounts and memory management
On Fri, Feb 15, 2019 at 10:41:26PM +, Richard W.M. Jones wrote: > On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote: > > Hi! > > > > I watched fosdem talk about > > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word > > of warning: I'm not sure using it read-write on localhost is safe. > > > > In particular, user application could create a lot of dirty data > > quickly. If there's not enough memory for nbdkit (or nbd-client or > > nbd-server), you might get a deadlock. > > Thanks for the kind words about the talk. I've added Wouter Verhelst > & the NBD mailing list to CC. Although I did the talk because the > subject is interesting, how I actually use nbdkit / NBD is to talk to > qemu and that's where I have most experience and where we (Red Hat) > use it in production systems. > > However in January I spent a lot of time exercising the NBD loop-mount > + nbdkit case using fio in order to find contention / bottlenecks in > our use of threads and locks. I didn't notice any particular problems > then, but it's possible my testing wasn't thorough enough. Or that > fio only creates small numbers of dirty pages (because of locality in > its access patterns I guess?) > > When you say it's not safe, what could happen? What would we observe > if it was going wrong? Reading more carefully I see you said we'd observe a deadlock. I didn't see that, but again my testing of this wouldn't have been very thorough. When I have some time I'll try creating / spooling huge files into an NBD loop mount to see if I can cause a deadlock. Thanks, Rich. > > Also note that nbd.txt in Documentation/blockdev/ points to > > sourceforge; it should probably point to > > https://github.com/NetworkBlockDevice/nbd ? > > Wouter should be able to say what the correct link should be. > > Rich. > > -- > Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones > Read my programming and virtualization blog: http://rwmj.wordpress.com > libguestfs lets you edit virtual machines. Supports shell scripting, > bindings from many languages. http://libguestfs.org -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: nbd, nbdkit, loopback mounts and memory management
On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote: > Hi! > > I watched fosdem talk about > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word > of warning: I'm not sure using it read-write on localhost is safe. > > In particular, user application could create a lot of dirty data > quickly. If there's not enough memory for nbdkit (or nbd-client or > nbd-server), you might get a deadlock. Thanks for the kind words about the talk. I've added Wouter Verhelst & the NBD mailing list to CC. Although I did the talk because the subject is interesting, how I actually use nbdkit / NBD is to talk to qemu and that's where I have most experience and where we (Red Hat) use it in production systems. However in January I spent a lot of time exercising the NBD loop-mount + nbdkit case using fio in order to find contention / bottlenecks in our use of threads and locks. I didn't notice any particular problems then, but it's possible my testing wasn't thorough enough. Or that fio only creates small numbers of dirty pages (because of locality in its access patterns I guess?) When you say it's not safe, what could happen? What would we observe if it was going wrong? > Also note that nbd.txt in Documentation/blockdev/ points to > sourceforge; it should probably point to > https://github.com/NetworkBlockDevice/nbd ? Wouter should be able to say what the correct link should be. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: [PATCH v2] libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk
On Thu, May 31, 2018 at 01:21:07PM +0200, Hans de Goede wrote: > Commit 184add2ca23c ("libata: Apply NOLPM quirk for SanDisk > SD7UB3Q*G1001 SSDs") disabled LPM for SanDisk SD7UB3Q*G1001 SSDs. > > This has lead to several reports of users of that SSD where LPM > was working fine and who know have a significantly increased idle > power consumption on their laptops. > > Likely there is another problem on the T450s from the original > reporter which gets exposed by the uncore reaching deeper sleep > states (higher PC-states) due to LPM being enabled. The problem as > reported, a hardfreeze about once a day, already did not sound like > it would be caused by LPM and the reports of the SSD working fine > confirm this. The original reporter is ok with dropping the quirk. > > A X250 user has reported the same hard freeze problem and for him > the problem went away after unrelated updates, I suspect some GPU > driver stack changes fixed things. > > TL;DR: The original reporters problem were triggered by LPM but not > an LPM issue, so drop the quirk for the SSD in question. As the reporter of the original issue, I agree with Hans's analysis above, so ACK from me. Rich. > BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1583207 > Cc: sta...@vger.kernel.org > Cc: Richard W.M. Jones > Cc: Lorenzo Dalrio > Reported-by: Lorenzo Dalrio > Signed-off-by: Hans de Goede > --- > Changes in v2: > -Rebase on 4.17-rc7 > --- > drivers/ata/libata-core.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c > index 346b163f6e89..9bfd2f7e4542 100644 > --- a/drivers/ata/libata-core.c > +++ b/drivers/ata/libata-core.c > @@ -4557,9 +4557,6 @@ static const struct ata_blacklist_entry > ata_device_blacklist [] = { > { "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM, }, > { "SAMSUNG SSD PM830 mSATA *", "CXM13D1Q", ATA_HORKAGE_NOLPM, }, > > - /* Sandisk devices which are known to not handle LPM well */ > - { "SanDisk SD7UB3Q*G1001", NULL, ATA_HORKAGE_NOLPM, }, > - > /* devices that don't properly handle queued TRIM commands */ > { "Micron_M500IT_*","MU01", ATA_HORKAGE_NO_NCQ_TRIM | > ATA_HORKAGE_ZERO_AFTER_TRIM, }, > -- > 2.17.0 -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Re: [PATCH v2] libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk
On Thu, May 31, 2018 at 01:21:07PM +0200, Hans de Goede wrote: > Commit 184add2ca23c ("libata: Apply NOLPM quirk for SanDisk > SD7UB3Q*G1001 SSDs") disabled LPM for SanDisk SD7UB3Q*G1001 SSDs. > > This has lead to several reports of users of that SSD where LPM > was working fine and who know have a significantly increased idle > power consumption on their laptops. > > Likely there is another problem on the T450s from the original > reporter which gets exposed by the uncore reaching deeper sleep > states (higher PC-states) due to LPM being enabled. The problem as > reported, a hardfreeze about once a day, already did not sound like > it would be caused by LPM and the reports of the SSD working fine > confirm this. The original reporter is ok with dropping the quirk. > > A X250 user has reported the same hard freeze problem and for him > the problem went away after unrelated updates, I suspect some GPU > driver stack changes fixed things. > > TL;DR: The original reporters problem were triggered by LPM but not > an LPM issue, so drop the quirk for the SSD in question. As the reporter of the original issue, I agree with Hans's analysis above, so ACK from me. Rich. > BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1583207 > Cc: sta...@vger.kernel.org > Cc: Richard W.M. Jones > Cc: Lorenzo Dalrio > Reported-by: Lorenzo Dalrio > Signed-off-by: Hans de Goede > --- > Changes in v2: > -Rebase on 4.17-rc7 > --- > drivers/ata/libata-core.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c > index 346b163f6e89..9bfd2f7e4542 100644 > --- a/drivers/ata/libata-core.c > +++ b/drivers/ata/libata-core.c > @@ -4557,9 +4557,6 @@ static const struct ata_blacklist_entry > ata_device_blacklist [] = { > { "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM, }, > { "SAMSUNG SSD PM830 mSATA *", "CXM13D1Q", ATA_HORKAGE_NOLPM, }, > > - /* Sandisk devices which are known to not handle LPM well */ > - { "SanDisk SD7UB3Q*G1001", NULL, ATA_HORKAGE_NOLPM, }, > - > /* devices that don't properly handle queued TRIM commands */ > { "Micron_M500IT_*","MU01", ATA_HORKAGE_NO_NCQ_TRIM | > ATA_HORKAGE_ZERO_AFTER_TRIM, }, > -- > 2.17.0 -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.
On Fri, Aug 11, 2017 at 12:41:47AM +0300, Michael S. Tsirkin wrote: > On Thu, Aug 10, 2017 at 10:35:11PM +0100, Richard W.M. Jones wrote: > > On Fri, Aug 11, 2017 at 12:31:44AM +0300, Michael S. Tsirkin wrote: > > > Then we probably should fail probe if vq size is too small. > > > > What does this mean? > > > > Rich. > > We must prevent driver from submitting s/g lists > vq size to device. Fair point. I would have thought (not knowing anything about how this works in the kernel) that the driver should be able to split up scatter-gather lists if they are larger than what the driver supports? Anyway the commit message is derived from what Paolo told me on IRC, so let's see what he says. Rich. > Either tell linux to avoid s/g lists that are too long, or > simply fail request if this happens, or refuse to attach driver to device. > > Later option would look something like this within probe: > > for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++) > if (vqs[i]->num < MAX_SG_USED_BY_LINUX) > goto err; > > > I don't know what's MAX_SG_USED_BY_LINUX though. > > > -- > > Richard Jones, Virtualization Group, Red Hat > > http://people.redhat.com/~rjones > > Read my programming and virtualization blog: http://rwmj.wordpress.com > > virt-builder quickly builds VMs from scratch > > http://libguestfs.org/virt-builder.1.html -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.
On Fri, Aug 11, 2017 at 12:41:47AM +0300, Michael S. Tsirkin wrote: > On Thu, Aug 10, 2017 at 10:35:11PM +0100, Richard W.M. Jones wrote: > > On Fri, Aug 11, 2017 at 12:31:44AM +0300, Michael S. Tsirkin wrote: > > > Then we probably should fail probe if vq size is too small. > > > > What does this mean? > > > > Rich. > > We must prevent driver from submitting s/g lists > vq size to device. Fair point. I would have thought (not knowing anything about how this works in the kernel) that the driver should be able to split up scatter-gather lists if they are larger than what the driver supports? Anyway the commit message is derived from what Paolo told me on IRC, so let's see what he says. Rich. > Either tell linux to avoid s/g lists that are too long, or > simply fail request if this happens, or refuse to attach driver to device. > > Later option would look something like this within probe: > > for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++) > if (vqs[i]->num < MAX_SG_USED_BY_LINUX) > goto err; > > > I don't know what's MAX_SG_USED_BY_LINUX though. > > > -- > > Richard Jones, Virtualization Group, Red Hat > > http://people.redhat.com/~rjones > > Read my programming and virtualization blog: http://rwmj.wordpress.com > > virt-builder quickly builds VMs from scratch > > http://libguestfs.org/virt-builder.1.html -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.
On Fri, Aug 11, 2017 at 12:31:44AM +0300, Michael S. Tsirkin wrote: > Then we probably should fail probe if vq size is too small. What does this mean? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.
On Fri, Aug 11, 2017 at 12:31:44AM +0300, Michael S. Tsirkin wrote: > Then we probably should fail probe if vq size is too small. What does this mean? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.
On Fri, Aug 11, 2017 at 12:21:16AM +0300, Michael S. Tsirkin wrote: > On Thu, Aug 10, 2017 at 05:40:34PM +0100, Richard W.M. Jones wrote: > > If using indirect descriptors, you can make the total_sg as large as > > you want. > > That would be a spec violation though, even if it happens to > work on current QEMU. > > The spec says: > A driver MUST NOT create a descriptor chain longer than the Queue Size > of the device. > > What prompted this patch? > Do we ever encounter this situation? This patch is needed because the following (2/2) patch will trigger that BUG_ON if I set virtqueue_size=64 or any smaller value. The precise backtrace is below. Rich. [4.029510] [ cut here ] [4.030127] kernel BUG at drivers/virtio/virtio_ring.c:299! [4.030834] invalid opcode: [#1] SMP [4.031340] Modules linked in: libcrc32c crc8 crc7 crc4 crc_itu_t virtio_pci virtio_mmio virtio_input virtio_balloon virtio_scsi nd_pmem nd_btt virtio_net virtio_crypto crypto_engine virtio_console virtio_rng virtio_blk virtio_ring virtio nfit crc32_generic crct10dif_pclmul crc32c_intel crc32_pclmul [4.034606] CPU: 0 PID: 1 Comm: init Not tainted 4.13.0-rc4+ #100 [4.035354] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [4.036770] task: 9a859e243300 task.stack: a731c00cc000 [4.037506] RIP: 0010:virtqueue_add_sgs+0x23d/0x460 [virtio_ring] [4.038250] RSP: :a731c00cf6e0 EFLAGS: 00010097 [4.038898] RAX: RBX: 0011 RCX: dd0680646c40 [4.039762] RDX: RSI: a731c00cf7b8 RDI: 9a85945c4d68 [4.040634] RBP: a731c00cf748 R08: 9a85945c4a78 R09: 01080020 [4.041508] R10: a731c00cf788 R11: 9a859b3d3120 R12: a731c00cf7d0 [4.042382] R13: a731c00cf7d0 R14: 0001 R15: 9a859b3f8200 [4.043248] FS: () GS:9a859e60() knlGS: [4.044232] CS: 0010 DS: ES: CR0: 80050033 [4.044942] CR2: 7fcff02e931c CR3: 1d23b000 CR4: 003406f0 [4.045815] DR0: DR1: DR2: [4.046684] DR3: DR6: fffe0ff0 DR7: 0400 [4.047559] Call Trace: [4.047876] virtscsi_add_cmd+0x1c9/0x280 [virtio_scsi] [4.048528] virtscsi_kick_cmd+0x38/0x90 [virtio_scsi] [4.049161] virtscsi_queuecommand+0x104/0x280 [virtio_scsi] [4.049875] virtscsi_queuecommand_single+0x38/0x40 [virtio_scsi] [4.050628] scsi_dispatch_cmd+0xf9/0x390 [4.051128] scsi_queue_rq+0x5e5/0x6f0 [4.051602] blk_mq_dispatch_rq_list+0x1ff/0x3c0 [4.052175] blk_mq_sched_dispatch_requests+0x199/0x1f0 [4.052824] __blk_mq_run_hw_queue+0x11d/0x1b0 [4.053382] __blk_mq_delay_run_hw_queue+0x8d/0xa0 [4.053972] blk_mq_run_hw_queue+0x14/0x20 [4.054485] blk_mq_sched_insert_requests+0x96/0x120 [4.055095] blk_mq_flush_plug_list+0x19d/0x410 [4.055661] blk_flush_plug_list+0xf9/0x2b0 [4.056182] blk_finish_plug+0x2c/0x40 [4.056655] __do_page_cache_readahead+0x32e/0x400 [4.057261] filemap_fault+0x2fb/0x890 [4.057731] ? filemap_fault+0x2fb/0x890 [4.058220] ? find_held_lock+0x3c/0xb0 [4.058714] ext4_filemap_fault+0x34/0x50 [4.059212] __do_fault+0x1e/0x110 [4.059644] __handle_mm_fault+0x6b2/0x1080 [4.060167] handle_mm_fault+0x178/0x350 [4.060662] __do_page_fault+0x26e/0x510 [4.061152] trace_do_page_fault+0x9d/0x290 [4.061677] do_async_page_fault+0x51/0xa0 [4.062189] async_page_fault+0x28/0x30 [4.062667] RIP: 0033:0x7fcff030a24f [4.063113] RSP: 002b:7ffefc2ad078 EFLAGS: 00010206 [4.063760] RAX: 7fcff02e931c RBX: 7fcff050f660 RCX: 7fcff02e935c [4.064648] RDX: 0664 RSI: RDI: 7fcff02e931c [4.065519] RBP: 7ffefc2ad320 R08: 7fcff02e931c R09: 00027000 [4.066392] R10: 7fcff02e9980 R11: 0206 R12: 7ffefc2ad0b0 [4.067263] R13: 7ffefc2ad408 R14: 0002 R15: 87f0 [4.068135] Code: af 01 c7 45 c8 01 00 00 00 45 31 ed e9 9b fe ff ff 31 db 48 83 7d d0 00 0f 85 3f fe ff ff 0f 0b 48 8b 7d b8 e8 e5 fd 22 de eb 8b <0f> 0b 0f 0b 44 89 6d a8 45 89 f5 48 8b 45 a0 44 8d 70 01 48 83 [4.070506] RIP: virtqueue_add_sgs+0x23d/0x460 [virtio_ring] RSP: a731c00cf6e0 [4.071434] ---[ end trace 02532659840e2a64 ]--- -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.
On Fri, Aug 11, 2017 at 12:21:16AM +0300, Michael S. Tsirkin wrote: > On Thu, Aug 10, 2017 at 05:40:34PM +0100, Richard W.M. Jones wrote: > > If using indirect descriptors, you can make the total_sg as large as > > you want. > > That would be a spec violation though, even if it happens to > work on current QEMU. > > The spec says: > A driver MUST NOT create a descriptor chain longer than the Queue Size > of the device. > > What prompted this patch? > Do we ever encounter this situation? This patch is needed because the following (2/2) patch will trigger that BUG_ON if I set virtqueue_size=64 or any smaller value. The precise backtrace is below. Rich. [4.029510] [ cut here ] [4.030127] kernel BUG at drivers/virtio/virtio_ring.c:299! [4.030834] invalid opcode: [#1] SMP [4.031340] Modules linked in: libcrc32c crc8 crc7 crc4 crc_itu_t virtio_pci virtio_mmio virtio_input virtio_balloon virtio_scsi nd_pmem nd_btt virtio_net virtio_crypto crypto_engine virtio_console virtio_rng virtio_blk virtio_ring virtio nfit crc32_generic crct10dif_pclmul crc32c_intel crc32_pclmul [4.034606] CPU: 0 PID: 1 Comm: init Not tainted 4.13.0-rc4+ #100 [4.035354] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [4.036770] task: 9a859e243300 task.stack: a731c00cc000 [4.037506] RIP: 0010:virtqueue_add_sgs+0x23d/0x460 [virtio_ring] [4.038250] RSP: :a731c00cf6e0 EFLAGS: 00010097 [4.038898] RAX: RBX: 0011 RCX: dd0680646c40 [4.039762] RDX: RSI: a731c00cf7b8 RDI: 9a85945c4d68 [4.040634] RBP: a731c00cf748 R08: 9a85945c4a78 R09: 01080020 [4.041508] R10: a731c00cf788 R11: 9a859b3d3120 R12: a731c00cf7d0 [4.042382] R13: a731c00cf7d0 R14: 0001 R15: 9a859b3f8200 [4.043248] FS: () GS:9a859e60() knlGS: [4.044232] CS: 0010 DS: ES: CR0: 80050033 [4.044942] CR2: 7fcff02e931c CR3: 1d23b000 CR4: 003406f0 [4.045815] DR0: DR1: DR2: [4.046684] DR3: DR6: fffe0ff0 DR7: 0400 [4.047559] Call Trace: [4.047876] virtscsi_add_cmd+0x1c9/0x280 [virtio_scsi] [4.048528] virtscsi_kick_cmd+0x38/0x90 [virtio_scsi] [4.049161] virtscsi_queuecommand+0x104/0x280 [virtio_scsi] [4.049875] virtscsi_queuecommand_single+0x38/0x40 [virtio_scsi] [4.050628] scsi_dispatch_cmd+0xf9/0x390 [4.051128] scsi_queue_rq+0x5e5/0x6f0 [4.051602] blk_mq_dispatch_rq_list+0x1ff/0x3c0 [4.052175] blk_mq_sched_dispatch_requests+0x199/0x1f0 [4.052824] __blk_mq_run_hw_queue+0x11d/0x1b0 [4.053382] __blk_mq_delay_run_hw_queue+0x8d/0xa0 [4.053972] blk_mq_run_hw_queue+0x14/0x20 [4.054485] blk_mq_sched_insert_requests+0x96/0x120 [4.055095] blk_mq_flush_plug_list+0x19d/0x410 [4.055661] blk_flush_plug_list+0xf9/0x2b0 [4.056182] blk_finish_plug+0x2c/0x40 [4.056655] __do_page_cache_readahead+0x32e/0x400 [4.057261] filemap_fault+0x2fb/0x890 [4.057731] ? filemap_fault+0x2fb/0x890 [4.058220] ? find_held_lock+0x3c/0xb0 [4.058714] ext4_filemap_fault+0x34/0x50 [4.059212] __do_fault+0x1e/0x110 [4.059644] __handle_mm_fault+0x6b2/0x1080 [4.060167] handle_mm_fault+0x178/0x350 [4.060662] __do_page_fault+0x26e/0x510 [4.061152] trace_do_page_fault+0x9d/0x290 [4.061677] do_async_page_fault+0x51/0xa0 [4.062189] async_page_fault+0x28/0x30 [4.062667] RIP: 0033:0x7fcff030a24f [4.063113] RSP: 002b:7ffefc2ad078 EFLAGS: 00010206 [4.063760] RAX: 7fcff02e931c RBX: 7fcff050f660 RCX: 7fcff02e935c [4.064648] RDX: 0664 RSI: RDI: 7fcff02e931c [4.065519] RBP: 7ffefc2ad320 R08: 7fcff02e931c R09: 00027000 [4.066392] R10: 7fcff02e9980 R11: 0206 R12: 7ffefc2ad0b0 [4.067263] R13: 7ffefc2ad408 R14: 0002 R15: 87f0 [4.068135] Code: af 01 c7 45 c8 01 00 00 00 45 31 ed e9 9b fe ff ff 31 db 48 83 7d d0 00 0f 85 3f fe ff ff 0f 0b 48 8b 7d b8 e8 e5 fd 22 de eb 8b <0f> 0b 0f 0b 44 89 6d a8 45 89 f5 48 8b 45 a0 44 8d 70 01 48 83 [4.070506] RIP: virtqueue_add_sgs+0x23d/0x460 [virtio_ring] RSP: a731c00cf6e0 [4.071434] ---[ end trace 02532659840e2a64 ]--- -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
[PATCH v2 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.
If using indirect descriptors, you can make the total_sg as large as you want. If not, BUG is too serious because the function later returns -ENOSPC. Thanks Paolo Bonzini, Christoph Hellwig. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> Reviewed-by: Paolo Bonzini <pbonz...@redhat.com> --- drivers/virtio/virtio_ring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 5e1b548828e6..27cbc1eab868 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq, } #endif - BUG_ON(total_sg > vq->vring.num); BUG_ON(total_sg == 0); head = vq->free_head; @@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq, * buffers, then go indirect. FIXME: tune this threshold */ if (vq->indirect && total_sg > 1 && vq->vq.num_free) desc = alloc_indirect(_vq, total_sg, gfp); - else + else { desc = NULL; + WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect); + } if (desc) { /* Use a single buffer which doesn't continue */ -- 2.13.1
[PATCH v2 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.
If using indirect descriptors, you can make the total_sg as large as you want. If not, BUG is too serious because the function later returns -ENOSPC. Thanks Paolo Bonzini, Christoph Hellwig. Signed-off-by: Richard W.M. Jones Reviewed-by: Paolo Bonzini --- drivers/virtio/virtio_ring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 5e1b548828e6..27cbc1eab868 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq, } #endif - BUG_ON(total_sg > vq->vring.num); BUG_ON(total_sg == 0); head = vq->free_head; @@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq, * buffers, then go indirect. FIXME: tune this threshold */ if (vq->indirect && total_sg > 1 && vq->vq.num_free) desc = alloc_indirect(_vq, total_sg, gfp); - else + else { desc = NULL; + WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect); + } if (desc) { /* Use a single buffer which doesn't continue */ -- 2.13.1
[PATCH v2 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.
Since switching to blk-mq as the default in commit 5c279bd9e406 ("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as much kernel memory. qemu currently allocates a fixed 128 entry virtqueue. can_queue currently is set to 1024. But with indirect descriptors, each command in the queue takes 1 virtqueue entry, so the number of commands which can be queued is equal to the length of the virtqueue. Note I intend to send a patch to qemu to allow the virtqueue size to be configured from the qemu command line. Thanks Paolo Bonzini, Christoph Hellwig. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- drivers/scsi/virtio_scsi.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 9be211d68b15..7c28e8d4955a 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -818,7 +818,6 @@ static struct scsi_host_template virtscsi_host_template_single = { .eh_timed_out = virtscsi_eh_timed_out, .slave_alloc = virtscsi_device_alloc, - .can_queue = 1024, .dma_boundary = UINT_MAX, .use_clustering = ENABLE_CLUSTERING, .target_alloc = virtscsi_target_alloc, @@ -839,7 +838,6 @@ static struct scsi_host_template virtscsi_host_template_multi = { .eh_timed_out = virtscsi_eh_timed_out, .slave_alloc = virtscsi_device_alloc, - .can_queue = 1024, .dma_boundary = UINT_MAX, .use_clustering = ENABLE_CLUSTERING, .target_alloc = virtscsi_target_alloc, @@ -972,6 +970,8 @@ static int virtscsi_probe(struct virtio_device *vdev) if (err) goto virtscsi_init_failed; + shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq); + cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; -- 2.13.1
[PATCH v2 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.
Since switching to blk-mq as the default in commit 5c279bd9e406 ("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as much kernel memory. qemu currently allocates a fixed 128 entry virtqueue. can_queue currently is set to 1024. But with indirect descriptors, each command in the queue takes 1 virtqueue entry, so the number of commands which can be queued is equal to the length of the virtqueue. Note I intend to send a patch to qemu to allow the virtqueue size to be configured from the qemu command line. Thanks Paolo Bonzini, Christoph Hellwig. Signed-off-by: Richard W.M. Jones --- drivers/scsi/virtio_scsi.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 9be211d68b15..7c28e8d4955a 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -818,7 +818,6 @@ static struct scsi_host_template virtscsi_host_template_single = { .eh_timed_out = virtscsi_eh_timed_out, .slave_alloc = virtscsi_device_alloc, - .can_queue = 1024, .dma_boundary = UINT_MAX, .use_clustering = ENABLE_CLUSTERING, .target_alloc = virtscsi_target_alloc, @@ -839,7 +838,6 @@ static struct scsi_host_template virtscsi_host_template_multi = { .eh_timed_out = virtscsi_eh_timed_out, .slave_alloc = virtscsi_device_alloc, - .can_queue = 1024, .dma_boundary = UINT_MAX, .use_clustering = ENABLE_CLUSTERING, .target_alloc = virtscsi_target_alloc, @@ -972,6 +970,8 @@ static int virtscsi_probe(struct virtio_device *vdev) if (err) goto virtscsi_init_failed; + shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq); + cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; -- 2.13.1
[PATCH v2 0/2] virtio_scsi: Set can_queue based on size of virtqueue.
v1 was here: https://lkml.org/lkml/2017/8/10/689 v1 -> v2: Remove .can_queue field from the templates. Rich.
[PATCH v2 0/2] virtio_scsi: Set can_queue based on size of virtqueue.
v1 was here: https://lkml.org/lkml/2017/8/10/689 v1 -> v2: Remove .can_queue field from the templates. Rich.
[PATCH 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.
Since switching to blk-mq as the default in commit 5c279bd9e406 ("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as much kernel memory. qemu currently allocates a fixed 128 entry virtqueue. can_queue currently is set to 1024. But with indirect descriptors, each command in the queue takes 1 virtqueue entry, so the number of commands which can be queued is equal to the length of the virtqueue. Note I intend to send a patch to qemu to allow the virtqueue size to be configured from the qemu command line. Thanks Paolo Bonzini, Christoph Hellwig. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- drivers/scsi/virtio_scsi.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 9be211d68b15..d6b4ff634c0d 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev) if (err) goto virtscsi_init_failed; + shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq); + cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; -- 2.13.1
[PATCH 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.
Since switching to blk-mq as the default in commit 5c279bd9e406 ("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as much kernel memory. qemu currently allocates a fixed 128 entry virtqueue. can_queue currently is set to 1024. But with indirect descriptors, each command in the queue takes 1 virtqueue entry, so the number of commands which can be queued is equal to the length of the virtqueue. Note I intend to send a patch to qemu to allow the virtqueue size to be configured from the qemu command line. Thanks Paolo Bonzini, Christoph Hellwig. Signed-off-by: Richard W.M. Jones --- drivers/scsi/virtio_scsi.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 9be211d68b15..d6b4ff634c0d 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev) if (err) goto virtscsi_init_failed; + shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq); + cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; -- 2.13.1
[PATCH 0/2] virtio_scsi: Set can_queue based on size of virtqueue.
Earlier discussion: https://lkml.org/lkml/2017/8/4/601 "Increased memory usage with scsi-mq" Downstream bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1478201
[PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.
If using indirect descriptors, you can make the total_sg as large as you want. If not, BUG is too serious because the function later returns -ENOSPC. Thanks Paolo Bonzini, Christoph Hellwig. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- drivers/virtio/virtio_ring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 5e1b548828e6..27cbc1eab868 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq, } #endif - BUG_ON(total_sg > vq->vring.num); BUG_ON(total_sg == 0); head = vq->free_head; @@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq, * buffers, then go indirect. FIXME: tune this threshold */ if (vq->indirect && total_sg > 1 && vq->vq.num_free) desc = alloc_indirect(_vq, total_sg, gfp); - else + else { desc = NULL; + WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect); + } if (desc) { /* Use a single buffer which doesn't continue */ -- 2.13.1
[PATCH 0/2] virtio_scsi: Set can_queue based on size of virtqueue.
Earlier discussion: https://lkml.org/lkml/2017/8/4/601 "Increased memory usage with scsi-mq" Downstream bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1478201
[PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.
If using indirect descriptors, you can make the total_sg as large as you want. If not, BUG is too serious because the function later returns -ENOSPC. Thanks Paolo Bonzini, Christoph Hellwig. Signed-off-by: Richard W.M. Jones --- drivers/virtio/virtio_ring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 5e1b548828e6..27cbc1eab868 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq, } #endif - BUG_ON(total_sg > vq->vring.num); BUG_ON(total_sg == 0); head = vq->free_head; @@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq, * buffers, then go indirect. FIXME: tune this threshold */ if (vq->indirect && total_sg > 1 && vq->vq.num_free) desc = alloc_indirect(_vq, total_sg, gfp); - else + else { desc = NULL; + WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect); + } if (desc) { /* Use a single buffer which doesn't continue */ -- 2.13.1
Re: Increased memory usage with scsi-mq
OK this is looking a bit better now. With scsi-mq enabled: 175 disks virtqueue_size=64: 318 disks * virtqueue_size=16: 775 disks * With scsi-mq disabled: 1755 disks * = new results I also ran the whole libguestfs test suite with virtqueue_size=16 (with no failures shown). As this tests many different disk I/O operations, it gives me some confidence that things generally work. Do you have any other comments about the patches? I'm not sure I know enough to write an intelligent commit message for the kernel patch. Rich. --- kernel patch --- diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 9be211d68b15..d6b4ff634c0d 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev) if (err) goto virtscsi_init_failed; + shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq); + cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 5e1b548828e6..2d7509da9f39 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq, } #endif - BUG_ON(total_sg > vq->vring.num); BUG_ON(total_sg == 0); head = vq->free_head; @@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq, * buffers, then go indirect. FIXME: tune this threshold */ if (vq->indirect && total_sg > 1 && vq->vq.num_free) desc = alloc_indirect(_vq, total_sg, gfp); - else + else { desc = NULL; + WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect); +} if (desc) { /* Use a single buffer which doesn't continue */ --- qemu patch --- diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index eb639442d1..aadd99aad1 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -867,10 +867,10 @@ void virtio_scsi_common_realize(DeviceState *dev, s->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE; s->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE; -s->ctrl_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, ctrl); -s->event_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, evt); +s->ctrl_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, ctrl); +s->event_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, evt); for (i = 0; i < s->conf.num_queues; i++) { -s->cmd_vqs[i] = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, cmd); +s->cmd_vqs[i] = virtio_add_queue(vdev, s->conf.virtqueue_size, cmd); } } @@ -917,6 +917,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, Error **errp) static Property virtio_scsi_properties[] = { DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 1), +DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSI, parent_obj.conf.virtqueue_size, 128), DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors, 0x), DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun, diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h index de6ae5a9f6..e30a92d3e7 100644 --- a/include/hw/virtio/virtio-scsi.h +++ b/include/hw/virtio/virtio-scsi.h @@ -48,6 +48,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig; struct VirtIOSCSIConf { uint32_t num_queues; +uint32_t virtqueue_size; uint32_t max_sectors; uint32_t cmd_per_lun; #ifdef CONFIG_VHOST_SCSI -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Re: Increased memory usage with scsi-mq
OK this is looking a bit better now. With scsi-mq enabled: 175 disks virtqueue_size=64: 318 disks * virtqueue_size=16: 775 disks * With scsi-mq disabled: 1755 disks * = new results I also ran the whole libguestfs test suite with virtqueue_size=16 (with no failures shown). As this tests many different disk I/O operations, it gives me some confidence that things generally work. Do you have any other comments about the patches? I'm not sure I know enough to write an intelligent commit message for the kernel patch. Rich. --- kernel patch --- diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 9be211d68b15..d6b4ff634c0d 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev) if (err) goto virtscsi_init_failed; + shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq); + cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 5e1b548828e6..2d7509da9f39 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq, } #endif - BUG_ON(total_sg > vq->vring.num); BUG_ON(total_sg == 0); head = vq->free_head; @@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq, * buffers, then go indirect. FIXME: tune this threshold */ if (vq->indirect && total_sg > 1 && vq->vq.num_free) desc = alloc_indirect(_vq, total_sg, gfp); - else + else { desc = NULL; + WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect); +} if (desc) { /* Use a single buffer which doesn't continue */ --- qemu patch --- diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index eb639442d1..aadd99aad1 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -867,10 +867,10 @@ void virtio_scsi_common_realize(DeviceState *dev, s->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE; s->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE; -s->ctrl_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, ctrl); -s->event_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, evt); +s->ctrl_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, ctrl); +s->event_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, evt); for (i = 0; i < s->conf.num_queues; i++) { -s->cmd_vqs[i] = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, cmd); +s->cmd_vqs[i] = virtio_add_queue(vdev, s->conf.virtqueue_size, cmd); } } @@ -917,6 +917,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, Error **errp) static Property virtio_scsi_properties[] = { DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 1), +DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSI, parent_obj.conf.virtqueue_size, 128), DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors, 0x), DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun, diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h index de6ae5a9f6..e30a92d3e7 100644 --- a/include/hw/virtio/virtio-scsi.h +++ b/include/hw/virtio/virtio-scsi.h @@ -48,6 +48,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig; struct VirtIOSCSIConf { uint32_t num_queues; +uint32_t virtqueue_size; uint32_t max_sectors; uint32_t cmd_per_lun; #ifdef CONFIG_VHOST_SCSI -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Re: Increased memory usage with scsi-mq
On Thu, Aug 10, 2017 at 02:53:58PM +0200, Paolo Bonzini wrote: > can_queue and cmd_per_lun are different. can_queue should be set to the > value of vq->vring.num where vq is the command virtqueue (the first one > is okay if there's >1). > > If you want to change it, you'll have to do so in QEMU. Here are a couple more patches I came up with, the first to Linux, the second to qemu. There are a few problems ... (1) Setting virtqueue_size to < 128 causes a BUG_ON failure in virtio_ring.c:virtqueue_add at: BUG_ON(total_sg > vq->vring.num); I initially thought that I should also set cmd_per_lun to the same value, but that doesn't help. Is there some relationship between virtqueue_size and other settings? (2) Can/should the ctrl, event and cmd queue sizes be set to the same values? Or should ctrl & event be left at 128? (3) It seems like it might be a problem that virtqueue_size is not passed through the virtio_scsi_conf struct (which is ABI between the kernel and qemu AFAICT and so not easily modifiable). However I think this is not a problem because virtqueue size is stored in the Virtqueue Descriptor table, which is how the kernel gets the value. Is that right? Rich. --- kernel patch --- diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 9be211d68b15..d6b4ff634c0d 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev) if (err) goto virtscsi_init_failed; + shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq); + cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; --- qemu patch --- diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index eb639442d1..aadd99aad1 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -867,10 +867,10 @@ void virtio_scsi_common_realize(DeviceState *dev, s->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE; s->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE; -s->ctrl_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, ctrl); -s->event_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, evt); +s->ctrl_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, ctrl); +s->event_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, evt); for (i = 0; i < s->conf.num_queues; i++) { -s->cmd_vqs[i] = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, cmd); +s->cmd_vqs[i] = virtio_add_queue(vdev, s->conf.virtqueue_size, cmd); } } @@ -917,6 +917,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, Error **errp) static Property virtio_scsi_properties[] = { DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 1), +DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSI, parent_obj.conf.virtqueue_size, 128), DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors, 0x), DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun, diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h index de6ae5a9f6..e30a92d3e7 100644 --- a/include/hw/virtio/virtio-scsi.h +++ b/include/hw/virtio/virtio-scsi.h @@ -48,6 +48,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig; struct VirtIOSCSIConf { uint32_t num_queues; +uint32_t virtqueue_size; uint32_t max_sectors; uint32_t cmd_per_lun; #ifdef CONFIG_VHOST_SCSI -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: Increased memory usage with scsi-mq
On Thu, Aug 10, 2017 at 02:53:58PM +0200, Paolo Bonzini wrote: > can_queue and cmd_per_lun are different. can_queue should be set to the > value of vq->vring.num where vq is the command virtqueue (the first one > is okay if there's >1). > > If you want to change it, you'll have to do so in QEMU. Here are a couple more patches I came up with, the first to Linux, the second to qemu. There are a few problems ... (1) Setting virtqueue_size to < 128 causes a BUG_ON failure in virtio_ring.c:virtqueue_add at: BUG_ON(total_sg > vq->vring.num); I initially thought that I should also set cmd_per_lun to the same value, but that doesn't help. Is there some relationship between virtqueue_size and other settings? (2) Can/should the ctrl, event and cmd queue sizes be set to the same values? Or should ctrl & event be left at 128? (3) It seems like it might be a problem that virtqueue_size is not passed through the virtio_scsi_conf struct (which is ABI between the kernel and qemu AFAICT and so not easily modifiable). However I think this is not a problem because virtqueue size is stored in the Virtqueue Descriptor table, which is how the kernel gets the value. Is that right? Rich. --- kernel patch --- diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 9be211d68b15..d6b4ff634c0d 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev) if (err) goto virtscsi_init_failed; + shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq); + cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; --- qemu patch --- diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index eb639442d1..aadd99aad1 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -867,10 +867,10 @@ void virtio_scsi_common_realize(DeviceState *dev, s->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE; s->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE; -s->ctrl_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, ctrl); -s->event_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, evt); +s->ctrl_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, ctrl); +s->event_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, evt); for (i = 0; i < s->conf.num_queues; i++) { -s->cmd_vqs[i] = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, cmd); +s->cmd_vqs[i] = virtio_add_queue(vdev, s->conf.virtqueue_size, cmd); } } @@ -917,6 +917,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, Error **errp) static Property virtio_scsi_properties[] = { DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 1), +DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSI, parent_obj.conf.virtqueue_size, 128), DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors, 0x), DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun, diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h index de6ae5a9f6..e30a92d3e7 100644 --- a/include/hw/virtio/virtio-scsi.h +++ b/include/hw/virtio/virtio-scsi.h @@ -48,6 +48,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig; struct VirtIOSCSIConf { uint32_t num_queues; +uint32_t virtqueue_size; uint32_t max_sectors; uint32_t cmd_per_lun; #ifdef CONFIG_VHOST_SCSI -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: Increased memory usage with scsi-mq
On Wed, Aug 09, 2017 at 06:50:10PM +0200, Paolo Bonzini wrote: > On 09/08/2017 18:01, Christoph Hellwig wrote: > > On Mon, Aug 07, 2017 at 03:07:48PM +0200, Paolo Bonzini wrote: > >> can_queue should depend on the virtqueue size, which unfortunately can > >> vary for each virtio-scsi device in theory. The virtqueue size is > >> retrieved by drivers/virtio prior to calling vring_create_virtqueue, and > >> in QEMU it is the second argument to virtio_add_queue. > > > > Why is that unfortunate? We don't even have to set can_queue in > > the host template, we can dynamically set it on per-host basis. > > Ah, cool, I thought allocations based on can_queue happened already in > scsi_host_alloc, but they happen at scsi_add_host time. I think I've decoded all that information into the patch below. I tested it, and it appears to work: when I set cmd_per_lun on the qemu command line, I see that the guest can add more disks: With scsi-mq enabled: 175 disks cmd_per_lun not set:177 disks * cmd_per_lun=16: 776 disks * cmd_per_lun=4: 1160 disks * With scsi-mq disabled: 1755 disks * = new result >From my point of view, this is a good result, but you should be warned that I don't fully understand what's going on here and I may have made obvious or not-so-obvious mistakes. I tested the performance impact and it's not noticable in the libguestfs case even with very small cmd_per_lun settings, but libguestfs is largely serial and so this result won't be applicable to guests in general. Also, should the guest kernel validate cmd_per_lun to make sure it's not too small or large? And if so, what would the limits be? Rich. >From e923e49846189b2f55f3f02b70a290d4be237ed5 Mon Sep 17 00:00:00 2001 From: "Richard W.M. Jones" <rjo...@redhat.com> Date: Thu, 10 Aug 2017 12:21:47 +0100 Subject: [PATCH] scsi: virtio_scsi: Set can_queue based on cmd_per_lun passed by hypervisor. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- drivers/scsi/virtio_scsi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 9be211d68b15..b22591e9b16b 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -973,7 +973,7 @@ static int virtscsi_probe(struct virtio_device *vdev) goto virtscsi_init_failed; cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; - shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); + shost->cmd_per_lun = shost->can_queue = cmd_per_lun; shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; /* LUNs > 256 are reported with format 1, so they go in the range -- 2.13.1 -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Re: Increased memory usage with scsi-mq
On Wed, Aug 09, 2017 at 06:50:10PM +0200, Paolo Bonzini wrote: > On 09/08/2017 18:01, Christoph Hellwig wrote: > > On Mon, Aug 07, 2017 at 03:07:48PM +0200, Paolo Bonzini wrote: > >> can_queue should depend on the virtqueue size, which unfortunately can > >> vary for each virtio-scsi device in theory. The virtqueue size is > >> retrieved by drivers/virtio prior to calling vring_create_virtqueue, and > >> in QEMU it is the second argument to virtio_add_queue. > > > > Why is that unfortunate? We don't even have to set can_queue in > > the host template, we can dynamically set it on per-host basis. > > Ah, cool, I thought allocations based on can_queue happened already in > scsi_host_alloc, but they happen at scsi_add_host time. I think I've decoded all that information into the patch below. I tested it, and it appears to work: when I set cmd_per_lun on the qemu command line, I see that the guest can add more disks: With scsi-mq enabled: 175 disks cmd_per_lun not set:177 disks * cmd_per_lun=16: 776 disks * cmd_per_lun=4: 1160 disks * With scsi-mq disabled: 1755 disks * = new result >From my point of view, this is a good result, but you should be warned that I don't fully understand what's going on here and I may have made obvious or not-so-obvious mistakes. I tested the performance impact and it's not noticable in the libguestfs case even with very small cmd_per_lun settings, but libguestfs is largely serial and so this result won't be applicable to guests in general. Also, should the guest kernel validate cmd_per_lun to make sure it's not too small or large? And if so, what would the limits be? Rich. >From e923e49846189b2f55f3f02b70a290d4be237ed5 Mon Sep 17 00:00:00 2001 From: "Richard W.M. Jones" Date: Thu, 10 Aug 2017 12:21:47 +0100 Subject: [PATCH] scsi: virtio_scsi: Set can_queue based on cmd_per_lun passed by hypervisor. Signed-off-by: Richard W.M. Jones --- drivers/scsi/virtio_scsi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 9be211d68b15..b22591e9b16b 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -973,7 +973,7 @@ static int virtscsi_probe(struct virtio_device *vdev) goto virtscsi_init_failed; cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; - shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); + shost->cmd_per_lun = shost->can_queue = cmd_per_lun; shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; /* LUNs > 256 are reported with format 1, so they go in the range -- 2.13.1 -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Re: Increased memory usage with scsi-mq
On Mon, Aug 07, 2017 at 02:11:39PM +0200, Paolo Bonzini wrote: > You could also add a module parameter to the driver, and set it to 64 on > the kernel command line (there is an example in > drivers/scsi/vmw_pvscsi.c of how to do it). [Proviso: I've not tested the performance of difference values, nor do I have any particular knowledge in this area] can_queue is documented as: * This determines if we will use a non-interrupt driven * or an interrupt driven scheme. It is set to the maximum number * of simultaneous commands a given host adapter will accept. Wouldn't it be better to make it default to k * number of CPUs for some small integer k? I looked at the other scsi drivers and they set it to all kinds of values. 1, small integers, large integers, configurable values. Also I noticed this code in virtio_scsi.c: cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); but setting cmd_per_lun (as a qemu virtio-scsi-pci parameter) didn't seem to make any difference to memory usage. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Re: Increased memory usage with scsi-mq
On Mon, Aug 07, 2017 at 02:11:39PM +0200, Paolo Bonzini wrote: > You could also add a module parameter to the driver, and set it to 64 on > the kernel command line (there is an example in > drivers/scsi/vmw_pvscsi.c of how to do it). [Proviso: I've not tested the performance of difference values, nor do I have any particular knowledge in this area] can_queue is documented as: * This determines if we will use a non-interrupt driven * or an interrupt driven scheme. It is set to the maximum number * of simultaneous commands a given host adapter will accept. Wouldn't it be better to make it default to k * number of CPUs for some small integer k? I looked at the other scsi drivers and they set it to all kinds of values. 1, small integers, large integers, configurable values. Also I noticed this code in virtio_scsi.c: cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); but setting cmd_per_lun (as a qemu virtio-scsi-pci parameter) didn't seem to make any difference to memory usage. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Re: Increased memory usage with scsi-mq
On Sat, Aug 05, 2017 at 03:39:54PM +0200, Christoph Hellwig wrote: > For now can you apply this testing patch to the guest kernel? > > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > index 9be211d68b15..0cbe2c882e1c 100644 > --- a/drivers/scsi/virtio_scsi.c > +++ b/drivers/scsi/virtio_scsi.c > @@ -818,7 +818,7 @@ static struct scsi_host_template > virtscsi_host_template_single = { > .eh_timed_out = virtscsi_eh_timed_out, > .slave_alloc = virtscsi_device_alloc, > > - .can_queue = 1024, > + .can_queue = 64, > .dma_boundary = UINT_MAX, > .use_clustering = ENABLE_CLUSTERING, > .target_alloc = virtscsi_target_alloc, > @@ -839,7 +839,7 @@ static struct scsi_host_template > virtscsi_host_template_multi = { > .eh_timed_out = virtscsi_eh_timed_out, > .slave_alloc = virtscsi_device_alloc, > > - .can_queue = 1024, > + .can_queue = 64, > .dma_boundary = UINT_MAX, > .use_clustering = ENABLE_CLUSTERING, > .target_alloc = virtscsi_target_alloc, > @@ -983,7 +983,7 @@ static int virtscsi_probe(struct virtio_device *vdev) > shost->max_id = num_targets; > shost->max_channel = 0; > shost->max_cmd_len = VIRTIO_SCSI_CDB_SIZE; > - shost->nr_hw_queues = num_queues; > + shost->nr_hw_queues = 1; > > #ifdef CONFIG_BLK_DEV_INTEGRITY > if (virtio_has_feature(vdev, VIRTIO_SCSI_F_T10_PI)) { Yes, that's an improvement, although it's still a little way off the density possible the old way: With scsi-mq enabled: 175 disks * With this patch:319 disks * With scsi-mq disabled: 1755 disks Also only the first two hunks are necessary. The kernel behaves exactly the same way with or without the third hunk (ie. num_queues must already be 1). Can I infer from this that qemu needs a way to specify the can_queue setting to the virtio-scsi driver in the guest kernel? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: Increased memory usage with scsi-mq
On Sat, Aug 05, 2017 at 03:39:54PM +0200, Christoph Hellwig wrote: > For now can you apply this testing patch to the guest kernel? > > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > index 9be211d68b15..0cbe2c882e1c 100644 > --- a/drivers/scsi/virtio_scsi.c > +++ b/drivers/scsi/virtio_scsi.c > @@ -818,7 +818,7 @@ static struct scsi_host_template > virtscsi_host_template_single = { > .eh_timed_out = virtscsi_eh_timed_out, > .slave_alloc = virtscsi_device_alloc, > > - .can_queue = 1024, > + .can_queue = 64, > .dma_boundary = UINT_MAX, > .use_clustering = ENABLE_CLUSTERING, > .target_alloc = virtscsi_target_alloc, > @@ -839,7 +839,7 @@ static struct scsi_host_template > virtscsi_host_template_multi = { > .eh_timed_out = virtscsi_eh_timed_out, > .slave_alloc = virtscsi_device_alloc, > > - .can_queue = 1024, > + .can_queue = 64, > .dma_boundary = UINT_MAX, > .use_clustering = ENABLE_CLUSTERING, > .target_alloc = virtscsi_target_alloc, > @@ -983,7 +983,7 @@ static int virtscsi_probe(struct virtio_device *vdev) > shost->max_id = num_targets; > shost->max_channel = 0; > shost->max_cmd_len = VIRTIO_SCSI_CDB_SIZE; > - shost->nr_hw_queues = num_queues; > + shost->nr_hw_queues = 1; > > #ifdef CONFIG_BLK_DEV_INTEGRITY > if (virtio_has_feature(vdev, VIRTIO_SCSI_F_T10_PI)) { Yes, that's an improvement, although it's still a little way off the density possible the old way: With scsi-mq enabled: 175 disks * With this patch:319 disks * With scsi-mq disabled: 1755 disks Also only the first two hunks are necessary. The kernel behaves exactly the same way with or without the third hunk (ie. num_queues must already be 1). Can I infer from this that qemu needs a way to specify the can_queue setting to the virtio-scsi driver in the guest kernel? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: Increased memory usage with scsi-mq
On Sat, Aug 05, 2017 at 10:44:36AM +0200, Christoph Hellwig wrote: > On Fri, Aug 04, 2017 at 10:00:47PM +0100, Richard W.M. Jones wrote: > > I read your slides about scsi-mq and it seems like a significant > > benefit to large machines, but could the out of the box defaults be > > made more friendly for small memory machines? > > The default inumber of queues and queue depth and thus memory usage is > set by the LLDD. > > Try to reduce the can_queue value in virtio_scsi and/or make sure > you use the single queue variant in your VM (which should be tunable > in qemu). Thanks, this is interesting. Virtio-scsi seems to have a few settable parameters that might be related to this: DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 1), DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors, 0x), DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun, 128), Unfortunately (assuming I'm setting them right - see below), none of them have any effect on the number of disks that I can add to the VM. I am testing them by placing them in the ‘-device virtio-scsi-pci’ parameter, ie. as a property of the controller, not a property of the LUN, eg: -device virtio-scsi-pci,cmd_per_lun=32,id=scsi \ -drive file=/home/rjones/d/libguestfs/tmp/libguestfshXImTv/scratch.1,cache=unsafe,format=raw,id=hd0,if=none \ -device scsi-hd,drive=hd0 \ The debugging output is a bit too large to attach to this email, but I have placed it at the link below. It contains (if you scroll down a bit) the full qemu command line and full kernel output. http://oirase.annexia.org/tmp/bz1478201-log.txt I can add some extra debugging into the kernel if you like. Just point me to the right place. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: Increased memory usage with scsi-mq
On Sat, Aug 05, 2017 at 10:44:36AM +0200, Christoph Hellwig wrote: > On Fri, Aug 04, 2017 at 10:00:47PM +0100, Richard W.M. Jones wrote: > > I read your slides about scsi-mq and it seems like a significant > > benefit to large machines, but could the out of the box defaults be > > made more friendly for small memory machines? > > The default inumber of queues and queue depth and thus memory usage is > set by the LLDD. > > Try to reduce the can_queue value in virtio_scsi and/or make sure > you use the single queue variant in your VM (which should be tunable > in qemu). Thanks, this is interesting. Virtio-scsi seems to have a few settable parameters that might be related to this: DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 1), DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors, 0x), DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun, 128), Unfortunately (assuming I'm setting them right - see below), none of them have any effect on the number of disks that I can add to the VM. I am testing them by placing them in the ‘-device virtio-scsi-pci’ parameter, ie. as a property of the controller, not a property of the LUN, eg: -device virtio-scsi-pci,cmd_per_lun=32,id=scsi \ -drive file=/home/rjones/d/libguestfs/tmp/libguestfshXImTv/scratch.1,cache=unsafe,format=raw,id=hd0,if=none \ -device scsi-hd,drive=hd0 \ The debugging output is a bit too large to attach to this email, but I have placed it at the link below. It contains (if you scroll down a bit) the full qemu command line and full kernel output. http://oirase.annexia.org/tmp/bz1478201-log.txt I can add some extra debugging into the kernel if you like. Just point me to the right place. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Increased memory usage with scsi-mq
https://bugzilla.redhat.com/show_bug.cgi?id=1478201 We have a libguestfs test which adds 256 virtio-scsi disks to a qemu virtual machine. The VM has 500 MB of RAM, 1 vCPU and no swap. This test has been failing for a little while. It runs out of memory during SCSI enumeration in early boot. Tonight I bisected the cause to: 5c279bd9e40624f4ab6e688671026d6005b066fa is the first bad commit commit 5c279bd9e40624f4ab6e688671026d6005b066fa Author: Christoph HellwigDate: Fri Jun 16 10:27:55 2017 +0200 scsi: default to scsi-mq Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O path now that we had plenty of testing, and have I/O schedulers for blk-mq. The module option to disable the blk-mq path is kept around for now. Signed-off-by: Christoph Hellwig Signed-off-by: Martin K. Petersen :04 04 57ec7d5d2ba76592a695f533a69f747700c31966 c79f6ecb070acc4fadf6fc05ca9ba32bc9c0c665 Mdrivers I also wrote a small test to see the maximum number of virtio-scsi disks I could add to the above VM. The results were very surprising (to me anyhow): With scsi-mq enabled: 175 disks With scsi-mq disabled: 1755 disks I don't know why the ratio is almost exactly 10 times. I read your slides about scsi-mq and it seems like a significant benefit to large machines, but could the out of the box defaults be made more friendly for small memory machines? Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Increased memory usage with scsi-mq
https://bugzilla.redhat.com/show_bug.cgi?id=1478201 We have a libguestfs test which adds 256 virtio-scsi disks to a qemu virtual machine. The VM has 500 MB of RAM, 1 vCPU and no swap. This test has been failing for a little while. It runs out of memory during SCSI enumeration in early boot. Tonight I bisected the cause to: 5c279bd9e40624f4ab6e688671026d6005b066fa is the first bad commit commit 5c279bd9e40624f4ab6e688671026d6005b066fa Author: Christoph Hellwig Date: Fri Jun 16 10:27:55 2017 +0200 scsi: default to scsi-mq Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O path now that we had plenty of testing, and have I/O schedulers for blk-mq. The module option to disable the blk-mq path is kept around for now. Signed-off-by: Christoph Hellwig Signed-off-by: Martin K. Petersen :04 04 57ec7d5d2ba76592a695f533a69f747700c31966 c79f6ecb070acc4fadf6fc05ca9ba32bc9c0c665 Mdrivers I also wrote a small test to see the maximum number of virtio-scsi disks I could add to the above VM. The results were very surprising (to me anyhow): With scsi-mq enabled: 175 disks With scsi-mq disabled: 1755 disks I don't know why the ratio is almost exactly 10 times. I read your slides about scsi-mq and it seems like a significant benefit to large machines, but could the out of the box defaults be made more friendly for small memory machines? Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: [REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest
On Thu, Mar 23, 2017 at 01:13:50PM +0800, Jason Wang wrote: > >From 312859b596e83a2164a8430343d31fce2a5ad808 Mon Sep 17 00:00:00 2001 > From: Jason Wang <jasow...@redhat.com> > Date: Thu, 23 Mar 2017 13:07:16 +0800 > Subject: [PATCH] virtio_pci: fix out of bound access for msix_names > > Signed-off-by: Jason Wang <jasow...@redhat.com> I tested this, and it does appear to fix the crashes in vp_modern_find_vqs. Therefore: Tested-by: Richard W.M. Jones <rjo...@redhat.com> Thanks, Rich. > drivers/virtio/virtio_pci_common.c | 9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/drivers/virtio/virtio_pci_common.c > b/drivers/virtio/virtio_pci_common.c > index df548a6..5905349 100644 > --- a/drivers/virtio/virtio_pci_common.c > +++ b/drivers/virtio/virtio_pci_common.c > @@ -147,7 +147,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, > unsigned nvqs, > { > struct virtio_pci_device *vp_dev = to_vp_device(vdev); > const char *name = dev_name(_dev->vdev.dev); > - int i, err = -ENOMEM, allocated_vectors, nvectors; > + int i, j, err = -ENOMEM, allocated_vectors, nvectors; > unsigned flags = PCI_IRQ_MSIX; > bool shared = false; > u16 msix_vec; > @@ -212,7 +212,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, > unsigned nvqs, > if (!vp_dev->msix_vector_map) > goto out_disable_config_irq; > > - allocated_vectors = 1; /* vector 0 is the config interrupt */ > + allocated_vectors = j = 1; /* vector 0 is the config interrupt */ > for (i = 0; i < nvqs; ++i) { > if (!names[i]) { > vqs[i] = NULL; > @@ -236,18 +236,19 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, > unsigned nvqs, > continue; > } > > - snprintf(vp_dev->msix_names[i + 1], > + snprintf(vp_dev->msix_names[j], >sizeof(*vp_dev->msix_names), "%s-%s", >dev_name(_dev->vdev.dev), names[i]); > err = request_irq(pci_irq_vector(vp_dev->pci_dev, msix_vec), > vring_interrupt, IRQF_SHARED, > - vp_dev->msix_names[i + 1], vqs[i]); > + vp_dev->msix_names[j], vqs[i]); > if (err) { > /* don't free this irq on error */ > vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR; > goto out_remove_vqs; > } > vp_dev->msix_vector_map[i] = msix_vec; > + j++; > > /* >* Use a different vector for each queue if they are available, > -- > 2.7.4 > -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: [REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest
On Thu, Mar 23, 2017 at 01:13:50PM +0800, Jason Wang wrote: > >From 312859b596e83a2164a8430343d31fce2a5ad808 Mon Sep 17 00:00:00 2001 > From: Jason Wang > Date: Thu, 23 Mar 2017 13:07:16 +0800 > Subject: [PATCH] virtio_pci: fix out of bound access for msix_names > > Signed-off-by: Jason Wang I tested this, and it does appear to fix the crashes in vp_modern_find_vqs. Therefore: Tested-by: Richard W.M. Jones Thanks, Rich. > drivers/virtio/virtio_pci_common.c | 9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/drivers/virtio/virtio_pci_common.c > b/drivers/virtio/virtio_pci_common.c > index df548a6..5905349 100644 > --- a/drivers/virtio/virtio_pci_common.c > +++ b/drivers/virtio/virtio_pci_common.c > @@ -147,7 +147,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, > unsigned nvqs, > { > struct virtio_pci_device *vp_dev = to_vp_device(vdev); > const char *name = dev_name(_dev->vdev.dev); > - int i, err = -ENOMEM, allocated_vectors, nvectors; > + int i, j, err = -ENOMEM, allocated_vectors, nvectors; > unsigned flags = PCI_IRQ_MSIX; > bool shared = false; > u16 msix_vec; > @@ -212,7 +212,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, > unsigned nvqs, > if (!vp_dev->msix_vector_map) > goto out_disable_config_irq; > > - allocated_vectors = 1; /* vector 0 is the config interrupt */ > + allocated_vectors = j = 1; /* vector 0 is the config interrupt */ > for (i = 0; i < nvqs; ++i) { > if (!names[i]) { > vqs[i] = NULL; > @@ -236,18 +236,19 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, > unsigned nvqs, > continue; > } > > - snprintf(vp_dev->msix_names[i + 1], > + snprintf(vp_dev->msix_names[j], >sizeof(*vp_dev->msix_names), "%s-%s", >dev_name(_dev->vdev.dev), names[i]); > err = request_irq(pci_irq_vector(vp_dev->pci_dev, msix_vec), > vring_interrupt, IRQF_SHARED, > - vp_dev->msix_names[i + 1], vqs[i]); > + vp_dev->msix_names[j], vqs[i]); > if (err) { > /* don't free this irq on error */ > vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR; > goto out_remove_vqs; > } > vp_dev->msix_vector_map[i] = msix_vec; > + j++; > > /* >* Use a different vector for each queue if they are available, > -- > 2.7.4 > -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")
On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote: > Does the patch from Jason in the > > "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for > virtqueues") causes crashes in guest" > > thread fix the issue for you? In brief, yes it does. I followed up on that thread. Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")
On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote: > Does the patch from Jason in the > > "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for > virtqueues") causes crashes in guest" > > thread fix the issue for you? In brief, yes it does. I followed up on that thread. Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")
On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote: > Does the patch from Jason in the > > "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for > virtqueues") causes crashes in guest" > > thread fix the issue for you? I didn't see this thread before. I'll check that out for you now. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")
On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote: > Does the patch from Jason in the > > "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for > virtqueues") causes crashes in guest" > > thread fix the issue for you? I didn't see this thread before. I'll check that out for you now. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")
On Thu, Mar 23, 2017 at 03:51:25PM +0100, Thorsten Leemhuis wrote: > Hi Christoph! Hi Michael! > > (Mail roughly based on text from > https://bugzilla.kernel.org/show_bug.cgi?id=194911 ) > > I'm seeing random crashes during boot every few boot attempts when > running Linux 4.11-rc/mainline in a Fedora 26 guest under a CentOS7 host > (CPU: Intel(R) Pentium(R) CPU G3220) using KVM. Sometimes when the guest > actually booted the network did not work. To get some impressions of the > crashes I got see this gallery: > https://plus.google.com/+ThorstenLeemhuis/posts/FjyyGjNtrrG > > Richard W.M. Jones and Adam Williamson see the same problems. See above > bug for details. It seems they ran into the problem in the past few > days, so I assume it's still present in mainline (I'm travelling > currently and haven't had time for proper tests since last last Friday > (pre-rc3); but I thought it's time to get the problem to the lists). > > Long story short: Richard and I did bisections and we both found that > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07ec51480b5e > ("virtio_pci: use shared interrupts for virtqueues") is the first bad > commit. Any idea what might be wrong? Do you need more details from us > to fix this? Laura Abbott posted a kernel RPM which works for me. She has had to revert quite a number of commits, which are detailed in this comment: https://bugzilla.redhat.com/show_bug.cgi?id=1430297#c7 Her reverting patch is also attached. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html >From 4d3cba0be27b20516eb765c2913bce93e73fe30e Mon Sep 17 00:00:00 2001 From: Laura Abbott <labb...@redhat.com> Date: Wed, 22 Mar 2017 15:41:27 -0700 Subject: [PATCH] Revert a bunch of virtio commits 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") is linked to a bunch of issues. Unfortunately we can't just revert it by itself. Revert it and dependency patches as well. Revert "virtio: provide a method to get the IRQ affinity mask for a virtqueue" This reverts commit bbaba479563910aaa51e59bb9027a09e396d3a3c. Revert "virtio-console: avoid DMA from stack" This reverts commit c4baad50297d84bde1a7ad45e50c73adae4a2192. Revert "vhost: introduce O(1) vq metadata cache" This reverts commit f889491380582b4ba2981cf0b0d7d6a40fb30ab7. Conflicts: drivers/vhost/vhost.c Revert "virtio_scsi: use virtio IRQ affinity" This reverts commit 0d9f0a52c8b9f7a003fe1650b7d5fb8518efabe0. Revert "virtio_blk: use virtio IRQ affinity" This reverts commit ad71473d9c43725c917fc5a86d54ceb7001ee28c. Revert "blk-mq: provide a default queue mapping for virtio device" This reverts commit 73473427bb551686e4b68ecd99bfd27e6635286a. Revert "virtio: allow drivers to request IRQ affinity when creating VQs" This reverts commit fb5e31d970ce8b4941f03ed765d7dbefc39f22d9. Revert "virtio_pci: simplify MSI-X setup" This reverts commit 52a61516125fa9a21b3bdf4f90928308e2e5573f. Revert "virtio_pci: don't duplicate the msix_enable flag in struct pci_dev" This reverts commit 53a020c661741f3b87ad3ac6fa545088aaebac9b. Revert "virtio_pci: use shared interrupts for virtqueues" This reverts commit 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507. --- block/Kconfig | 5 - block/Makefile | 1 - block/blk-mq-virtio.c | 54 -- drivers/block/virtio_blk.c | 14 +- drivers/char/virtio_console.c | 14 +- drivers/crypto/virtio/virtio_crypto_core.c | 2 +- drivers/gpu/drm/virtio/virtgpu_kms.c | 2 +- drivers/misc/mic/vop/vop_main.c| 2 +- drivers/net/caif/caif_virtio.c | 3 +- drivers/net/virtio_net.c | 2 +- drivers/remoteproc/remoteproc_virtio.c | 3 +- drivers/rpmsg/virtio_rpmsg_bus.c | 2 +- drivers/s390/virtio/kvm_virtio.c | 3 +- drivers/s390/virtio/virtio_ccw.c | 3 +- drivers/scsi/virtio_scsi.c | 127 +++-- drivers/vhost/vhost.c | 136 +++--- drivers/vhost/vhost.h | 8 - drivers/virtio/virtio_balloon.c| 3 +- drivers/virtio/virtio_input.c | 3 +- drivers/virtio/virtio_mmio.c | 3 +- drivers/virtio/virtio_pci_common.c | 287 +++-- drivers/virtio/virtio_pci_common.h | 25 ++- drivers/virtio/virtio_pci_legacy.c | 3 +- drivers/virtio/virtio_pci_modern.c | 11 +- include/linux/blk-mq-virtio.h | 10 - include/
Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")
On Thu, Mar 23, 2017 at 03:51:25PM +0100, Thorsten Leemhuis wrote: > Hi Christoph! Hi Michael! > > (Mail roughly based on text from > https://bugzilla.kernel.org/show_bug.cgi?id=194911 ) > > I'm seeing random crashes during boot every few boot attempts when > running Linux 4.11-rc/mainline in a Fedora 26 guest under a CentOS7 host > (CPU: Intel(R) Pentium(R) CPU G3220) using KVM. Sometimes when the guest > actually booted the network did not work. To get some impressions of the > crashes I got see this gallery: > https://plus.google.com/+ThorstenLeemhuis/posts/FjyyGjNtrrG > > Richard W.M. Jones and Adam Williamson see the same problems. See above > bug for details. It seems they ran into the problem in the past few > days, so I assume it's still present in mainline (I'm travelling > currently and haven't had time for proper tests since last last Friday > (pre-rc3); but I thought it's time to get the problem to the lists). > > Long story short: Richard and I did bisections and we both found that > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07ec51480b5e > ("virtio_pci: use shared interrupts for virtqueues") is the first bad > commit. Any idea what might be wrong? Do you need more details from us > to fix this? Laura Abbott posted a kernel RPM which works for me. She has had to revert quite a number of commits, which are detailed in this comment: https://bugzilla.redhat.com/show_bug.cgi?id=1430297#c7 Her reverting patch is also attached. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html >From 4d3cba0be27b20516eb765c2913bce93e73fe30e Mon Sep 17 00:00:00 2001 From: Laura Abbott Date: Wed, 22 Mar 2017 15:41:27 -0700 Subject: [PATCH] Revert a bunch of virtio commits 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") is linked to a bunch of issues. Unfortunately we can't just revert it by itself. Revert it and dependency patches as well. Revert "virtio: provide a method to get the IRQ affinity mask for a virtqueue" This reverts commit bbaba479563910aaa51e59bb9027a09e396d3a3c. Revert "virtio-console: avoid DMA from stack" This reverts commit c4baad50297d84bde1a7ad45e50c73adae4a2192. Revert "vhost: introduce O(1) vq metadata cache" This reverts commit f889491380582b4ba2981cf0b0d7d6a40fb30ab7. Conflicts: drivers/vhost/vhost.c Revert "virtio_scsi: use virtio IRQ affinity" This reverts commit 0d9f0a52c8b9f7a003fe1650b7d5fb8518efabe0. Revert "virtio_blk: use virtio IRQ affinity" This reverts commit ad71473d9c43725c917fc5a86d54ceb7001ee28c. Revert "blk-mq: provide a default queue mapping for virtio device" This reverts commit 73473427bb551686e4b68ecd99bfd27e6635286a. Revert "virtio: allow drivers to request IRQ affinity when creating VQs" This reverts commit fb5e31d970ce8b4941f03ed765d7dbefc39f22d9. Revert "virtio_pci: simplify MSI-X setup" This reverts commit 52a61516125fa9a21b3bdf4f90928308e2e5573f. Revert "virtio_pci: don't duplicate the msix_enable flag in struct pci_dev" This reverts commit 53a020c661741f3b87ad3ac6fa545088aaebac9b. Revert "virtio_pci: use shared interrupts for virtqueues" This reverts commit 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507. --- block/Kconfig | 5 - block/Makefile | 1 - block/blk-mq-virtio.c | 54 -- drivers/block/virtio_blk.c | 14 +- drivers/char/virtio_console.c | 14 +- drivers/crypto/virtio/virtio_crypto_core.c | 2 +- drivers/gpu/drm/virtio/virtgpu_kms.c | 2 +- drivers/misc/mic/vop/vop_main.c| 2 +- drivers/net/caif/caif_virtio.c | 3 +- drivers/net/virtio_net.c | 2 +- drivers/remoteproc/remoteproc_virtio.c | 3 +- drivers/rpmsg/virtio_rpmsg_bus.c | 2 +- drivers/s390/virtio/kvm_virtio.c | 3 +- drivers/s390/virtio/virtio_ccw.c | 3 +- drivers/scsi/virtio_scsi.c | 127 +++-- drivers/vhost/vhost.c | 136 +++--- drivers/vhost/vhost.h | 8 - drivers/virtio/virtio_balloon.c| 3 +- drivers/virtio/virtio_input.c | 3 +- drivers/virtio/virtio_mmio.c | 3 +- drivers/virtio/virtio_pci_common.c | 287 +++-- drivers/virtio/virtio_pci_common.h | 25 ++- drivers/virtio/virtio_pci_legacy.c | 3 +- drivers/virtio/virtio_pci_modern.c | 11 +- include/linux/blk-mq-virtio.h | 10 - include/linux/cpuhotplug.h
Re: [PATCH RESEND] hwrng: core - don't pass stack allocated buffer to rng->read()
On Fri, Oct 21, 2016 at 02:04:27PM -0700, Andy Lutomirski wrote: > https://git.kernel.org/cgit/linux/kernel/git/herbert/cryptodev-2.6.git/commit/?id=6d4952d9d9d4dc2bb9c0255d95a09405a1e958f7 I have tested this one, and it also fixes the bug I was seeing. Thanks Laszlo as well for his fix, and sorry for not finding the patch above first. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Re: [PATCH RESEND] hwrng: core - don't pass stack allocated buffer to rng->read()
On Fri, Oct 21, 2016 at 02:04:27PM -0700, Andy Lutomirski wrote: > https://git.kernel.org/cgit/linux/kernel/git/herbert/cryptodev-2.6.git/commit/?id=6d4952d9d9d4dc2bb9c0255d95a09405a1e958f7 I have tested this one, and it also fixes the bug I was seeing. Thanks Laszlo as well for his fix, and sorry for not finding the patch above first. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Re: Warning from free_init_pages with large initrd
On Tue, Sep 27, 2016 at 11:20:06AM +0100, Sitsofe Wheeler wrote: > (See http://www.gossamer-threads.com/lists/linux/kernel/2534175 for > the history of this thread ) > > On 26 September 2016 at 20:00, Randy Dunlapwrote: > > > > but the warning in free_init_pages() is about alignment, not size... > > Maybe the concatenation is bad? > > What would l have to pull apart to be able to tell? > > Having said that I've just noticed that newer versions of the script > concatenate an ISO to the initrd (rather than another cpio) - > https://github.com/rhinstaller/livecd-tools/commit/8067be50907da9461e442c11a664c89e066ccac6#diff-88c69e43bb69726c532af3a136cc50e8 > . Unfortunately after rebuilding the initrd with the > livecd-iso-to-pxeboot version from Fedora 24 the warning persists. > CC'ing a few of the tool's authors on this email. I would just comment that I wrote livecd-iso-to-pxeboot for a specific purpose -- virt-p2v -- where the size of the ISO rarely exceeded around 100-200 MB, which is large for an initramfs, but about an order of magnitude smaller than what you are trying. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Re: Warning from free_init_pages with large initrd
On Tue, Sep 27, 2016 at 11:20:06AM +0100, Sitsofe Wheeler wrote: > (See http://www.gossamer-threads.com/lists/linux/kernel/2534175 for > the history of this thread ) > > On 26 September 2016 at 20:00, Randy Dunlap wrote: > > > > but the warning in free_init_pages() is about alignment, not size... > > Maybe the concatenation is bad? > > What would l have to pull apart to be able to tell? > > Having said that I've just noticed that newer versions of the script > concatenate an ISO to the initrd (rather than another cpio) - > https://github.com/rhinstaller/livecd-tools/commit/8067be50907da9461e442c11a664c89e066ccac6#diff-88c69e43bb69726c532af3a136cc50e8 > . Unfortunately after rebuilding the initrd with the > livecd-iso-to-pxeboot version from Fedora 24 the warning persists. > CC'ing a few of the tool's authors on this email. I would just comment that I wrote livecd-iso-to-pxeboot for a specific purpose -- virt-p2v -- where the size of the ISO rarely exceeded around 100-200 MB, which is large for an initramfs, but about an order of magnitude smaller than what you are trying. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
[PATCH v3] crypto: Add a flag allowing the self-tests to be disabled at runtime.
Running self-tests for a short-lived KVM VM takes 28ms on my laptop. This commit adds a flag 'cryptomgr.notests' which allows them to be disabled. However if fips=1 as well, we ignore this flag as FIPS mode mandates that the self-tests are run. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- Documentation/kernel-parameters.txt | 3 +++ crypto/testmgr.c| 9 + 2 files changed, 12 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 0b3de80..d4d5fb7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. + cryptomgr.notests +[KNL] Disable crypto self-tests + cs89x0_dma= [HW,NET] Format: diff --git a/crypto/testmgr.c b/crypto/testmgr.c index b86883a..fcd89fe 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -35,6 +35,10 @@ #include "internal.h" +static bool notests; +module_param(notests, bool, 0644); +MODULE_PARM_DESC(notests, "disable crypto self-tests"); + #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS /* a perfect nop */ @@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask) int j; int rc; + if (!fips_enabled && notests) { + printk_once(KERN_INFO "alg: self-tests disabled\n"); + return 0; + } + alg_test_descs_check_order(); if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) { -- 2.7.4
[PATCH v3] crypto: Add a flag allowing the self-tests to be disabled at runtime.
Running self-tests for a short-lived KVM VM takes 28ms on my laptop. This commit adds a flag 'cryptomgr.notests' which allows them to be disabled. However if fips=1 as well, we ignore this flag as FIPS mode mandates that the self-tests are run. Signed-off-by: Richard W.M. Jones --- Documentation/kernel-parameters.txt | 3 +++ crypto/testmgr.c| 9 + 2 files changed, 12 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 0b3de80..d4d5fb7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. + cryptomgr.notests +[KNL] Disable crypto self-tests + cs89x0_dma= [HW,NET] Format: diff --git a/crypto/testmgr.c b/crypto/testmgr.c index b86883a..fcd89fe 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -35,6 +35,10 @@ #include "internal.h" +static bool notests; +module_param(notests, bool, 0644); +MODULE_PARM_DESC(notests, "disable crypto self-tests"); + #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS /* a perfect nop */ @@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask) int j; int rc; + if (!fips_enabled && notests) { + printk_once(KERN_INFO "alg: self-tests disabled\n"); + return 0; + } + alg_test_descs_check_order(); if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) { -- 2.7.4
[PATCH v3] crypto: Add a flag allowing the self-tests to be disabled at runtime.
v2 -> v3: - Ignore the flag if FIPS mode is enabled. v1 -> v2: - Use printk_once. Because the serial console is so slow, printing the message multiple times actually consumed about 6ms extra later on during the boot. - - - I'm trying to reduce the time taken in the kernel in initcalls, with my aim being to reduce the current ~700ms spent in initcalls before userspace, down to something like 100ms. All times on my Broadwell-U laptop, under virtualization. The purpose of this is to be able to launch VMs around containers with minimal overhead, like Intel Clear Containers, but using standard distro kernels and qemu. Currently the kernel spends 28ms (on my laptop) running crypto algorithm self-tests. Although it's possibe to disable these at compile time, Fedora kernel maintainers want to maintain a single kernel image for all uses. So this commit adds a runtime flag which callers can set to skip the self-tests in the fast container/virtualization case. Rich.
[PATCH v3] crypto: Add a flag allowing the self-tests to be disabled at runtime.
v2 -> v3: - Ignore the flag if FIPS mode is enabled. v1 -> v2: - Use printk_once. Because the serial console is so slow, printing the message multiple times actually consumed about 6ms extra later on during the boot. - - - I'm trying to reduce the time taken in the kernel in initcalls, with my aim being to reduce the current ~700ms spent in initcalls before userspace, down to something like 100ms. All times on my Broadwell-U laptop, under virtualization. The purpose of this is to be able to launch VMs around containers with minimal overhead, like Intel Clear Containers, but using standard distro kernels and qemu. Currently the kernel spends 28ms (on my laptop) running crypto algorithm self-tests. Although it's possibe to disable these at compile time, Fedora kernel maintainers want to maintain a single kernel image for all uses. So this commit adds a runtime flag which callers can set to skip the self-tests in the fast container/virtualization case. Rich.
Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.
On Fri, Apr 29, 2016 at 08:54:13AM -0700, Greg KH wrote: > You are trying to take a generalized kernel and somehow "know" about the > hardware ahead of time it is going to run on. That seems like two > conflicting requirements, don't you agree? We would have the 8250 serial port in any kernel. Even if Fedora kernel maintainers allowed us to have specialized kernels for each purpose, I would use the simple ISA serial port here because it allows us to capture debug messages very early in the boot. Alternatives like virtio-console don't allow that. The kernel does know what hardware it's running on - via the CPUID hypervisor leaf. It's also possible for us to tell the kernel about the hardware using the command line, ACPI[*], DT, etc. I'd really like to tell the kernel this is a 16550A, not broken, you don't need to spend time testing that. There is prior art here: no_timer_check & lpj=.. Rich. [*] Although ACPI is really slow, adding another 190ms, and for this reason I have disabled it for now, but not investigated why it's so slow. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.
On Fri, Apr 29, 2016 at 08:54:13AM -0700, Greg KH wrote: > You are trying to take a generalized kernel and somehow "know" about the > hardware ahead of time it is going to run on. That seems like two > conflicting requirements, don't you agree? We would have the 8250 serial port in any kernel. Even if Fedora kernel maintainers allowed us to have specialized kernels for each purpose, I would use the simple ISA serial port here because it allows us to capture debug messages very early in the boot. Alternatives like virtio-console don't allow that. The kernel does know what hardware it's running on - via the CPUID hypervisor leaf. It's also possible for us to tell the kernel about the hardware using the command line, ACPI[*], DT, etc. I'd really like to tell the kernel this is a 16550A, not broken, you don't need to spend time testing that. There is prior art here: no_timer_check & lpj=.. Rich. [*] Although ACPI is really slow, adding another 190ms, and for this reason I have disabled it for now, but not investigated why it's so slow. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.
On Fri, Apr 29, 2016 at 08:16:35AM -0700, Greg KH wrote: > On Fri, Apr 29, 2016 at 09:10:06AM +0100, Richard W.M. Jones wrote: > > On Thu, Apr 28, 2016 at 03:56:33PM -0700, Greg KH wrote: > > > On Thu, Apr 28, 2016 at 11:18:33PM +0100, Richard W.M. Jones wrote: > > > > Currently autoconf spends 25ms (on my laptop) testing if the UART > > > > exported to it by KVM is an 8250 without FIFO and/or with strange > > > > quirks, which it obviously isn't. Assume it is exported to us by a > > > > hypervisor, it's a normal, working 16550A. > > > > > > > > Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> > > > > --- > > > > drivers/tty/serial/8250/8250_port.c | 7 +++ > > > > 1 file changed, 7 insertions(+) > > > > > > > > diff --git a/drivers/tty/serial/8250/8250_port.c > > > > b/drivers/tty/serial/8250/8250_port.c > > > > index 00ad2637..de19924 100644 > > > > --- a/drivers/tty/serial/8250/8250_port.c > > > > +++ b/drivers/tty/serial/8250/8250_port.c > > > > @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up) > > > > if (!port->iobase && !port->mapbase && !port->membase) > > > > return; > > > > > > > > + /* Hypervisors always export working 16550A devices. */ > > > > + if (cpu_has_hypervisor) { > > > > + up->port.type = PORT_16550A; > > > > + up->capabilities |= UART_CAP_FIFO; > > > > + return; > > > > + } > > > > > > Have you audited vmware, virtualbox, and everyone else that provides a > > > virtual uart device that it will work properly here? > > > > > > qemu isn't all the world :) > > > > Attached below is a slightly different approach. If the user passes a > > special flag on the kernel command line then we force 16550A and avoid > > the 25ms delay. Since the user chooses the flag, any concerns about > > the behaviour of the hypervisor or use of VFIO should be moot. > > No, no more module parameters, that's crazy, what happens when you have > 64 serial ports in a system, which one is this option for? In this (very special) case, the domain is running under qemu and I know it only has one serial port that doesn't need probing. What's the right way to avoid this 25ms delay? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.
On Fri, Apr 29, 2016 at 08:16:35AM -0700, Greg KH wrote: > On Fri, Apr 29, 2016 at 09:10:06AM +0100, Richard W.M. Jones wrote: > > On Thu, Apr 28, 2016 at 03:56:33PM -0700, Greg KH wrote: > > > On Thu, Apr 28, 2016 at 11:18:33PM +0100, Richard W.M. Jones wrote: > > > > Currently autoconf spends 25ms (on my laptop) testing if the UART > > > > exported to it by KVM is an 8250 without FIFO and/or with strange > > > > quirks, which it obviously isn't. Assume it is exported to us by a > > > > hypervisor, it's a normal, working 16550A. > > > > > > > > Signed-off-by: Richard W.M. Jones > > > > --- > > > > drivers/tty/serial/8250/8250_port.c | 7 +++ > > > > 1 file changed, 7 insertions(+) > > > > > > > > diff --git a/drivers/tty/serial/8250/8250_port.c > > > > b/drivers/tty/serial/8250/8250_port.c > > > > index 00ad2637..de19924 100644 > > > > --- a/drivers/tty/serial/8250/8250_port.c > > > > +++ b/drivers/tty/serial/8250/8250_port.c > > > > @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up) > > > > if (!port->iobase && !port->mapbase && !port->membase) > > > > return; > > > > > > > > + /* Hypervisors always export working 16550A devices. */ > > > > + if (cpu_has_hypervisor) { > > > > + up->port.type = PORT_16550A; > > > > + up->capabilities |= UART_CAP_FIFO; > > > > + return; > > > > + } > > > > > > Have you audited vmware, virtualbox, and everyone else that provides a > > > virtual uart device that it will work properly here? > > > > > > qemu isn't all the world :) > > > > Attached below is a slightly different approach. If the user passes a > > special flag on the kernel command line then we force 16550A and avoid > > the 25ms delay. Since the user chooses the flag, any concerns about > > the behaviour of the hypervisor or use of VFIO should be moot. > > No, no more module parameters, that's crazy, what happens when you have > 64 serial ports in a system, which one is this option for? In this (very special) case, the domain is running under qemu and I know it only has one serial port that doesn't need probing. What's the right way to avoid this 25ms delay? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: [PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.
On Fri, Apr 29, 2016 at 12:59:57PM +0200, Stephan Mueller wrote: > Am Freitag, 29. April 2016, 11:07:43 schrieb Richard W.M. Jones: > > Hi Richard, [...] > > + if (notests) { > > What about if (!fips_enabled && notests) ? > > I am not sure whether the kernel should prevent mistakes in user space. A > mistake would be when setting fips=1 and notests=1 as the FIPS mode mandates > the self tests. (Sorry, I just posted v2 before I saw this message.) I saw the FIPS stuff and thought about that. Should we prevent mistakes like that? I really don't know. Rich. > > + pr_info("alg: self-tests disabled\n"); > > + return 0; > > + } > > + > > alg_test_descs_check_order(); > > > > if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) { > > > Ciao > Stephan > -- > | Nimm das Recht weg - | > | was ist dann der Staat noch anderes als eine große Räuberbande? | -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Re: [PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.
On Fri, Apr 29, 2016 at 12:59:57PM +0200, Stephan Mueller wrote: > Am Freitag, 29. April 2016, 11:07:43 schrieb Richard W.M. Jones: > > Hi Richard, [...] > > + if (notests) { > > What about if (!fips_enabled && notests) ? > > I am not sure whether the kernel should prevent mistakes in user space. A > mistake would be when setting fips=1 and notests=1 as the FIPS mode mandates > the self tests. (Sorry, I just posted v2 before I saw this message.) I saw the FIPS stuff and thought about that. Should we prevent mistakes like that? I really don't know. Rich. > > + pr_info("alg: self-tests disabled\n"); > > + return 0; > > + } > > + > > alg_test_descs_check_order(); > > > > if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) { > > > Ciao > Stephan > -- > | Nimm das Recht weg - | > | was ist dann der Staat noch anderes als eine große Räuberbande? | -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
[PATCH v2] crypto: Add a flag allowing the self-tests to be disabled at runtime.
Running self-tests for a short-lived KVM VM takes 28ms on my laptop. This commit adds a flag 'cryptomgr.notests' which allows them to be disabled. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- Documentation/kernel-parameters.txt | 3 +++ crypto/testmgr.c| 9 + 2 files changed, 12 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 0b3de80..d4d5fb7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. + cryptomgr.notests +[KNL] Disable crypto self-tests + cs89x0_dma= [HW,NET] Format: diff --git a/crypto/testmgr.c b/crypto/testmgr.c index b86883a..5c0664d 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -35,6 +35,10 @@ #include "internal.h" +static bool notests; +module_param(notests, bool, 0644); +MODULE_PARM_DESC(notests, "disable crypto self-tests"); + #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS /* a perfect nop */ @@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask) int j; int rc; + if (notests) { + printk_once(KERN_INFO "alg: self-tests disabled\n"); + return 0; + } + alg_test_descs_check_order(); if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) { -- 2.7.4
[PATCH v2] crypto: Add a flag allowing the self-tests to be disabled at runtime.
Running self-tests for a short-lived KVM VM takes 28ms on my laptop. This commit adds a flag 'cryptomgr.notests' which allows them to be disabled. Signed-off-by: Richard W.M. Jones --- Documentation/kernel-parameters.txt | 3 +++ crypto/testmgr.c| 9 + 2 files changed, 12 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 0b3de80..d4d5fb7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. + cryptomgr.notests +[KNL] Disable crypto self-tests + cs89x0_dma= [HW,NET] Format: diff --git a/crypto/testmgr.c b/crypto/testmgr.c index b86883a..5c0664d 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -35,6 +35,10 @@ #include "internal.h" +static bool notests; +module_param(notests, bool, 0644); +MODULE_PARM_DESC(notests, "disable crypto self-tests"); + #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS /* a perfect nop */ @@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask) int j; int rc; + if (notests) { + printk_once(KERN_INFO "alg: self-tests disabled\n"); + return 0; + } + alg_test_descs_check_order(); if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) { -- 2.7.4
[PATCH v2] crypto: Add a flag allowing the self-tests to be disabled at runtime.
v1 -> v2: - Use printk_once. Because the serial console is so slow, printing the message multiple times consumed about 6ms extra later on in the boot. Printing it only once is both neater and avoids this extra overhead. Rich. - - - I'm trying to reduce the time taken in the kernel in initcalls, with my aim being to reduce the current ~700ms spent in initcalls before userspace, down to something like 100ms. All times on my Broadwell-U laptop, under virtualization. The purpose of this is to be able to launch VMs around containers with minimal overhead, like Intel Clear Containers, but using standard distro kernels and qemu. Currently the kernel spends 28ms (on my laptop) running crypto algorithm self-tests. Although it's possibe to disable these at compile time, Fedora kernel maintainers want to maintain a single kernel image for all uses. So this commit adds a runtime flag which callers can set to skip the self-tests in the fast container/virtualization case.
[PATCH v2] crypto: Add a flag allowing the self-tests to be disabled at runtime.
v1 -> v2: - Use printk_once. Because the serial console is so slow, printing the message multiple times consumed about 6ms extra later on in the boot. Printing it only once is both neater and avoids this extra overhead. Rich. - - - I'm trying to reduce the time taken in the kernel in initcalls, with my aim being to reduce the current ~700ms spent in initcalls before userspace, down to something like 100ms. All times on my Broadwell-U laptop, under virtualization. The purpose of this is to be able to launch VMs around containers with minimal overhead, like Intel Clear Containers, but using standard distro kernels and qemu. Currently the kernel spends 28ms (on my laptop) running crypto algorithm self-tests. Although it's possibe to disable these at compile time, Fedora kernel maintainers want to maintain a single kernel image for all uses. So this commit adds a runtime flag which callers can set to skip the self-tests in the fast container/virtualization case.
[PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.
Running self-tests for a short-lived KVM VM takes 28ms on my laptop. This commit adds a flag 'cryptomgr.notests' which allows them to be disabled. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- Documentation/kernel-parameters.txt | 3 +++ crypto/testmgr.c| 9 + 2 files changed, 12 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 0b3de80..d4d5fb7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. + cryptomgr.notests +[KNL] Disable crypto self-tests + cs89x0_dma= [HW,NET] Format: diff --git a/crypto/testmgr.c b/crypto/testmgr.c index b86883a..dc613f2 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -35,6 +35,10 @@ #include "internal.h" +static bool notests; +module_param(notests, bool, 0644); +MODULE_PARM_DESC(notests, "disable crypto self-tests"); + #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS /* a perfect nop */ @@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask) int j; int rc; + if (notests) { + pr_info("alg: self-tests disabled\n"); + return 0; + } + alg_test_descs_check_order(); if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) { -- 2.7.4
[PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.
Running self-tests for a short-lived KVM VM takes 28ms on my laptop. This commit adds a flag 'cryptomgr.notests' which allows them to be disabled. Signed-off-by: Richard W.M. Jones --- Documentation/kernel-parameters.txt | 3 +++ crypto/testmgr.c| 9 + 2 files changed, 12 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 0b3de80..d4d5fb7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. + cryptomgr.notests +[KNL] Disable crypto self-tests + cs89x0_dma= [HW,NET] Format: diff --git a/crypto/testmgr.c b/crypto/testmgr.c index b86883a..dc613f2 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -35,6 +35,10 @@ #include "internal.h" +static bool notests; +module_param(notests, bool, 0644); +MODULE_PARM_DESC(notests, "disable crypto self-tests"); + #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS /* a perfect nop */ @@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask) int j; int rc; + if (notests) { + pr_info("alg: self-tests disabled\n"); + return 0; + } + alg_test_descs_check_order(); if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) { -- 2.7.4
[PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.
I'm trying to reduce the time taken in the kernel in initcalls, with my aim being to reduce the current ~700ms spent in initcalls before userspace, down to something like 100ms. All times on my Broadwell-U laptop, under virtualization. The purpose of this is to be able to launch VMs around containers with minimal overhead, like Intel Clear Containers, but using standard distro kernels and qemu. Currently the kernel spends 28ms (on my laptop) running crypto algorithm self-tests. Although it's possibe to disable these at compile time, Fedora kernel maintainers want to maintain a single kernel image for all uses. So this commit adds a runtime flag which callers can set to skip the self-tests in the fast container/virtualization case. Rich.
[PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.
I'm trying to reduce the time taken in the kernel in initcalls, with my aim being to reduce the current ~700ms spent in initcalls before userspace, down to something like 100ms. All times on my Broadwell-U laptop, under virtualization. The purpose of this is to be able to launch VMs around containers with minimal overhead, like Intel Clear Containers, but using standard distro kernels and qemu. Currently the kernel spends 28ms (on my laptop) running crypto algorithm self-tests. Although it's possibe to disable these at compile time, Fedora kernel maintainers want to maintain a single kernel image for all uses. So this commit adds a runtime flag which callers can set to skip the self-tests in the fast container/virtualization case. Rich.
Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.
On Thu, Apr 28, 2016 at 03:56:33PM -0700, Greg KH wrote: > On Thu, Apr 28, 2016 at 11:18:33PM +0100, Richard W.M. Jones wrote: > > Currently autoconf spends 25ms (on my laptop) testing if the UART > > exported to it by KVM is an 8250 without FIFO and/or with strange > > quirks, which it obviously isn't. Assume it is exported to us by a > > hypervisor, it's a normal, working 16550A. > > > > Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> > > --- > > drivers/tty/serial/8250/8250_port.c | 7 +++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/drivers/tty/serial/8250/8250_port.c > > b/drivers/tty/serial/8250/8250_port.c > > index 00ad2637..de19924 100644 > > --- a/drivers/tty/serial/8250/8250_port.c > > +++ b/drivers/tty/serial/8250/8250_port.c > > @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up) > > if (!port->iobase && !port->mapbase && !port->membase) > > return; > > > > + /* Hypervisors always export working 16550A devices. */ > > + if (cpu_has_hypervisor) { > > + up->port.type = PORT_16550A; > > + up->capabilities |= UART_CAP_FIFO; > > + return; > > + } > > Have you audited vmware, virtualbox, and everyone else that provides a > virtual uart device that it will work properly here? > > qemu isn't all the world :) Attached below is a slightly different approach. If the user passes a special flag on the kernel command line then we force 16550A and avoid the 25ms delay. Since the user chooses the flag, any concerns about the behaviour of the hypervisor or use of VFIO should be moot. Rich. >From 5694addf0e05de9c842274be8392ebce5dc24280 Mon Sep 17 00:00:00 2001 From: "Richard W.M. Jones" <rjo...@redhat.com> Date: Thu, 28 Apr 2016 23:08:54 +0100 Subject: [PATCH] 8250: Allow user to force 16550A UART and avoid probing. Currently autoconf spends 25ms (on my laptop) testing the UART. Allow the user to avoid this delay if they know that all serial ports (eg on a virtual machine) are ordinary 16550A. The user does this by passing '8250_base.really_16550a' on the command line. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- drivers/tty/serial/8250/8250_port.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c index 00ad2637..ac92f55 100644 --- a/drivers/tty/serial/8250/8250_port.c +++ b/drivers/tty/serial/8250/8250_port.c @@ -53,6 +53,10 @@ #define DEBUG_AUTOCONF(fmt...) do { } while (0) #endif +static bool really_16550a; +module_param(really_16550a, bool, 0644); +MODULE_PARM_DESC(really_16550a, "Don't probe, assume 16550A"); + #define BOTH_EMPTY (UART_LSR_TEMT | UART_LSR_THRE) /* @@ -1171,6 +1175,12 @@ static void autoconfig(struct uart_8250_port *up) if (!port->iobase && !port->mapbase && !port->membase) return; + if (really_16550a) { + up->port.type = PORT_16550A; + up->capabilities |= UART_CAP_FIFO; + return; + } + DEBUG_AUTOCONF("ttyS%d: autoconf (0x%04lx, 0x%p): ", serial_index(port), port->iobase, port->membase); -- 2.7.4 -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.
On Thu, Apr 28, 2016 at 03:56:33PM -0700, Greg KH wrote: > On Thu, Apr 28, 2016 at 11:18:33PM +0100, Richard W.M. Jones wrote: > > Currently autoconf spends 25ms (on my laptop) testing if the UART > > exported to it by KVM is an 8250 without FIFO and/or with strange > > quirks, which it obviously isn't. Assume it is exported to us by a > > hypervisor, it's a normal, working 16550A. > > > > Signed-off-by: Richard W.M. Jones > > --- > > drivers/tty/serial/8250/8250_port.c | 7 +++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/drivers/tty/serial/8250/8250_port.c > > b/drivers/tty/serial/8250/8250_port.c > > index 00ad2637..de19924 100644 > > --- a/drivers/tty/serial/8250/8250_port.c > > +++ b/drivers/tty/serial/8250/8250_port.c > > @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up) > > if (!port->iobase && !port->mapbase && !port->membase) > > return; > > > > + /* Hypervisors always export working 16550A devices. */ > > + if (cpu_has_hypervisor) { > > + up->port.type = PORT_16550A; > > + up->capabilities |= UART_CAP_FIFO; > > + return; > > + } > > Have you audited vmware, virtualbox, and everyone else that provides a > virtual uart device that it will work properly here? > > qemu isn't all the world :) Attached below is a slightly different approach. If the user passes a special flag on the kernel command line then we force 16550A and avoid the 25ms delay. Since the user chooses the flag, any concerns about the behaviour of the hypervisor or use of VFIO should be moot. Rich. >From 5694addf0e05de9c842274be8392ebce5dc24280 Mon Sep 17 00:00:00 2001 From: "Richard W.M. Jones" Date: Thu, 28 Apr 2016 23:08:54 +0100 Subject: [PATCH] 8250: Allow user to force 16550A UART and avoid probing. Currently autoconf spends 25ms (on my laptop) testing the UART. Allow the user to avoid this delay if they know that all serial ports (eg on a virtual machine) are ordinary 16550A. The user does this by passing '8250_base.really_16550a' on the command line. Signed-off-by: Richard W.M. Jones --- drivers/tty/serial/8250/8250_port.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c index 00ad2637..ac92f55 100644 --- a/drivers/tty/serial/8250/8250_port.c +++ b/drivers/tty/serial/8250/8250_port.c @@ -53,6 +53,10 @@ #define DEBUG_AUTOCONF(fmt...) do { } while (0) #endif +static bool really_16550a; +module_param(really_16550a, bool, 0644); +MODULE_PARM_DESC(really_16550a, "Don't probe, assume 16550A"); + #define BOTH_EMPTY (UART_LSR_TEMT | UART_LSR_THRE) /* @@ -1171,6 +1175,12 @@ static void autoconfig(struct uart_8250_port *up) if (!port->iobase && !port->mapbase && !port->membase) return; + if (really_16550a) { + up->port.type = PORT_16550A; + up->capabilities |= UART_CAP_FIFO; + return; + } + DEBUG_AUTOCONF("ttyS%d: autoconf (0x%04lx, 0x%p): ", serial_index(port), port->iobase, port->membase); -- 2.7.4 -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.
On Fri, Apr 29, 2016 at 10:01:08AM +0300, Matwey V. Kornilov wrote: > 2016-04-29 1:18 GMT+03:00 Richard W.M. Jones <rjo...@redhat.com>: > > [This is an opinionated patch, mainly for discussion.] > > > > I'm trying to reduce the time taken in the kernel in initcalls, with > > my aim being to reduce the current ~700ms spent in initcalls before > > userspace, down to something like 100ms. All times on my Broadwell-U > > laptop, under virtualization. The purpose of this is to be able to > > launch VMs around containers with minimal overhead, like Intel Clear > > Containers, but using standard distro kernels and qemu. > > > > Currently the kernel spends 25ms inspecting the UART that we passed to > > it from qemu to find out whether it's an 8250/16550/16550A perhaps > > with a non-working FIFO or other quirks. Well, it isn't -- it's a > > working emulated 16550A, with a FIFO and no quirks, and if it isn't, > > we should fix qemu. > > > > So the patch detects if we're running virtualized (perhaps it should > > only check for qemu/KVM?) and if so, shortcuts the tests. > > Does anybody know, whether it is possible to pass through real > hardware serial port to a guest? It seems to be as simple as to pass > through an interrupt and memory IO ports. In theory it seems like something VFIO could do. Passing something as low performance as a serial port through would seem to make little sense though. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.
On Fri, Apr 29, 2016 at 10:01:08AM +0300, Matwey V. Kornilov wrote: > 2016-04-29 1:18 GMT+03:00 Richard W.M. Jones : > > [This is an opinionated patch, mainly for discussion.] > > > > I'm trying to reduce the time taken in the kernel in initcalls, with > > my aim being to reduce the current ~700ms spent in initcalls before > > userspace, down to something like 100ms. All times on my Broadwell-U > > laptop, under virtualization. The purpose of this is to be able to > > launch VMs around containers with minimal overhead, like Intel Clear > > Containers, but using standard distro kernels and qemu. > > > > Currently the kernel spends 25ms inspecting the UART that we passed to > > it from qemu to find out whether it's an 8250/16550/16550A perhaps > > with a non-working FIFO or other quirks. Well, it isn't -- it's a > > working emulated 16550A, with a FIFO and no quirks, and if it isn't, > > we should fix qemu. > > > > So the patch detects if we're running virtualized (perhaps it should > > only check for qemu/KVM?) and if so, shortcuts the tests. > > Does anybody know, whether it is possible to pass through real > hardware serial port to a guest? It seems to be as simple as to pass > through an interrupt and memory IO ports. In theory it seems like something VFIO could do. Passing something as low performance as a serial port through would seem to make little sense though. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
[PATCH] 8250: Hypervisors always export working 16550A UARTs.
[This is an opinionated patch, mainly for discussion.] I'm trying to reduce the time taken in the kernel in initcalls, with my aim being to reduce the current ~700ms spent in initcalls before userspace, down to something like 100ms. All times on my Broadwell-U laptop, under virtualization. The purpose of this is to be able to launch VMs around containers with minimal overhead, like Intel Clear Containers, but using standard distro kernels and qemu. Currently the kernel spends 25ms inspecting the UART that we passed to it from qemu to find out whether it's an 8250/16550/16550A perhaps with a non-working FIFO or other quirks. Well, it isn't -- it's a working emulated 16550A, with a FIFO and no quirks, and if it isn't, we should fix qemu. So the patch detects if we're running virtualized (perhaps it should only check for qemu/KVM?) and if so, shortcuts the tests. Rich.
[PATCH] 8250: Hypervisors always export working 16550A UARTs.
[This is an opinionated patch, mainly for discussion.] I'm trying to reduce the time taken in the kernel in initcalls, with my aim being to reduce the current ~700ms spent in initcalls before userspace, down to something like 100ms. All times on my Broadwell-U laptop, under virtualization. The purpose of this is to be able to launch VMs around containers with minimal overhead, like Intel Clear Containers, but using standard distro kernels and qemu. Currently the kernel spends 25ms inspecting the UART that we passed to it from qemu to find out whether it's an 8250/16550/16550A perhaps with a non-working FIFO or other quirks. Well, it isn't -- it's a working emulated 16550A, with a FIFO and no quirks, and if it isn't, we should fix qemu. So the patch detects if we're running virtualized (perhaps it should only check for qemu/KVM?) and if so, shortcuts the tests. Rich.
[PATCH] 8250: Hypervisors always export working 16550A UARTs.
Currently autoconf spends 25ms (on my laptop) testing if the UART exported to it by KVM is an 8250 without FIFO and/or with strange quirks, which it obviously isn't. Assume it is exported to us by a hypervisor, it's a normal, working 16550A. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- drivers/tty/serial/8250/8250_port.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c index 00ad2637..de19924 100644 --- a/drivers/tty/serial/8250/8250_port.c +++ b/drivers/tty/serial/8250/8250_port.c @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up) if (!port->iobase && !port->mapbase && !port->membase) return; + /* Hypervisors always export working 16550A devices. */ + if (cpu_has_hypervisor) { + up->port.type = PORT_16550A; + up->capabilities |= UART_CAP_FIFO; + return; + } + DEBUG_AUTOCONF("ttyS%d: autoconf (0x%04lx, 0x%p): ", serial_index(port), port->iobase, port->membase); -- 2.7.4
[PATCH] 8250: Hypervisors always export working 16550A UARTs.
Currently autoconf spends 25ms (on my laptop) testing if the UART exported to it by KVM is an 8250 without FIFO and/or with strange quirks, which it obviously isn't. Assume it is exported to us by a hypervisor, it's a normal, working 16550A. Signed-off-by: Richard W.M. Jones --- drivers/tty/serial/8250/8250_port.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c index 00ad2637..de19924 100644 --- a/drivers/tty/serial/8250/8250_port.c +++ b/drivers/tty/serial/8250/8250_port.c @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up) if (!port->iobase && !port->mapbase && !port->membase) return; + /* Hypervisors always export working 16550A devices. */ + if (cpu_has_hypervisor) { + up->port.type = PORT_16550A; + up->capabilities |= UART_CAP_FIFO; + return; + } + DEBUG_AUTOCONF("ttyS%d: autoconf (0x%04lx, 0x%p): ", serial_index(port), port->iobase, port->membase); -- 2.7.4
Re: [PATCH v2 0/2] vfs: Define new syscall getumask.
On Sun, Apr 17, 2016 at 06:57:36PM -0700, Josh Triplett wrote: > O_NOUMASK seems potentially useful to support implementation of umask > entirely in userspace, which also addresses thread-safety. A program > could read its process umask out at startup, handle umask entirely in > userspace (including for threads), and only interact with the system > umask after fork and before exec. I had a look at O_NOUMASK and there are a few problems: It's relatively easy to implement for open(2). A few filesystems implement their own open so I had to go into those filesystems and modify how they handle current_umask too. And FUSE support is tricky so I passed on that. The real problem is that mkdir/mkdirat/mknod/mknodat are affected by umask, but there is no convenient flags parameter to pass the O_NOUMASK flag. So I think the patch only half-solves the problem. I have a patch which needs a bit more testing, which I can post if you think that's helpful, but I don't think it would be acceptable in its current state. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: [PATCH v2 0/2] vfs: Define new syscall getumask.
On Sun, Apr 17, 2016 at 06:57:36PM -0700, Josh Triplett wrote: > O_NOUMASK seems potentially useful to support implementation of umask > entirely in userspace, which also addresses thread-safety. A program > could read its process umask out at startup, handle umask entirely in > userspace (including for threads), and only interact with the system > umask after fork and before exec. I had a look at O_NOUMASK and there are a few problems: It's relatively easy to implement for open(2). A few filesystems implement their own open so I had to go into those filesystems and modify how they handle current_umask too. And FUSE support is tricky so I passed on that. The real problem is that mkdir/mkdirat/mknod/mknodat are affected by umask, but there is no convenient flags parameter to pass the O_NOUMASK flag. So I think the patch only half-solves the problem. I have a patch which needs a bit more testing, which I can post if you think that's helpful, but I don't think it would be acceptable in its current state. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: [PATCH v2] procfs: expose umask in /proc//status
On Fri, Apr 15, 2016 at 03:13:10PM +0200, Michal Hocko wrote: > On Thu 14-04-16 12:08:15, Richard W.M. Jones wrote: > > It's not possible to read the process umask without also modifying it, > > which is what umask(2) does. A library cannot read umask safely, > > especially if the main program might be multithreaded. > > It would be helpful to describe the usecase a bit better. Who needs to > read the umask and what for (e.g. what if the umask changes right after > it has been read by the 3rd party?). > > I am not opposing the patch as is but this exports a new information to > the userspace and we will have to maintain it for ever, so we should > better describe the usecase. The use case is that we have endless trouble with people setting weird umask() values (usually on the grounds of "security"), and then everything breaking. I'm on the hook to fix these. We'd like to add debugging to our program so we can dump out the umask in debug reports. Previous versions of the patch used a syscall so you could only read your own umask. That's all I need. However there was quite a lot of push-back from those, so this new version exports it in /proc. See: https://lkml.org/lkml/2016/4/13/704 [umask2] https://lkml.org/lkml/2016/4/13/487 [getumask] Rich. > > Add a new status line ("Umask") in /proc//status. It contains > > the file mode creation mask (umask) in octal. It is only shown for > > tasks which have task->fs. > > > > This patch is adapted from one originally written by Pierre Carrier. > > > > Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> > > --- > > Documentation/filesystems/proc.txt | 1 + > > fs/proc/array.c| 20 +++- > > 2 files changed, 20 insertions(+), 1 deletion(-) > > > > diff --git a/Documentation/filesystems/proc.txt > > b/Documentation/filesystems/proc.txt > > index 7f5607a..e8d0075 100644 > > --- a/Documentation/filesystems/proc.txt > > +++ b/Documentation/filesystems/proc.txt > > @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1) > > TracerPid PID of process tracing this process (0 if not) > > Uid Real, effective, saved set, and file system > > UIDs > > Gid Real, effective, saved set, and file system > > GIDs > > + Umask file mode creation mask > > FDSize number of file descriptor slots currently > > allocated > > Groups supplementary group list > > NStgid descendant namespace thread group ID hierarchy > > diff --git a/fs/proc/array.c b/fs/proc/array.c > > index b6c00ce..88c7de1 100644 > > --- a/fs/proc/array.c > > +++ b/fs/proc/array.c > > @@ -83,6 +83,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct > > task_struct *tsk) > > return task_state_array[fls(state)]; > > } > > > > +static inline int get_task_umask(struct task_struct *tsk) > > +{ > > + struct fs_struct *fs; > > + int umask = -ENOENT; > > + > > + task_lock(tsk); > > + fs = tsk->fs; > > + if (fs) > > + umask = fs->umask; > > + task_unlock(tsk); > > + return umask; > > +} > > + > > static inline void task_state(struct seq_file *m, struct pid_namespace *ns, > > struct pid *pid, struct task_struct *p) > > { > > struct user_namespace *user_ns = seq_user_ns(m); > > struct group_info *group_info; > > - int g; > > + int g, umask; > > struct task_struct *tracer; > > const struct cred *cred; > > pid_t ppid, tpid = 0, tgid, ngid; > > @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, > > struct pid_namespace *ns, > > ngid = task_numa_group_id(p); > > cred = get_task_cred(p); > > > > + umask = get_task_umask(p); > > + if (umask >= 0) > > + seq_printf(m, "Umask:\t%#04o\n", umask); > > + > > task_lock(p); > > if (p->files) > > max_fds = files_fdtable(p->files)->max_fds; > > -- > > 2.7.4 > > -- > Michal Hocko > SUSE Labs -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Re: [PATCH v2] procfs: expose umask in /proc//status
On Fri, Apr 15, 2016 at 03:13:10PM +0200, Michal Hocko wrote: > On Thu 14-04-16 12:08:15, Richard W.M. Jones wrote: > > It's not possible to read the process umask without also modifying it, > > which is what umask(2) does. A library cannot read umask safely, > > especially if the main program might be multithreaded. > > It would be helpful to describe the usecase a bit better. Who needs to > read the umask and what for (e.g. what if the umask changes right after > it has been read by the 3rd party?). > > I am not opposing the patch as is but this exports a new information to > the userspace and we will have to maintain it for ever, so we should > better describe the usecase. The use case is that we have endless trouble with people setting weird umask() values (usually on the grounds of "security"), and then everything breaking. I'm on the hook to fix these. We'd like to add debugging to our program so we can dump out the umask in debug reports. Previous versions of the patch used a syscall so you could only read your own umask. That's all I need. However there was quite a lot of push-back from those, so this new version exports it in /proc. See: https://lkml.org/lkml/2016/4/13/704 [umask2] https://lkml.org/lkml/2016/4/13/487 [getumask] Rich. > > Add a new status line ("Umask") in /proc//status. It contains > > the file mode creation mask (umask) in octal. It is only shown for > > tasks which have task->fs. > > > > This patch is adapted from one originally written by Pierre Carrier. > > > > Signed-off-by: Richard W.M. Jones > > --- > > Documentation/filesystems/proc.txt | 1 + > > fs/proc/array.c| 20 +++- > > 2 files changed, 20 insertions(+), 1 deletion(-) > > > > diff --git a/Documentation/filesystems/proc.txt > > b/Documentation/filesystems/proc.txt > > index 7f5607a..e8d0075 100644 > > --- a/Documentation/filesystems/proc.txt > > +++ b/Documentation/filesystems/proc.txt > > @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1) > > TracerPid PID of process tracing this process (0 if not) > > Uid Real, effective, saved set, and file system > > UIDs > > Gid Real, effective, saved set, and file system > > GIDs > > + Umask file mode creation mask > > FDSize number of file descriptor slots currently > > allocated > > Groups supplementary group list > > NStgid descendant namespace thread group ID hierarchy > > diff --git a/fs/proc/array.c b/fs/proc/array.c > > index b6c00ce..88c7de1 100644 > > --- a/fs/proc/array.c > > +++ b/fs/proc/array.c > > @@ -83,6 +83,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct > > task_struct *tsk) > > return task_state_array[fls(state)]; > > } > > > > +static inline int get_task_umask(struct task_struct *tsk) > > +{ > > + struct fs_struct *fs; > > + int umask = -ENOENT; > > + > > + task_lock(tsk); > > + fs = tsk->fs; > > + if (fs) > > + umask = fs->umask; > > + task_unlock(tsk); > > + return umask; > > +} > > + > > static inline void task_state(struct seq_file *m, struct pid_namespace *ns, > > struct pid *pid, struct task_struct *p) > > { > > struct user_namespace *user_ns = seq_user_ns(m); > > struct group_info *group_info; > > - int g; > > + int g, umask; > > struct task_struct *tracer; > > const struct cred *cred; > > pid_t ppid, tpid = 0, tgid, ngid; > > @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, > > struct pid_namespace *ns, > > ngid = task_numa_group_id(p); > > cred = get_task_cred(p); > > > > + umask = get_task_umask(p); > > + if (umask >= 0) > > + seq_printf(m, "Umask:\t%#04o\n", umask); > > + > > task_lock(p); > > if (p->files) > > max_fds = files_fdtable(p->files)->max_fds; > > -- > > 2.7.4 > > -- > Michal Hocko > SUSE Labs -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
[PATCH v2] procfs: expose umask in /proc//status
v1 -> v2: - Change printf format to %#04o. - Retest and update examples accordingly. -- It's not possible to read the process umask without also modifying it, which is what umask(2) does. A library cannot read umask safely, especially if the main program might be multithreaded. Add a new status line ("Umask") in /proc//status. It contains the file mode creation mask (umask) in octal. It is only shown for tasks which have task->fs. For the library this allows me to read the umask from /proc/self/status. This patch is adapted from one originally written by Pierre Carrier: https://lkml.org/lkml/2012/5/4/451 Example usage: $ grep Umask /proc/1/status Umask: 0022 $ grep Umask /proc/2/status Umask: 0022 $ grep Umask /proc/self/status Umask: 0022 $ umask 002 $ grep Umask /proc/self/status Umask: 0002 Rich.
[PATCH v2] procfs: expose umask in /proc//status
v1 -> v2: - Change printf format to %#04o. - Retest and update examples accordingly. -- It's not possible to read the process umask without also modifying it, which is what umask(2) does. A library cannot read umask safely, especially if the main program might be multithreaded. Add a new status line ("Umask") in /proc//status. It contains the file mode creation mask (umask) in octal. It is only shown for tasks which have task->fs. For the library this allows me to read the umask from /proc/self/status. This patch is adapted from one originally written by Pierre Carrier: https://lkml.org/lkml/2012/5/4/451 Example usage: $ grep Umask /proc/1/status Umask: 0022 $ grep Umask /proc/2/status Umask: 0022 $ grep Umask /proc/self/status Umask: 0022 $ umask 002 $ grep Umask /proc/self/status Umask: 0002 Rich.
[PATCH v2] procfs: expose umask in /proc//status
It's not possible to read the process umask without also modifying it, which is what umask(2) does. A library cannot read umask safely, especially if the main program might be multithreaded. Add a new status line ("Umask") in /proc//status. It contains the file mode creation mask (umask) in octal. It is only shown for tasks which have task->fs. This patch is adapted from one originally written by Pierre Carrier. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- Documentation/filesystems/proc.txt | 1 + fs/proc/array.c| 20 +++- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 7f5607a..e8d0075 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1) TracerPid PID of process tracing this process (0 if not) Uid Real, effective, saved set, and file system UIDs Gid Real, effective, saved set, and file system GIDs + Umask file mode creation mask FDSize number of file descriptor slots currently allocated Groups supplementary group list NStgid descendant namespace thread group ID hierarchy diff --git a/fs/proc/array.c b/fs/proc/array.c index b6c00ce..88c7de1 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -83,6 +83,7 @@ #include #include #include +#include #include #include @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct task_struct *tsk) return task_state_array[fls(state)]; } +static inline int get_task_umask(struct task_struct *tsk) +{ + struct fs_struct *fs; + int umask = -ENOENT; + + task_lock(tsk); + fs = tsk->fs; + if (fs) + umask = fs->umask; + task_unlock(tsk); + return umask; +} + static inline void task_state(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *p) { struct user_namespace *user_ns = seq_user_ns(m); struct group_info *group_info; - int g; + int g, umask; struct task_struct *tracer; const struct cred *cred; pid_t ppid, tpid = 0, tgid, ngid; @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns, ngid = task_numa_group_id(p); cred = get_task_cred(p); + umask = get_task_umask(p); + if (umask >= 0) + seq_printf(m, "Umask:\t%#04o\n", umask); + task_lock(p); if (p->files) max_fds = files_fdtable(p->files)->max_fds; -- 2.7.4
[PATCH v2] procfs: expose umask in /proc//status
It's not possible to read the process umask without also modifying it, which is what umask(2) does. A library cannot read umask safely, especially if the main program might be multithreaded. Add a new status line ("Umask") in /proc//status. It contains the file mode creation mask (umask) in octal. It is only shown for tasks which have task->fs. This patch is adapted from one originally written by Pierre Carrier. Signed-off-by: Richard W.M. Jones --- Documentation/filesystems/proc.txt | 1 + fs/proc/array.c| 20 +++- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 7f5607a..e8d0075 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1) TracerPid PID of process tracing this process (0 if not) Uid Real, effective, saved set, and file system UIDs Gid Real, effective, saved set, and file system GIDs + Umask file mode creation mask FDSize number of file descriptor slots currently allocated Groups supplementary group list NStgid descendant namespace thread group ID hierarchy diff --git a/fs/proc/array.c b/fs/proc/array.c index b6c00ce..88c7de1 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -83,6 +83,7 @@ #include #include #include +#include #include #include @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct task_struct *tsk) return task_state_array[fls(state)]; } +static inline int get_task_umask(struct task_struct *tsk) +{ + struct fs_struct *fs; + int umask = -ENOENT; + + task_lock(tsk); + fs = tsk->fs; + if (fs) + umask = fs->umask; + task_unlock(tsk); + return umask; +} + static inline void task_state(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *p) { struct user_namespace *user_ns = seq_user_ns(m); struct group_info *group_info; - int g; + int g, umask; struct task_struct *tracer; const struct cred *cred; pid_t ppid, tpid = 0, tgid, ngid; @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns, ngid = task_numa_group_id(p); cred = get_task_cred(p); + umask = get_task_umask(p); + if (umask >= 0) + seq_printf(m, "Umask:\t%#04o\n", umask); + task_lock(p); if (p->files) max_fds = files_fdtable(p->files)->max_fds; -- 2.7.4
Re: [PATCH] procfs: expose umask in /proc//status (formerly umask2, formerly getumask)
On Thu, Apr 14, 2016 at 10:34:48AM +0100, Richard W.M. Jones wrote: > It's not possible to read the process umask without also modifying it, > which is what umask(2) does. A library cannot read umask safely, > especially if the main program might be multithreaded. > > Add a new status line ("Umask") in /proc//status. It contains > the file mode creation mask (umask) in octal. It is only shown for > tasks which have task->fs. > > For the library this allows me to read the umask from > /proc/self/status. > > This patch is adapted from one originally written by Pierre Carrier: > https://lkml.org/lkml/2012/5/4/451 Sorry, I meant to add an example of what this looks like: $ grep Umask /proc/1/status Umask:022 $ grep Umask /proc/2/status Umask:022 $ grep Umask /proc/self/status Umask:022 $ umask 002 $ grep Umask /proc/self/status Umask:02 Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Re: [PATCH] procfs: expose umask in /proc//status (formerly umask2, formerly getumask)
On Thu, Apr 14, 2016 at 10:34:48AM +0100, Richard W.M. Jones wrote: > It's not possible to read the process umask without also modifying it, > which is what umask(2) does. A library cannot read umask safely, > especially if the main program might be multithreaded. > > Add a new status line ("Umask") in /proc//status. It contains > the file mode creation mask (umask) in octal. It is only shown for > tasks which have task->fs. > > For the library this allows me to read the umask from > /proc/self/status. > > This patch is adapted from one originally written by Pierre Carrier: > https://lkml.org/lkml/2012/5/4/451 Sorry, I meant to add an example of what this looks like: $ grep Umask /proc/1/status Umask:022 $ grep Umask /proc/2/status Umask:022 $ grep Umask /proc/self/status Umask:022 $ umask 002 $ grep Umask /proc/self/status Umask:02 Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
[PATCH] procfs: expose umask in /proc//status
It's not possible to read the process umask without also modifying it, which is what umask(2) does. A library cannot read umask safely, especially if the main program might be multithreaded. Add a new status line ("Umask") in /proc//status. It contains the file mode creation mask (umask) in octal. It is only shown for tasks which have task->fs. This patch is adapted from one originally written by Pierre Carrier. Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- Documentation/filesystems/proc.txt | 1 + fs/proc/array.c| 20 +++- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 7f5607a..e8d0075 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1) TracerPid PID of process tracing this process (0 if not) Uid Real, effective, saved set, and file system UIDs Gid Real, effective, saved set, and file system GIDs + Umask file mode creation mask FDSize number of file descriptor slots currently allocated Groups supplementary group list NStgid descendant namespace thread group ID hierarchy diff --git a/fs/proc/array.c b/fs/proc/array.c index b6c00ce..03e8d3f 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -83,6 +83,7 @@ #include #include #include +#include #include #include @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct task_struct *tsk) return task_state_array[fls(state)]; } +static inline int get_task_umask(struct task_struct *tsk) +{ + struct fs_struct *fs; + int umask = -ENOENT; + + task_lock(tsk); + fs = tsk->fs; + if (fs) + umask = fs->umask; + task_unlock(tsk); + return umask; +} + static inline void task_state(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *p) { struct user_namespace *user_ns = seq_user_ns(m); struct group_info *group_info; - int g; + int g, umask; struct task_struct *tracer; const struct cred *cred; pid_t ppid, tpid = 0, tgid, ngid; @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns, ngid = task_numa_group_id(p); cred = get_task_cred(p); + umask = get_task_umask(p); + if (umask >= 0) + seq_printf(m, "Umask:\t0%o\n", umask); + task_lock(p); if (p->files) max_fds = files_fdtable(p->files)->max_fds; -- 2.7.4
[PATCH] procfs: expose umask in /proc//status
It's not possible to read the process umask without also modifying it, which is what umask(2) does. A library cannot read umask safely, especially if the main program might be multithreaded. Add a new status line ("Umask") in /proc//status. It contains the file mode creation mask (umask) in octal. It is only shown for tasks which have task->fs. This patch is adapted from one originally written by Pierre Carrier. Signed-off-by: Richard W.M. Jones --- Documentation/filesystems/proc.txt | 1 + fs/proc/array.c| 20 +++- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 7f5607a..e8d0075 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1) TracerPid PID of process tracing this process (0 if not) Uid Real, effective, saved set, and file system UIDs Gid Real, effective, saved set, and file system GIDs + Umask file mode creation mask FDSize number of file descriptor slots currently allocated Groups supplementary group list NStgid descendant namespace thread group ID hierarchy diff --git a/fs/proc/array.c b/fs/proc/array.c index b6c00ce..03e8d3f 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -83,6 +83,7 @@ #include #include #include +#include #include #include @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct task_struct *tsk) return task_state_array[fls(state)]; } +static inline int get_task_umask(struct task_struct *tsk) +{ + struct fs_struct *fs; + int umask = -ENOENT; + + task_lock(tsk); + fs = tsk->fs; + if (fs) + umask = fs->umask; + task_unlock(tsk); + return umask; +} + static inline void task_state(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *p) { struct user_namespace *user_ns = seq_user_ns(m); struct group_info *group_info; - int g; + int g, umask; struct task_struct *tracer; const struct cred *cred; pid_t ppid, tpid = 0, tgid, ngid; @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns, ngid = task_numa_group_id(p); cred = get_task_cred(p); + umask = get_task_umask(p); + if (umask >= 0) + seq_printf(m, "Umask:\t0%o\n", umask); + task_lock(p); if (p->files) max_fds = files_fdtable(p->files)->max_fds; -- 2.7.4
[PATCH] procfs: expose umask in /proc//status (formerly umask2, formerly getumask)
It's not possible to read the process umask without also modifying it, which is what umask(2) does. A library cannot read umask safely, especially if the main program might be multithreaded. Add a new status line ("Umask") in /proc//status. It contains the file mode creation mask (umask) in octal. It is only shown for tasks which have task->fs. For the library this allows me to read the umask from /proc/self/status. This patch is adapted from one originally written by Pierre Carrier: https://lkml.org/lkml/2012/5/4/451 Rich.
[PATCH] procfs: expose umask in /proc//status (formerly umask2, formerly getumask)
It's not possible to read the process umask without also modifying it, which is what umask(2) does. A library cannot read umask safely, especially if the main program might be multithreaded. Add a new status line ("Umask") in /proc//status. It contains the file mode creation mask (umask) in octal. It is only shown for tasks which have task->fs. For the library this allows me to read the umask from /proc/self/status. This patch is adapted from one originally written by Pierre Carrier: https://lkml.org/lkml/2012/5/4/451 Rich.
Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]
On Thu, Apr 14, 2016 at 09:09:38AM +1000, Stephen Rothwell wrote: > Hi Richard, > > On Wed, 13 Apr 2016 20:05:33 +0100 "Richard W.M. Jones" <rjo...@redhat.com> > wrote: > > > > It's not possible to read the process umask without also modifying it, > > which is what umask(2) does. A library cannot read umask safely, > > especially if the main program might be multithreaded. > > I was wondering if you really need to read the umask, or would just a > "ignore umask" flag to open{,at} do what you want? This would be very useful, although I think being able to read umask is also useful. --- FWIW I am currently developing a patch to add umask to /proc/PID/status. Will post it shortly once I've tested it properly. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]
On Thu, Apr 14, 2016 at 09:09:38AM +1000, Stephen Rothwell wrote: > Hi Richard, > > On Wed, 13 Apr 2016 20:05:33 +0100 "Richard W.M. Jones" > wrote: > > > > It's not possible to read the process umask without also modifying it, > > which is what umask(2) does. A library cannot read umask safely, > > especially if the main program might be multithreaded. > > I was wondering if you really need to read the umask, or would just a > "ignore umask" flag to open{,at} do what you want? This would be very useful, although I think being able to read umask is also useful. --- FWIW I am currently developing a patch to add umask to /proc/PID/status. Will post it shortly once I've tested it properly. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]
On Wed, Apr 13, 2016 at 10:45:05PM +0200, Florian Weimer wrote: > * H. Peter Anvin: > > > I have to say I'm skeptic to the need for umask2() as opposed to > > getumask(). > > I find the extension with a set-the-thread umask somewhat unlikely. > How would a potential per-thread umask interact with CLONE_FS? > Have a per-thread umask that, when active, overrides the global > one, similar to what uselocale provides? That seems rather messy, > and I'm not aware of any precedent. The flags parameter is for extensions we can't envisage now ... ... although since umask has been around since Unix V6 in 1978, we might be waiting a long time. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]
On Wed, Apr 13, 2016 at 10:45:05PM +0200, Florian Weimer wrote: > * H. Peter Anvin: > > > I have to say I'm skeptic to the need for umask2() as opposed to > > getumask(). > > I find the extension with a set-the-thread umask somewhat unlikely. > How would a potential per-thread umask interact with CLONE_FS? > Have a per-thread umask that, when active, overrides the global > one, similar to what uselocale provides? That seems rather messy, > and I'm not aware of any precedent. The flags parameter is for extensions we can't envisage now ... ... although since umask has been around since Unix V6 in 1978, we might be waiting a long time. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
umask2 man page (was: Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask])
UMASK(2) Linux Programmer's Manual UMASK(2) NAME umask, umask2 - get and set file mode creation mask SYNOPSIS #include #include mode_t umask(mode_t mask); #define _GNU_SOURCE #include #include #include mode_t umask2(mode_t mask, int flags); DESCRIPTION umask() sets the calling process's file mode creation mask (umask) to mask & 0777 (i.e., only the file permission bits of mask are used), and returns the previous value of the mask. If flags is 0, then umask2() is the same as umask(). If flags is UMASK_GET_MASK then umask2() ignores the mask parameter and returns the process's current umask. The process's current mask is not modified in this case. The umask is used by open(2), mkdir(2), and other system calls that create files to modify the permissions placed on newly created files or directories. Specifically, permissions in the umask are turned off from the mode argument to open(2) and mkdir(2). Alternatively, if the parent directory has a default ACL (see acl(5)), the umask is ignored, the default ACL is inherited, the permission bits are set based on the inherited ACL, and permission bits absent in the mode argument are turned off. For example, the following default ACL is equivalent to a umask of 022: u::rwx,g::r-x,o::r-x Combining the effect of this default ACL with a mode argument of 0666 (rw-rw-rw-), the resulting file permissions would be 0644 (rw-r--r--). The constants that should be used to specify mask are described under stat(2). The typical default value for the process umask is S_IWGRP | S_IWOTH (octal 022). In the usual case where the mode argument to open(2) is specified as: S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH (octal 0666) when creating a new file, the permissions on the resulting file will be: S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH (because 0666 & ~022 = 0644; i.e., rw-r--r--). RETURN VALUE The umask() system call always succeeds and the previous value of the mask is returned. The umask2() system call returns the process's current umask on suc‐ cess. On error it returns -1 and sets errno appropriately. CONFORMING TO SVr4, 4.3BSD, POSIX.1-2001. NOTES A child process created via fork(2) inherits its parent's umask. The umask is left unchanged by execve(2). The umask setting also affects the permissions assigned to POSIX IPC objects (mq_open(3), sem_open(3), shm_open(3)), FIFOs (mkfifo(3)), and UNIX domain sockets (unix(7)) created by the process. The umask does not affect the permissions assigned to System V IPC objects created by the process (using msgget(2), semget(2), shmget(2)). SEE ALSO chmod(2), mkdir(2), open(2), stat(2), acl(5) COLOPHON This page is part of release 4.00 of the Linux man-pages project. A description of the project, information about reporting bugs, and the latestversionofthispage,can be found at http://www.kernel.org/doc/man-pages/. Linux 2016-04-13 UMASK(2)
umask2 man page (was: Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask])
UMASK(2) Linux Programmer's Manual UMASK(2) NAME umask, umask2 - get and set file mode creation mask SYNOPSIS #include #include mode_t umask(mode_t mask); #define _GNU_SOURCE #include #include #include mode_t umask2(mode_t mask, int flags); DESCRIPTION umask() sets the calling process's file mode creation mask (umask) to mask & 0777 (i.e., only the file permission bits of mask are used), and returns the previous value of the mask. If flags is 0, then umask2() is the same as umask(). If flags is UMASK_GET_MASK then umask2() ignores the mask parameter and returns the process's current umask. The process's current mask is not modified in this case. The umask is used by open(2), mkdir(2), and other system calls that create files to modify the permissions placed on newly created files or directories. Specifically, permissions in the umask are turned off from the mode argument to open(2) and mkdir(2). Alternatively, if the parent directory has a default ACL (see acl(5)), the umask is ignored, the default ACL is inherited, the permission bits are set based on the inherited ACL, and permission bits absent in the mode argument are turned off. For example, the following default ACL is equivalent to a umask of 022: u::rwx,g::r-x,o::r-x Combining the effect of this default ACL with a mode argument of 0666 (rw-rw-rw-), the resulting file permissions would be 0644 (rw-r--r--). The constants that should be used to specify mask are described under stat(2). The typical default value for the process umask is S_IWGRP | S_IWOTH (octal 022). In the usual case where the mode argument to open(2) is specified as: S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH (octal 0666) when creating a new file, the permissions on the resulting file will be: S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH (because 0666 & ~022 = 0644; i.e., rw-r--r--). RETURN VALUE The umask() system call always succeeds and the previous value of the mask is returned. The umask2() system call returns the process's current umask on suc‐ cess. On error it returns -1 and sets errno appropriately. CONFORMING TO SVr4, 4.3BSD, POSIX.1-2001. NOTES A child process created via fork(2) inherits its parent's umask. The umask is left unchanged by execve(2). The umask setting also affects the permissions assigned to POSIX IPC objects (mq_open(3), sem_open(3), shm_open(3)), FIFOs (mkfifo(3)), and UNIX domain sockets (unix(7)) created by the process. The umask does not affect the permissions assigned to System V IPC objects created by the process (using msgget(2), semget(2), shmget(2)). SEE ALSO chmod(2), mkdir(2), open(2), stat(2), acl(5) COLOPHON This page is part of release 4.00 of the Linux man-pages project. A description of the project, information about reporting bugs, and the latestversionofthispage,can be found at http://www.kernel.org/doc/man-pages/. Linux 2016-04-13 UMASK(2)
Re: [PATCH v3 0/2] vfs: Define new syscall getumask.
On Wed, Apr 13, 2016 at 11:52:53AM -0700, Davidlohr Bueso wrote: > ENOMANPAGE Where do man pages go? In the man-pages project? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
[PATCH v4 3/3] vfs: selftests: Add test for umask2 system call.
Signed-off-by: Richard W.M. Jones <rjo...@redhat.com> --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/fs/.gitignore | 1 + tools/testing/selftests/fs/Makefile | 9 tools/testing/selftests/fs/umask2-tests.c | 77 +++ 4 files changed, 88 insertions(+) create mode 100644 tools/testing/selftests/fs/.gitignore create mode 100644 tools/testing/selftests/fs/Makefile create mode 100644 tools/testing/selftests/fs/umask2-tests.c diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index b04afc3..9e2eb24 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -4,6 +4,7 @@ TARGETS += cpu-hotplug TARGETS += efivarfs TARGETS += exec TARGETS += firmware +TARGETS += fs TARGETS += ftrace TARGETS += futex TARGETS += ipc diff --git a/tools/testing/selftests/fs/.gitignore b/tools/testing/selftests/fs/.gitignore new file mode 100644 index 000..057dced --- /dev/null +++ b/tools/testing/selftests/fs/.gitignore @@ -0,0 +1 @@ +umask2-tests diff --git a/tools/testing/selftests/fs/Makefile b/tools/testing/selftests/fs/Makefile new file mode 100644 index 000..6f231d7 --- /dev/null +++ b/tools/testing/selftests/fs/Makefile @@ -0,0 +1,9 @@ +BINARIES := umask2-tests + +all: $(BINARIES) + +clean: + $(RM) $(BINARIES) + +include ../lib.mk + diff --git a/tools/testing/selftests/fs/umask2-tests.c b/tools/testing/selftests/fs/umask2-tests.c new file mode 100644 index 000..3e01575 --- /dev/null +++ b/tools/testing/selftests/fs/umask2-tests.c @@ -0,0 +1,77 @@ +#include +#include +#include +#include +#include +#include +#include +#include + +#ifndef UMASK_GET_MASK +#define UMASK_GET_MASK 1 +#endif + +static int umask2_(int mask, int flags) +{ +#ifdef __NR_umask2 + return syscall(__NR_umask2, mask, flags); +#else + errno = ENOSYS; + return -1; +#endif +} + +int main(int argc, char **argv) +{ + int r; + + /* umask2 available in current kernel? */ + r = umask2_(0, UMASK_GET_MASK); + if (r == -1 && errno == ENOSYS) { + fprintf(stderr, + "umask2 not available in current kernel or headers, skipping test\n"); + exit(0); + } + + /* Check that old umask still works. */ + r = umask(022); + if (r == -1) { + perror("umask"); + exit(1); + } + + /* Call umask2 to emulate old umask. */ + r = umask2_(023, 0); + if (r == -1) { + perror("umask2"); + exit(1); + } + if (r != 022) { + fprintf(stderr, "umask2: expected %o, got %o\n", 022, r); + exit(1); + } + + /* Call umask2 to read current umask without modifying it. */ + r = umask2_(0777, UMASK_GET_MASK); + if (r == -1) { + perror("umask2"); + exit(1); + } + if (r != 023) { + fprintf(stderr, "umask2: expected %o, got %o\n", 023, r); + exit(1); + } + + /* Call it again to make sure we didn't modify umask. */ + r = umask2_(0777, UMASK_GET_MASK); + if (r == -1) { + perror("umask2"); + exit(1); + } + if (r != 023) { + fprintf(stderr, "umask2: expected %o, got %o\n", 023, r); + exit(1); + } + + exit(0); +} -- 2.7.4
Re: [PATCH v3 0/2] vfs: Define new syscall getumask.
On Wed, Apr 13, 2016 at 11:52:53AM -0700, Davidlohr Bueso wrote: > ENOMANPAGE Where do man pages go? In the man-pages project? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org