Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Thu, Jun 28, 2018 at 08:38:34AM +0800, Ye Xiaolong wrote: > Update the result: > > testcase/path_params/tbox_group/run: will-it-scale/poll2-performance/lkp-sb03 So this looks like a huge improvement in the per process ops, but not as large as the original regression, and no change in the per-thread ops. But the baseline already looks much lower, e.g. this shows an improvement from 404611 to 424608 for the per-process ops, while the original showed a regression from 501456 to 457120. Are we measuring on different hardware? Did we gain new spectre mitigations elsewhere? Either way I'm going to send these patches out for review, but I'd like to understand the numbers a bit more. > > 894b8c000ae6c106 8fbedc19c94fd25a2b9b327015 > -- > %stddev change %stddev > \ |\ > 404611 ± 4% 5% 424608will-it-scale.per_process_ops > 1489 ± 21%28% 1899 ± 18% > will-it-scale.time.voluntary_context_switches > 4582856046155690will-it-scale.workload > 23372342will-it-scale.time.system_time >806 806 > will-it-scale.time.percent_of_cpu_this_job_got >310 310will-it-scale.time.elapsed_time >310 310 > will-it-scale.time.elapsed_time.max > 40964096will-it-scale.time.page_size > 233917 233862will-it-scale.per_thread_ops > 17196 17179 > will-it-scale.time.minor_page_faults > 99019862 > will-it-scale.time.maximum_resident_set_size > 14705 ± 3% 14397 ± 4% > will-it-scale.time.involuntary_context_switches >167 163will-it-scale.time.user_time > 0.66 ± 25% -17% 0.54will-it-scale.scalability > 120508 ± 15%-7% 112098 ± 5% > interrupts.CAL:Function_call_interrupts > 1670 ± 3%10% 1845 ± 3% vmstat.system.cs > 32707 32635vmstat.system.in >121 122turbostat.CorWatt >149 150turbostat.PkgWatt > 15731573turbostat.Avg_MHz > 17.54 ± 19% 17.77 ± 19% boot-time.kernel_boot >824 ± 12% 834 ± 12% boot-time.idle > 27.45 ± 12% 27.69 ± 12% boot-time.boot > 16.96 ± 21% 16.93 ± 21% boot-time.dhcp > 1489 ± 21%28% 1899 ± 18% time.voluntary_context_switches > 23372342time.system_time >806 806time.percent_of_cpu_this_job_got >310 310time.elapsed_time >310 310time.elapsed_time.max > 40964096time.page_size > 17196 17179time.minor_page_faults > 99019862time.maximum_resident_set_size > 14705 ± 3% 14397 ± 4% > time.involuntary_context_switches >167 163time.user_time > 18320 6% 19506 ± 8% > proc-vmstat.nr_slab_unreclaimable > 1518 ± 7% 1558 ± 10% proc-vmstat.numa_hint_faults > 1387 ± 8% 1421 ± 9% > proc-vmstat.numa_hint_faults_local > 1873 ± 5% 1917 ± 8% proc-vmstat.numa_pte_updates > 19987 20005proc-vmstat.nr_anon_pages > 84648471proc-vmstat.nr_kernel_stack > 309815 310062proc-vmstat.nr_file_pages > 50828 50828proc-vmstat.nr_free_cma > 1606559016064831proc-vmstat.nr_free_pages >3194669 3194517proc-vmstat.nr_dirty_threshold >1595384 1595308 > proc-vmstat.nr_dirty_background_threshold > 798886 797937proc-vmstat.pgfault > 65106499proc-vmstat.nr_mapped > 659089 657491proc-vmstat.numa_local > 665458 663786proc-vmstat.numa_hit > 10371033proc-vmstat.nr_page_table_pages > 669923 665906proc-vmstat.pgfree > 676982 672385proc-vmstat.pgalloc_normal > 63686294proc-vmstat.numa_other > 13013 -7% 12152 ± 11% proc-vmstat.nr_slab_reclaimable >
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Thu, Jun 28, 2018 at 08:38:34AM +0800, Ye Xiaolong wrote: > Update the result: > > testcase/path_params/tbox_group/run: will-it-scale/poll2-performance/lkp-sb03 So this looks like a huge improvement in the per process ops, but not as large as the original regression, and no change in the per-thread ops. But the baseline already looks much lower, e.g. this shows an improvement from 404611 to 424608 for the per-process ops, while the original showed a regression from 501456 to 457120. Are we measuring on different hardware? Did we gain new spectre mitigations elsewhere? Either way I'm going to send these patches out for review, but I'd like to understand the numbers a bit more. > > 894b8c000ae6c106 8fbedc19c94fd25a2b9b327015 > -- > %stddev change %stddev > \ |\ > 404611 ± 4% 5% 424608will-it-scale.per_process_ops > 1489 ± 21%28% 1899 ± 18% > will-it-scale.time.voluntary_context_switches > 4582856046155690will-it-scale.workload > 23372342will-it-scale.time.system_time >806 806 > will-it-scale.time.percent_of_cpu_this_job_got >310 310will-it-scale.time.elapsed_time >310 310 > will-it-scale.time.elapsed_time.max > 40964096will-it-scale.time.page_size > 233917 233862will-it-scale.per_thread_ops > 17196 17179 > will-it-scale.time.minor_page_faults > 99019862 > will-it-scale.time.maximum_resident_set_size > 14705 ± 3% 14397 ± 4% > will-it-scale.time.involuntary_context_switches >167 163will-it-scale.time.user_time > 0.66 ± 25% -17% 0.54will-it-scale.scalability > 120508 ± 15%-7% 112098 ± 5% > interrupts.CAL:Function_call_interrupts > 1670 ± 3%10% 1845 ± 3% vmstat.system.cs > 32707 32635vmstat.system.in >121 122turbostat.CorWatt >149 150turbostat.PkgWatt > 15731573turbostat.Avg_MHz > 17.54 ± 19% 17.77 ± 19% boot-time.kernel_boot >824 ± 12% 834 ± 12% boot-time.idle > 27.45 ± 12% 27.69 ± 12% boot-time.boot > 16.96 ± 21% 16.93 ± 21% boot-time.dhcp > 1489 ± 21%28% 1899 ± 18% time.voluntary_context_switches > 23372342time.system_time >806 806time.percent_of_cpu_this_job_got >310 310time.elapsed_time >310 310time.elapsed_time.max > 40964096time.page_size > 17196 17179time.minor_page_faults > 99019862time.maximum_resident_set_size > 14705 ± 3% 14397 ± 4% > time.involuntary_context_switches >167 163time.user_time > 18320 6% 19506 ± 8% > proc-vmstat.nr_slab_unreclaimable > 1518 ± 7% 1558 ± 10% proc-vmstat.numa_hint_faults > 1387 ± 8% 1421 ± 9% > proc-vmstat.numa_hint_faults_local > 1873 ± 5% 1917 ± 8% proc-vmstat.numa_pte_updates > 19987 20005proc-vmstat.nr_anon_pages > 84648471proc-vmstat.nr_kernel_stack > 309815 310062proc-vmstat.nr_file_pages > 50828 50828proc-vmstat.nr_free_cma > 1606559016064831proc-vmstat.nr_free_pages >3194669 3194517proc-vmstat.nr_dirty_threshold >1595384 1595308 > proc-vmstat.nr_dirty_background_threshold > 798886 797937proc-vmstat.pgfault > 65106499proc-vmstat.nr_mapped > 659089 657491proc-vmstat.numa_local > 665458 663786proc-vmstat.numa_hit > 10371033proc-vmstat.nr_page_table_pages > 669923 665906proc-vmstat.pgfree > 676982 672385proc-vmstat.pgalloc_normal > 63686294proc-vmstat.numa_other > 13013 -7% 12152 ± 11% proc-vmstat.nr_slab_reclaimable >
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On 06/27, Christoph Hellwig wrote: >On Tue, Jun 26, 2018 at 02:03:38PM +0800, Ye Xiaolong wrote: >> Hi, >> >> On 06/22, Christoph Hellwig wrote: >> >Hi Xiaolong, >> > >> >can you retest this workload on the following branch: >> > >> >git://git.infradead.org/users/hch/vfs.git remove-get-poll-head >> > >> >Gitweb: >> > >> > >> > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head >> >> Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: >> replace f_ops->get_poll_head with a static ->f_poll_head pointer") in >> remove-get-poll-head branch. > >Especially the boot time ones and others look like they have additional >changes. > >Can you compare the baseline of my tree, which is >894b8c00 ("Merge tag 'for_v4.18-rc2' of >git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs") against 8fbedc1 >(("fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer") ? Update the result: testcase/path_params/tbox_group/run: will-it-scale/poll2-performance/lkp-sb03 894b8c000ae6c106 8fbedc19c94fd25a2b9b327015 -- %stddev change %stddev \ |\ 404611 ± 4% 5% 424608will-it-scale.per_process_ops 1489 ± 21%28% 1899 ± 18% will-it-scale.time.voluntary_context_switches 4582856046155690will-it-scale.workload 23372342will-it-scale.time.system_time 806 806 will-it-scale.time.percent_of_cpu_this_job_got 310 310will-it-scale.time.elapsed_time 310 310 will-it-scale.time.elapsed_time.max 40964096will-it-scale.time.page_size 233917 233862will-it-scale.per_thread_ops 17196 17179 will-it-scale.time.minor_page_faults 99019862 will-it-scale.time.maximum_resident_set_size 14705 ± 3% 14397 ± 4% will-it-scale.time.involuntary_context_switches 167 163will-it-scale.time.user_time 0.66 ± 25% -17% 0.54will-it-scale.scalability 120508 ± 15%-7% 112098 ± 5% interrupts.CAL:Function_call_interrupts 1670 ± 3%10% 1845 ± 3% vmstat.system.cs 32707 32635vmstat.system.in 121 122turbostat.CorWatt 149 150turbostat.PkgWatt 15731573turbostat.Avg_MHz 17.54 ± 19% 17.77 ± 19% boot-time.kernel_boot 824 ± 12% 834 ± 12% boot-time.idle 27.45 ± 12% 27.69 ± 12% boot-time.boot 16.96 ± 21% 16.93 ± 21% boot-time.dhcp 1489 ± 21%28% 1899 ± 18% time.voluntary_context_switches 23372342time.system_time 806 806time.percent_of_cpu_this_job_got 310 310time.elapsed_time 310 310time.elapsed_time.max 40964096time.page_size 17196 17179time.minor_page_faults 99019862time.maximum_resident_set_size 14705 ± 3% 14397 ± 4% time.involuntary_context_switches 167 163time.user_time 18320 6% 19506 ± 8% proc-vmstat.nr_slab_unreclaimable 1518 ± 7% 1558 ± 10% proc-vmstat.numa_hint_faults 1387 ± 8% 1421 ± 9% proc-vmstat.numa_hint_faults_local 1873 ± 5% 1917 ± 8% proc-vmstat.numa_pte_updates 19987 20005proc-vmstat.nr_anon_pages 84648471proc-vmstat.nr_kernel_stack 309815 310062proc-vmstat.nr_file_pages 50828 50828proc-vmstat.nr_free_cma 1606559016064831proc-vmstat.nr_free_pages 3194669 3194517proc-vmstat.nr_dirty_threshold 1595384 1595308 proc-vmstat.nr_dirty_background_threshold 798886 797937proc-vmstat.pgfault 65106499proc-vmstat.nr_mapped 659089 657491proc-vmstat.numa_local 665458 663786proc-vmstat.numa_hit 10371033proc-vmstat.nr_page_table_pages 669923 665906proc-vmstat.pgfree 676982
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On 06/27, Christoph Hellwig wrote: >On Tue, Jun 26, 2018 at 02:03:38PM +0800, Ye Xiaolong wrote: >> Hi, >> >> On 06/22, Christoph Hellwig wrote: >> >Hi Xiaolong, >> > >> >can you retest this workload on the following branch: >> > >> >git://git.infradead.org/users/hch/vfs.git remove-get-poll-head >> > >> >Gitweb: >> > >> > >> > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head >> >> Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: >> replace f_ops->get_poll_head with a static ->f_poll_head pointer") in >> remove-get-poll-head branch. > >Especially the boot time ones and others look like they have additional >changes. > >Can you compare the baseline of my tree, which is >894b8c00 ("Merge tag 'for_v4.18-rc2' of >git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs") against 8fbedc1 >(("fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer") ? Update the result: testcase/path_params/tbox_group/run: will-it-scale/poll2-performance/lkp-sb03 894b8c000ae6c106 8fbedc19c94fd25a2b9b327015 -- %stddev change %stddev \ |\ 404611 ± 4% 5% 424608will-it-scale.per_process_ops 1489 ± 21%28% 1899 ± 18% will-it-scale.time.voluntary_context_switches 4582856046155690will-it-scale.workload 23372342will-it-scale.time.system_time 806 806 will-it-scale.time.percent_of_cpu_this_job_got 310 310will-it-scale.time.elapsed_time 310 310 will-it-scale.time.elapsed_time.max 40964096will-it-scale.time.page_size 233917 233862will-it-scale.per_thread_ops 17196 17179 will-it-scale.time.minor_page_faults 99019862 will-it-scale.time.maximum_resident_set_size 14705 ± 3% 14397 ± 4% will-it-scale.time.involuntary_context_switches 167 163will-it-scale.time.user_time 0.66 ± 25% -17% 0.54will-it-scale.scalability 120508 ± 15%-7% 112098 ± 5% interrupts.CAL:Function_call_interrupts 1670 ± 3%10% 1845 ± 3% vmstat.system.cs 32707 32635vmstat.system.in 121 122turbostat.CorWatt 149 150turbostat.PkgWatt 15731573turbostat.Avg_MHz 17.54 ± 19% 17.77 ± 19% boot-time.kernel_boot 824 ± 12% 834 ± 12% boot-time.idle 27.45 ± 12% 27.69 ± 12% boot-time.boot 16.96 ± 21% 16.93 ± 21% boot-time.dhcp 1489 ± 21%28% 1899 ± 18% time.voluntary_context_switches 23372342time.system_time 806 806time.percent_of_cpu_this_job_got 310 310time.elapsed_time 310 310time.elapsed_time.max 40964096time.page_size 17196 17179time.minor_page_faults 99019862time.maximum_resident_set_size 14705 ± 3% 14397 ± 4% time.involuntary_context_switches 167 163time.user_time 18320 6% 19506 ± 8% proc-vmstat.nr_slab_unreclaimable 1518 ± 7% 1558 ± 10% proc-vmstat.numa_hint_faults 1387 ± 8% 1421 ± 9% proc-vmstat.numa_hint_faults_local 1873 ± 5% 1917 ± 8% proc-vmstat.numa_pte_updates 19987 20005proc-vmstat.nr_anon_pages 84648471proc-vmstat.nr_kernel_stack 309815 310062proc-vmstat.nr_file_pages 50828 50828proc-vmstat.nr_free_cma 1606559016064831proc-vmstat.nr_free_pages 3194669 3194517proc-vmstat.nr_dirty_threshold 1595384 1595308 proc-vmstat.nr_dirty_background_threshold 798886 797937proc-vmstat.pgfault 65106499proc-vmstat.nr_mapped 659089 657491proc-vmstat.numa_local 665458 663786proc-vmstat.numa_hit 10371033proc-vmstat.nr_page_table_pages 669923 665906proc-vmstat.pgfree 676982
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Tue, Jun 26, 2018 at 02:03:38PM +0800, Ye Xiaolong wrote: > Hi, > > On 06/22, Christoph Hellwig wrote: > >Hi Xiaolong, > > > >can you retest this workload on the following branch: > > > >git://git.infradead.org/users/hch/vfs.git remove-get-poll-head > > > >Gitweb: > > > > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head > > Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: replace > f_ops->get_poll_head with a static ->f_poll_head pointer") in > remove-get-poll-head branch. Especially the boot time ones and others look like they have additional changes. Can you compare the baseline of my tree, which is 894b8c00 ("Merge tag 'for_v4.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs") against 8fbedc1 (("fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer") ?
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Tue, Jun 26, 2018 at 02:03:38PM +0800, Ye Xiaolong wrote: > Hi, > > On 06/22, Christoph Hellwig wrote: > >Hi Xiaolong, > > > >can you retest this workload on the following branch: > > > >git://git.infradead.org/users/hch/vfs.git remove-get-poll-head > > > >Gitweb: > > > > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head > > Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: replace > f_ops->get_poll_head with a static ->f_poll_head pointer") in > remove-get-poll-head branch. Especially the boot time ones and others look like they have additional changes. Can you compare the baseline of my tree, which is 894b8c00 ("Merge tag 'for_v4.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs") against 8fbedc1 (("fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer") ?
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
Hi, On 06/22, Christoph Hellwig wrote: >Hi Xiaolong, > >can you retest this workload on the following branch: > >git://git.infradead.org/users/hch/vfs.git remove-get-poll-head > >Gitweb: > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer") in remove-get-poll-head branch. 3deb642f0de4c14f 8fbedc19c94fd25a2b9b327015 -- %stddev change %stddev \ |\ 457120 -7% 424608will-it-scale.per_process_ops 238978 233862will-it-scale.per_thread_ops 1755 ± 13% 8% 1899 ± 18% will-it-scale.time.voluntary_context_switches 23322342will-it-scale.time.system_time 310 310will-it-scale.time.elapsed_time 310 310 will-it-scale.time.elapsed_time.max 40964096will-it-scale.time.page_size 0.540.54will-it-scale.scalability 807 806 will-it-scale.time.percent_of_cpu_this_job_got 17218 17179 will-it-scale.time.minor_page_faults 99319862 will-it-scale.time.maximum_resident_set_size 173 -6%163will-it-scale.time.user_time 49024375 -6% 46155690will-it-scale.workload 17818 ± 10% -19% 14397 ± 4% will-it-scale.time.involuntary_context_switches 116842 ± 12%-4% 112098 ± 5% interrupts.CAL:Function_call_interrupts 32735 32635vmstat.system.in 2112 ± 7% -13% 1845 ± 3% vmstat.system.cs 150 150turbostat.PkgWatt 123 122turbostat.CorWatt 15731573turbostat.Avg_MHz 15.73 13% 17.77 ± 19% boot-time.kernel_boot 15.07 12% 16.93 ± 21% boot-time.dhcp 771 8%834 ± 12% boot-time.idle 25.69 8% 27.69 ± 12% boot-time.boot 1755 ± 13% 8% 1899 ± 18% time.voluntary_context_switches 23322342time.system_time 310 310time.elapsed_time 310 310time.elapsed_time.max 40964096time.page_size 807 806time.percent_of_cpu_this_job_got 17218 17179time.minor_page_faults 99319862time.maximum_resident_set_size 173 -6%163time.user_time 17818 ± 10% -19% 14397 ± 4% time.involuntary_context_switches 428813 ± 9%57% 672385proc-vmstat.pgalloc_normal 41736 ± 15%22% 50828proc-vmstat.nr_free_cma 18116 8% 19506 ± 8% proc-vmstat.nr_slab_unreclaimable 10291033proc-vmstat.nr_page_table_pages 84538471proc-vmstat.nr_kernel_stack 64866499proc-vmstat.nr_mapped 3193607 3194517proc-vmstat.nr_dirty_threshold 1594853 1595308 proc-vmstat.nr_dirty_background_threshold 1606187716064831proc-vmstat.nr_free_pages 20009 20005proc-vmstat.nr_anon_pages 63036294proc-vmstat.numa_other 799772 797937proc-vmstat.pgfault 667803 665906proc-vmstat.pgfree 666440 663786proc-vmstat.numa_hit 660136 657491proc-vmstat.numa_local 313125 310062proc-vmstat.nr_file_pages 1941 ± 5% 1917 ± 8% proc-vmstat.numa_pte_updates 1448 ± 7% 1421 ± 9% proc-vmstat.numa_hint_faults_local 1596 ± 6% 1558 ± 10% proc-vmstat.numa_hint_faults 12893 -6% 12152 ± 11% proc-vmstat.nr_slab_reclaimable 22885-100% 0 proc-vmstat.nr_indirectly_reclaimable 245443 ± 16% -100% 0proc-vmstat.pgalloc_movable 19861107 ± 14%34% 26619357 ± 35% perf-stat.node-load-misses 51734389 ± 5%22% 63014695 ± 25% perf-stat.node-loads 1.924e+09 ± 3%21% 2.32e+09 ± 5% perf-stat.iTLB-load-misses 2.342e+09 ± 8%
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
Hi, On 06/22, Christoph Hellwig wrote: >Hi Xiaolong, > >can you retest this workload on the following branch: > >git://git.infradead.org/users/hch/vfs.git remove-get-poll-head > >Gitweb: > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer") in remove-get-poll-head branch. 3deb642f0de4c14f 8fbedc19c94fd25a2b9b327015 -- %stddev change %stddev \ |\ 457120 -7% 424608will-it-scale.per_process_ops 238978 233862will-it-scale.per_thread_ops 1755 ± 13% 8% 1899 ± 18% will-it-scale.time.voluntary_context_switches 23322342will-it-scale.time.system_time 310 310will-it-scale.time.elapsed_time 310 310 will-it-scale.time.elapsed_time.max 40964096will-it-scale.time.page_size 0.540.54will-it-scale.scalability 807 806 will-it-scale.time.percent_of_cpu_this_job_got 17218 17179 will-it-scale.time.minor_page_faults 99319862 will-it-scale.time.maximum_resident_set_size 173 -6%163will-it-scale.time.user_time 49024375 -6% 46155690will-it-scale.workload 17818 ± 10% -19% 14397 ± 4% will-it-scale.time.involuntary_context_switches 116842 ± 12%-4% 112098 ± 5% interrupts.CAL:Function_call_interrupts 32735 32635vmstat.system.in 2112 ± 7% -13% 1845 ± 3% vmstat.system.cs 150 150turbostat.PkgWatt 123 122turbostat.CorWatt 15731573turbostat.Avg_MHz 15.73 13% 17.77 ± 19% boot-time.kernel_boot 15.07 12% 16.93 ± 21% boot-time.dhcp 771 8%834 ± 12% boot-time.idle 25.69 8% 27.69 ± 12% boot-time.boot 1755 ± 13% 8% 1899 ± 18% time.voluntary_context_switches 23322342time.system_time 310 310time.elapsed_time 310 310time.elapsed_time.max 40964096time.page_size 807 806time.percent_of_cpu_this_job_got 17218 17179time.minor_page_faults 99319862time.maximum_resident_set_size 173 -6%163time.user_time 17818 ± 10% -19% 14397 ± 4% time.involuntary_context_switches 428813 ± 9%57% 672385proc-vmstat.pgalloc_normal 41736 ± 15%22% 50828proc-vmstat.nr_free_cma 18116 8% 19506 ± 8% proc-vmstat.nr_slab_unreclaimable 10291033proc-vmstat.nr_page_table_pages 84538471proc-vmstat.nr_kernel_stack 64866499proc-vmstat.nr_mapped 3193607 3194517proc-vmstat.nr_dirty_threshold 1594853 1595308 proc-vmstat.nr_dirty_background_threshold 1606187716064831proc-vmstat.nr_free_pages 20009 20005proc-vmstat.nr_anon_pages 63036294proc-vmstat.numa_other 799772 797937proc-vmstat.pgfault 667803 665906proc-vmstat.pgfree 666440 663786proc-vmstat.numa_hit 660136 657491proc-vmstat.numa_local 313125 310062proc-vmstat.nr_file_pages 1941 ± 5% 1917 ± 8% proc-vmstat.numa_pte_updates 1448 ± 7% 1421 ± 9% proc-vmstat.numa_hint_faults_local 1596 ± 6% 1558 ± 10% proc-vmstat.numa_hint_faults 12893 -6% 12152 ± 11% proc-vmstat.nr_slab_reclaimable 22885-100% 0 proc-vmstat.nr_indirectly_reclaimable 245443 ± 16% -100% 0proc-vmstat.pgalloc_movable 19861107 ± 14%34% 26619357 ± 35% perf-stat.node-load-misses 51734389 ± 5%22% 63014695 ± 25% perf-stat.node-loads 1.924e+09 ± 3%21% 2.32e+09 ± 5% perf-stat.iTLB-load-misses 2.342e+09 ± 8%
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 09:02:55PM +0100, Al Viro wrote: > > While at the same time corect poll code already checks net_busy_loop_on > > to set POLL_BUSY_LOOP. So except for sockets where people set the > > timeout to 0 the code already does the right thing as-is. IMHO not > > really worth wasting a FMODE_* flag for it, but if you insist I'll add > > it. > > It's not just that - there's also an issue of extra indirect call on the > fast path for sockets. You get this method of yours + ->poll_mask(), > which hits another indirect to per-family ->poll_mask(). It might be > better to have these combined, sparing us an extra indirect call. > > Just give it the same calling conventions as ->poll_mask() have... The problem is that for the busy poll we want the actual busy poll + __pollwait + ->poll_mask. Which is going to make that new poll_busy_loop with a return value look exactly like ->poll. So for now I'm tempted to just do this: --- >From 4abf23f6565ff2a74f1859758f9c894abe476a00 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sat, 23 Jun 2018 09:02:59 +0200 Subject: FOLD: remove ->poll_busy_loop again Busy looping always comes in from poll(2) or select(2). So instead of adding a separate method we can just do it at the beginning of ->poll for now. Signed-off-by: Christoph Hellwig --- fs/select.c| 8 include/linux/fs.h | 1 - net/socket.c | 20 ++-- 3 files changed, 6 insertions(+), 23 deletions(-) diff --git a/fs/select.c b/fs/select.c index 25327efca2f9..c68f7cdc777a 100644 --- a/fs/select.c +++ b/fs/select.c @@ -38,14 +38,6 @@ __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt) { unsigned int events = poll_requested_events(pt); - /* -* XXX: might be worth adding a f_mode flag to see if busy looping is -* supported. Although callers probably only keep setting it when -* supported, that's why POLL_BUSY_LOOP is reported in the output. -*/ - if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop) - file->f_op->poll_busy_loop(file, events); - if (file->f_op->poll) { return file->f_op->poll(file, pt); } else if (file->f_poll_head) { diff --git a/include/linux/fs.h b/include/linux/fs.h index 82133bd1a047..bfaebdc03878 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1722,7 +1722,6 @@ struct file_operations { int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); __poll_t (*poll_mask) (struct file *, __poll_t); - void (*poll_busy_loop)(struct file *file, __poll_t events); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); diff --git a/net/socket.c b/net/socket.c index b52e5b900e09..0aaa49190b30 100644 --- a/net/socket.c +++ b/net/socket.c @@ -131,19 +131,6 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags); -#ifdef CONFIG_NET_RX_BUSY_POLL -static void sock_poll_busy_loop(struct file *file, __poll_t events) -{ - struct socket *sock = file->private_data; - - /* once, only if requested by syscall */ - if (sk_can_busy_loop(sock->sk)) - sk_busy_loop(sock->sk, 1); -} -#else -#define sock_poll_busy_loopNULL -#endif - /* * Socket files have a set of 'special' operations as well as the generic file ones. These don't appear * in the operation structures but are done directly via the socketcall() multiplexor. @@ -155,7 +142,6 @@ static const struct file_operations socket_file_ops = { .read_iter =sock_read_iter, .write_iter = sock_write_iter, .poll_mask =sock_poll_mask, - .poll_busy_loop = sock_poll_busy_loop, .poll = sock_poll, .unlocked_ioctl = sock_ioctl, #ifdef CONFIG_COMPAT @@ -1163,6 +1149,12 @@ static __poll_t sock_poll(struct file *file, poll_table *wait) struct socket *sock = file->private_data; __poll_t events = poll_requested_events(wait), mask = 0; + /* +* Poll once, if requested by syscall. +*/ + if ((events & POLL_BUSY_LOOP) && sk_can_busy_loop(sock->sk)) + sk_busy_loop(sock->sk, 1); + if (sock->ops->poll) { mask = sock->ops->poll(file, sock, wait); } else if (sock->ops->poll_mask) { -- 2.17.1
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 09:02:55PM +0100, Al Viro wrote: > > While at the same time corect poll code already checks net_busy_loop_on > > to set POLL_BUSY_LOOP. So except for sockets where people set the > > timeout to 0 the code already does the right thing as-is. IMHO not > > really worth wasting a FMODE_* flag for it, but if you insist I'll add > > it. > > It's not just that - there's also an issue of extra indirect call on the > fast path for sockets. You get this method of yours + ->poll_mask(), > which hits another indirect to per-family ->poll_mask(). It might be > better to have these combined, sparing us an extra indirect call. > > Just give it the same calling conventions as ->poll_mask() have... The problem is that for the busy poll we want the actual busy poll + __pollwait + ->poll_mask. Which is going to make that new poll_busy_loop with a return value look exactly like ->poll. So for now I'm tempted to just do this: --- >From 4abf23f6565ff2a74f1859758f9c894abe476a00 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sat, 23 Jun 2018 09:02:59 +0200 Subject: FOLD: remove ->poll_busy_loop again Busy looping always comes in from poll(2) or select(2). So instead of adding a separate method we can just do it at the beginning of ->poll for now. Signed-off-by: Christoph Hellwig --- fs/select.c| 8 include/linux/fs.h | 1 - net/socket.c | 20 ++-- 3 files changed, 6 insertions(+), 23 deletions(-) diff --git a/fs/select.c b/fs/select.c index 25327efca2f9..c68f7cdc777a 100644 --- a/fs/select.c +++ b/fs/select.c @@ -38,14 +38,6 @@ __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt) { unsigned int events = poll_requested_events(pt); - /* -* XXX: might be worth adding a f_mode flag to see if busy looping is -* supported. Although callers probably only keep setting it when -* supported, that's why POLL_BUSY_LOOP is reported in the output. -*/ - if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop) - file->f_op->poll_busy_loop(file, events); - if (file->f_op->poll) { return file->f_op->poll(file, pt); } else if (file->f_poll_head) { diff --git a/include/linux/fs.h b/include/linux/fs.h index 82133bd1a047..bfaebdc03878 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1722,7 +1722,6 @@ struct file_operations { int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); __poll_t (*poll_mask) (struct file *, __poll_t); - void (*poll_busy_loop)(struct file *file, __poll_t events); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); diff --git a/net/socket.c b/net/socket.c index b52e5b900e09..0aaa49190b30 100644 --- a/net/socket.c +++ b/net/socket.c @@ -131,19 +131,6 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags); -#ifdef CONFIG_NET_RX_BUSY_POLL -static void sock_poll_busy_loop(struct file *file, __poll_t events) -{ - struct socket *sock = file->private_data; - - /* once, only if requested by syscall */ - if (sk_can_busy_loop(sock->sk)) - sk_busy_loop(sock->sk, 1); -} -#else -#define sock_poll_busy_loopNULL -#endif - /* * Socket files have a set of 'special' operations as well as the generic file ones. These don't appear * in the operation structures but are done directly via the socketcall() multiplexor. @@ -155,7 +142,6 @@ static const struct file_operations socket_file_ops = { .read_iter =sock_read_iter, .write_iter = sock_write_iter, .poll_mask =sock_poll_mask, - .poll_busy_loop = sock_poll_busy_loop, .poll = sock_poll, .unlocked_ioctl = sock_ioctl, #ifdef CONFIG_COMPAT @@ -1163,6 +1149,12 @@ static __poll_t sock_poll(struct file *file, poll_table *wait) struct socket *sock = file->private_data; __poll_t events = poll_requested_events(wait), mask = 0; + /* +* Poll once, if requested by syscall. +*/ + if ((events & POLL_BUSY_LOOP) && sk_can_busy_loop(sock->sk)) + sk_busy_loop(sock->sk, 1); + if (sock->ops->poll) { mask = sock->ops->poll(file, sock, wait); } else if (sock->ops->poll_mask) { -- 2.17.1
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 06:18:02PM +0200, Christoph Hellwig wrote: > On Fri, Jun 22, 2018 at 05:28:50PM +0200, Christoph Hellwig wrote: > > On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote: > > > > > > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head > > > > > > See objections upthread re "fs,net: move poll busy loop handling into a > > > separate method"; as for the next one... I'd like an ACK from networking > > > folks. The rest of queue makes sense. > > > > I want to see basic results first before micro-optimizing. After that > > I'll send it out to the net folks for feedback. > > I looked into this a bit, in the end sk_can_busy_loop does this: > > return sk->sk_ll_usec && !signal_pending(current); > > where sk_ll_usec defaults based on a sysctl that needs to be > turned on, but can be overriden per socket. > > While at the same time corect poll code already checks net_busy_loop_on > to set POLL_BUSY_LOOP. So except for sockets where people set the > timeout to 0 the code already does the right thing as-is. IMHO not > really worth wasting a FMODE_* flag for it, but if you insist I'll add > it. It's not just that - there's also an issue of extra indirect call on the fast path for sockets. You get this method of yours + ->poll_mask(), which hits another indirect to per-family ->poll_mask(). It might be better to have these combined, sparing us an extra indirect call. Just give it the same calling conventions as ->poll_mask() have...
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 06:18:02PM +0200, Christoph Hellwig wrote: > On Fri, Jun 22, 2018 at 05:28:50PM +0200, Christoph Hellwig wrote: > > On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote: > > > > > > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head > > > > > > See objections upthread re "fs,net: move poll busy loop handling into a > > > separate method"; as for the next one... I'd like an ACK from networking > > > folks. The rest of queue makes sense. > > > > I want to see basic results first before micro-optimizing. After that > > I'll send it out to the net folks for feedback. > > I looked into this a bit, in the end sk_can_busy_loop does this: > > return sk->sk_ll_usec && !signal_pending(current); > > where sk_ll_usec defaults based on a sysctl that needs to be > turned on, but can be overriden per socket. > > While at the same time corect poll code already checks net_busy_loop_on > to set POLL_BUSY_LOOP. So except for sockets where people set the > timeout to 0 the code already does the right thing as-is. IMHO not > really worth wasting a FMODE_* flag for it, but if you insist I'll add > it. It's not just that - there's also an issue of extra indirect call on the fast path for sockets. You get this method of yours + ->poll_mask(), which hits another indirect to per-family ->poll_mask(). It might be better to have these combined, sparing us an extra indirect call. Just give it the same calling conventions as ->poll_mask() have...
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 7:01 AM Al Viro wrote: > > On Fri, Jun 22, 2018 at 12:00:14PM +0200, Christoph Hellwig wrote: > > And a version with select() also covered: > > For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*. > Is there any reason for not doing that other than EXPORT_SYMBOL_GPL > fetish? Because if there isn't, I would like to draw your attention > to the fact that _this_ pwecious inchewlekshul pwopewty can be trivially > open-coded by out-of-tree shite even if it happens to be non-GPL one. > Was this suggestion so bad that you have to insult not only the author, but also people with speech impediments? Sean > > mask = vfs_poll(f.file, wait); > > + if (f.file->f_op->poll) { > > ... not to mention that here you forgot to remove the call itself while > expanding it. > > Said that, you are not attacking the worst part of it - it's a static > branch, not the considerably more costly indirect ones. Remember when > I asked you about the price of those? Method calls are costly. > > Another problem with with ->get_poll_head() calling conventions is > that originally you wanted to return ERR_PTR(-mask) as a way to report > not needing to call ->poll_mask(); that got shot down since quite > a few of those don't fit into 12 bits that ERR_PTR() gives us. > > IIRC, the real reason for non-constant ->get_poll_head() was the sockets, > with > > static struct wait_queue_head *sock_get_poll_head(struct file *file, > __poll_t events) > { > struct socket *sock = file->private_data; > > if (!sock->ops->poll_mask) > return NULL; > sock_poll_busy_loop(sock, events); > return sk_sleep(sock->sk); > } > > The first part isn't a problem (it is constant). The second is > static inline void sock_poll_busy_loop(struct socket *sock, __poll_t events) > { > if (sk_can_busy_loop(sock->sk) && > events && (events & POLL_BUSY_LOOP)) { > /* once, only if requested by syscall */ > sk_busy_loop(sock->sk, 1); > } > } > > and the third - > > static inline wait_queue_head_t *sk_sleep(struct sock *sk) > { > BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0); > return _dereference_raw(sk->sk_wq)->wait; > } > > Now, ->sk_wq is modified only in sock_init_data() and sock_graft(); > the latter, IIRC, is ->accept() helper. Do we ever call either of > those on a sock of already opened file? IOW, is there any real > reason for socket ->get_poll_head() not to be constant, other > than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()? > I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have > sock_poll_mask() not free from it...
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 7:01 AM Al Viro wrote: > > On Fri, Jun 22, 2018 at 12:00:14PM +0200, Christoph Hellwig wrote: > > And a version with select() also covered: > > For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*. > Is there any reason for not doing that other than EXPORT_SYMBOL_GPL > fetish? Because if there isn't, I would like to draw your attention > to the fact that _this_ pwecious inchewlekshul pwopewty can be trivially > open-coded by out-of-tree shite even if it happens to be non-GPL one. > Was this suggestion so bad that you have to insult not only the author, but also people with speech impediments? Sean > > mask = vfs_poll(f.file, wait); > > + if (f.file->f_op->poll) { > > ... not to mention that here you forgot to remove the call itself while > expanding it. > > Said that, you are not attacking the worst part of it - it's a static > branch, not the considerably more costly indirect ones. Remember when > I asked you about the price of those? Method calls are costly. > > Another problem with with ->get_poll_head() calling conventions is > that originally you wanted to return ERR_PTR(-mask) as a way to report > not needing to call ->poll_mask(); that got shot down since quite > a few of those don't fit into 12 bits that ERR_PTR() gives us. > > IIRC, the real reason for non-constant ->get_poll_head() was the sockets, > with > > static struct wait_queue_head *sock_get_poll_head(struct file *file, > __poll_t events) > { > struct socket *sock = file->private_data; > > if (!sock->ops->poll_mask) > return NULL; > sock_poll_busy_loop(sock, events); > return sk_sleep(sock->sk); > } > > The first part isn't a problem (it is constant). The second is > static inline void sock_poll_busy_loop(struct socket *sock, __poll_t events) > { > if (sk_can_busy_loop(sock->sk) && > events && (events & POLL_BUSY_LOOP)) { > /* once, only if requested by syscall */ > sk_busy_loop(sock->sk, 1); > } > } > > and the third - > > static inline wait_queue_head_t *sk_sleep(struct sock *sk) > { > BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0); > return _dereference_raw(sk->sk_wq)->wait; > } > > Now, ->sk_wq is modified only in sock_init_data() and sock_graft(); > the latter, IIRC, is ->accept() helper. Do we ever call either of > those on a sock of already opened file? IOW, is there any real > reason for socket ->get_poll_head() not to be constant, other > than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()? > I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have > sock_poll_mask() not free from it...
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 05:28:50PM +0200, Christoph Hellwig wrote: > On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote: > > > > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head > > > > See objections upthread re "fs,net: move poll busy loop handling into a > > separate method"; as for the next one... I'd like an ACK from networking > > folks. The rest of queue makes sense. > > I want to see basic results first before micro-optimizing. After that > I'll send it out to the net folks for feedback. I looked into this a bit, in the end sk_can_busy_loop does this: return sk->sk_ll_usec && !signal_pending(current); where sk_ll_usec defaults based on a sysctl that needs to be turned on, but can be overriden per socket. While at the same time corect poll code already checks net_busy_loop_on to set POLL_BUSY_LOOP. So except for sockets where people set the timeout to 0 the code already does the right thing as-is. IMHO not really worth wasting a FMODE_* flag for it, but if you insist I'll add it.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 05:28:50PM +0200, Christoph Hellwig wrote: > On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote: > > > > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head > > > > See objections upthread re "fs,net: move poll busy loop handling into a > > separate method"; as for the next one... I'd like an ACK from networking > > folks. The rest of queue makes sense. > > I want to see basic results first before micro-optimizing. After that > I'll send it out to the net folks for feedback. I looked into this a bit, in the end sk_can_busy_loop does this: return sk->sk_ll_usec && !signal_pending(current); where sk_ll_usec defaults based on a sysctl that needs to be turned on, but can be overriden per socket. While at the same time corect poll code already checks net_busy_loop_on to set POLL_BUSY_LOOP. So except for sockets where people set the timeout to 0 the code already does the right thing as-is. IMHO not really worth wasting a FMODE_* flag for it, but if you insist I'll add it.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote: > > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head > > See objections upthread re "fs,net: move poll busy loop handling into a > separate method"; as for the next one... I'd like an ACK from networking > folks. The rest of queue makes sense. I want to see basic results first before micro-optimizing. After that I'll send it out to the net folks for feedback.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote: > > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head > > See objections upthread re "fs,net: move poll busy loop handling into a > separate method"; as for the next one... I'd like an ACK from networking > folks. The rest of queue makes sense. I want to see basic results first before micro-optimizing. After that I'll send it out to the net folks for feedback.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 05:02:51PM +0200, Christoph Hellwig wrote: > Hi Xiaolong, > > can you retest this workload on the following branch: > > git://git.infradead.org/users/hch/vfs.git remove-get-poll-head > > Gitweb: > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head See objections upthread re "fs,net: move poll busy loop handling into a separate method"; as for the next one... I'd like an ACK from networking folks. The rest of queue makes sense.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 05:02:51PM +0200, Christoph Hellwig wrote: > Hi Xiaolong, > > can you retest this workload on the following branch: > > git://git.infradead.org/users/hch/vfs.git remove-get-poll-head > > Gitweb: > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head See objections upthread re "fs,net: move poll busy loop handling into a separate method"; as for the next one... I'd like an ACK from networking folks. The rest of queue makes sense.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
Hi Xiaolong, can you retest this workload on the following branch: git://git.infradead.org/users/hch/vfs.git remove-get-poll-head Gitweb: http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
Hi Xiaolong, can you retest this workload on the following branch: git://git.infradead.org/users/hch/vfs.git remove-get-poll-head Gitweb: http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 02:33:07PM +0200, Christoph Hellwig wrote: > On Fri, Jun 22, 2018 at 01:17:22PM +0100, Al Viro wrote: > > > The problem is that call to sk_busy_loop(), which is going to be indirect > > > no matter what. > > > > if ->f_poll_head is NULL { > > use ->poll > > } else { > > if can ll_poll (checked in ->f_mode) > > call ->ll_poll(), if it returns what we want - we are > > done > > add to ->f_poll_head > > call ->poll_mask() > > What I have for now is slightly different: > > if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop) > file->f_op->poll_busy_loop(file, events); > > if (file->f_op->poll) { > return file->f_op->poll(file, pt); > } else if (file_has_poll_mask(file)) { > ... > } > > returns whatever we want part is something I want to look into > once the basics are done as it probably is non entirely trivial due to > structure of polling in the low-level network protocol. First of all, you'll get the same ->f_op for *all* sockets. So you'll be hitting that path regardless of sk_can_busy_loop(sock->sk). What's more, that way you get (on fast path) even more indirect calls, AFAICS. And I don't see any point in separate file_has_poll_mask() - just check ->f_poll_head and that's it.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 02:33:07PM +0200, Christoph Hellwig wrote: > On Fri, Jun 22, 2018 at 01:17:22PM +0100, Al Viro wrote: > > > The problem is that call to sk_busy_loop(), which is going to be indirect > > > no matter what. > > > > if ->f_poll_head is NULL { > > use ->poll > > } else { > > if can ll_poll (checked in ->f_mode) > > call ->ll_poll(), if it returns what we want - we are > > done > > add to ->f_poll_head > > call ->poll_mask() > > What I have for now is slightly different: > > if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop) > file->f_op->poll_busy_loop(file, events); > > if (file->f_op->poll) { > return file->f_op->poll(file, pt); > } else if (file_has_poll_mask(file)) { > ... > } > > returns whatever we want part is something I want to look into > once the basics are done as it probably is non entirely trivial due to > structure of polling in the low-level network protocol. First of all, you'll get the same ->f_op for *all* sockets. So you'll be hitting that path regardless of sk_can_busy_loop(sock->sk). What's more, that way you get (on fast path) even more indirect calls, AFAICS. And I don't see any point in separate file_has_poll_mask() - just check ->f_poll_head and that's it.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 01:17:22PM +0100, Al Viro wrote: > > The problem is that call to sk_busy_loop(), which is going to be indirect > > no matter what. > > if ->f_poll_head is NULL { > use ->poll > } else { > if can ll_poll (checked in ->f_mode) > call ->ll_poll(), if it returns what we want - we are > done > add to ->f_poll_head > call ->poll_mask() What I have for now is slightly different: if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop) file->f_op->poll_busy_loop(file, events); if (file->f_op->poll) { return file->f_op->poll(file, pt); } else if (file_has_poll_mask(file)) { ... } returns whatever we want part is something I want to look into once the basics are done as it probably is non entirely trivial due to structure of polling in the low-level network protocol.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 01:17:22PM +0100, Al Viro wrote: > > The problem is that call to sk_busy_loop(), which is going to be indirect > > no matter what. > > if ->f_poll_head is NULL { > use ->poll > } else { > if can ll_poll (checked in ->f_mode) > call ->ll_poll(), if it returns what we want - we are > done > add to ->f_poll_head > call ->poll_mask() What I have for now is slightly different: if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop) file->f_op->poll_busy_loop(file, events); if (file->f_op->poll) { return file->f_op->poll(file, pt); } else if (file_has_poll_mask(file)) { ... } returns whatever we want part is something I want to look into once the basics are done as it probably is non entirely trivial due to structure of polling in the low-level network protocol.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 02:07:39PM +0200, Christoph Hellwig wrote: > On Fri, Jun 22, 2018 at 12:56:13PM +0100, Al Viro wrote: > > So mark that in ->f_mode - I strongly suspect that > > sk_can_busy_loop(sock->sk) can't change while an opened file is there. > > And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll() > > and do_select() - we *already* have bits of pieces of that logics in > > there and that way they'd at least be gathered in one place. > > The problem is that call to sk_busy_loop(), which is going to be indirect > no matter what. if ->f_poll_head is NULL { use ->poll } else { if can ll_poll (checked in ->f_mode) call ->ll_poll(), if it returns what we want - we are done add to ->f_poll_head call ->poll_mask() }
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 02:07:39PM +0200, Christoph Hellwig wrote: > On Fri, Jun 22, 2018 at 12:56:13PM +0100, Al Viro wrote: > > So mark that in ->f_mode - I strongly suspect that > > sk_can_busy_loop(sock->sk) can't change while an opened file is there. > > And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll() > > and do_select() - we *already* have bits of pieces of that logics in > > there and that way they'd at least be gathered in one place. > > The problem is that call to sk_busy_loop(), which is going to be indirect > no matter what. if ->f_poll_head is NULL { use ->poll } else { if can ll_poll (checked in ->f_mode) call ->ll_poll(), if it returns what we want - we are done add to ->f_poll_head call ->poll_mask() }
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 12:56:13PM +0100, Al Viro wrote: > So mark that in ->f_mode - I strongly suspect that > sk_can_busy_loop(sock->sk) can't change while an opened file is there. > And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll() > and do_select() - we *already* have bits of pieces of that logics in > there and that way they'd at least be gathered in one place. The problem is that call to sk_busy_loop(), which is going to be indirect no matter what. > > Then replace ->get_poll_head() with file->f_poll_head and > see what it gives. Working on it right now. Works so far except for the busy loop case. I'm looking into a separate methods just for that as a first idea.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 12:56:13PM +0100, Al Viro wrote: > So mark that in ->f_mode - I strongly suspect that > sk_can_busy_loop(sock->sk) can't change while an opened file is there. > And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll() > and do_select() - we *already* have bits of pieces of that logics in > there and that way they'd at least be gathered in one place. The problem is that call to sk_busy_loop(), which is going to be indirect no matter what. > > Then replace ->get_poll_head() with file->f_poll_head and > see what it gives. Working on it right now. Works so far except for the busy loop case. I'm looking into a separate methods just for that as a first idea.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 01:53:00PM +0200, Christoph Hellwig wrote: > > Now, ->sk_wq is modified only in sock_init_data() and sock_graft(); > > the latter, IIRC, is ->accept() helper. Do we ever call either of > > those on a sock of already opened file? IOW, is there any real > > reason for socket ->get_poll_head() not to be constant, other > > than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()? > > I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have > > sock_poll_mask() not free from it... > > I'd have to defer to networking folks if busy looping after pollwait > is what they want, but I suspect the answer is no, by the time > we are already waiting for the queue busy waiting seems pointless. So mark that in ->f_mode - I strongly suspect that sk_can_busy_loop(sock->sk) can't change while an opened file is there. And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll() and do_select() - we *already* have bits of pieces of that logics in there and that way they'd at least be gathered in one place. Then replace ->get_poll_head() with file->f_poll_head and see what it gives.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 01:53:00PM +0200, Christoph Hellwig wrote: > > Now, ->sk_wq is modified only in sock_init_data() and sock_graft(); > > the latter, IIRC, is ->accept() helper. Do we ever call either of > > those on a sock of already opened file? IOW, is there any real > > reason for socket ->get_poll_head() not to be constant, other > > than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()? > > I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have > > sock_poll_mask() not free from it... > > I'd have to defer to networking folks if busy looping after pollwait > is what they want, but I suspect the answer is no, by the time > we are already waiting for the queue busy waiting seems pointless. So mark that in ->f_mode - I strongly suspect that sk_can_busy_loop(sock->sk) can't change while an opened file is there. And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll() and do_select() - we *already* have bits of pieces of that logics in there and that way they'd at least be gathered in one place. Then replace ->get_poll_head() with file->f_poll_head and see what it gives.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 12:01:17PM +0100, Al Viro wrote: > For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*. That is not going to help with de-virtualizing _qproc, which was the whole idea of that change. At least not without a compiler way smarter than gcc. But if you want it inline that is fine with me, it just seems little large for inlining. None that I plan to actually remove all calls except for poll and select for vfs_poll in a pending series, at which point it would become static anyway. > Said that, you are not attacking the worst part of it - it's a static > branch, not the considerably more costly indirect ones. Remember when > I asked you about the price of those? Method calls are costly. And back then it did not show up even in poll heavy workloads. But since then something new happened - spectre mitigations, which make indirect calls exorbitantly more expensive. > Now, ->sk_wq is modified only in sock_init_data() and sock_graft(); > the latter, IIRC, is ->accept() helper. Do we ever call either of > those on a sock of already opened file? IOW, is there any real > reason for socket ->get_poll_head() not to be constant, other > than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()? > I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have > sock_poll_mask() not free from it... I'd have to defer to networking folks if busy looping after pollwait is what they want, but I suspect the answer is no, by the time we are already waiting for the queue busy waiting seems pointless.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 12:01:17PM +0100, Al Viro wrote: > For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*. That is not going to help with de-virtualizing _qproc, which was the whole idea of that change. At least not without a compiler way smarter than gcc. But if you want it inline that is fine with me, it just seems little large for inlining. None that I plan to actually remove all calls except for poll and select for vfs_poll in a pending series, at which point it would become static anyway. > Said that, you are not attacking the worst part of it - it's a static > branch, not the considerably more costly indirect ones. Remember when > I asked you about the price of those? Method calls are costly. And back then it did not show up even in poll heavy workloads. But since then something new happened - spectre mitigations, which make indirect calls exorbitantly more expensive. > Now, ->sk_wq is modified only in sock_init_data() and sock_graft(); > the latter, IIRC, is ->accept() helper. Do we ever call either of > those on a sock of already opened file? IOW, is there any real > reason for socket ->get_poll_head() not to be constant, other > than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()? > I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have > sock_poll_mask() not free from it... I'd have to defer to networking folks if busy looping after pollwait is what they want, but I suspect the answer is no, by the time we are already waiting for the queue busy waiting seems pointless.
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 12:00:14PM +0200, Christoph Hellwig wrote: > And a version with select() also covered: For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*. Is there any reason for not doing that other than EXPORT_SYMBOL_GPL fetish? Because if there isn't, I would like to draw your attention to the fact that _this_ pwecious inchewlekshul pwopewty can be trivially open-coded by out-of-tree shite even if it happens to be non-GPL one. > mask = vfs_poll(f.file, wait); > + if (f.file->f_op->poll) { ... not to mention that here you forgot to remove the call itself while expanding it. Said that, you are not attacking the worst part of it - it's a static branch, not the considerably more costly indirect ones. Remember when I asked you about the price of those? Method calls are costly. Another problem with with ->get_poll_head() calling conventions is that originally you wanted to return ERR_PTR(-mask) as a way to report not needing to call ->poll_mask(); that got shot down since quite a few of those don't fit into 12 bits that ERR_PTR() gives us. IIRC, the real reason for non-constant ->get_poll_head() was the sockets, with static struct wait_queue_head *sock_get_poll_head(struct file *file, __poll_t events) { struct socket *sock = file->private_data; if (!sock->ops->poll_mask) return NULL; sock_poll_busy_loop(sock, events); return sk_sleep(sock->sk); } The first part isn't a problem (it is constant). The second is static inline void sock_poll_busy_loop(struct socket *sock, __poll_t events) { if (sk_can_busy_loop(sock->sk) && events && (events & POLL_BUSY_LOOP)) { /* once, only if requested by syscall */ sk_busy_loop(sock->sk, 1); } } and the third - static inline wait_queue_head_t *sk_sleep(struct sock *sk) { BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0); return _dereference_raw(sk->sk_wq)->wait; } Now, ->sk_wq is modified only in sock_init_data() and sock_graft(); the latter, IIRC, is ->accept() helper. Do we ever call either of those on a sock of already opened file? IOW, is there any real reason for socket ->get_poll_head() not to be constant, other than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()? I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have sock_poll_mask() not free from it...
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 12:00:14PM +0200, Christoph Hellwig wrote: > And a version with select() also covered: For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*. Is there any reason for not doing that other than EXPORT_SYMBOL_GPL fetish? Because if there isn't, I would like to draw your attention to the fact that _this_ pwecious inchewlekshul pwopewty can be trivially open-coded by out-of-tree shite even if it happens to be non-GPL one. > mask = vfs_poll(f.file, wait); > + if (f.file->f_op->poll) { ... not to mention that here you forgot to remove the call itself while expanding it. Said that, you are not attacking the worst part of it - it's a static branch, not the considerably more costly indirect ones. Remember when I asked you about the price of those? Method calls are costly. Another problem with with ->get_poll_head() calling conventions is that originally you wanted to return ERR_PTR(-mask) as a way to report not needing to call ->poll_mask(); that got shot down since quite a few of those don't fit into 12 bits that ERR_PTR() gives us. IIRC, the real reason for non-constant ->get_poll_head() was the sockets, with static struct wait_queue_head *sock_get_poll_head(struct file *file, __poll_t events) { struct socket *sock = file->private_data; if (!sock->ops->poll_mask) return NULL; sock_poll_busy_loop(sock, events); return sk_sleep(sock->sk); } The first part isn't a problem (it is constant). The second is static inline void sock_poll_busy_loop(struct socket *sock, __poll_t events) { if (sk_can_busy_loop(sock->sk) && events && (events & POLL_BUSY_LOOP)) { /* once, only if requested by syscall */ sk_busy_loop(sock->sk, 1); } } and the third - static inline wait_queue_head_t *sk_sleep(struct sock *sk) { BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0); return _dereference_raw(sk->sk_wq)->wait; } Now, ->sk_wq is modified only in sock_init_data() and sock_graft(); the latter, IIRC, is ->accept() helper. Do we ever call either of those on a sock of already opened file? IOW, is there any real reason for socket ->get_poll_head() not to be constant, other than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()? I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have sock_poll_mask() not free from it...
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 7:02 PM Linus Torvalds wrote: > > Get your act together. Don't uglify and slow down everything else just > because you're concentrating only on aio. .. and seriously, poll and select are timing-critical. There are many real loads where they show up as *the* thing in kernel profiles. aio is a distant distant second cousin. You need to make sure poll and select are prioritized, and aio should be a "if we can do this cleanly" case. Linus
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 7:02 PM Linus Torvalds wrote: > > Get your act together. Don't uglify and slow down everything else just > because you're concentrating only on aio. .. and seriously, poll and select are timing-critical. There are many real loads where they show up as *the* thing in kernel profiles. aio is a distant distant second cousin. You need to make sure poll and select are prioritized, and aio should be a "if we can do this cleanly" case. Linus
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 6:46 PM Christoph Hellwig wrote: > > > The disadvantages are obvious: every poll event now causes *two* > > indirect branches to the low-level filesystem or driver - one to get > > he poll head, and one to get the mask. Add to that all the new "do we > > have the new-style or old sane poll interface" tests, and poll is > > obviously more complicated. > > It already caused two No it didn't. If the data was ready, all that got short-circuited, and we just had that ->poll() call. Only if you *waited* did you get the second one to check the result, and the whole poll_wait(). You could just say "I already have the data, there's no wait-queu to add, I'll just return immediately". You're making this all unconditionally pessimal. Admit it. The new interface is inferior. > In the meantime below is an ugly patch that removes the _qproc > indirect for ->poll only (similar patch is possible for select > assuming the code uses select). And for next merge window I plan > to kill it off entirely. You're just making the code even worse here. Seriously, this patch is beyond ugly. The new interface is nasty shit. Get your act together. Don't uglify and slow down everything else just because you're concentrating only on aio. Your reply so far just makes me more convinced this was not thought through properly. Linus
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 6:46 PM Christoph Hellwig wrote: > > > The disadvantages are obvious: every poll event now causes *two* > > indirect branches to the low-level filesystem or driver - one to get > > he poll head, and one to get the mask. Add to that all the new "do we > > have the new-style or old sane poll interface" tests, and poll is > > obviously more complicated. > > It already caused two No it didn't. If the data was ready, all that got short-circuited, and we just had that ->poll() call. Only if you *waited* did you get the second one to check the result, and the whole poll_wait(). You could just say "I already have the data, there's no wait-queu to add, I'll just return immediately". You're making this all unconditionally pessimal. Admit it. The new interface is inferior. > In the meantime below is an ugly patch that removes the _qproc > indirect for ->poll only (similar patch is possible for select > assuming the code uses select). And for next merge window I plan > to kill it off entirely. You're just making the code even worse here. Seriously, this patch is beyond ugly. The new interface is nasty shit. Get your act together. Don't uglify and slow down everything else just because you're concentrating only on aio. Your reply so far just makes me more convinced this was not thought through properly. Linus
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
And a version with select() also covered: --- >From 317159003ae28113cf759c632b161fb39192fe3c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 22 Jun 2018 11:36:26 +0200 Subject: fs: optimize away ->_qproc indirection for poll_mask based polling Signed-off-by: Christoph Hellwig --- fs/select.c | 38 +- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/fs/select.c b/fs/select.c index bc3cc0f98896..2c9d81892509 100644 --- a/fs/select.c +++ b/fs/select.c @@ -528,6 +528,24 @@ static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time) wait_key_set(wait, in, out, bit, busy_flag); mask = vfs_poll(f.file, wait); + if (f.file->f_op->poll) { + mask = f.file->f_op->poll(f.file, wait); + } else if (file_has_poll_mask(f.file)) { + struct wait_queue_head *head; + + head = f.file->f_op->get_poll_head(f.file, wait->_key); + if (!head) { + mask = DEFAULT_POLLMASK; + } else if (IS_ERR(head)) { + mask = EPOLLERR; + } else { + if (wait->_qproc) + __pollwait(f.file, head, wait); + mask = f.file->f_op->poll_mask(f.file, wait->_key); + } + } else { + mask = DEFAULT_POLLMASK; + } fdput(f); if ((mask & POLLIN_SET) && (in & bit)) { @@ -845,7 +863,25 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, poll_table *pwait, /* userland u16 ->events contains POLL... bitmap */ filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP; pwait->_key = filter | busy_flag; - mask = vfs_poll(f.file, pwait); + if (f.file->f_op->poll) { + mask = f.file->f_op->poll(f.file, pwait); + } else if (file_has_poll_mask(f.file)) { + struct wait_queue_head *head; + + head = f.file->f_op->get_poll_head(f.file, pwait->_key); + if (!head) { + mask = DEFAULT_POLLMASK; + } else if (IS_ERR(head)) { + mask = EPOLLERR; + } else { + if (pwait->_qproc) + __pollwait(f.file, head, pwait); + mask = f.file->f_op->poll_mask(f.file, pwait->_key); + } + } else { + mask = DEFAULT_POLLMASK; + } + if (mask & busy_flag) *can_busy_poll = true; mask &= filter; /* Mask out unneeded events. */ -- 2.17.1
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
And a version with select() also covered: --- >From 317159003ae28113cf759c632b161fb39192fe3c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 22 Jun 2018 11:36:26 +0200 Subject: fs: optimize away ->_qproc indirection for poll_mask based polling Signed-off-by: Christoph Hellwig --- fs/select.c | 38 +- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/fs/select.c b/fs/select.c index bc3cc0f98896..2c9d81892509 100644 --- a/fs/select.c +++ b/fs/select.c @@ -528,6 +528,24 @@ static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time) wait_key_set(wait, in, out, bit, busy_flag); mask = vfs_poll(f.file, wait); + if (f.file->f_op->poll) { + mask = f.file->f_op->poll(f.file, wait); + } else if (file_has_poll_mask(f.file)) { + struct wait_queue_head *head; + + head = f.file->f_op->get_poll_head(f.file, wait->_key); + if (!head) { + mask = DEFAULT_POLLMASK; + } else if (IS_ERR(head)) { + mask = EPOLLERR; + } else { + if (wait->_qproc) + __pollwait(f.file, head, wait); + mask = f.file->f_op->poll_mask(f.file, wait->_key); + } + } else { + mask = DEFAULT_POLLMASK; + } fdput(f); if ((mask & POLLIN_SET) && (in & bit)) { @@ -845,7 +863,25 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, poll_table *pwait, /* userland u16 ->events contains POLL... bitmap */ filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP; pwait->_key = filter | busy_flag; - mask = vfs_poll(f.file, pwait); + if (f.file->f_op->poll) { + mask = f.file->f_op->poll(f.file, pwait); + } else if (file_has_poll_mask(f.file)) { + struct wait_queue_head *head; + + head = f.file->f_op->get_poll_head(f.file, pwait->_key); + if (!head) { + mask = DEFAULT_POLLMASK; + } else if (IS_ERR(head)) { + mask = EPOLLERR; + } else { + if (pwait->_qproc) + __pollwait(f.file, head, pwait); + mask = f.file->f_op->poll_mask(f.file, pwait->_key); + } + } else { + mask = DEFAULT_POLLMASK; + } + if (mask & busy_flag) *can_busy_poll = true; mask &= filter; /* Mask out unneeded events. */ -- 2.17.1
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 06:25:45PM +0900, Linus Torvalds wrote: > What was the alleged advantage of the new poll methods again? Because > it sure isn't obvious - not from the numbers, and not from the commit > messages. The primary goal is that we can implement a race-free aio poll, the primary benefit is that we can get rid of the currently racy and bug prone way we do in-kernel poll-like calls for things like eventfd. The first is clearly is in 4.18-rc and provides massive performance advantanges if used, the second is not there yet, more on that below. > I was assuming there was a good reason for it, but looking closer I > see absolutely nothing but negatives. The argument that keyed wake-ups > somehow make multiple wake-queues irrelevant doesn't hold water when > the code is more complex and apparently slower. It's not like anybody > ever *had* to use multiple wait-queues, but the old code was both > simpler and cleaner and *allowed* you to use multiple queues if you > wanted to. It wasn't cleaner at all if you aren't poll or select, and even for those it isn't exactly clean, see the whole mess around ->qproc. > The disadvantages are obvious: every poll event now causes *two* > indirect branches to the low-level filesystem or driver - one to get > he poll head, and one to get the mask. Add to that all the new "do we > have the new-style or old sane poll interface" tests, and poll is > obviously more complicated. It already caused two, and now we have three thanks to ->qproc. One of the advantages of the new code is that we can eventually get rid of ->qproc once all users of a non-default qproc are switched away from vfs_poll. Which requires a little more work, but I have the patches for that to be posted soon. > If we could get the poll head by just having a direct pointer in the > 'struct file', maybe that would be one thing. As it is, this all > literally just adds overhead for no obvious reason. It replaced one > simple direct call with two dependent but separate ones. People are doing weird things with their poll heads, so we can't do that unconditionally. We could however offer a waitqueue pointer in struct file and most users would be very happy with that. In the meantime below is an ugly patch that removes the _qproc indirect for ->poll only (similar patch is possible for select assuming the code uses select). And for next merge window I plan to kill it off entirely. How can we get this thrown into the will it scale run? --- >From 50ca47fdcfec0a1af56aac6db8a168bb678308a5 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 22 Jun 2018 11:36:26 +0200 Subject: fs: optimize away ->_qproc indirection for poll_mask based polling Signed-off-by: Christoph Hellwig --- fs/select.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/fs/select.c b/fs/select.c index bc3cc0f98896..54406e0ad23e 100644 --- a/fs/select.c +++ b/fs/select.c @@ -845,7 +845,25 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, poll_table *pwait, /* userland u16 ->events contains POLL... bitmap */ filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP; pwait->_key = filter | busy_flag; - mask = vfs_poll(f.file, pwait); + if (f.file->f_op->poll) { + mask = f.file->f_op->poll(f.file, pwait); + } else if (file_has_poll_mask(f.file)) { + struct wait_queue_head *head; + + head = f.file->f_op->get_poll_head(f.file, pwait->_key); + if (!head) { + mask = DEFAULT_POLLMASK; + } else if (IS_ERR(head)) { + mask = EPOLLERR; + } else { + if (pwait->_qproc) + __pollwait(f.file, head, pwait); + mask = f.file->f_op->poll_mask(f.file, pwait->_key); + } + } else { + mask = DEFAULT_POLLMASK; + } + if (mask & busy_flag) *can_busy_poll = true; mask &= filter; /* Mask out unneeded events. */ -- 2.17.1
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
On Fri, Jun 22, 2018 at 06:25:45PM +0900, Linus Torvalds wrote: > What was the alleged advantage of the new poll methods again? Because > it sure isn't obvious - not from the numbers, and not from the commit > messages. The primary goal is that we can implement a race-free aio poll, the primary benefit is that we can get rid of the currently racy and bug prone way we do in-kernel poll-like calls for things like eventfd. The first is clearly is in 4.18-rc and provides massive performance advantanges if used, the second is not there yet, more on that below. > I was assuming there was a good reason for it, but looking closer I > see absolutely nothing but negatives. The argument that keyed wake-ups > somehow make multiple wake-queues irrelevant doesn't hold water when > the code is more complex and apparently slower. It's not like anybody > ever *had* to use multiple wait-queues, but the old code was both > simpler and cleaner and *allowed* you to use multiple queues if you > wanted to. It wasn't cleaner at all if you aren't poll or select, and even for those it isn't exactly clean, see the whole mess around ->qproc. > The disadvantages are obvious: every poll event now causes *two* > indirect branches to the low-level filesystem or driver - one to get > he poll head, and one to get the mask. Add to that all the new "do we > have the new-style or old sane poll interface" tests, and poll is > obviously more complicated. It already caused two, and now we have three thanks to ->qproc. One of the advantages of the new code is that we can eventually get rid of ->qproc once all users of a non-default qproc are switched away from vfs_poll. Which requires a little more work, but I have the patches for that to be posted soon. > If we could get the poll head by just having a direct pointer in the > 'struct file', maybe that would be one thing. As it is, this all > literally just adds overhead for no obvious reason. It replaced one > simple direct call with two dependent but separate ones. People are doing weird things with their poll heads, so we can't do that unconditionally. We could however offer a waitqueue pointer in struct file and most users would be very happy with that. In the meantime below is an ugly patch that removes the _qproc indirect for ->poll only (similar patch is possible for select assuming the code uses select). And for next merge window I plan to kill it off entirely. How can we get this thrown into the will it scale run? --- >From 50ca47fdcfec0a1af56aac6db8a168bb678308a5 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 22 Jun 2018 11:36:26 +0200 Subject: fs: optimize away ->_qproc indirection for poll_mask based polling Signed-off-by: Christoph Hellwig --- fs/select.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/fs/select.c b/fs/select.c index bc3cc0f98896..54406e0ad23e 100644 --- a/fs/select.c +++ b/fs/select.c @@ -845,7 +845,25 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, poll_table *pwait, /* userland u16 ->events contains POLL... bitmap */ filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP; pwait->_key = filter | busy_flag; - mask = vfs_poll(f.file, pwait); + if (f.file->f_op->poll) { + mask = f.file->f_op->poll(f.file, pwait); + } else if (file_has_poll_mask(f.file)) { + struct wait_queue_head *head; + + head = f.file->f_op->get_poll_head(f.file, pwait->_key); + if (!head) { + mask = DEFAULT_POLLMASK; + } else if (IS_ERR(head)) { + mask = EPOLLERR; + } else { + if (pwait->_qproc) + __pollwait(f.file, head, pwait); + mask = f.file->f_op->poll_mask(f.file, pwait->_key); + } + } else { + mask = DEFAULT_POLLMASK; + } + if (mask & busy_flag) *can_busy_poll = true; mask &= filter; /* Mask out unneeded events. */ -- 2.17.1
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
[ Added Al, since this all came in through his trees. The guilty authors were already added by the robot ] On Fri, Jun 22, 2018 at 5:31 PM kernel test robot wrote: > > FYI, we noticed a -8.8% regression of will-it-scale.per_process_ops due to > commit: Guys, this seems pretty big. What was the alleged advantage of the new poll methods again? Because it sure isn't obvious - not from the numbers, and not from the commit messages. The code seems to be garbage. It replaces our nice generic "you can do anything you damn well like in your poll function" with two limited fixed-function "just give me the poll head and the mask". I was assuming there was a good reason for it, but looking closer I see absolutely nothing but negatives. The argument that keyed wake-ups somehow make multiple wake-queues irrelevant doesn't hold water when the code is more complex and apparently slower. It's not like anybody ever *had* to use multiple wait-queues, but the old code was both simpler and cleaner and *allowed* you to use multiple queues if you wanted to. So the old code is simpler, cleaner, and more flexible. And according to the test robot, it also performs better. So I'm inclined to just revert the whole mess unless I get some serious explanations for what the supposed advantages are. The disadvantages are obvious: every poll event now causes *two* indirect branches to the low-level filesystem or driver - one to get he poll head, and one to get the mask. Add to that all the new "do we have the new-style or old sane poll interface" tests, and poll is obviously more complicated. Quite frankly, when I look at it, I just go "that's _stupid_". I'm entirely missing the point of the conversion, and it's not explained in the messages either. If we could get the poll head by just having a direct pointer in the 'struct file', maybe that would be one thing. As it is, this all literally just adds overhead for no obvious reason. It replaced one simple direct call with two dependent but separate ones. Linus
Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression
[ Added Al, since this all came in through his trees. The guilty authors were already added by the robot ] On Fri, Jun 22, 2018 at 5:31 PM kernel test robot wrote: > > FYI, we noticed a -8.8% regression of will-it-scale.per_process_ops due to > commit: Guys, this seems pretty big. What was the alleged advantage of the new poll methods again? Because it sure isn't obvious - not from the numbers, and not from the commit messages. The code seems to be garbage. It replaces our nice generic "you can do anything you damn well like in your poll function" with two limited fixed-function "just give me the poll head and the mask". I was assuming there was a good reason for it, but looking closer I see absolutely nothing but negatives. The argument that keyed wake-ups somehow make multiple wake-queues irrelevant doesn't hold water when the code is more complex and apparently slower. It's not like anybody ever *had* to use multiple wait-queues, but the old code was both simpler and cleaner and *allowed* you to use multiple queues if you wanted to. So the old code is simpler, cleaner, and more flexible. And according to the test robot, it also performs better. So I'm inclined to just revert the whole mess unless I get some serious explanations for what the supposed advantages are. The disadvantages are obvious: every poll event now causes *two* indirect branches to the low-level filesystem or driver - one to get he poll head, and one to get the mask. Add to that all the new "do we have the new-style or old sane poll interface" tests, and poll is obviously more complicated. Quite frankly, when I look at it, I just go "that's _stupid_". I'm entirely missing the point of the conversion, and it's not explained in the messages either. If we could get the poll head by just having a direct pointer in the 'struct file', maybe that would be one thing. As it is, this all literally just adds overhead for no obvious reason. It replaced one simple direct call with two dependent but separate ones. Linus