Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-28 Thread Christoph Hellwig
On Thu, Jun 28, 2018 at 08:38:34AM +0800, Ye Xiaolong wrote:
> Update the result:
> 
> testcase/path_params/tbox_group/run: will-it-scale/poll2-performance/lkp-sb03

So this looks like a huge improvement in the per process ops, but not
as large as the original regression, and no change in the per-thread
ops.

But the baseline already looks much lower, e.g. this shows an improvement
from 404611 to 424608 for the per-process ops, while the original showed
a regression from 501456 to 457120.  Are we measuring on different
hardware?   Did we gain new spectre mitigations elsewhere?  Either way
I'm going to send these patches out for review, but I'd like to understand
the numbers a bit more.

> 
> 894b8c000ae6c106  8fbedc19c94fd25a2b9b327015  
>   --  
>  %stddev  change %stddev
>  \  |\  
> 404611 ±  4% 5% 424608will-it-scale.per_process_ops
>   1489 ± 21%28%   1899 ± 18%  
> will-it-scale.time.voluntary_context_switches
>   4582856046155690will-it-scale.workload
>   23372342will-it-scale.time.system_time
>806 806
> will-it-scale.time.percent_of_cpu_this_job_got
>310 310will-it-scale.time.elapsed_time
>310 310
> will-it-scale.time.elapsed_time.max
>   40964096will-it-scale.time.page_size
> 233917  233862will-it-scale.per_thread_ops
>  17196   17179
> will-it-scale.time.minor_page_faults
>   99019862
> will-it-scale.time.maximum_resident_set_size
>  14705 ±  3% 14397 ±  4%  
> will-it-scale.time.involuntary_context_switches
>167 163will-it-scale.time.user_time
>   0.66 ± 25%   -17%   0.54will-it-scale.scalability
> 120508 ± 15%-7% 112098 ±  5%  
> interrupts.CAL:Function_call_interrupts
>   1670 ±  3%10%   1845 ±  3%  vmstat.system.cs
>  32707   32635vmstat.system.in
>121 122turbostat.CorWatt
>149 150turbostat.PkgWatt
>   15731573turbostat.Avg_MHz
>  17.54 ± 19% 17.77 ± 19%  boot-time.kernel_boot
>824 ± 12%   834 ± 12%  boot-time.idle
>  27.45 ± 12% 27.69 ± 12%  boot-time.boot
>  16.96 ± 21% 16.93 ± 21%  boot-time.dhcp
>   1489 ± 21%28%   1899 ± 18%  time.voluntary_context_switches
>   23372342time.system_time
>806 806time.percent_of_cpu_this_job_got
>310 310time.elapsed_time
>310 310time.elapsed_time.max
>   40964096time.page_size
>  17196   17179time.minor_page_faults
>   99019862time.maximum_resident_set_size
>  14705 ±  3% 14397 ±  4%  
> time.involuntary_context_switches
>167 163time.user_time
>  18320   6%  19506 ±  8%  
> proc-vmstat.nr_slab_unreclaimable
>   1518 ±  7%  1558 ± 10%  proc-vmstat.numa_hint_faults
>   1387 ±  8%  1421 ±  9%  
> proc-vmstat.numa_hint_faults_local
>   1873 ±  5%  1917 ±  8%  proc-vmstat.numa_pte_updates
>  19987   20005proc-vmstat.nr_anon_pages
>   84648471proc-vmstat.nr_kernel_stack
> 309815  310062proc-vmstat.nr_file_pages
>  50828   50828proc-vmstat.nr_free_cma
>   1606559016064831proc-vmstat.nr_free_pages
>3194669 3194517proc-vmstat.nr_dirty_threshold
>1595384 1595308
> proc-vmstat.nr_dirty_background_threshold
> 798886  797937proc-vmstat.pgfault
>   65106499proc-vmstat.nr_mapped
> 659089  657491proc-vmstat.numa_local
> 665458  663786proc-vmstat.numa_hit
>   10371033proc-vmstat.nr_page_table_pages
> 669923  665906proc-vmstat.pgfree
> 676982  672385proc-vmstat.pgalloc_normal
>   63686294proc-vmstat.numa_other
>  13013  -7%  12152 ± 11%  proc-vmstat.nr_slab_reclaimable
>   

Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-28 Thread Christoph Hellwig
On Thu, Jun 28, 2018 at 08:38:34AM +0800, Ye Xiaolong wrote:
> Update the result:
> 
> testcase/path_params/tbox_group/run: will-it-scale/poll2-performance/lkp-sb03

So this looks like a huge improvement in the per process ops, but not
as large as the original regression, and no change in the per-thread
ops.

But the baseline already looks much lower, e.g. this shows an improvement
from 404611 to 424608 for the per-process ops, while the original showed
a regression from 501456 to 457120.  Are we measuring on different
hardware?   Did we gain new spectre mitigations elsewhere?  Either way
I'm going to send these patches out for review, but I'd like to understand
the numbers a bit more.

> 
> 894b8c000ae6c106  8fbedc19c94fd25a2b9b327015  
>   --  
>  %stddev  change %stddev
>  \  |\  
> 404611 ±  4% 5% 424608will-it-scale.per_process_ops
>   1489 ± 21%28%   1899 ± 18%  
> will-it-scale.time.voluntary_context_switches
>   4582856046155690will-it-scale.workload
>   23372342will-it-scale.time.system_time
>806 806
> will-it-scale.time.percent_of_cpu_this_job_got
>310 310will-it-scale.time.elapsed_time
>310 310
> will-it-scale.time.elapsed_time.max
>   40964096will-it-scale.time.page_size
> 233917  233862will-it-scale.per_thread_ops
>  17196   17179
> will-it-scale.time.minor_page_faults
>   99019862
> will-it-scale.time.maximum_resident_set_size
>  14705 ±  3% 14397 ±  4%  
> will-it-scale.time.involuntary_context_switches
>167 163will-it-scale.time.user_time
>   0.66 ± 25%   -17%   0.54will-it-scale.scalability
> 120508 ± 15%-7% 112098 ±  5%  
> interrupts.CAL:Function_call_interrupts
>   1670 ±  3%10%   1845 ±  3%  vmstat.system.cs
>  32707   32635vmstat.system.in
>121 122turbostat.CorWatt
>149 150turbostat.PkgWatt
>   15731573turbostat.Avg_MHz
>  17.54 ± 19% 17.77 ± 19%  boot-time.kernel_boot
>824 ± 12%   834 ± 12%  boot-time.idle
>  27.45 ± 12% 27.69 ± 12%  boot-time.boot
>  16.96 ± 21% 16.93 ± 21%  boot-time.dhcp
>   1489 ± 21%28%   1899 ± 18%  time.voluntary_context_switches
>   23372342time.system_time
>806 806time.percent_of_cpu_this_job_got
>310 310time.elapsed_time
>310 310time.elapsed_time.max
>   40964096time.page_size
>  17196   17179time.minor_page_faults
>   99019862time.maximum_resident_set_size
>  14705 ±  3% 14397 ±  4%  
> time.involuntary_context_switches
>167 163time.user_time
>  18320   6%  19506 ±  8%  
> proc-vmstat.nr_slab_unreclaimable
>   1518 ±  7%  1558 ± 10%  proc-vmstat.numa_hint_faults
>   1387 ±  8%  1421 ±  9%  
> proc-vmstat.numa_hint_faults_local
>   1873 ±  5%  1917 ±  8%  proc-vmstat.numa_pte_updates
>  19987   20005proc-vmstat.nr_anon_pages
>   84648471proc-vmstat.nr_kernel_stack
> 309815  310062proc-vmstat.nr_file_pages
>  50828   50828proc-vmstat.nr_free_cma
>   1606559016064831proc-vmstat.nr_free_pages
>3194669 3194517proc-vmstat.nr_dirty_threshold
>1595384 1595308
> proc-vmstat.nr_dirty_background_threshold
> 798886  797937proc-vmstat.pgfault
>   65106499proc-vmstat.nr_mapped
> 659089  657491proc-vmstat.numa_local
> 665458  663786proc-vmstat.numa_hit
>   10371033proc-vmstat.nr_page_table_pages
> 669923  665906proc-vmstat.pgfree
> 676982  672385proc-vmstat.pgalloc_normal
>   63686294proc-vmstat.numa_other
>  13013  -7%  12152 ± 11%  proc-vmstat.nr_slab_reclaimable
>   

Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-27 Thread Ye Xiaolong
On 06/27, Christoph Hellwig wrote:
>On Tue, Jun 26, 2018 at 02:03:38PM +0800, Ye Xiaolong wrote:
>> Hi,
>> 
>> On 06/22, Christoph Hellwig wrote:
>> >Hi Xiaolong,
>> >
>> >can you retest this workload on the following branch:
>> >
>> >git://git.infradead.org/users/hch/vfs.git remove-get-poll-head
>> >
>> >Gitweb:
>> >
>> >
>> > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
>> 
>> Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: 
>> replace f_ops->get_poll_head with a static ->f_poll_head pointer") in 
>> remove-get-poll-head branch.
>
>Especially the boot time ones and others look like they have additional
>changes.
>
>Can you compare the baseline of my tree, which is
>894b8c00 ("Merge tag 'for_v4.18-rc2' of
>git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs") against 8fbedc1
>(("fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer") ?

Update the result:

testcase/path_params/tbox_group/run: will-it-scale/poll2-performance/lkp-sb03

894b8c000ae6c106  8fbedc19c94fd25a2b9b327015  
  --  
 %stddev  change %stddev
 \  |\  
404611 ±  4% 5% 424608will-it-scale.per_process_ops
  1489 ± 21%28%   1899 ± 18%  
will-it-scale.time.voluntary_context_switches
  4582856046155690will-it-scale.workload
  23372342will-it-scale.time.system_time
   806 806
will-it-scale.time.percent_of_cpu_this_job_got
   310 310will-it-scale.time.elapsed_time
   310 310
will-it-scale.time.elapsed_time.max
  40964096will-it-scale.time.page_size
233917  233862will-it-scale.per_thread_ops
 17196   17179
will-it-scale.time.minor_page_faults
  99019862
will-it-scale.time.maximum_resident_set_size
 14705 ±  3% 14397 ±  4%  
will-it-scale.time.involuntary_context_switches
   167 163will-it-scale.time.user_time
  0.66 ± 25%   -17%   0.54will-it-scale.scalability
120508 ± 15%-7% 112098 ±  5%  
interrupts.CAL:Function_call_interrupts
  1670 ±  3%10%   1845 ±  3%  vmstat.system.cs
 32707   32635vmstat.system.in
   121 122turbostat.CorWatt
   149 150turbostat.PkgWatt
  15731573turbostat.Avg_MHz
 17.54 ± 19% 17.77 ± 19%  boot-time.kernel_boot
   824 ± 12%   834 ± 12%  boot-time.idle
 27.45 ± 12% 27.69 ± 12%  boot-time.boot
 16.96 ± 21% 16.93 ± 21%  boot-time.dhcp
  1489 ± 21%28%   1899 ± 18%  time.voluntary_context_switches
  23372342time.system_time
   806 806time.percent_of_cpu_this_job_got
   310 310time.elapsed_time
   310 310time.elapsed_time.max
  40964096time.page_size
 17196   17179time.minor_page_faults
  99019862time.maximum_resident_set_size
 14705 ±  3% 14397 ±  4%  time.involuntary_context_switches
   167 163time.user_time
 18320   6%  19506 ±  8%  proc-vmstat.nr_slab_unreclaimable
  1518 ±  7%  1558 ± 10%  proc-vmstat.numa_hint_faults
  1387 ±  8%  1421 ±  9%  proc-vmstat.numa_hint_faults_local
  1873 ±  5%  1917 ±  8%  proc-vmstat.numa_pte_updates
 19987   20005proc-vmstat.nr_anon_pages
  84648471proc-vmstat.nr_kernel_stack
309815  310062proc-vmstat.nr_file_pages
 50828   50828proc-vmstat.nr_free_cma
  1606559016064831proc-vmstat.nr_free_pages
   3194669 3194517proc-vmstat.nr_dirty_threshold
   1595384 1595308
proc-vmstat.nr_dirty_background_threshold
798886  797937proc-vmstat.pgfault
  65106499proc-vmstat.nr_mapped
659089  657491proc-vmstat.numa_local
665458  663786proc-vmstat.numa_hit
  10371033proc-vmstat.nr_page_table_pages
669923  665906proc-vmstat.pgfree
676982

Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-27 Thread Ye Xiaolong
On 06/27, Christoph Hellwig wrote:
>On Tue, Jun 26, 2018 at 02:03:38PM +0800, Ye Xiaolong wrote:
>> Hi,
>> 
>> On 06/22, Christoph Hellwig wrote:
>> >Hi Xiaolong,
>> >
>> >can you retest this workload on the following branch:
>> >
>> >git://git.infradead.org/users/hch/vfs.git remove-get-poll-head
>> >
>> >Gitweb:
>> >
>> >
>> > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
>> 
>> Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: 
>> replace f_ops->get_poll_head with a static ->f_poll_head pointer") in 
>> remove-get-poll-head branch.
>
>Especially the boot time ones and others look like they have additional
>changes.
>
>Can you compare the baseline of my tree, which is
>894b8c00 ("Merge tag 'for_v4.18-rc2' of
>git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs") against 8fbedc1
>(("fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer") ?

Update the result:

testcase/path_params/tbox_group/run: will-it-scale/poll2-performance/lkp-sb03

894b8c000ae6c106  8fbedc19c94fd25a2b9b327015  
  --  
 %stddev  change %stddev
 \  |\  
404611 ±  4% 5% 424608will-it-scale.per_process_ops
  1489 ± 21%28%   1899 ± 18%  
will-it-scale.time.voluntary_context_switches
  4582856046155690will-it-scale.workload
  23372342will-it-scale.time.system_time
   806 806
will-it-scale.time.percent_of_cpu_this_job_got
   310 310will-it-scale.time.elapsed_time
   310 310
will-it-scale.time.elapsed_time.max
  40964096will-it-scale.time.page_size
233917  233862will-it-scale.per_thread_ops
 17196   17179
will-it-scale.time.minor_page_faults
  99019862
will-it-scale.time.maximum_resident_set_size
 14705 ±  3% 14397 ±  4%  
will-it-scale.time.involuntary_context_switches
   167 163will-it-scale.time.user_time
  0.66 ± 25%   -17%   0.54will-it-scale.scalability
120508 ± 15%-7% 112098 ±  5%  
interrupts.CAL:Function_call_interrupts
  1670 ±  3%10%   1845 ±  3%  vmstat.system.cs
 32707   32635vmstat.system.in
   121 122turbostat.CorWatt
   149 150turbostat.PkgWatt
  15731573turbostat.Avg_MHz
 17.54 ± 19% 17.77 ± 19%  boot-time.kernel_boot
   824 ± 12%   834 ± 12%  boot-time.idle
 27.45 ± 12% 27.69 ± 12%  boot-time.boot
 16.96 ± 21% 16.93 ± 21%  boot-time.dhcp
  1489 ± 21%28%   1899 ± 18%  time.voluntary_context_switches
  23372342time.system_time
   806 806time.percent_of_cpu_this_job_got
   310 310time.elapsed_time
   310 310time.elapsed_time.max
  40964096time.page_size
 17196   17179time.minor_page_faults
  99019862time.maximum_resident_set_size
 14705 ±  3% 14397 ±  4%  time.involuntary_context_switches
   167 163time.user_time
 18320   6%  19506 ±  8%  proc-vmstat.nr_slab_unreclaimable
  1518 ±  7%  1558 ± 10%  proc-vmstat.numa_hint_faults
  1387 ±  8%  1421 ±  9%  proc-vmstat.numa_hint_faults_local
  1873 ±  5%  1917 ±  8%  proc-vmstat.numa_pte_updates
 19987   20005proc-vmstat.nr_anon_pages
  84648471proc-vmstat.nr_kernel_stack
309815  310062proc-vmstat.nr_file_pages
 50828   50828proc-vmstat.nr_free_cma
  1606559016064831proc-vmstat.nr_free_pages
   3194669 3194517proc-vmstat.nr_dirty_threshold
   1595384 1595308
proc-vmstat.nr_dirty_background_threshold
798886  797937proc-vmstat.pgfault
  65106499proc-vmstat.nr_mapped
659089  657491proc-vmstat.numa_local
665458  663786proc-vmstat.numa_hit
  10371033proc-vmstat.nr_page_table_pages
669923  665906proc-vmstat.pgfree
676982

Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-27 Thread Christoph Hellwig
On Tue, Jun 26, 2018 at 02:03:38PM +0800, Ye Xiaolong wrote:
> Hi,
> 
> On 06/22, Christoph Hellwig wrote:
> >Hi Xiaolong,
> >
> >can you retest this workload on the following branch:
> >
> >git://git.infradead.org/users/hch/vfs.git remove-get-poll-head
> >
> >Gitweb:
> >
> >
> > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
> 
> Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: replace 
> f_ops->get_poll_head with a static ->f_poll_head pointer") in 
> remove-get-poll-head branch.

Especially the boot time ones and others look like they have additional
changes.

Can you compare the baseline of my tree, which is
894b8c00 ("Merge tag 'for_v4.18-rc2' of
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs") against 8fbedc1
(("fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer") ?


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-27 Thread Christoph Hellwig
On Tue, Jun 26, 2018 at 02:03:38PM +0800, Ye Xiaolong wrote:
> Hi,
> 
> On 06/22, Christoph Hellwig wrote:
> >Hi Xiaolong,
> >
> >can you retest this workload on the following branch:
> >
> >git://git.infradead.org/users/hch/vfs.git remove-get-poll-head
> >
> >Gitweb:
> >
> >
> > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
> 
> Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: replace 
> f_ops->get_poll_head with a static ->f_poll_head pointer") in 
> remove-get-poll-head branch.

Especially the boot time ones and others look like they have additional
changes.

Can you compare the baseline of my tree, which is
894b8c00 ("Merge tag 'for_v4.18-rc2' of
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs") against 8fbedc1
(("fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer") ?


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-26 Thread Ye Xiaolong
Hi,

On 06/22, Christoph Hellwig wrote:
>Hi Xiaolong,
>
>can you retest this workload on the following branch:
>
>git://git.infradead.org/users/hch/vfs.git remove-get-poll-head
>
>Gitweb:
>
>
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head

Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: replace 
f_ops->get_poll_head with a static ->f_poll_head pointer") in 
remove-get-poll-head branch.

3deb642f0de4c14f  8fbedc19c94fd25a2b9b327015  
  --  
 %stddev  change %stddev
 \  |\  
457120  -7% 424608will-it-scale.per_process_ops
238978  233862will-it-scale.per_thread_ops
  1755 ± 13% 8%   1899 ± 18%  
will-it-scale.time.voluntary_context_switches
  23322342will-it-scale.time.system_time
   310 310will-it-scale.time.elapsed_time
   310 310
will-it-scale.time.elapsed_time.max
  40964096will-it-scale.time.page_size
  0.540.54will-it-scale.scalability
   807 806
will-it-scale.time.percent_of_cpu_this_job_got
 17218   17179
will-it-scale.time.minor_page_faults
  99319862
will-it-scale.time.maximum_resident_set_size
   173  -6%163will-it-scale.time.user_time
  49024375  -6%   46155690will-it-scale.workload
 17818 ± 10%   -19%  14397 ±  4%  
will-it-scale.time.involuntary_context_switches
116842 ± 12%-4% 112098 ±  5%  
interrupts.CAL:Function_call_interrupts
 32735   32635vmstat.system.in
  2112 ±  7%   -13%   1845 ±  3%  vmstat.system.cs
   150 150turbostat.PkgWatt
   123 122turbostat.CorWatt
  15731573turbostat.Avg_MHz
 15.73  13%  17.77 ± 19%  boot-time.kernel_boot
 15.07  12%  16.93 ± 21%  boot-time.dhcp
   771   8%834 ± 12%  boot-time.idle
 25.69   8%  27.69 ± 12%  boot-time.boot
  1755 ± 13% 8%   1899 ± 18%  time.voluntary_context_switches
  23322342time.system_time
   310 310time.elapsed_time
   310 310time.elapsed_time.max
  40964096time.page_size
   807 806time.percent_of_cpu_this_job_got
 17218   17179time.minor_page_faults
  99319862time.maximum_resident_set_size
   173  -6%163time.user_time
 17818 ± 10%   -19%  14397 ±  4%  time.involuntary_context_switches
428813 ±  9%57% 672385proc-vmstat.pgalloc_normal
 41736 ± 15%22%  50828proc-vmstat.nr_free_cma
 18116   8%  19506 ±  8%  proc-vmstat.nr_slab_unreclaimable
  10291033proc-vmstat.nr_page_table_pages
  84538471proc-vmstat.nr_kernel_stack
  64866499proc-vmstat.nr_mapped
   3193607 3194517proc-vmstat.nr_dirty_threshold
   1594853 1595308
proc-vmstat.nr_dirty_background_threshold
  1606187716064831proc-vmstat.nr_free_pages
 20009   20005proc-vmstat.nr_anon_pages
  63036294proc-vmstat.numa_other
799772  797937proc-vmstat.pgfault
667803  665906proc-vmstat.pgfree
666440  663786proc-vmstat.numa_hit
660136  657491proc-vmstat.numa_local
313125  310062proc-vmstat.nr_file_pages
  1941 ±  5%  1917 ±  8%  proc-vmstat.numa_pte_updates
  1448 ±  7%  1421 ±  9%  proc-vmstat.numa_hint_faults_local
  1596 ±  6%  1558 ± 10%  proc-vmstat.numa_hint_faults
 12893  -6%  12152 ± 11%  proc-vmstat.nr_slab_reclaimable
 22885-100%  0
proc-vmstat.nr_indirectly_reclaimable
245443 ± 16%  -100%  0proc-vmstat.pgalloc_movable
  19861107 ± 14%34%   26619357 ± 35%  perf-stat.node-load-misses
  51734389 ±  5%22%   63014695 ± 25%  perf-stat.node-loads
 1.924e+09 ±  3%21%   2.32e+09 ±  5%  perf-stat.iTLB-load-misses
 2.342e+09 ±  8%

Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-26 Thread Ye Xiaolong
Hi,

On 06/22, Christoph Hellwig wrote:
>Hi Xiaolong,
>
>can you retest this workload on the following branch:
>
>git://git.infradead.org/users/hch/vfs.git remove-get-poll-head
>
>Gitweb:
>
>
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head

Here is the comparison for commit 3deb642f0d and commit 8fbedc1 ("fs: replace 
f_ops->get_poll_head with a static ->f_poll_head pointer") in 
remove-get-poll-head branch.

3deb642f0de4c14f  8fbedc19c94fd25a2b9b327015  
  --  
 %stddev  change %stddev
 \  |\  
457120  -7% 424608will-it-scale.per_process_ops
238978  233862will-it-scale.per_thread_ops
  1755 ± 13% 8%   1899 ± 18%  
will-it-scale.time.voluntary_context_switches
  23322342will-it-scale.time.system_time
   310 310will-it-scale.time.elapsed_time
   310 310
will-it-scale.time.elapsed_time.max
  40964096will-it-scale.time.page_size
  0.540.54will-it-scale.scalability
   807 806
will-it-scale.time.percent_of_cpu_this_job_got
 17218   17179
will-it-scale.time.minor_page_faults
  99319862
will-it-scale.time.maximum_resident_set_size
   173  -6%163will-it-scale.time.user_time
  49024375  -6%   46155690will-it-scale.workload
 17818 ± 10%   -19%  14397 ±  4%  
will-it-scale.time.involuntary_context_switches
116842 ± 12%-4% 112098 ±  5%  
interrupts.CAL:Function_call_interrupts
 32735   32635vmstat.system.in
  2112 ±  7%   -13%   1845 ±  3%  vmstat.system.cs
   150 150turbostat.PkgWatt
   123 122turbostat.CorWatt
  15731573turbostat.Avg_MHz
 15.73  13%  17.77 ± 19%  boot-time.kernel_boot
 15.07  12%  16.93 ± 21%  boot-time.dhcp
   771   8%834 ± 12%  boot-time.idle
 25.69   8%  27.69 ± 12%  boot-time.boot
  1755 ± 13% 8%   1899 ± 18%  time.voluntary_context_switches
  23322342time.system_time
   310 310time.elapsed_time
   310 310time.elapsed_time.max
  40964096time.page_size
   807 806time.percent_of_cpu_this_job_got
 17218   17179time.minor_page_faults
  99319862time.maximum_resident_set_size
   173  -6%163time.user_time
 17818 ± 10%   -19%  14397 ±  4%  time.involuntary_context_switches
428813 ±  9%57% 672385proc-vmstat.pgalloc_normal
 41736 ± 15%22%  50828proc-vmstat.nr_free_cma
 18116   8%  19506 ±  8%  proc-vmstat.nr_slab_unreclaimable
  10291033proc-vmstat.nr_page_table_pages
  84538471proc-vmstat.nr_kernel_stack
  64866499proc-vmstat.nr_mapped
   3193607 3194517proc-vmstat.nr_dirty_threshold
   1594853 1595308
proc-vmstat.nr_dirty_background_threshold
  1606187716064831proc-vmstat.nr_free_pages
 20009   20005proc-vmstat.nr_anon_pages
  63036294proc-vmstat.numa_other
799772  797937proc-vmstat.pgfault
667803  665906proc-vmstat.pgfree
666440  663786proc-vmstat.numa_hit
660136  657491proc-vmstat.numa_local
313125  310062proc-vmstat.nr_file_pages
  1941 ±  5%  1917 ±  8%  proc-vmstat.numa_pte_updates
  1448 ±  7%  1421 ±  9%  proc-vmstat.numa_hint_faults_local
  1596 ±  6%  1558 ± 10%  proc-vmstat.numa_hint_faults
 12893  -6%  12152 ± 11%  proc-vmstat.nr_slab_reclaimable
 22885-100%  0
proc-vmstat.nr_indirectly_reclaimable
245443 ± 16%  -100%  0proc-vmstat.pgalloc_movable
  19861107 ± 14%34%   26619357 ± 35%  perf-stat.node-load-misses
  51734389 ±  5%22%   63014695 ± 25%  perf-stat.node-loads
 1.924e+09 ±  3%21%   2.32e+09 ±  5%  perf-stat.iTLB-load-misses
 2.342e+09 ±  8%

Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-23 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 09:02:55PM +0100, Al Viro wrote:
> > While at the same time corect poll code already checks net_busy_loop_on
> > to set POLL_BUSY_LOOP.  So except for sockets where people set the
> > timeout to 0 the code already does the right thing as-is.  IMHO not
> > really worth wasting a FMODE_* flag for it, but if you insist I'll add
> > it.
> 
> It's not just that - there's also an issue of extra indirect call on the
> fast path for sockets.  You get this method of yours + ->poll_mask(),
> which hits another indirect to per-family ->poll_mask().  It might be
> better to have these combined, sparing us an extra indirect call.
> 
> Just give it the same calling conventions as ->poll_mask() have...

The problem is that for the busy poll we want the actual busy poll +
__pollwait + ->poll_mask.  Which is going to make that new poll_busy_loop
with a return value look exactly like ->poll.

So for now I'm tempted to just do this:

---
>From 4abf23f6565ff2a74f1859758f9c894abe476a00 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Sat, 23 Jun 2018 09:02:59 +0200
Subject: FOLD: remove ->poll_busy_loop again

Busy looping always comes in from poll(2) or select(2).  So instead of
adding a separate method we can just do it at the beginning of ->poll
for now.

Signed-off-by: Christoph Hellwig 
---
 fs/select.c|  8 
 include/linux/fs.h |  1 -
 net/socket.c   | 20 ++--
 3 files changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 25327efca2f9..c68f7cdc777a 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -38,14 +38,6 @@ __poll_t vfs_poll(struct file *file, struct 
poll_table_struct *pt)
 {
unsigned int events = poll_requested_events(pt);
 
-   /*
-* XXX: might be worth adding a f_mode flag to see if busy looping is
-* supported.  Although callers probably only keep setting it when
-* supported, that's why POLL_BUSY_LOOP is reported in the output.
-*/
-   if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop)
-   file->f_op->poll_busy_loop(file, events);
-
if (file->f_op->poll) {
return file->f_op->poll(file, pt);
} else if (file->f_poll_head) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 82133bd1a047..bfaebdc03878 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1722,7 +1722,6 @@ struct file_operations {
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
__poll_t (*poll_mask) (struct file *, __poll_t);
-   void (*poll_busy_loop)(struct file *file, __poll_t events);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
diff --git a/net/socket.c b/net/socket.c
index b52e5b900e09..0aaa49190b30 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -131,19 +131,6 @@ static ssize_t sock_splice_read(struct file *file, loff_t 
*ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);
 
-#ifdef CONFIG_NET_RX_BUSY_POLL
-static void sock_poll_busy_loop(struct file *file, __poll_t events)
-{
-   struct socket *sock = file->private_data;
-
-   /* once, only if requested by syscall */
-   if (sk_can_busy_loop(sock->sk))
-   sk_busy_loop(sock->sk, 1);
-}
-#else
-#define sock_poll_busy_loopNULL
-#endif
-
 /*
  * Socket files have a set of 'special' operations as well as the generic 
file ones. These don't appear
  * in the operation structures but are done directly via the socketcall() 
multiplexor.
@@ -155,7 +142,6 @@ static const struct file_operations socket_file_ops = {
.read_iter =sock_read_iter,
.write_iter =   sock_write_iter,
.poll_mask =sock_poll_mask,
-   .poll_busy_loop = sock_poll_busy_loop,
.poll = sock_poll,
.unlocked_ioctl = sock_ioctl,
 #ifdef CONFIG_COMPAT
@@ -1163,6 +1149,12 @@ static __poll_t sock_poll(struct file *file, poll_table 
*wait)
struct socket *sock = file->private_data;
__poll_t events = poll_requested_events(wait), mask = 0;
 
+   /*
+* Poll once, if requested by syscall.
+*/
+   if ((events & POLL_BUSY_LOOP) && sk_can_busy_loop(sock->sk))
+   sk_busy_loop(sock->sk, 1);
+
if (sock->ops->poll) {
mask = sock->ops->poll(file, sock, wait);
} else if (sock->ops->poll_mask) {
-- 
2.17.1



Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-23 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 09:02:55PM +0100, Al Viro wrote:
> > While at the same time corect poll code already checks net_busy_loop_on
> > to set POLL_BUSY_LOOP.  So except for sockets where people set the
> > timeout to 0 the code already does the right thing as-is.  IMHO not
> > really worth wasting a FMODE_* flag for it, but if you insist I'll add
> > it.
> 
> It's not just that - there's also an issue of extra indirect call on the
> fast path for sockets.  You get this method of yours + ->poll_mask(),
> which hits another indirect to per-family ->poll_mask().  It might be
> better to have these combined, sparing us an extra indirect call.
> 
> Just give it the same calling conventions as ->poll_mask() have...

The problem is that for the busy poll we want the actual busy poll +
__pollwait + ->poll_mask.  Which is going to make that new poll_busy_loop
with a return value look exactly like ->poll.

So for now I'm tempted to just do this:

---
>From 4abf23f6565ff2a74f1859758f9c894abe476a00 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Sat, 23 Jun 2018 09:02:59 +0200
Subject: FOLD: remove ->poll_busy_loop again

Busy looping always comes in from poll(2) or select(2).  So instead of
adding a separate method we can just do it at the beginning of ->poll
for now.

Signed-off-by: Christoph Hellwig 
---
 fs/select.c|  8 
 include/linux/fs.h |  1 -
 net/socket.c   | 20 ++--
 3 files changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 25327efca2f9..c68f7cdc777a 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -38,14 +38,6 @@ __poll_t vfs_poll(struct file *file, struct 
poll_table_struct *pt)
 {
unsigned int events = poll_requested_events(pt);
 
-   /*
-* XXX: might be worth adding a f_mode flag to see if busy looping is
-* supported.  Although callers probably only keep setting it when
-* supported, that's why POLL_BUSY_LOOP is reported in the output.
-*/
-   if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop)
-   file->f_op->poll_busy_loop(file, events);
-
if (file->f_op->poll) {
return file->f_op->poll(file, pt);
} else if (file->f_poll_head) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 82133bd1a047..bfaebdc03878 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1722,7 +1722,6 @@ struct file_operations {
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
__poll_t (*poll_mask) (struct file *, __poll_t);
-   void (*poll_busy_loop)(struct file *file, __poll_t events);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
diff --git a/net/socket.c b/net/socket.c
index b52e5b900e09..0aaa49190b30 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -131,19 +131,6 @@ static ssize_t sock_splice_read(struct file *file, loff_t 
*ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);
 
-#ifdef CONFIG_NET_RX_BUSY_POLL
-static void sock_poll_busy_loop(struct file *file, __poll_t events)
-{
-   struct socket *sock = file->private_data;
-
-   /* once, only if requested by syscall */
-   if (sk_can_busy_loop(sock->sk))
-   sk_busy_loop(sock->sk, 1);
-}
-#else
-#define sock_poll_busy_loopNULL
-#endif
-
 /*
  * Socket files have a set of 'special' operations as well as the generic 
file ones. These don't appear
  * in the operation structures but are done directly via the socketcall() 
multiplexor.
@@ -155,7 +142,6 @@ static const struct file_operations socket_file_ops = {
.read_iter =sock_read_iter,
.write_iter =   sock_write_iter,
.poll_mask =sock_poll_mask,
-   .poll_busy_loop = sock_poll_busy_loop,
.poll = sock_poll,
.unlocked_ioctl = sock_ioctl,
 #ifdef CONFIG_COMPAT
@@ -1163,6 +1149,12 @@ static __poll_t sock_poll(struct file *file, poll_table 
*wait)
struct socket *sock = file->private_data;
__poll_t events = poll_requested_events(wait), mask = 0;
 
+   /*
+* Poll once, if requested by syscall.
+*/
+   if ((events & POLL_BUSY_LOOP) && sk_can_busy_loop(sock->sk))
+   sk_busy_loop(sock->sk, 1);
+
if (sock->ops->poll) {
mask = sock->ops->poll(file, sock, wait);
} else if (sock->ops->poll_mask) {
-- 
2.17.1



Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 06:18:02PM +0200, Christoph Hellwig wrote:
> On Fri, Jun 22, 2018 at 05:28:50PM +0200, Christoph Hellwig wrote:
> > On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote:
> > > > 
> > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
> > > 
> > > See objections upthread re "fs,net: move poll busy loop handling into a
> > > separate method"; as for the next one... I'd like an ACK from networking
> > > folks.  The rest of queue makes sense.
> > 
> > I want to see basic results first before micro-optimizing.  After that
> > I'll send it out to the net folks for feedback.
> 
> I looked into this a bit, in the end sk_can_busy_loop does this:
> 
>   return sk->sk_ll_usec && !signal_pending(current);
> 
> where sk_ll_usec defaults based on a sysctl that needs to be
> turned on, but can be overriden per socket.
> 
> While at the same time corect poll code already checks net_busy_loop_on
> to set POLL_BUSY_LOOP.  So except for sockets where people set the
> timeout to 0 the code already does the right thing as-is.  IMHO not
> really worth wasting a FMODE_* flag for it, but if you insist I'll add
> it.

It's not just that - there's also an issue of extra indirect call on the
fast path for sockets.  You get this method of yours + ->poll_mask(),
which hits another indirect to per-family ->poll_mask().  It might be
better to have these combined, sparing us an extra indirect call.

Just give it the same calling conventions as ->poll_mask() have...


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 06:18:02PM +0200, Christoph Hellwig wrote:
> On Fri, Jun 22, 2018 at 05:28:50PM +0200, Christoph Hellwig wrote:
> > On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote:
> > > > 
> > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
> > > 
> > > See objections upthread re "fs,net: move poll busy loop handling into a
> > > separate method"; as for the next one... I'd like an ACK from networking
> > > folks.  The rest of queue makes sense.
> > 
> > I want to see basic results first before micro-optimizing.  After that
> > I'll send it out to the net folks for feedback.
> 
> I looked into this a bit, in the end sk_can_busy_loop does this:
> 
>   return sk->sk_ll_usec && !signal_pending(current);
> 
> where sk_ll_usec defaults based on a sysctl that needs to be
> turned on, but can be overriden per socket.
> 
> While at the same time corect poll code already checks net_busy_loop_on
> to set POLL_BUSY_LOOP.  So except for sockets where people set the
> timeout to 0 the code already does the right thing as-is.  IMHO not
> really worth wasting a FMODE_* flag for it, but if you insist I'll add
> it.

It's not just that - there's also an issue of extra indirect call on the
fast path for sockets.  You get this method of yours + ->poll_mask(),
which hits another indirect to per-family ->poll_mask().  It might be
better to have these combined, sparing us an extra indirect call.

Just give it the same calling conventions as ->poll_mask() have...


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Sean Paul
On Fri, Jun 22, 2018 at 7:01 AM Al Viro  wrote:
>
> On Fri, Jun 22, 2018 at 12:00:14PM +0200, Christoph Hellwig wrote:
> > And a version with select() also covered:
>
> For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*.
> Is there any reason for not doing that other than EXPORT_SYMBOL_GPL
> fetish?  Because if there isn't, I would like to draw your attention
> to the fact that _this_ pwecious inchewlekshul pwopewty can be trivially
> open-coded by out-of-tree shite even if it happens to be non-GPL one.
>

Was this suggestion so bad that you have to insult not only the
author, but also people with speech impediments?

Sean


> >   mask = vfs_poll(f.file, wait);
> > + if (f.file->f_op->poll) {
>
> ... not to mention that here you forgot to remove the call itself while
> expanding it.
>
> Said that, you are not attacking the worst part of it - it's a static
> branch, not the considerably more costly indirect ones.  Remember when
> I asked you about the price of those?  Method calls are costly.
>
> Another problem with with ->get_poll_head() calling conventions is
> that originally you wanted to return ERR_PTR(-mask) as a way to report
> not needing to call ->poll_mask(); that got shot down since quite
> a few of those don't fit into 12 bits that ERR_PTR() gives us.
>
> IIRC, the real reason for non-constant ->get_poll_head() was the sockets,
> with
>
> static struct wait_queue_head *sock_get_poll_head(struct file *file,
> __poll_t events)
> {
> struct socket *sock = file->private_data;
>
> if (!sock->ops->poll_mask)
> return NULL;
> sock_poll_busy_loop(sock, events);
> return sk_sleep(sock->sk);
> }
>
> The first part isn't a problem (it is constant).  The second is
> static inline void sock_poll_busy_loop(struct socket *sock, __poll_t events)
> {
> if (sk_can_busy_loop(sock->sk) &&
> events && (events & POLL_BUSY_LOOP)) {
> /* once, only if requested by syscall */
> sk_busy_loop(sock->sk, 1);
> }
> }
>
> and the third -
>
> static inline wait_queue_head_t *sk_sleep(struct sock *sk)
> {
> BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
> return _dereference_raw(sk->sk_wq)->wait;
> }
>
> Now, ->sk_wq is modified only in sock_init_data() and sock_graft();
> the latter, IIRC, is ->accept() helper.  Do we ever call either of
> those on a sock of already opened file?  IOW, is there any real
> reason for socket ->get_poll_head() not to be constant, other
> than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()?
> I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have
> sock_poll_mask() not free from it...


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Sean Paul
On Fri, Jun 22, 2018 at 7:01 AM Al Viro  wrote:
>
> On Fri, Jun 22, 2018 at 12:00:14PM +0200, Christoph Hellwig wrote:
> > And a version with select() also covered:
>
> For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*.
> Is there any reason for not doing that other than EXPORT_SYMBOL_GPL
> fetish?  Because if there isn't, I would like to draw your attention
> to the fact that _this_ pwecious inchewlekshul pwopewty can be trivially
> open-coded by out-of-tree shite even if it happens to be non-GPL one.
>

Was this suggestion so bad that you have to insult not only the
author, but also people with speech impediments?

Sean


> >   mask = vfs_poll(f.file, wait);
> > + if (f.file->f_op->poll) {
>
> ... not to mention that here you forgot to remove the call itself while
> expanding it.
>
> Said that, you are not attacking the worst part of it - it's a static
> branch, not the considerably more costly indirect ones.  Remember when
> I asked you about the price of those?  Method calls are costly.
>
> Another problem with with ->get_poll_head() calling conventions is
> that originally you wanted to return ERR_PTR(-mask) as a way to report
> not needing to call ->poll_mask(); that got shot down since quite
> a few of those don't fit into 12 bits that ERR_PTR() gives us.
>
> IIRC, the real reason for non-constant ->get_poll_head() was the sockets,
> with
>
> static struct wait_queue_head *sock_get_poll_head(struct file *file,
> __poll_t events)
> {
> struct socket *sock = file->private_data;
>
> if (!sock->ops->poll_mask)
> return NULL;
> sock_poll_busy_loop(sock, events);
> return sk_sleep(sock->sk);
> }
>
> The first part isn't a problem (it is constant).  The second is
> static inline void sock_poll_busy_loop(struct socket *sock, __poll_t events)
> {
> if (sk_can_busy_loop(sock->sk) &&
> events && (events & POLL_BUSY_LOOP)) {
> /* once, only if requested by syscall */
> sk_busy_loop(sock->sk, 1);
> }
> }
>
> and the third -
>
> static inline wait_queue_head_t *sk_sleep(struct sock *sk)
> {
> BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
> return _dereference_raw(sk->sk_wq)->wait;
> }
>
> Now, ->sk_wq is modified only in sock_init_data() and sock_graft();
> the latter, IIRC, is ->accept() helper.  Do we ever call either of
> those on a sock of already opened file?  IOW, is there any real
> reason for socket ->get_poll_head() not to be constant, other
> than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()?
> I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have
> sock_poll_mask() not free from it...


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 05:28:50PM +0200, Christoph Hellwig wrote:
> On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote:
> > > 
> > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
> > 
> > See objections upthread re "fs,net: move poll busy loop handling into a
> > separate method"; as for the next one... I'd like an ACK from networking
> > folks.  The rest of queue makes sense.
> 
> I want to see basic results first before micro-optimizing.  After that
> I'll send it out to the net folks for feedback.

I looked into this a bit, in the end sk_can_busy_loop does this:

return sk->sk_ll_usec && !signal_pending(current);

where sk_ll_usec defaults based on a sysctl that needs to be
turned on, but can be overriden per socket.

While at the same time corect poll code already checks net_busy_loop_on
to set POLL_BUSY_LOOP.  So except for sockets where people set the
timeout to 0 the code already does the right thing as-is.  IMHO not
really worth wasting a FMODE_* flag for it, but if you insist I'll add
it.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 05:28:50PM +0200, Christoph Hellwig wrote:
> On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote:
> > > 
> > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
> > 
> > See objections upthread re "fs,net: move poll busy loop handling into a
> > separate method"; as for the next one... I'd like an ACK from networking
> > folks.  The rest of queue makes sense.
> 
> I want to see basic results first before micro-optimizing.  After that
> I'll send it out to the net folks for feedback.

I looked into this a bit, in the end sk_can_busy_loop does this:

return sk->sk_ll_usec && !signal_pending(current);

where sk_ll_usec defaults based on a sysctl that needs to be
turned on, but can be overriden per socket.

While at the same time corect poll code already checks net_busy_loop_on
to set POLL_BUSY_LOOP.  So except for sockets where people set the
timeout to 0 the code already does the right thing as-is.  IMHO not
really worth wasting a FMODE_* flag for it, but if you insist I'll add
it.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote:
> > 
> > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
> 
> See objections upthread re "fs,net: move poll busy loop handling into a
> separate method"; as for the next one... I'd like an ACK from networking
> folks.  The rest of queue makes sense.

I want to see basic results first before micro-optimizing.  After that
I'll send it out to the net folks for feedback.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 04:14:09PM +0100, Al Viro wrote:
> > 
> > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head
> 
> See objections upthread re "fs,net: move poll busy loop handling into a
> separate method"; as for the next one... I'd like an ACK from networking
> folks.  The rest of queue makes sense.

I want to see basic results first before micro-optimizing.  After that
I'll send it out to the net folks for feedback.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 05:02:51PM +0200, Christoph Hellwig wrote:
> Hi Xiaolong,
> 
> can you retest this workload on the following branch:
> 
> git://git.infradead.org/users/hch/vfs.git remove-get-poll-head
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head

See objections upthread re "fs,net: move poll busy loop handling into a
separate method"; as for the next one... I'd like an ACK from networking
folks.  The rest of queue makes sense.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 05:02:51PM +0200, Christoph Hellwig wrote:
> Hi Xiaolong,
> 
> can you retest this workload on the following branch:
> 
> git://git.infradead.org/users/hch/vfs.git remove-get-poll-head
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head

See objections upthread re "fs,net: move poll busy loop handling into a
separate method"; as for the next one... I'd like an ACK from networking
folks.  The rest of queue makes sense.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
Hi Xiaolong,

can you retest this workload on the following branch:

git://git.infradead.org/users/hch/vfs.git remove-get-poll-head

Gitweb:


http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
Hi Xiaolong,

can you retest this workload on the following branch:

git://git.infradead.org/users/hch/vfs.git remove-get-poll-head

Gitweb:


http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/remove-get-poll-head


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 02:33:07PM +0200, Christoph Hellwig wrote:
> On Fri, Jun 22, 2018 at 01:17:22PM +0100, Al Viro wrote:
> > > The problem is that call to sk_busy_loop(), which is going to be indirect
> > > no matter what.
> > 
> > if ->f_poll_head is NULL {
> > use ->poll
> > } else {
> > if can ll_poll (checked in ->f_mode)
> > call ->ll_poll(), if it returns what we want - we are 
> > done
> > add to ->f_poll_head
> > call ->poll_mask()
> 
> What I have for now is slightly different:
> 
>   if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop)
>   file->f_op->poll_busy_loop(file, events);
> 
>   if (file->f_op->poll) {
>   return file->f_op->poll(file, pt);
>   } else if (file_has_poll_mask(file)) {
>   ...
>   }
> 
> returns whatever we want part is something I want to look into
> once the basics are done as it probably is non entirely trivial due to
> structure of polling in the low-level network protocol.

First of all, you'll get the same ->f_op for *all* sockets.  So you'll be
hitting that path regardless of sk_can_busy_loop(sock->sk).  What's more,
that way you get (on fast path) even more indirect calls, AFAICS.

And I don't see any point in separate file_has_poll_mask() - just check
->f_poll_head and that's it.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 02:33:07PM +0200, Christoph Hellwig wrote:
> On Fri, Jun 22, 2018 at 01:17:22PM +0100, Al Viro wrote:
> > > The problem is that call to sk_busy_loop(), which is going to be indirect
> > > no matter what.
> > 
> > if ->f_poll_head is NULL {
> > use ->poll
> > } else {
> > if can ll_poll (checked in ->f_mode)
> > call ->ll_poll(), if it returns what we want - we are 
> > done
> > add to ->f_poll_head
> > call ->poll_mask()
> 
> What I have for now is slightly different:
> 
>   if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop)
>   file->f_op->poll_busy_loop(file, events);
> 
>   if (file->f_op->poll) {
>   return file->f_op->poll(file, pt);
>   } else if (file_has_poll_mask(file)) {
>   ...
>   }
> 
> returns whatever we want part is something I want to look into
> once the basics are done as it probably is non entirely trivial due to
> structure of polling in the low-level network protocol.

First of all, you'll get the same ->f_op for *all* sockets.  So you'll be
hitting that path regardless of sk_can_busy_loop(sock->sk).  What's more,
that way you get (on fast path) even more indirect calls, AFAICS.

And I don't see any point in separate file_has_poll_mask() - just check
->f_poll_head and that's it.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 01:17:22PM +0100, Al Viro wrote:
> > The problem is that call to sk_busy_loop(), which is going to be indirect
> > no matter what.
> 
>   if ->f_poll_head is NULL {
>   use ->poll
>   } else {
>   if can ll_poll (checked in ->f_mode)
>   call ->ll_poll(), if it returns what we want - we are 
> done
>   add to ->f_poll_head
>   call ->poll_mask()

What I have for now is slightly different:

if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop)
file->f_op->poll_busy_loop(file, events);

if (file->f_op->poll) {
return file->f_op->poll(file, pt);
} else if (file_has_poll_mask(file)) {
...
}

returns whatever we want part is something I want to look into
once the basics are done as it probably is non entirely trivial due to
structure of polling in the low-level network protocol.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 01:17:22PM +0100, Al Viro wrote:
> > The problem is that call to sk_busy_loop(), which is going to be indirect
> > no matter what.
> 
>   if ->f_poll_head is NULL {
>   use ->poll
>   } else {
>   if can ll_poll (checked in ->f_mode)
>   call ->ll_poll(), if it returns what we want - we are 
> done
>   add to ->f_poll_head
>   call ->poll_mask()

What I have for now is slightly different:

if ((events & POLL_BUSY_LOOP) && file->f_op->poll_busy_loop)
file->f_op->poll_busy_loop(file, events);

if (file->f_op->poll) {
return file->f_op->poll(file, pt);
} else if (file_has_poll_mask(file)) {
...
}

returns whatever we want part is something I want to look into
once the basics are done as it probably is non entirely trivial due to
structure of polling in the low-level network protocol.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 02:07:39PM +0200, Christoph Hellwig wrote:
> On Fri, Jun 22, 2018 at 12:56:13PM +0100, Al Viro wrote:
> > So mark that in ->f_mode - I strongly suspect that
> > sk_can_busy_loop(sock->sk) can't change while an opened file is there.
> > And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll()
> > and do_select() - we *already* have bits of pieces of that logics in
> > there and that way they'd at least be gathered in one place.
> 
> The problem is that call to sk_busy_loop(), which is going to be indirect
> no matter what.

if ->f_poll_head is NULL {
use ->poll
} else {
if can ll_poll (checked in ->f_mode)
call ->ll_poll(), if it returns what we want - we are 
done
add to ->f_poll_head
call ->poll_mask()
}


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 02:07:39PM +0200, Christoph Hellwig wrote:
> On Fri, Jun 22, 2018 at 12:56:13PM +0100, Al Viro wrote:
> > So mark that in ->f_mode - I strongly suspect that
> > sk_can_busy_loop(sock->sk) can't change while an opened file is there.
> > And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll()
> > and do_select() - we *already* have bits of pieces of that logics in
> > there and that way they'd at least be gathered in one place.
> 
> The problem is that call to sk_busy_loop(), which is going to be indirect
> no matter what.

if ->f_poll_head is NULL {
use ->poll
} else {
if can ll_poll (checked in ->f_mode)
call ->ll_poll(), if it returns what we want - we are 
done
add to ->f_poll_head
call ->poll_mask()
}


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 12:56:13PM +0100, Al Viro wrote:
>   So mark that in ->f_mode - I strongly suspect that
> sk_can_busy_loop(sock->sk) can't change while an opened file is there.
> And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll()
> and do_select() - we *already* have bits of pieces of that logics in
> there and that way they'd at least be gathered in one place.

The problem is that call to sk_busy_loop(), which is going to be indirect
no matter what.

> 
>   Then replace ->get_poll_head() with file->f_poll_head and
> see what it gives.

Working on it right now.  Works so far except for the busy loop case.
I'm looking into a separate methods just for that as a first idea.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 12:56:13PM +0100, Al Viro wrote:
>   So mark that in ->f_mode - I strongly suspect that
> sk_can_busy_loop(sock->sk) can't change while an opened file is there.
> And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll()
> and do_select() - we *already* have bits of pieces of that logics in
> there and that way they'd at least be gathered in one place.

The problem is that call to sk_busy_loop(), which is going to be indirect
no matter what.

> 
>   Then replace ->get_poll_head() with file->f_poll_head and
> see what it gives.

Working on it right now.  Works so far except for the busy loop case.
I'm looking into a separate methods just for that as a first idea.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 01:53:00PM +0200, Christoph Hellwig wrote:

> > Now, ->sk_wq is modified only in sock_init_data() and sock_graft();
> > the latter, IIRC, is ->accept() helper.  Do we ever call either of
> > those on a sock of already opened file?  IOW, is there any real
> > reason for socket ->get_poll_head() not to be constant, other
> > than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()?
> > I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have
> > sock_poll_mask() not free from it...
> 
> I'd have to defer to networking folks if busy looping after pollwait
> is what they want, but I suspect the answer is no, by the time
> we are already waiting for the queue busy waiting seems pointless.

So mark that in ->f_mode - I strongly suspect that
sk_can_busy_loop(sock->sk) can't change while an opened file is there.
And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll()
and do_select() - we *already* have bits of pieces of that logics in
there and that way they'd at least be gathered in one place.

Then replace ->get_poll_head() with file->f_poll_head and
see what it gives.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 01:53:00PM +0200, Christoph Hellwig wrote:

> > Now, ->sk_wq is modified only in sock_init_data() and sock_graft();
> > the latter, IIRC, is ->accept() helper.  Do we ever call either of
> > those on a sock of already opened file?  IOW, is there any real
> > reason for socket ->get_poll_head() not to be constant, other
> > than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()?
> > I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have
> > sock_poll_mask() not free from it...
> 
> I'd have to defer to networking folks if busy looping after pollwait
> is what they want, but I suspect the answer is no, by the time
> we are already waiting for the queue busy waiting seems pointless.

So mark that in ->f_mode - I strongly suspect that
sk_can_busy_loop(sock->sk) can't change while an opened file is there.
And lift that (conditional on new FMODE_BUSY_LOOP) into do_poll()
and do_select() - we *already* have bits of pieces of that logics in
there and that way they'd at least be gathered in one place.

Then replace ->get_poll_head() with file->f_poll_head and
see what it gives.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 12:01:17PM +0100, Al Viro wrote:
> For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*.

That is not going to help with de-virtualizing _qproc, which was
the whole idea of that change.  At least not without a compiler
way smarter than gcc.

But if you want it inline that is fine with me, it just seems little
large for inlining.

None that I plan to actually remove all calls except for poll and select
for vfs_poll in a pending series, at which point it would become static
anyway.

> Said that, you are not attacking the worst part of it - it's a static
> branch, not the considerably more costly indirect ones.  Remember when
> I asked you about the price of those?  Method calls are costly.

And back then it did not show up even in poll heavy workloads.  But
since then something new happened - spectre mitigations, which make
indirect calls exorbitantly more expensive.

> Now, ->sk_wq is modified only in sock_init_data() and sock_graft();
> the latter, IIRC, is ->accept() helper.  Do we ever call either of
> those on a sock of already opened file?  IOW, is there any real
> reason for socket ->get_poll_head() not to be constant, other
> than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()?
> I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have
> sock_poll_mask() not free from it...

I'd have to defer to networking folks if busy looping after pollwait
is what they want, but I suspect the answer is no, by the time
we are already waiting for the queue busy waiting seems pointless.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 12:01:17PM +0100, Al Viro wrote:
> For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*.

That is not going to help with de-virtualizing _qproc, which was
the whole idea of that change.  At least not without a compiler
way smarter than gcc.

But if you want it inline that is fine with me, it just seems little
large for inlining.

None that I plan to actually remove all calls except for poll and select
for vfs_poll in a pending series, at which point it would become static
anyway.

> Said that, you are not attacking the worst part of it - it's a static
> branch, not the considerably more costly indirect ones.  Remember when
> I asked you about the price of those?  Method calls are costly.

And back then it did not show up even in poll heavy workloads.  But
since then something new happened - spectre mitigations, which make
indirect calls exorbitantly more expensive.

> Now, ->sk_wq is modified only in sock_init_data() and sock_graft();
> the latter, IIRC, is ->accept() helper.  Do we ever call either of
> those on a sock of already opened file?  IOW, is there any real
> reason for socket ->get_poll_head() not to be constant, other
> than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()?
> I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have
> sock_poll_mask() not free from it...

I'd have to defer to networking folks if busy looping after pollwait
is what they want, but I suspect the answer is no, by the time
we are already waiting for the queue busy waiting seems pointless.


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 12:00:14PM +0200, Christoph Hellwig wrote:
> And a version with select() also covered:

For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*.
Is there any reason for not doing that other than EXPORT_SYMBOL_GPL
fetish?  Because if there isn't, I would like to draw your attention
to the fact that _this_ pwecious inchewlekshul pwopewty can be trivially
open-coded by out-of-tree shite even if it happens to be non-GPL one.

>   mask = vfs_poll(f.file, wait);
> + if (f.file->f_op->poll) {

... not to mention that here you forgot to remove the call itself while
expanding it.

Said that, you are not attacking the worst part of it - it's a static
branch, not the considerably more costly indirect ones.  Remember when
I asked you about the price of those?  Method calls are costly.

Another problem with with ->get_poll_head() calling conventions is
that originally you wanted to return ERR_PTR(-mask) as a way to report
not needing to call ->poll_mask(); that got shot down since quite
a few of those don't fit into 12 bits that ERR_PTR() gives us.

IIRC, the real reason for non-constant ->get_poll_head() was the sockets,
with

static struct wait_queue_head *sock_get_poll_head(struct file *file,
__poll_t events)
{
struct socket *sock = file->private_data;

if (!sock->ops->poll_mask)
return NULL;
sock_poll_busy_loop(sock, events);
return sk_sleep(sock->sk); 
}

The first part isn't a problem (it is constant).  The second is
static inline void sock_poll_busy_loop(struct socket *sock, __poll_t events)
{
if (sk_can_busy_loop(sock->sk) &&
events && (events & POLL_BUSY_LOOP)) {
/* once, only if requested by syscall */
sk_busy_loop(sock->sk, 1);
}
} 

and the third -

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
return _dereference_raw(sk->sk_wq)->wait; 
} 

Now, ->sk_wq is modified only in sock_init_data() and sock_graft();
the latter, IIRC, is ->accept() helper.  Do we ever call either of
those on a sock of already opened file?  IOW, is there any real
reason for socket ->get_poll_head() not to be constant, other
than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()?
I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have
sock_poll_mask() not free from it...


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Al Viro
On Fri, Jun 22, 2018 at 12:00:14PM +0200, Christoph Hellwig wrote:
> And a version with select() also covered:

For fuck sake, if you want vfs_poll() inlined, *make* *it* *inlined*.
Is there any reason for not doing that other than EXPORT_SYMBOL_GPL
fetish?  Because if there isn't, I would like to draw your attention
to the fact that _this_ pwecious inchewlekshul pwopewty can be trivially
open-coded by out-of-tree shite even if it happens to be non-GPL one.

>   mask = vfs_poll(f.file, wait);
> + if (f.file->f_op->poll) {

... not to mention that here you forgot to remove the call itself while
expanding it.

Said that, you are not attacking the worst part of it - it's a static
branch, not the considerably more costly indirect ones.  Remember when
I asked you about the price of those?  Method calls are costly.

Another problem with with ->get_poll_head() calling conventions is
that originally you wanted to return ERR_PTR(-mask) as a way to report
not needing to call ->poll_mask(); that got shot down since quite
a few of those don't fit into 12 bits that ERR_PTR() gives us.

IIRC, the real reason for non-constant ->get_poll_head() was the sockets,
with

static struct wait_queue_head *sock_get_poll_head(struct file *file,
__poll_t events)
{
struct socket *sock = file->private_data;

if (!sock->ops->poll_mask)
return NULL;
sock_poll_busy_loop(sock, events);
return sk_sleep(sock->sk); 
}

The first part isn't a problem (it is constant).  The second is
static inline void sock_poll_busy_loop(struct socket *sock, __poll_t events)
{
if (sk_can_busy_loop(sock->sk) &&
events && (events & POLL_BUSY_LOOP)) {
/* once, only if requested by syscall */
sk_busy_loop(sock->sk, 1);
}
} 

and the third -

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
return _dereference_raw(sk->sk_wq)->wait; 
} 

Now, ->sk_wq is modified only in sock_init_data() and sock_graft();
the latter, IIRC, is ->accept() helper.  Do we ever call either of
those on a sock of already opened file?  IOW, is there any real
reason for socket ->get_poll_head() not to be constant, other
than wanting to keep POLL_BUSY_LOOP handling out of ->poll_mask()?
I agree that POLL_BUSY_LOOP is ugly as hell, but you *still* have
sock_poll_mask() not free from it...


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Linus Torvalds
On Fri, Jun 22, 2018 at 7:02 PM Linus Torvalds
 wrote:
>
> Get your act together. Don't uglify and slow down everything else just
> because you're concentrating only on aio.

.. and seriously, poll and select are timing-critical. There are many
real loads where they show up as *the* thing in kernel profiles.

aio is a distant distant second cousin. You need to make sure poll and
select are prioritized, and aio should be a "if we can do this
cleanly" case.

Linus


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Linus Torvalds
On Fri, Jun 22, 2018 at 7:02 PM Linus Torvalds
 wrote:
>
> Get your act together. Don't uglify and slow down everything else just
> because you're concentrating only on aio.

.. and seriously, poll and select are timing-critical. There are many
real loads where they show up as *the* thing in kernel profiles.

aio is a distant distant second cousin. You need to make sure poll and
select are prioritized, and aio should be a "if we can do this
cleanly" case.

Linus


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Linus Torvalds
On Fri, Jun 22, 2018 at 6:46 PM Christoph Hellwig  wrote:
>
> > The disadvantages are obvious: every poll event now causes *two*
> > indirect branches to the low-level filesystem or driver - one to get
> > he poll head, and one to get the mask. Add to that all the new "do we
> > have the new-style or old sane poll interface" tests, and poll is
> > obviously more complicated.
>
> It already caused two

No it didn't. If the data was ready, all that got short-circuited, and
we just had that ->poll() call.

Only if you *waited* did you get the second one to check the result,
and the whole poll_wait(). You could just say "I already have the
data, there's no wait-queu to add, I'll just return immediately".

You're making this all unconditionally pessimal.

Admit it. The new interface is inferior.

> In the meantime below is an ugly patch that removes the _qproc
> indirect for ->poll only (similar patch is possible for select
> assuming the code uses select).  And for next merge window I plan
> to kill it off entirely.

You're just making the code even worse here.

Seriously, this patch is beyond ugly. The new interface is nasty shit.

Get your act together. Don't uglify and slow down everything else just
because you're concentrating only on aio.

Your reply so far just makes me more convinced this was not thought
through properly.

Linus


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Linus Torvalds
On Fri, Jun 22, 2018 at 6:46 PM Christoph Hellwig  wrote:
>
> > The disadvantages are obvious: every poll event now causes *two*
> > indirect branches to the low-level filesystem or driver - one to get
> > he poll head, and one to get the mask. Add to that all the new "do we
> > have the new-style or old sane poll interface" tests, and poll is
> > obviously more complicated.
>
> It already caused two

No it didn't. If the data was ready, all that got short-circuited, and
we just had that ->poll() call.

Only if you *waited* did you get the second one to check the result,
and the whole poll_wait(). You could just say "I already have the
data, there's no wait-queu to add, I'll just return immediately".

You're making this all unconditionally pessimal.

Admit it. The new interface is inferior.

> In the meantime below is an ugly patch that removes the _qproc
> indirect for ->poll only (similar patch is possible for select
> assuming the code uses select).  And for next merge window I plan
> to kill it off entirely.

You're just making the code even worse here.

Seriously, this patch is beyond ugly. The new interface is nasty shit.

Get your act together. Don't uglify and slow down everything else just
because you're concentrating only on aio.

Your reply so far just makes me more convinced this was not thought
through properly.

Linus


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
And a version with select() also covered:

---
>From 317159003ae28113cf759c632b161fb39192fe3c Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Fri, 22 Jun 2018 11:36:26 +0200
Subject: fs: optimize away ->_qproc indirection for poll_mask based polling

Signed-off-by: Christoph Hellwig 
---
 fs/select.c | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/fs/select.c b/fs/select.c
index bc3cc0f98896..2c9d81892509 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -528,6 +528,24 @@ static int do_select(int n, fd_set_bits *fds, struct 
timespec64 *end_time)
wait_key_set(wait, in, out, bit,
 busy_flag);
mask = vfs_poll(f.file, wait);
+   if (f.file->f_op->poll) {
+   mask = 
f.file->f_op->poll(f.file, wait);
+   } else if (file_has_poll_mask(f.file)) {
+   struct wait_queue_head *head;
+
+   head = 
f.file->f_op->get_poll_head(f.file, wait->_key);
+   if (!head) {
+   mask = DEFAULT_POLLMASK;
+   } else if (IS_ERR(head)) {
+   mask = EPOLLERR;
+   } else {
+   if (wait->_qproc)
+   
__pollwait(f.file, head, wait);
+   mask = 
f.file->f_op->poll_mask(f.file, wait->_key);
+   }
+   } else {
+   mask = DEFAULT_POLLMASK;
+   }
 
fdput(f);
if ((mask & POLLIN_SET) && (in & bit)) {
@@ -845,7 +863,25 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, 
poll_table *pwait,
/* userland u16 ->events contains POLL... bitmap */
filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP;
pwait->_key = filter | busy_flag;
-   mask = vfs_poll(f.file, pwait);
+   if (f.file->f_op->poll) {
+   mask = f.file->f_op->poll(f.file, pwait);
+   } else if (file_has_poll_mask(f.file)) {
+   struct wait_queue_head *head;
+
+   head = f.file->f_op->get_poll_head(f.file, pwait->_key);
+   if (!head) {
+   mask = DEFAULT_POLLMASK;
+   } else if (IS_ERR(head)) {
+   mask = EPOLLERR;
+   } else {
+   if (pwait->_qproc)
+   __pollwait(f.file, head, pwait);
+   mask = f.file->f_op->poll_mask(f.file, pwait->_key);
+   }
+   } else {
+   mask = DEFAULT_POLLMASK;
+   }
+
if (mask & busy_flag)
*can_busy_poll = true;
mask &= filter; /* Mask out unneeded events. */
-- 
2.17.1



Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
And a version with select() also covered:

---
>From 317159003ae28113cf759c632b161fb39192fe3c Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Fri, 22 Jun 2018 11:36:26 +0200
Subject: fs: optimize away ->_qproc indirection for poll_mask based polling

Signed-off-by: Christoph Hellwig 
---
 fs/select.c | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/fs/select.c b/fs/select.c
index bc3cc0f98896..2c9d81892509 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -528,6 +528,24 @@ static int do_select(int n, fd_set_bits *fds, struct 
timespec64 *end_time)
wait_key_set(wait, in, out, bit,
 busy_flag);
mask = vfs_poll(f.file, wait);
+   if (f.file->f_op->poll) {
+   mask = 
f.file->f_op->poll(f.file, wait);
+   } else if (file_has_poll_mask(f.file)) {
+   struct wait_queue_head *head;
+
+   head = 
f.file->f_op->get_poll_head(f.file, wait->_key);
+   if (!head) {
+   mask = DEFAULT_POLLMASK;
+   } else if (IS_ERR(head)) {
+   mask = EPOLLERR;
+   } else {
+   if (wait->_qproc)
+   
__pollwait(f.file, head, wait);
+   mask = 
f.file->f_op->poll_mask(f.file, wait->_key);
+   }
+   } else {
+   mask = DEFAULT_POLLMASK;
+   }
 
fdput(f);
if ((mask & POLLIN_SET) && (in & bit)) {
@@ -845,7 +863,25 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, 
poll_table *pwait,
/* userland u16 ->events contains POLL... bitmap */
filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP;
pwait->_key = filter | busy_flag;
-   mask = vfs_poll(f.file, pwait);
+   if (f.file->f_op->poll) {
+   mask = f.file->f_op->poll(f.file, pwait);
+   } else if (file_has_poll_mask(f.file)) {
+   struct wait_queue_head *head;
+
+   head = f.file->f_op->get_poll_head(f.file, pwait->_key);
+   if (!head) {
+   mask = DEFAULT_POLLMASK;
+   } else if (IS_ERR(head)) {
+   mask = EPOLLERR;
+   } else {
+   if (pwait->_qproc)
+   __pollwait(f.file, head, pwait);
+   mask = f.file->f_op->poll_mask(f.file, pwait->_key);
+   }
+   } else {
+   mask = DEFAULT_POLLMASK;
+   }
+
if (mask & busy_flag)
*can_busy_poll = true;
mask &= filter; /* Mask out unneeded events. */
-- 
2.17.1



Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 06:25:45PM +0900, Linus Torvalds wrote:
> What was the alleged advantage of the new poll methods again? Because
> it sure isn't obvious - not from the numbers, and not from the commit
> messages.

The primary goal is that we can implement a race-free aio poll,
the primary benefit is that we can get rid of the currently racy
and bug prone way we do in-kernel poll-like calls for things like
eventfd.  The first is clearly is in 4.18-rc and provides massive
performance advantanges if used, the second is not there yet,
more on that below.

> I was assuming there  was a good reason for it, but looking closer I
> see absolutely nothing but negatives. The argument that keyed wake-ups
> somehow make multiple wake-queues irrelevant doesn't hold water when
> the code is more complex and apparently slower. It's not like anybody
> ever *had* to use multiple wait-queues, but the old code was both
> simpler and cleaner and *allowed* you to use multiple queues if you
> wanted to.

It wasn't cleaner at all if you aren't poll or select, and even
for those it isn't exactly clean, see the whole mess around ->qproc.

> The disadvantages are obvious: every poll event now causes *two*
> indirect branches to the low-level filesystem or driver - one to get
> he poll head, and one to get the mask. Add to that all the new "do we
> have the new-style or old sane poll interface" tests, and poll is
> obviously more complicated.

It already caused two, and now we have three thanks to ->qproc.  One
of the advantages of the new code is that we can eventually get rid
of ->qproc once all users of a non-default qproc are switched away
from vfs_poll.  Which requires a little more work, but I have the
patches for that to be posted soon.

> If we could get the poll head by just having a direct pointer in the
> 'struct file', maybe that would be one thing. As it is, this all
> literally just adds overhead for no obvious reason. It replaced one
> simple direct call with two dependent but separate ones.

People are doing weird things with their poll heads, so we can't do
that unconditionally.  We could however offer a waitqueue pointer
in struct file and most users would be very happy with that.

In the meantime below is an ugly patch that removes the _qproc
indirect for ->poll only (similar patch is possible for select
assuming the code uses select).  And for next merge window I plan
to kill it off entirely.

How can we get this thrown into the will it scale run?

---
>From 50ca47fdcfec0a1af56aac6db8a168bb678308a5 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Fri, 22 Jun 2018 11:36:26 +0200
Subject: fs: optimize away ->_qproc indirection for poll_mask based polling

Signed-off-by: Christoph Hellwig 
---
 fs/select.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/select.c b/fs/select.c
index bc3cc0f98896..54406e0ad23e 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -845,7 +845,25 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, 
poll_table *pwait,
/* userland u16 ->events contains POLL... bitmap */
filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP;
pwait->_key = filter | busy_flag;
-   mask = vfs_poll(f.file, pwait);
+   if (f.file->f_op->poll) {
+   mask = f.file->f_op->poll(f.file, pwait);
+   } else if (file_has_poll_mask(f.file)) {
+   struct wait_queue_head *head;
+
+   head = f.file->f_op->get_poll_head(f.file, pwait->_key);
+   if (!head) {
+   mask = DEFAULT_POLLMASK;
+   } else if (IS_ERR(head)) {
+   mask = EPOLLERR;
+   } else {
+   if (pwait->_qproc)
+   __pollwait(f.file, head, pwait);
+   mask = f.file->f_op->poll_mask(f.file, pwait->_key);
+   }
+   } else {
+   mask = DEFAULT_POLLMASK;
+   }
+
if (mask & busy_flag)
*can_busy_poll = true;
mask &= filter; /* Mask out unneeded events. */
-- 
2.17.1



Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Christoph Hellwig
On Fri, Jun 22, 2018 at 06:25:45PM +0900, Linus Torvalds wrote:
> What was the alleged advantage of the new poll methods again? Because
> it sure isn't obvious - not from the numbers, and not from the commit
> messages.

The primary goal is that we can implement a race-free aio poll,
the primary benefit is that we can get rid of the currently racy
and bug prone way we do in-kernel poll-like calls for things like
eventfd.  The first is clearly is in 4.18-rc and provides massive
performance advantanges if used, the second is not there yet,
more on that below.

> I was assuming there  was a good reason for it, but looking closer I
> see absolutely nothing but negatives. The argument that keyed wake-ups
> somehow make multiple wake-queues irrelevant doesn't hold water when
> the code is more complex and apparently slower. It's not like anybody
> ever *had* to use multiple wait-queues, but the old code was both
> simpler and cleaner and *allowed* you to use multiple queues if you
> wanted to.

It wasn't cleaner at all if you aren't poll or select, and even
for those it isn't exactly clean, see the whole mess around ->qproc.

> The disadvantages are obvious: every poll event now causes *two*
> indirect branches to the low-level filesystem or driver - one to get
> he poll head, and one to get the mask. Add to that all the new "do we
> have the new-style or old sane poll interface" tests, and poll is
> obviously more complicated.

It already caused two, and now we have three thanks to ->qproc.  One
of the advantages of the new code is that we can eventually get rid
of ->qproc once all users of a non-default qproc are switched away
from vfs_poll.  Which requires a little more work, but I have the
patches for that to be posted soon.

> If we could get the poll head by just having a direct pointer in the
> 'struct file', maybe that would be one thing. As it is, this all
> literally just adds overhead for no obvious reason. It replaced one
> simple direct call with two dependent but separate ones.

People are doing weird things with their poll heads, so we can't do
that unconditionally.  We could however offer a waitqueue pointer
in struct file and most users would be very happy with that.

In the meantime below is an ugly patch that removes the _qproc
indirect for ->poll only (similar patch is possible for select
assuming the code uses select).  And for next merge window I plan
to kill it off entirely.

How can we get this thrown into the will it scale run?

---
>From 50ca47fdcfec0a1af56aac6db8a168bb678308a5 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Fri, 22 Jun 2018 11:36:26 +0200
Subject: fs: optimize away ->_qproc indirection for poll_mask based polling

Signed-off-by: Christoph Hellwig 
---
 fs/select.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/select.c b/fs/select.c
index bc3cc0f98896..54406e0ad23e 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -845,7 +845,25 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, 
poll_table *pwait,
/* userland u16 ->events contains POLL... bitmap */
filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP;
pwait->_key = filter | busy_flag;
-   mask = vfs_poll(f.file, pwait);
+   if (f.file->f_op->poll) {
+   mask = f.file->f_op->poll(f.file, pwait);
+   } else if (file_has_poll_mask(f.file)) {
+   struct wait_queue_head *head;
+
+   head = f.file->f_op->get_poll_head(f.file, pwait->_key);
+   if (!head) {
+   mask = DEFAULT_POLLMASK;
+   } else if (IS_ERR(head)) {
+   mask = EPOLLERR;
+   } else {
+   if (pwait->_qproc)
+   __pollwait(f.file, head, pwait);
+   mask = f.file->f_op->poll_mask(f.file, pwait->_key);
+   }
+   } else {
+   mask = DEFAULT_POLLMASK;
+   }
+
if (mask & busy_flag)
*can_busy_poll = true;
mask &= filter; /* Mask out unneeded events. */
-- 
2.17.1



Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Linus Torvalds
[ Added Al, since this all came in through his trees. The guilty
authors were already added by the robot ]

On Fri, Jun 22, 2018 at 5:31 PM kernel test robot  wrote:
>
> FYI, we noticed a -8.8% regression of will-it-scale.per_process_ops due to 
> commit:

Guys, this seems pretty big.

What was the alleged advantage of the new poll methods again? Because
it sure isn't obvious - not from the numbers, and not from the commit
messages.

The code seems to be garbage. It replaces our nice generic "you can do
anything you damn well like in your poll function" with two limited
fixed-function "just give me the poll head and the mask".

I was assuming there  was a good reason for it, but looking closer I
see absolutely nothing but negatives. The argument that keyed wake-ups
somehow make multiple wake-queues irrelevant doesn't hold water when
the code is more complex and apparently slower. It's not like anybody
ever *had* to use multiple wait-queues, but the old code was both
simpler and cleaner and *allowed* you to use multiple queues if you
wanted to.

So the old code is simpler, cleaner, and more flexible. And according
to the test robot, it also performs better.

So I'm inclined to just revert the whole mess unless I get some
serious explanations for what the supposed advantages are.

The disadvantages are obvious: every poll event now causes *two*
indirect branches to the low-level filesystem or driver - one to get
he poll head, and one to get the mask. Add to that all the new "do we
have the new-style or old sane poll interface" tests, and poll is
obviously more complicated.

Quite frankly, when I look at it, I just go "that's _stupid_". I'm
entirely  missing the point of the conversion, and it's not explained
in the messages either.

If we could get the poll head by just having a direct pointer in the
'struct file', maybe that would be one thing. As it is, this all
literally just adds overhead for no obvious reason. It replaced one
simple direct call with two dependent but separate ones.

Linus


Re: [lkp-robot] [fs] 3deb642f0d: will-it-scale.per_process_ops -8.8% regression

2018-06-22 Thread Linus Torvalds
[ Added Al, since this all came in through his trees. The guilty
authors were already added by the robot ]

On Fri, Jun 22, 2018 at 5:31 PM kernel test robot  wrote:
>
> FYI, we noticed a -8.8% regression of will-it-scale.per_process_ops due to 
> commit:

Guys, this seems pretty big.

What was the alleged advantage of the new poll methods again? Because
it sure isn't obvious - not from the numbers, and not from the commit
messages.

The code seems to be garbage. It replaces our nice generic "you can do
anything you damn well like in your poll function" with two limited
fixed-function "just give me the poll head and the mask".

I was assuming there  was a good reason for it, but looking closer I
see absolutely nothing but negatives. The argument that keyed wake-ups
somehow make multiple wake-queues irrelevant doesn't hold water when
the code is more complex and apparently slower. It's not like anybody
ever *had* to use multiple wait-queues, but the old code was both
simpler and cleaner and *allowed* you to use multiple queues if you
wanted to.

So the old code is simpler, cleaner, and more flexible. And according
to the test robot, it also performs better.

So I'm inclined to just revert the whole mess unless I get some
serious explanations for what the supposed advantages are.

The disadvantages are obvious: every poll event now causes *two*
indirect branches to the low-level filesystem or driver - one to get
he poll head, and one to get the mask. Add to that all the new "do we
have the new-style or old sane poll interface" tests, and poll is
obviously more complicated.

Quite frankly, when I look at it, I just go "that's _stupid_". I'm
entirely  missing the point of the conversion, and it's not explained
in the messages either.

If we could get the poll head by just having a direct pointer in the
'struct file', maybe that would be one thing. As it is, this all
literally just adds overhead for no obvious reason. It replaced one
simple direct call with two dependent but separate ones.

Linus