Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

2019-09-28 Thread Mark Millard via freebsd-amd64



On 2019-Sep-27, at 15:22, Mark Millard  wrote:

> On 2019-Sep-27, at 13:52, Mark Millard  wrote:
> 
>> On 2019-Sep-27, at 12:24, Mark Johnston  wrote:
>> 
>>> On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote:
 
 
 On 2019-Sep-26, at 17:05, Mark Millard  wrote:
 
> On 2019-Sep-26, at 13:29, Mark Johnston  wrote:
>> One possibility is that these are kernel memory allocations occurring in
>> the context of the benchmark threads.  Such allocations may not respect
>> the configured policy since they are not private to the allocating
>> thread.  For instance, upon opening a file, the kernel may allocate a
>> vnode structure for that file.  That vnode may be accessed by threads
>> from many processes over its lifetime, and may be recycled many times
>> before its memory is released back to the allocator.
> 
> For -l0-15 -n prefer:1 :
> 
> Looks like this reports sys_thr_new activity, sys_cpuset
> activity, and 0x80bc09bd activity (whatever that
> is). Mostly sys_thr_new activity, over 1300 of them . . .
> 
> dtrace: pid 13553 has exited
> 
> 
> kernel`uma_small_alloc+0x61
> kernel`keg_alloc_slab+0x10b
> kernel`zone_import+0x1d2
> kernel`uma_zalloc_arg+0x62b
> kernel`thread_init+0x22
> kernel`keg_alloc_slab+0x259
> kernel`zone_import+0x1d2
> kernel`uma_zalloc_arg+0x62b
> kernel`thread_alloc+0x23
> kernel`thread_create+0x13a
> kernel`sys_thr_new+0xd2
> kernel`amd64_syscall+0x3ae
> kernel`0x811b7600
>   2
> 
> kernel`uma_small_alloc+0x61
> kernel`keg_alloc_slab+0x10b
> kernel`zone_import+0x1d2
> kernel`uma_zalloc_arg+0x62b
> kernel`cpuset_setproc+0x65
> kernel`sys_cpuset+0x123
> kernel`amd64_syscall+0x3ae
> kernel`0x811b7600
>   2
> 
> kernel`uma_small_alloc+0x61
> kernel`keg_alloc_slab+0x10b
> kernel`zone_import+0x1d2
> kernel`uma_zalloc_arg+0x62b
> kernel`uma_zfree_arg+0x36a
> kernel`thread_reap+0x106
> kernel`thread_alloc+0xf
> kernel`thread_create+0x13a
> kernel`sys_thr_new+0xd2
> kernel`amd64_syscall+0x3ae
> kernel`0x811b7600
>   6
> 
> kernel`uma_small_alloc+0x61
> kernel`keg_alloc_slab+0x10b
> kernel`zone_import+0x1d2
> kernel`uma_zalloc_arg+0x62b
> kernel`uma_zfree_arg+0x36a
> kernel`vm_map_process_deferred+0x8c
> kernel`vm_map_remove+0x11d
> kernel`vmspace_exit+0xd3
> kernel`exit1+0x5a9
> kernel`0x80bc09bd
> kernel`amd64_syscall+0x3ae
> kernel`0x811b7600
>   6
> 
> kernel`uma_small_alloc+0x61
> kernel`keg_alloc_slab+0x10b
> kernel`zone_import+0x1d2
> kernel`uma_zalloc_arg+0x62b
> kernel`thread_alloc+0x23
> kernel`thread_create+0x13a
> kernel`sys_thr_new+0xd2
> kernel`amd64_syscall+0x3ae
> kernel`0x811b7600
>  22
> 
> kernel`vm_page_grab_pages+0x1b4
> kernel`vm_thread_stack_create+0xc0
> kernel`kstack_import+0x52
> kernel`uma_zalloc_arg+0x62b
> kernel`vm_thread_new+0x4d
> kernel`thread_alloc+0x31
> kernel`thread_create+0x13a
> kernel`sys_thr_new+0xd2
> kernel`amd64_syscall+0x3ae
> kernel`0x811b7600
>1324
 
 With sys_thr_new not respecting -n prefer:1 for
 -l0-15 (especially for the thread stacks), I
 looked some at the generated integration kernel
 code and it makes significant use of %rsp based
 memory accesses (read and write).
 
 That would get both memory controllers going in
 parallel (kernel vectors accesses to the preferred
 memory domain), so not slowing down as expected.
 
 If round-robin is not respected for thread stacks,
 and if threads migrate cpus across memory domains
 at times, there could be considerable variability
 for that context as well. (This may not be the
 only way to have different/extra variability for
 this context.)
 
 Overall: I'd be surprised if this was not
 contributing to what I thought was odd about
 the benchmark results.

Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

2019-09-27 Thread Mark Millard via freebsd-amd64



On 2019-Sep-27, at 13:52, Mark Millard  wrote:

> On 2019-Sep-27, at 12:24, Mark Johnston  > wrote:
> 
>> On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote:
>>> 
>>> 
>>> On 2019-Sep-26, at 17:05, Mark Millard >> > wrote:
>>> 
 On 2019-Sep-26, at 13:29, Mark Johnston >>> > wrote:
> One possibility is that these are kernel memory allocations occurring in
> the context of the benchmark threads.  Such allocations may not respect
> the configured policy since they are not private to the allocating
> thread.  For instance, upon opening a file, the kernel may allocate a
> vnode structure for that file.  That vnode may be accessed by threads
> from many processes over its lifetime, and may be recycled many times
> before its memory is released back to the allocator.
 
 For -l0-15 -n prefer:1 :
 
 Looks like this reports sys_thr_new activity, sys_cpuset
 activity, and 0x80bc09bd activity (whatever that
 is). Mostly sys_thr_new activity, over 1300 of them . . .
 
 dtrace: pid 13553 has exited
 
 
 kernel`uma_small_alloc+0x61
 kernel`keg_alloc_slab+0x10b
 kernel`zone_import+0x1d2
 kernel`uma_zalloc_arg+0x62b
 kernel`thread_init+0x22
 kernel`keg_alloc_slab+0x259
 kernel`zone_import+0x1d2
 kernel`uma_zalloc_arg+0x62b
 kernel`thread_alloc+0x23
 kernel`thread_create+0x13a
 kernel`sys_thr_new+0xd2
 kernel`amd64_syscall+0x3ae
 kernel`0x811b7600
   2
 
 kernel`uma_small_alloc+0x61
 kernel`keg_alloc_slab+0x10b
 kernel`zone_import+0x1d2
 kernel`uma_zalloc_arg+0x62b
 kernel`cpuset_setproc+0x65
 kernel`sys_cpuset+0x123
 kernel`amd64_syscall+0x3ae
 kernel`0x811b7600
   2
 
 kernel`uma_small_alloc+0x61
 kernel`keg_alloc_slab+0x10b
 kernel`zone_import+0x1d2
 kernel`uma_zalloc_arg+0x62b
 kernel`uma_zfree_arg+0x36a
 kernel`thread_reap+0x106
 kernel`thread_alloc+0xf
 kernel`thread_create+0x13a
 kernel`sys_thr_new+0xd2
 kernel`amd64_syscall+0x3ae
 kernel`0x811b7600
   6
 
 kernel`uma_small_alloc+0x61
 kernel`keg_alloc_slab+0x10b
 kernel`zone_import+0x1d2
 kernel`uma_zalloc_arg+0x62b
 kernel`uma_zfree_arg+0x36a
 kernel`vm_map_process_deferred+0x8c
 kernel`vm_map_remove+0x11d
 kernel`vmspace_exit+0xd3
 kernel`exit1+0x5a9
 kernel`0x80bc09bd
 kernel`amd64_syscall+0x3ae
 kernel`0x811b7600
   6
 
 kernel`uma_small_alloc+0x61
 kernel`keg_alloc_slab+0x10b
 kernel`zone_import+0x1d2
 kernel`uma_zalloc_arg+0x62b
 kernel`thread_alloc+0x23
 kernel`thread_create+0x13a
 kernel`sys_thr_new+0xd2
 kernel`amd64_syscall+0x3ae
 kernel`0x811b7600
  22
 
 kernel`vm_page_grab_pages+0x1b4
 kernel`vm_thread_stack_create+0xc0
 kernel`kstack_import+0x52
 kernel`uma_zalloc_arg+0x62b
 kernel`vm_thread_new+0x4d
 kernel`thread_alloc+0x31
 kernel`thread_create+0x13a
 kernel`sys_thr_new+0xd2
 kernel`amd64_syscall+0x3ae
 kernel`0x811b7600
1324
>>> 
>>> With sys_thr_new not respecting -n prefer:1 for
>>> -l0-15 (especially for the thread stacks), I
>>> looked some at the generated integration kernel
>>> code and it makes significant use of %rsp based
>>> memory accesses (read and write).
>>> 
>>> That would get both memory controllers going in
>>> parallel (kernel vectors accesses to the preferred
>>> memory domain), so not slowing down as expected.
>>> 
>>> If round-robin is not respected for thread stacks,
>>> and if threads migrate cpus across memory domains
>>> at times, there could be considerable variability
>>> for that context as well. (This may not be the
>>> only way to have different/extra variability for
>>> this context.)
>>> 
>>> Overall: I'd be surprised if this was not
>>> contributing to what I thought was odd about
>>> the benchmark results.
>> 
>> Your tracing refers to kernel thread stacks though, not the stacks used
>> by threads when 

Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

2019-09-27 Thread Mark Millard via freebsd-amd64



On 2019-Sep-27, at 12:24, Mark Johnston  wrote:

> On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote:
>> 
>> 
>> On 2019-Sep-26, at 17:05, Mark Millard  wrote:
>> 
>>> On 2019-Sep-26, at 13:29, Mark Johnston  wrote:
 One possibility is that these are kernel memory allocations occurring in
 the context of the benchmark threads.  Such allocations may not respect
 the configured policy since they are not private to the allocating
 thread.  For instance, upon opening a file, the kernel may allocate a
 vnode structure for that file.  That vnode may be accessed by threads
 from many processes over its lifetime, and may be recycled many times
 before its memory is released back to the allocator.
>>> 
>>> For -l0-15 -n prefer:1 :
>>> 
>>> Looks like this reports sys_thr_new activity, sys_cpuset
>>> activity, and 0x80bc09bd activity (whatever that
>>> is). Mostly sys_thr_new activity, over 1300 of them . . .
>>> 
>>> dtrace: pid 13553 has exited
>>> 
>>> 
>>> kernel`uma_small_alloc+0x61
>>> kernel`keg_alloc_slab+0x10b
>>> kernel`zone_import+0x1d2
>>> kernel`uma_zalloc_arg+0x62b
>>> kernel`thread_init+0x22
>>> kernel`keg_alloc_slab+0x259
>>> kernel`zone_import+0x1d2
>>> kernel`uma_zalloc_arg+0x62b
>>> kernel`thread_alloc+0x23
>>> kernel`thread_create+0x13a
>>> kernel`sys_thr_new+0xd2
>>> kernel`amd64_syscall+0x3ae
>>> kernel`0x811b7600
>>>   2
>>> 
>>> kernel`uma_small_alloc+0x61
>>> kernel`keg_alloc_slab+0x10b
>>> kernel`zone_import+0x1d2
>>> kernel`uma_zalloc_arg+0x62b
>>> kernel`cpuset_setproc+0x65
>>> kernel`sys_cpuset+0x123
>>> kernel`amd64_syscall+0x3ae
>>> kernel`0x811b7600
>>>   2
>>> 
>>> kernel`uma_small_alloc+0x61
>>> kernel`keg_alloc_slab+0x10b
>>> kernel`zone_import+0x1d2
>>> kernel`uma_zalloc_arg+0x62b
>>> kernel`uma_zfree_arg+0x36a
>>> kernel`thread_reap+0x106
>>> kernel`thread_alloc+0xf
>>> kernel`thread_create+0x13a
>>> kernel`sys_thr_new+0xd2
>>> kernel`amd64_syscall+0x3ae
>>> kernel`0x811b7600
>>>   6
>>> 
>>> kernel`uma_small_alloc+0x61
>>> kernel`keg_alloc_slab+0x10b
>>> kernel`zone_import+0x1d2
>>> kernel`uma_zalloc_arg+0x62b
>>> kernel`uma_zfree_arg+0x36a
>>> kernel`vm_map_process_deferred+0x8c
>>> kernel`vm_map_remove+0x11d
>>> kernel`vmspace_exit+0xd3
>>> kernel`exit1+0x5a9
>>> kernel`0x80bc09bd
>>> kernel`amd64_syscall+0x3ae
>>> kernel`0x811b7600
>>>   6
>>> 
>>> kernel`uma_small_alloc+0x61
>>> kernel`keg_alloc_slab+0x10b
>>> kernel`zone_import+0x1d2
>>> kernel`uma_zalloc_arg+0x62b
>>> kernel`thread_alloc+0x23
>>> kernel`thread_create+0x13a
>>> kernel`sys_thr_new+0xd2
>>> kernel`amd64_syscall+0x3ae
>>> kernel`0x811b7600
>>>  22
>>> 
>>> kernel`vm_page_grab_pages+0x1b4
>>> kernel`vm_thread_stack_create+0xc0
>>> kernel`kstack_import+0x52
>>> kernel`uma_zalloc_arg+0x62b
>>> kernel`vm_thread_new+0x4d
>>> kernel`thread_alloc+0x31
>>> kernel`thread_create+0x13a
>>> kernel`sys_thr_new+0xd2
>>> kernel`amd64_syscall+0x3ae
>>> kernel`0x811b7600
>>>1324
>> 
>> With sys_thr_new not respecting -n prefer:1 for
>> -l0-15 (especially for the thread stacks), I
>> looked some at the generated integration kernel
>> code and it makes significant use of %rsp based
>> memory accesses (read and write).
>> 
>> That would get both memory controllers going in
>> parallel (kernel vectors accesses to the preferred
>> memory domain), so not slowing down as expected.
>> 
>> If round-robin is not respected for thread stacks,
>> and if threads migrate cpus across memory domains
>> at times, there could be considerable variability
>> for that context as well. (This may not be the
>> only way to have different/extra variability for
>> this context.)
>> 
>> Overall: I'd be surprised if this was not
>> contributing to what I thought was odd about
>> the benchmark results.
> 
> Your tracing refers to kernel thread stacks though, not the stacks used
> by threads when executing in user mode.  My understanding is that a HINT
> implementation would spend virtually all of its time in user mode, so it
> shouldn't matter much or at all if kernel thread stacks are backed by
> memory from the "wrong" domain.

Looks 

Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

2019-09-27 Thread Mark Johnston
On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote:
> 
> 
> On 2019-Sep-26, at 17:05, Mark Millard  wrote:
> 
> > On 2019-Sep-26, at 13:29, Mark Johnston  wrote:
> >> One possibility is that these are kernel memory allocations occurring in
> >> the context of the benchmark threads.  Such allocations may not respect
> >> the configured policy since they are not private to the allocating
> >> thread.  For instance, upon opening a file, the kernel may allocate a
> >> vnode structure for that file.  That vnode may be accessed by threads
> >> from many processes over its lifetime, and may be recycled many times
> >> before its memory is released back to the allocator.
> > 
> > For -l0-15 -n prefer:1 :
> > 
> > Looks like this reports sys_thr_new activity, sys_cpuset
> > activity, and 0x80bc09bd activity (whatever that
> > is). Mostly sys_thr_new activity, over 1300 of them . . .
> > 
> > dtrace: pid 13553 has exited
> > 
> > 
> >  kernel`uma_small_alloc+0x61
> >  kernel`keg_alloc_slab+0x10b
> >  kernel`zone_import+0x1d2
> >  kernel`uma_zalloc_arg+0x62b
> >  kernel`thread_init+0x22
> >  kernel`keg_alloc_slab+0x259
> >  kernel`zone_import+0x1d2
> >  kernel`uma_zalloc_arg+0x62b
> >  kernel`thread_alloc+0x23
> >  kernel`thread_create+0x13a
> >  kernel`sys_thr_new+0xd2
> >  kernel`amd64_syscall+0x3ae
> >  kernel`0x811b7600
> >2
> > 
> >  kernel`uma_small_alloc+0x61
> >  kernel`keg_alloc_slab+0x10b
> >  kernel`zone_import+0x1d2
> >  kernel`uma_zalloc_arg+0x62b
> >  kernel`cpuset_setproc+0x65
> >  kernel`sys_cpuset+0x123
> >  kernel`amd64_syscall+0x3ae
> >  kernel`0x811b7600
> >2
> > 
> >  kernel`uma_small_alloc+0x61
> >  kernel`keg_alloc_slab+0x10b
> >  kernel`zone_import+0x1d2
> >  kernel`uma_zalloc_arg+0x62b
> >  kernel`uma_zfree_arg+0x36a
> >  kernel`thread_reap+0x106
> >  kernel`thread_alloc+0xf
> >  kernel`thread_create+0x13a
> >  kernel`sys_thr_new+0xd2
> >  kernel`amd64_syscall+0x3ae
> >  kernel`0x811b7600
> >6
> > 
> >  kernel`uma_small_alloc+0x61
> >  kernel`keg_alloc_slab+0x10b
> >  kernel`zone_import+0x1d2
> >  kernel`uma_zalloc_arg+0x62b
> >  kernel`uma_zfree_arg+0x36a
> >  kernel`vm_map_process_deferred+0x8c
> >  kernel`vm_map_remove+0x11d
> >  kernel`vmspace_exit+0xd3
> >  kernel`exit1+0x5a9
> >  kernel`0x80bc09bd
> >  kernel`amd64_syscall+0x3ae
> >  kernel`0x811b7600
> >6
> > 
> >  kernel`uma_small_alloc+0x61
> >  kernel`keg_alloc_slab+0x10b
> >  kernel`zone_import+0x1d2
> >  kernel`uma_zalloc_arg+0x62b
> >  kernel`thread_alloc+0x23
> >  kernel`thread_create+0x13a
> >  kernel`sys_thr_new+0xd2
> >  kernel`amd64_syscall+0x3ae
> >  kernel`0x811b7600
> >   22
> > 
> >  kernel`vm_page_grab_pages+0x1b4
> >  kernel`vm_thread_stack_create+0xc0
> >  kernel`kstack_import+0x52
> >  kernel`uma_zalloc_arg+0x62b
> >  kernel`vm_thread_new+0x4d
> >  kernel`thread_alloc+0x31
> >  kernel`thread_create+0x13a
> >  kernel`sys_thr_new+0xd2
> >  kernel`amd64_syscall+0x3ae
> >  kernel`0x811b7600
> > 1324
> 
> With sys_thr_new not respecting -n prefer:1 for
> -l0-15 (especially for the thread stacks), I
> looked some at the generated integration kernel
> code and it makes significant use of %rsp based
> memory accesses (read and write).
> 
> That would get both memory controllers going in
> parallel (kernel vectors accesses to the preferred
> memory domain), so not slowing down as expected.
> 
> If round-robin is not respected for thread stacks,
> and if threads migrate cpus across memory domains
> at times, there could be considerable variability
> for that context as well. (This may not be the
> only way to have different/extra variability for
> this context.)
> 
> Overall: I'd be surprised if this was not
> contributing to what I thought was odd about
> the benchmark results.

Your tracing refers to kernel thread stacks though, not the stacks used
by threads when executing in user mode.  My understanding is that a HINT
implementation would spend virtually all of its time in user mode, so it
shouldn't matter much or at all if kernel thread stacks are backed by
memory from the "wrong" domain.

This also doesn't really 

Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

2019-09-26 Thread Mark Millard via freebsd-amd64



On 2019-Sep-26, at 17:05, Mark Millard  wrote:

> On 2019-Sep-26, at 13:29, Mark Johnston  wrote:
> 
>> On Wed, Sep 25, 2019 at 10:03:14PM -0700, Mark Millard wrote:
>>> 
>>> 
>>> On 2019-Sep-25, at 20:27, Mark Millard  wrote:
>>> 
 On 2019-Sep-25, at 19:26, Mark Millard  wrote:
 
> On 2019-Sep-25, at 10:02, Mark Johnston  wrote:
> 
>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 
>> wrote:
>>> Note: I have access to only one FreeBSD amd64 context, and
>>> it is also my only access to a NUMA context: 2 memory
>>> domains. A Threadripper 1950X context. Also: I have only
>>> a head FreeBSD context on any architecture, not 12.x or
>>> before. So I have limited compare/contrast material.
>>> 
>>> I present the below basically to ask if the NUMA handling
>>> has been validated, or if it is going to be, at least for
>>> contexts that might apply to ThreadRipper 1950X and
>>> analogous contexts. My results suggest they are not (or
>>> libc++'s now times get messed up such that it looks like
>>> NUMA mishandling since this is based on odd benchmark
>>> results that involve mean time for laps, using a median
>>> of such across multiple trials).
>>> 
>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
>>> 1950X got got expected  results on Fedora but odd ones on
>>> FreeBSD. The benchmark is a variation on the old HINT
>>> benchmark, spanning the old multi-threading variation. I
>>> later tried Fedora because the FreeBSD results looked odd.
>>> The other architectures I tried FreeBSD benchmarking with
>>> did not look odd like this. (powerpc64 on a old PowerMac 2
>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
>>> Ed. For these I used 4 threads, not more.)
>>> 
>>> I tend to write in terms of plots made from the data instead
>>> of the raw benchmark data.
>>> 
>>> FreeBSD testing based on:
>>> cpuset -l0-15  -n prefer:1
>>> cpuset -l16-31 -n prefer:1
>>> 
>>> Fedora 30 testing based on:
>>> numactl --preferred 1 --cpunodebind 0
>>> numactl --preferred 1 --cpunodebind 1
>>> 
>>> While I have more results, I reference primarily DSIZE
>>> and ISIZE being unsigned long long and also both being
>>> unsigned long as examples. Variations in results are not
>>> from the type differences for any LP64 architectures.
>>> (But they give an idea of benchmark variability in the
>>> test context.)
>>> 
>>> The Fedora results solidly show the bandwidth limitation
>>> of using one memory controller. They also show the latency
>>> consequences for the remote memory domain case vs. the
>>> local memory domain case. There is not a lot of
>>> variability between the examples of the 2 type-pairs used
>>> for Fedora.
>>> 
>>> Not true for FreeBSD on the 1950X:
>>> 
>>> A) The latency-constrained part of the graph looks to
>>> normally be using the local memory domain when
>>> -l0-15 is in use for 8 threads.
>>> 
>>> B) Both the -l0-15 and the -l16-31 parts of the
>>> graph for 8 threads that should be bandwidth
>>> limited show mostly examples that would have to
>>> involve both memory controllers for the bandwidth
>>> to get the results shown as far as I can tell.
>>> There is also wide variability ranging between the
>>> expected 1 controller result and, say, what a 2
>>> controller round-robin would be expected produce.
>>> 
>>> C) Even the single threaded result shows a higher
>>> result for larger total bytes for the kernel
>>> vectors. Fedora does not.
>>> 
>>> I think that (B) is the most solid evidence for
>>> something being odd.
>> 
>> The implication seems to be that your benchmark program is using pages
>> from both domains despite a policy which preferentially allocates pages
>> from domain 1, so you would first want to determine if this is actually
>> what's happening.  As far as I know we currently don't have a good way
>> of characterizing per-domain memory usage within a process.
>> 
>> If your benchmark uses a large fraction of the system's memory, you
>> could use the vm.phys_free sysctl to get a sense of how much memory from
>> each domain is free.
> 
> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per memory
> domain. I've never configured the benchmark such that it even reaches
> 10 GiBytes on this hardware. (It stops for a time constraint first,
> based on the values in use for the "adjustable" items.)
> 
> . . . (much omitted material) . . .
 
> 
>> Another possibility is to use DTrace to trace the
>> requested domain in vm_page_alloc_domain_after().  For example, the
>> following 

Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

2019-09-26 Thread Mark Johnston
On Wed, Sep 25, 2019 at 10:03:14PM -0700, Mark Millard wrote:
> 
> 
> On 2019-Sep-25, at 20:27, Mark Millard  wrote:
> 
> > On 2019-Sep-25, at 19:26, Mark Millard  wrote:
> > 
> >> On 2019-Sep-25, at 10:02, Mark Johnston  wrote:
> >> 
> >>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 
> >>> wrote:
>  Note: I have access to only one FreeBSD amd64 context, and
>  it is also my only access to a NUMA context: 2 memory
>  domains. A Threadripper 1950X context. Also: I have only
>  a head FreeBSD context on any architecture, not 12.x or
>  before. So I have limited compare/contrast material.
>  
>  I present the below basically to ask if the NUMA handling
>  has been validated, or if it is going to be, at least for
>  contexts that might apply to ThreadRipper 1950X and
>  analogous contexts. My results suggest they are not (or
>  libc++'s now times get messed up such that it looks like
>  NUMA mishandling since this is based on odd benchmark
>  results that involve mean time for laps, using a median
>  of such across multiple trials).
>  
>  I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
>  1950X got got expected  results on Fedora but odd ones on
>  FreeBSD. The benchmark is a variation on the old HINT
>  benchmark, spanning the old multi-threading variation. I
>  later tried Fedora because the FreeBSD results looked odd.
>  The other architectures I tried FreeBSD benchmarking with
>  did not look odd like this. (powerpc64 on a old PowerMac 2
>  socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
>  1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
>  Ed. For these I used 4 threads, not more.)
>  
>  I tend to write in terms of plots made from the data instead
>  of the raw benchmark data.
>  
>  FreeBSD testing based on:
>  cpuset -l0-15  -n prefer:1
>  cpuset -l16-31 -n prefer:1
>  
>  Fedora 30 testing based on:
>  numactl --preferred 1 --cpunodebind 0
>  numactl --preferred 1 --cpunodebind 1
>  
>  While I have more results, I reference primarily DSIZE
>  and ISIZE being unsigned long long and also both being
>  unsigned long as examples. Variations in results are not
>  from the type differences for any LP64 architectures.
>  (But they give an idea of benchmark variability in the
>  test context.)
>  
>  The Fedora results solidly show the bandwidth limitation
>  of using one memory controller. They also show the latency
>  consequences for the remote memory domain case vs. the
>  local memory domain case. There is not a lot of
>  variability between the examples of the 2 type-pairs used
>  for Fedora.
>  
>  Not true for FreeBSD on the 1950X:
>  
>  A) The latency-constrained part of the graph looks to
>  normally be using the local memory domain when
>  -l0-15 is in use for 8 threads.
>  
>  B) Both the -l0-15 and the -l16-31 parts of the
>  graph for 8 threads that should be bandwidth
>  limited show mostly examples that would have to
>  involve both memory controllers for the bandwidth
>  to get the results shown as far as I can tell.
>  There is also wide variability ranging between the
>  expected 1 controller result and, say, what a 2
>  controller round-robin would be expected produce.
>  
>  C) Even the single threaded result shows a higher
>  result for larger total bytes for the kernel
>  vectors. Fedora does not.
>  
>  I think that (B) is the most solid evidence for
>  something being odd.
> >>> 
> >>> The implication seems to be that your benchmark program is using pages
> >>> from both domains despite a policy which preferentially allocates pages
> >>> from domain 1, so you would first want to determine if this is actually
> >>> what's happening.  As far as I know we currently don't have a good way
> >>> of characterizing per-domain memory usage within a process.
> >>> 
> >>> If your benchmark uses a large fraction of the system's memory, you
> >>> could use the vm.phys_free sysctl to get a sense of how much memory from
> >>> each domain is free.
> >> 
> >> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per memory
> >> domain. I've never configured the benchmark such that it even reaches
> >> 10 GiBytes on this hardware. (It stops for a time constraint first,
> >> based on the values in use for the "adjustable" items.)
> >> 
> >> . . . (much omitted material) . . .
> > 
> >> 
> >>> Another possibility is to use DTrace to trace the
> >>> requested domain in vm_page_alloc_domain_after().  For example, the
> >>> following DTrace one-liner counts the number of pages allocated per
> >>> domain by ls(1):
> >>> 
> >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry 
> >>> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr 

Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

2019-09-25 Thread Mark Millard via freebsd-amd64



On 2019-Sep-25, at 20:27, Mark Millard  wrote:

> On 2019-Sep-25, at 19:26, Mark Millard  wrote:
> 
>> On 2019-Sep-25, at 10:02, Mark Johnston  wrote:
>> 
>>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 
>>> wrote:
 Note: I have access to only one FreeBSD amd64 context, and
 it is also my only access to a NUMA context: 2 memory
 domains. A Threadripper 1950X context. Also: I have only
 a head FreeBSD context on any architecture, not 12.x or
 before. So I have limited compare/contrast material.
 
 I present the below basically to ask if the NUMA handling
 has been validated, or if it is going to be, at least for
 contexts that might apply to ThreadRipper 1950X and
 analogous contexts. My results suggest they are not (or
 libc++'s now times get messed up such that it looks like
 NUMA mishandling since this is based on odd benchmark
 results that involve mean time for laps, using a median
 of such across multiple trials).
 
 I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
 1950X got got expected  results on Fedora but odd ones on
 FreeBSD. The benchmark is a variation on the old HINT
 benchmark, spanning the old multi-threading variation. I
 later tried Fedora because the FreeBSD results looked odd.
 The other architectures I tried FreeBSD benchmarking with
 did not look odd like this. (powerpc64 on a old PowerMac 2
 socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
 Ed. For these I used 4 threads, not more.)
 
 I tend to write in terms of plots made from the data instead
 of the raw benchmark data.
 
 FreeBSD testing based on:
 cpuset -l0-15  -n prefer:1
 cpuset -l16-31 -n prefer:1
 
 Fedora 30 testing based on:
 numactl --preferred 1 --cpunodebind 0
 numactl --preferred 1 --cpunodebind 1
 
 While I have more results, I reference primarily DSIZE
 and ISIZE being unsigned long long and also both being
 unsigned long as examples. Variations in results are not
 from the type differences for any LP64 architectures.
 (But they give an idea of benchmark variability in the
 test context.)
 
 The Fedora results solidly show the bandwidth limitation
 of using one memory controller. They also show the latency
 consequences for the remote memory domain case vs. the
 local memory domain case. There is not a lot of
 variability between the examples of the 2 type-pairs used
 for Fedora.
 
 Not true for FreeBSD on the 1950X:
 
 A) The latency-constrained part of the graph looks to
 normally be using the local memory domain when
 -l0-15 is in use for 8 threads.
 
 B) Both the -l0-15 and the -l16-31 parts of the
 graph for 8 threads that should be bandwidth
 limited show mostly examples that would have to
 involve both memory controllers for the bandwidth
 to get the results shown as far as I can tell.
 There is also wide variability ranging between the
 expected 1 controller result and, say, what a 2
 controller round-robin would be expected produce.
 
 C) Even the single threaded result shows a higher
 result for larger total bytes for the kernel
 vectors. Fedora does not.
 
 I think that (B) is the most solid evidence for
 something being odd.
>>> 
>>> The implication seems to be that your benchmark program is using pages
>>> from both domains despite a policy which preferentially allocates pages
>>> from domain 1, so you would first want to determine if this is actually
>>> what's happening.  As far as I know we currently don't have a good way
>>> of characterizing per-domain memory usage within a process.
>>> 
>>> If your benchmark uses a large fraction of the system's memory, you
>>> could use the vm.phys_free sysctl to get a sense of how much memory from
>>> each domain is free.
>> 
>> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per memory
>> domain. I've never configured the benchmark such that it even reaches
>> 10 GiBytes on this hardware. (It stops for a time constraint first,
>> based on the values in use for the "adjustable" items.)
>> 
>> . . . (much omitted material) . . .
> 
>> 
>>> Another possibility is to use DTrace to trace the
>>> requested domain in vm_page_alloc_domain_after().  For example, the
>>> following DTrace one-liner counts the number of pages allocated per
>>> domain by ls(1):
>>> 
>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry 
>>> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls"
>>> ...
>>> 0   71
>>> 1   72
>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry 
>>> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls"
>>> ...
>>> 1  143
>>> # dtrace -n 

Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

2019-09-25 Thread Mark Millard via freebsd-amd64



On 2019-Sep-25, at 19:26, Mark Millard  wrote:

> On 2019-Sep-25, at 10:02, Mark Johnston  wrote:
> 
>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 
>> wrote:
>>> Note: I have access to only one FreeBSD amd64 context, and
>>> it is also my only access to a NUMA context: 2 memory
>>> domains. A Threadripper 1950X context. Also: I have only
>>> a head FreeBSD context on any architecture, not 12.x or
>>> before. So I have limited compare/contrast material.
>>> 
>>> I present the below basically to ask if the NUMA handling
>>> has been validated, or if it is going to be, at least for
>>> contexts that might apply to ThreadRipper 1950X and
>>> analogous contexts. My results suggest they are not (or
>>> libc++'s now times get messed up such that it looks like
>>> NUMA mishandling since this is based on odd benchmark
>>> results that involve mean time for laps, using a median
>>> of such across multiple trials).
>>> 
>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
>>> 1950X got got expected  results on Fedora but odd ones on
>>> FreeBSD. The benchmark is a variation on the old HINT
>>> benchmark, spanning the old multi-threading variation. I
>>> later tried Fedora because the FreeBSD results looked odd.
>>> The other architectures I tried FreeBSD benchmarking with
>>> did not look odd like this. (powerpc64 on a old PowerMac 2
>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
>>> Ed. For these I used 4 threads, not more.)
>>> 
>>> I tend to write in terms of plots made from the data instead
>>> of the raw benchmark data.
>>> 
>>> FreeBSD testing based on:
>>> cpuset -l0-15  -n prefer:1
>>> cpuset -l16-31 -n prefer:1
>>> 
>>> Fedora 30 testing based on:
>>> numactl --preferred 1 --cpunodebind 0
>>> numactl --preferred 1 --cpunodebind 1
>>> 
>>> While I have more results, I reference primarily DSIZE
>>> and ISIZE being unsigned long long and also both being
>>> unsigned long as examples. Variations in results are not
>>> from the type differences for any LP64 architectures.
>>> (But they give an idea of benchmark variability in the
>>> test context.)
>>> 
>>> The Fedora results solidly show the bandwidth limitation
>>> of using one memory controller. They also show the latency
>>> consequences for the remote memory domain case vs. the
>>> local memory domain case. There is not a lot of
>>> variability between the examples of the 2 type-pairs used
>>> for Fedora.
>>> 
>>> Not true for FreeBSD on the 1950X:
>>> 
>>> A) The latency-constrained part of the graph looks to
>>>  normally be using the local memory domain when
>>>  -l0-15 is in use for 8 threads.
>>> 
>>> B) Both the -l0-15 and the -l16-31 parts of the
>>>  graph for 8 threads that should be bandwidth
>>>  limited show mostly examples that would have to
>>>  involve both memory controllers for the bandwidth
>>>  to get the results shown as far as I can tell.
>>>  There is also wide variability ranging between the
>>>  expected 1 controller result and, say, what a 2
>>>  controller round-robin would be expected produce.
>>> 
>>> C) Even the single threaded result shows a higher
>>>  result for larger total bytes for the kernel
>>>  vectors. Fedora does not.
>>> 
>>> I think that (B) is the most solid evidence for
>>> something being odd.
>> 
>> The implication seems to be that your benchmark program is using pages
>> from both domains despite a policy which preferentially allocates pages
>> from domain 1, so you would first want to determine if this is actually
>> what's happening.  As far as I know we currently don't have a good way
>> of characterizing per-domain memory usage within a process.
>> 
>> If your benchmark uses a large fraction of the system's memory, you
>> could use the vm.phys_free sysctl to get a sense of how much memory from
>> each domain is free.
> 
> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per memory
> domain. I've never configured the benchmark such that it even reaches
> 10 GiBytes on this hardware. (It stops for a time constraint first,
> based on the values in use for the "adjustable" items.)
> 
> . . . (much omitted material) . . .

> 
>> Another possibility is to use DTrace to trace the
>> requested domain in vm_page_alloc_domain_after().  For example, the
>> following DTrace one-liner counts the number of pages allocated per
>> domain by ls(1):
>> 
>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry 
>> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls"
>> ...
>>  0   71
>>  1   72
>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry 
>> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls"
>> ...
>>  1  143
>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry 
>> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:0 ls"
>> ...
>>  0  143
> 
> I'll think about this, 

Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

2019-09-25 Thread Mark Millard via freebsd-amd64



On 2019-Sep-25, at 10:02, Mark Johnston  wrote:

> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 
> wrote:
>> Note: I have access to only one FreeBSD amd64 context, and
>> it is also my only access to a NUMA context: 2 memory
>> domains. A Threadripper 1950X context. Also: I have only
>> a head FreeBSD context on any architecture, not 12.x or
>> before. So I have limited compare/contrast material.
>> 
>> I present the below basically to ask if the NUMA handling
>> has been validated, or if it is going to be, at least for
>> contexts that might apply to ThreadRipper 1950X and
>> analogous contexts. My results suggest they are not (or
>> libc++'s now times get messed up such that it looks like
>> NUMA mishandling since this is based on odd benchmark
>> results that involve mean time for laps, using a median
>> of such across multiple trials).
>> 
>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
>> 1950X got got expected  results on Fedora but odd ones on
>> FreeBSD. The benchmark is a variation on the old HINT
>> benchmark, spanning the old multi-threading variation. I
>> later tried Fedora because the FreeBSD results looked odd.
>> The other architectures I tried FreeBSD benchmarking with
>> did not look odd like this. (powerpc64 on a old PowerMac 2
>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
>> Ed. For these I used 4 threads, not more.)
>> 
>> I tend to write in terms of plots made from the data instead
>> of the raw benchmark data.
>> 
>> FreeBSD testing based on:
>> cpuset -l0-15  -n prefer:1
>> cpuset -l16-31 -n prefer:1
>> 
>> Fedora 30 testing based on:
>> numactl --preferred 1 --cpunodebind 0
>> numactl --preferred 1 --cpunodebind 1
>> 
>> While I have more results, I reference primarily DSIZE
>> and ISIZE being unsigned long long and also both being
>> unsigned long as examples. Variations in results are not
>> from the type differences for any LP64 architectures.
>> (But they give an idea of benchmark variability in the
>> test context.)
>> 
>> The Fedora results solidly show the bandwidth limitation
>> of using one memory controller. They also show the latency
>> consequences for the remote memory domain case vs. the
>> local memory domain case. There is not a lot of
>> variability between the examples of the 2 type-pairs used
>> for Fedora.
>> 
>> Not true for FreeBSD on the 1950X:
>> 
>> A) The latency-constrained part of the graph looks to
>>   normally be using the local memory domain when
>>   -l0-15 is in use for 8 threads.
>> 
>> B) Both the -l0-15 and the -l16-31 parts of the
>>   graph for 8 threads that should be bandwidth
>>   limited show mostly examples that would have to
>>   involve both memory controllers for the bandwidth
>>   to get the results shown as far as I can tell.
>>   There is also wide variability ranging between the
>>   expected 1 controller result and, say, what a 2
>>   controller round-robin would be expected produce.
>> 
>> C) Even the single threaded result shows a higher
>>   result for larger total bytes for the kernel
>>   vectors. Fedora does not.
>> 
>> I think that (B) is the most solid evidence for
>> something being odd.
> 
> The implication seems to be that your benchmark program is using pages
> from both domains despite a policy which preferentially allocates pages
> from domain 1, so you would first want to determine if this is actually
> what's happening.  As far as I know we currently don't have a good way
> of characterizing per-domain memory usage within a process.
> 
> If your benchmark uses a large fraction of the system's memory, you
> could use the vm.phys_free sysctl to get a sense of how much memory from
> each domain is free.

The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per memory
domain. I've never configured the benchmark such that it even reaches
10 GiBytes on this hardware. (It stops for a time constraint first,
based on the values in use for the "adjustable" items.)

The benchmark runs the Hierarchical INTegeration kernel for a sequence
of larger and larger number of cells in the grid that it uses. Each
size is run in isolation before the next is tried, each gets its own
timings. Each size gets its own kernel vector allocations (and
deallocations) with the trails and laps within a trail reusing the
same memory. Each lap in each trial gets its own thread creations (and
completions). The main thread combines the results when there are
multiple threads involved. (So I'm not sure of the main thread's
behavior relative to the cpuset commands.)

Thus, there are lots of thread creations overall, as well as
lots of allocations of vectors for use in the integration
kernel code.

What it looks like to me that the std::async's internal thread
creations are not respecting the cpuset command settings: in a
sense, not inheriting the cpuset information correctly (or such
is being ignored).

For reference, 

Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

2019-09-25 Thread Mark Johnston
On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 wrote:
> Note: I have access to only one FreeBSD amd64 context, and
> it is also my only access to a NUMA context: 2 memory
> domains. A Threadripper 1950X context. Also: I have only
> a head FreeBSD context on any architecture, not 12.x or
> before. So I have limited compare/contrast material.
> 
> I present the below basically to ask if the NUMA handling
> has been validated, or if it is going to be, at least for
> contexts that might apply to ThreadRipper 1950X and
> analogous contexts. My results suggest they are not (or
> libc++'s now times get messed up such that it looks like
> NUMA mishandling since this is based on odd benchmark
> results that involve mean time for laps, using a median
> of such across multiple trials).
> 
> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
> 1950X got got expected  results on Fedora but odd ones on
> FreeBSD. The benchmark is a variation on the old HINT
> benchmark, spanning the old multi-threading variation. I
> later tried Fedora because the FreeBSD results looked odd.
> The other architectures I tried FreeBSD benchmarking with
> did not look odd like this. (powerpc64 on a old PowerMac 2
> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
> Ed. For these I used 4 threads, not more.)
> 
> I tend to write in terms of plots made from the data instead
> of the raw benchmark data.
> 
> FreeBSD testing based on:
> cpuset -l0-15  -n prefer:1
> cpuset -l16-31 -n prefer:1
> 
> Fedora 30 testing based on:
> numactl --preferred 1 --cpunodebind 0
> numactl --preferred 1 --cpunodebind 1
> 
> While I have more results, I reference primarily DSIZE
> and ISIZE being unsigned long long and also both being
> unsigned long as examples. Variations in results are not
> from the type differences for any LP64 architectures.
> (But they give an idea of benchmark variability in the
> test context.)
> 
> The Fedora results solidly show the bandwidth limitation
> of using one memory controller. They also show the latency
> consequences for the remote memory domain case vs. the
> local memory domain case. There is not a lot of
> variability between the examples of the 2 type-pairs used
> for Fedora.
> 
> Not true for FreeBSD on the 1950X:
> 
> A) The latency-constrained part of the graph looks to
>normally be using the local memory domain when
>-l0-15 is in use for 8 threads.
> 
> B) Both the -l0-15 and the -l16-31 parts of the
>graph for 8 threads that should be bandwidth
>limited show mostly examples that would have to
>involve both memory controllers for the bandwidth
>to get the results shown as far as I can tell.
>There is also wide variability ranging between the
>expected 1 controller result and, say, what a 2
>controller round-robin would be expected produce.
> 
> C) Even the single threaded result shows a higher
>result for larger total bytes for the kernel
>vectors. Fedora does not.
> 
> I think that (B) is the most solid evidence for
> something being odd.

The implication seems to be that your benchmark program is using pages
from both domains despite a policy which preferentially allocates pages
from domain 1, so you would first want to determine if this is actually
what's happening.  As far as I know we currently don't have a good way
of characterizing per-domain memory usage within a process.

If your benchmark uses a large fraction of the system's memory, you
could use the vm.phys_free sysctl to get a sense of how much memory from
each domain is free.  Another possibility is to use DTrace to trace the
requested domain in vm_page_alloc_domain_after().  For example, the
following DTrace one-liner counts the number of pages allocated per
domain by ls(1):

# dtrace -n 'fbt::vm_page_alloc_domain_after:entry 
/progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls"
...
0   71
1   72
# dtrace -n 'fbt::vm_page_alloc_domain_after:entry 
/progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls"
...
1  143
# dtrace -n 'fbt::vm_page_alloc_domain_after:entry 
/progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:0 ls"
...
0  143

This approach might not work for various reasons depending on how
exactly your benchmark program works.
___
freebsd-amd64@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-amd64
To unsubscribe, send any mail to "freebsd-amd64-unsubscr...@freebsd.org"