Re: Unusual threading behavior on single processes

2020-03-28 Thread Stefmorino
Thank you, your information was very helpful. I compiled and ran
malloc_duel and it's working as intended. I wasn't aware of the -H flag
for top, and I can see programs are threading as you say, though the
bottleneck to my poor performance is still a mystery.

I took some screen captures so you can see what I'm seeing:
Xonotic:
https://0x0.st/iBCD.png
OpenMW:
https://0x0.st/iBC5.png
Terraria: (fnaify if curious, thanks thfr :>)
https://0x0.st/iMrr.png

In the case of OpenMW, the bottleneck actually seems pretty obvious with
what top -H reports. I don't really know what to say about the other
examples.

I would break out a profiling tool at this stage, but the results of
testing with top -H have left me with no idea where the bottleneck is
(except openmw where it might actually be CPU); digging through systat
hasn't really given me any revelations either. :/

If anyone has a hunch where I should check, or if you need me to test a
different software, I'd be more than happy to.

Regards,
Stefmorino


On Sat, Mar 28, 2020, at 09:00:21AM +, Otto Moerbeek wrote:

> On Fri, Mar 27, 2020 at 09:03:40PM +, Stefmorino wrote:

>> I have question about a performance quirk on OpenBSD, but I'm not really sure
>> how to address it, or what the root cause even is; that being how 
>> multithreaded
>> applications (libpthread?) behave (notably, games).
>>
>> I have tested many applications, the behavior is the same in all of them, but
>> I'll talk about OpenMW (an open-source game engine for morrowind) since I 
>> have
>> the most useful information about how this program is threaded. By default,
>> OpenMW uses 4 threads (cited here:
>> https://openmw.readthedocs.io/en/stable/reference/modding/settings/cells.html),
>> one for main/generic processing, one for graphics, one for audio, and one for
>> preloading terrain. You can see this if you look at the thread usage under 
>> top
>> while running the game; however, this is exactly where my question comes into
>> play. Instead of each thread processing the game independently with their own
>> limits, each thread is "capped" to the total limit of one thread (I.E. 
>> instead
>> of openmw's process using 100% of 4 threads, or 400% cpu in top, instead the
>> process uses 25% across 4 threads, or 100% cpu in top). I tested this using
>> GENERIC instead of GENERIC.MP as well, and get identical performance on the 
>> one
>> thread; it's almost like pthreads is acting as a placeholder of sorts and not
>> actually improving performance where it should.
>>
>> Is it a lock (spin is at 0)? A placeholder? A limitation of how Ryzen SMP is
>> implemented?
>
> Hard to tell, no idea what that game engine does.  But this not a
> general problem, e.g. the malloc_duel regress test
> (/usr/src/regress/lib/libpthread/malloc_duel). I see > 100% as well
> with other multi-threaded programs.
>
> 32013 otto  600 6020K 1552K onproc/3  - 1:07 228.81% 
> malloc_due
>
> Wild guess: it could be that you program actually does not do real
> threading, but userland threading. Check with top -H if it really
> creates threads.  You should see multiple threads having the same PID.
> or all thraeds are using a resource that cannot be shared.
>
>  -Otto
>>
>> I'd be happy to do any additional testing, I have a fresh -current source 
>> tree
>> ready
>>
>> dmesg
>> OpenBSD 6.6-current (GENERIC.MP) #75: Tue Mar 24 12:56:37 MDT 2020
>> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
>> real mem = 16603250688 (15834MB)
>> avail mem = 16087437312 (15342MB)
>> mpath0 at root
>> scsibus0 at mpath0: 256 targets
>> mainbus0 at root
>> bios0 at mainbus0: SMBIOS rev. 3.1 @ 0x986ec000 (62 entries)
>> bios0: vendor LENOVO version "R0UET76W (1.56 )" date 11/05/2019
>> bios0: LENOVO 20KVCTO1WW
>> acpi0 at bios0: ACPI 5.0
>> acpi0: sleep states S0 S3 S4 S5
>> acpi0: tables DSDT FACP SSDT SSDT CRAT CDIT UEFI MSDM BATB HPET APIC MCFG 
>> SBST WSMT IVRS FPDT SSDT SSDT SSDT UEFI SSDT
>> acpi0: wakeup devices GPP0(S3) GPP1(S3) GPP2(S3) GPP3(S3) GPP4(S3) GPP5(S3) 
>> GPP6(S3) GP17(S3) XHC0(S3) XHC1(S3) GP18(S3) LID_(S3) SLPB(S3)
>> acpitimer0 at acpi0: 3579545 Hz, 32 bits
>> acpihpet0 at acpi0: 14318180 Hz
>> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
>> cpu0 at mainbus0: apid 0 (boot processor)
>> cpu0: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.61 MHz, 17-11-00
>> cpu0: 
>> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
>> cpu0: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 
>> 64b/line 8-way L2 cache, 4MB 64b/line 

RE: Unusual threading behavior on single processes

2020-03-28 Thread zeurkous
Haai,

Just to make a more-or-less general point (or two)...

"Otto Moerbeek"  wrote:
> On Fri, Mar 27, 2020 at 09:03:40PM +, Stefmorino wrote:
>
>> I have tested many applications, the behavior is the same in all of them, but
>> I'll talk about OpenMW (an open-source game engine for morrowind) since I 
>> have
>> the most useful information about how this program is threaded. By default,
>> OpenMW uses 4 threads (cited here:
>> https://openmw.readthedocs.io/en/stable/reference/modding/settings/cells.html),
>> one for main/generic processing, one for graphics, one for audio, and one for
>> preloading terrain.
>>[snip]
>>
>> Is it a lock (spin is at 0)? A placeholder? A limitation of how Ryzen SMP is
>> implemented?
>[snip]
>
> Wild guess: it could be that you program actually does not do real
> threading, but userland threading.

"Fibering", in other words.

> Check with top -H if it really
> creates threads. You should see multiple threads having the same PID.
> or all thraeds are using a resource that cannot be shared.

Likely the latter. It's always funny, isn't it... A coder thinks "hey,
I want a multi-threading 'cause its 1337, I'll just neatly run these
subsystems within seperate threads and I'm done!".

The fact that such is a frequently a naive proposition should be clear
to the more clueful reader. Games tend to be heavy on global state, and
are more likely to benefit from a multi-process model w/ carefully
thought-out boundaries, than from a shared-everything thread model.
While that need not be the case here, mestrongly suspects it is. Take
heed, and measure. Always measure.

Take care,

 --zeurkous.

> -Otto

-- 
Friggin' Machines!



Re: Unusual threading behavior on single processes

2020-03-28 Thread Otto Moerbeek
On Fri, Mar 27, 2020 at 09:03:40PM +, Stefmorino wrote:

> I have question about a performance quirk on OpenBSD, but I'm not really sure
> how to address it, or what the root cause even is; that being how 
> multithreaded
> applications (libpthread?) behave (notably, games).
> 
> I have tested many applications, the behavior is the same in all of them, but
> I'll talk about OpenMW (an open-source game engine for morrowind) since I have
> the most useful information about how this program is threaded. By default,
> OpenMW uses 4 threads (cited here:
> https://openmw.readthedocs.io/en/stable/reference/modding/settings/cells.html),
> one for main/generic processing, one for graphics, one for audio, and one for
> preloading terrain. You can see this if you look at the thread usage under top
> while running the game; however, this is exactly where my question comes into
> play. Instead of each thread processing the game independently with their own
> limits, each thread is "capped" to the total limit of one thread (I.E. instead
> of openmw's process using 100% of 4 threads, or 400% cpu in top, instead the
> process uses 25% across 4 threads, or 100% cpu in top). I tested this using
> GENERIC instead of GENERIC.MP as well, and get identical performance on the 
> one
> thread; it's almost like pthreads is acting as a placeholder of sorts and not
> actually improving performance where it should.
> 
> Is it a lock (spin is at 0)? A placeholder? A limitation of how Ryzen SMP is
> implemented?

Hard to tell, no idea what that game engine does.  But this not a
general problem, e.g. the malloc_duel regress test
(/usr/src/regress/lib/libpthread/malloc_duel). I see > 100% as well
with other multi-threaded programs. 

32013 otto  600 6020K 1552K onproc/3  - 1:07 228.81% malloc_due

Wild guess: it could be that you program actually does not do real
threading, but userland threading. Check with top -H if it really
creates threads.  You should see multiple threads having the same PID.
or all thraeds are using a resource that cannot be shared.

-Otto
> 
> I'd be happy to do any additional testing, I have a fresh -current source tree
> ready
> 
> dmesg
> OpenBSD 6.6-current (GENERIC.MP) #75: Tue Mar 24 12:56:37 MDT 2020
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 16603250688 (15834MB)
> avail mem = 16087437312 (15342MB)
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 3.1 @ 0x986ec000 (62 entries)
> bios0: vendor LENOVO version "R0UET76W (1.56 )" date 11/05/2019
> bios0: LENOVO 20KVCTO1WW
> acpi0 at bios0: ACPI 5.0
> acpi0: sleep states S0 S3 S4 S5
> acpi0: tables DSDT FACP SSDT SSDT CRAT CDIT UEFI MSDM BATB HPET APIC MCFG 
> SBST WSMT IVRS FPDT SSDT SSDT SSDT UEFI SSDT
> acpi0: wakeup devices GPP0(S3) GPP1(S3) GPP2(S3) GPP3(S3) GPP4(S3) GPP5(S3) 
> GPP6(S3) GP17(S3) XHC0(S3) XHC1(S3) GP18(S3) LID_(S3) SLPB(S3)
> acpitimer0 at acpi0: 3579545 Hz, 32 bits
> acpihpet0 at acpi0: 14318180 Hz
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.61 MHz, 17-11-00
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
> cpu0: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 
> 64b/line 8-way L2 cache, 4MB 64b/line 16-way L3 cache
> cpu0: ITLB 64 4KB entries fully associative, 64 4MB entries fully associative
> cpu0: DTLB 64 4KB entries fully associative, 64 4MB entries fully associative
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
> cpu0: apic clock running at 24MHz
> cpu0: mwait min=64, max=64, C-substates=1.1, IBE
> cpu1 at mainbus0: apid 1 (application processor)
> cpu1: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.23 MHz, 17-11-00
> cpu1: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
> cpu1: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 
> 64b/line 8-way L2 cache, 4MB 64b/line 16-way L3 cache
> cpu1: ITLB 64 4KB entries fully associative, 64 4MB entries fully associative
> cpu1: DTLB 64 4KB entries fully associative, 64 4MB entries fully associative
> cpu1: smt 1, core 

Unusual threading behavior on single processes

2020-03-27 Thread Stefmorino
I have question about a performance quirk on OpenBSD, but I'm not really sure
how to address it, or what the root cause even is; that being how multithreaded
applications (libpthread?) behave (notably, games).

I have tested many applications, the behavior is the same in all of them, but
I'll talk about OpenMW (an open-source game engine for morrowind) since I have
the most useful information about how this program is threaded. By default,
OpenMW uses 4 threads (cited here:
https://openmw.readthedocs.io/en/stable/reference/modding/settings/cells.html),
one for main/generic processing, one for graphics, one for audio, and one for
preloading terrain. You can see this if you look at the thread usage under top
while running the game; however, this is exactly where my question comes into
play. Instead of each thread processing the game independently with their own
limits, each thread is "capped" to the total limit of one thread (I.E. instead
of openmw's process using 100% of 4 threads, or 400% cpu in top, instead the
process uses 25% across 4 threads, or 100% cpu in top). I tested this using
GENERIC instead of GENERIC.MP as well, and get identical performance on the one
thread; it's almost like pthreads is acting as a placeholder of sorts and not
actually improving performance where it should.

Is it a lock (spin is at 0)? A placeholder? A limitation of how Ryzen SMP is
implemented?

I'd be happy to do any additional testing, I have a fresh -current source tree
ready

dmesg
OpenBSD 6.6-current (GENERIC.MP) #75: Tue Mar 24 12:56:37 MDT 2020
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 16603250688 (15834MB)
avail mem = 16087437312 (15342MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.1 @ 0x986ec000 (62 entries)
bios0: vendor LENOVO version "R0UET76W (1.56 )" date 11/05/2019
bios0: LENOVO 20KVCTO1WW
acpi0 at bios0: ACPI 5.0
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP SSDT SSDT CRAT CDIT UEFI MSDM BATB HPET APIC MCFG SBST 
WSMT IVRS FPDT SSDT SSDT SSDT UEFI SSDT
acpi0: wakeup devices GPP0(S3) GPP1(S3) GPP2(S3) GPP3(S3) GPP4(S3) GPP5(S3) 
GPP6(S3) GP17(S3) XHC0(S3) XHC1(S3) GP18(S3) LID_(S3) SLPB(S3)
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpihpet0 at acpi0: 14318180 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.61 MHz, 17-11-00
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu0: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 
8-way L2 cache, 4MB 64b/line 16-way L3 cache
cpu0: ITLB 64 4KB entries fully associative, 64 4MB entries fully associative
cpu0: DTLB 64 4KB entries fully associative, 64 4MB entries fully associative
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 24MHz
cpu0: mwait min=64, max=64, C-substates=1.1, IBE
cpu1 at mainbus0: apid 1 (application processor)
cpu1: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.23 MHz, 17-11-00
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu1: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 
8-way L2 cache, 4MB 64b/line 16-way L3 cache
cpu1: ITLB 64 4KB entries fully associative, 64 4MB entries fully associative
cpu1: DTLB 64 4KB entries fully associative, 64 4MB entries fully associative
cpu1: smt 1, core 0, package 0
cpu2 at mainbus0: apid 2 (application processor)
cpu2: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.23 MHz, 17-11-00
cpu2: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu2: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 
8-way L2 cache, 4MB 64b/line 16-way L3 cache
cpu2: ITLB 64 4KB entries fully associative, 64 4MB entries fully associative
cpu2: DTLB