Re: i386 4/4 change

2018-04-09 Thread Konstantin Belousov
On Mon, Apr 09, 2018 at 08:22:13AM -0400, Yoshihiro Ota wrote:
> What is the current status of this?
> 
> Based on SVN history, it doesn't look https://reviews.freebsd.org/D14633 has 
> been merged/commited yet.
I fixed bugs reported by Bruce.

Right now the patch is waiting for some other testing to finish, before
the final retest.

> 
> I can try after I recover from disk crahes.
> I expect I need few more days to restore.
> 
> Will this retire PAE option?
The patch is ortogonal to PAE.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: i386 4/4 change

2018-04-09 Thread Yoshihiro Ota
What is the current status of this?

Based on SVN history, it doesn't look https://reviews.freebsd.org/D14633 has 
been merged/commited yet.

I can try after I recover from disk crahes.
I expect I need few more days to restore.

Will this retire PAE option?

Thanks,
Hiro

On Sun, 1 Apr 2018 17:05:03 +1000 (EST)
Bruce Evans  wrote:

> 
> On Sun, 1 Apr 2018, Dimitry Andric wrote:
> 
> > On 31 Mar 2018, at 17:57, Bruce Evans  wrote:
> >>
> >> On Sat, 31 Mar 2018, Konstantin Belousov wrote:
> >>
> >>> the change to provide full 4G of address space for both kernel and
> >>> user on i386 is ready to land.  The motivation for the work was to both
> >>> mitigate Meltdown on i386, and to give more breazing space for still
> >>> used 32bit architecture.  The patch was tested by Peter Holm, and I am
> >>> satisfied with the code.
> >>>
> >>> If you use i386 with HEAD, I recommend you to apply the patch from
> >>> https://reviews.freebsd.org/D14633
> >>> and report any regressions before the commit, not after.  Unless
> >>> a significant issue is reported, I plan to commit the change somewhere
> >>> at Wed/Thu next week.
> >>>
> >>> Also I welcome patch comments and reviews.
> >>
> >> It crashes at boot time in getmemsize() unless booted with loader which
> >> I don't want to use.
> 
> > For me, it at least compiles and boots OK, but I'm one of those crazy
> > people who use the default boot loader. ;)
> 
> I found a quick fix and sent it to kib.  (2 crashes in vm86 code for memory
> sizing.  This is not called if loader is used && the system has smap.  Old
> systems don't have smap, so they crash even if loader is used.)
> 
> > I haven't yet run any performance tests, I'll try building world and a
> > few large ports tomorrow.  General operation from the command line does
> > not feel "sluggish" in any way, however.
> 
> Further performance tests:
> - reading /dev/zero using tinygrams is 6 times slower
> - read/write of a pipe using tinygrams is 25 times slower.  It also gives
>unexpected values in wait statuses on exit, hopefully just because the
>bug is in the test program is exposed by the changed timing (but later
>it also gave SIGBUS errors).  This does a context switch or 2 for every
>read/write.  It now runs 7 times slower using 2 4.GHz CPUs than in
>FreeBSD-5 using 1 2.0 GHz CPU.  The faster CPUs and 2 of them used to
>make it run 4 times faster.  It shows another slowdown since FreeBSD-5,
>and much larger slowdowns since FreeBSD-1:
> 
>1996 FreeBSD on P1  133MHz:   72k/s
>1997 FreeBSD on P1  133MHz:   44k/s (after dyson's opts for large sizes)
>1997 Linux   on P1  133MHz:   93k/s (simpler is faster for small sizes)
>1999 FreeBSD on K6  266MHz:  129k/s
>2018 FBSD-~5 on AthXP 2GHz:  696k/s
>2018 FreeBSD on i7  2x4GHz: 2900k/s
>2018 FBSD4+4 on i7  2x4GHz:  113k/s (faster than Linux on a P1 133MHz!!)
> 
> Netblast to localhost has much the same 6 times slowness as reading
> /dev/zero using tinygrams.  This is the slowdown for syscalls.
> Tinygrams are hard to avoid for UDP.  Even 1500 bytes is a tinygram
> for /dev/zero.  Without 4+4, localhost is very slow because it does
> a context switch or 2 for every packet (even with 2 CPUs when there is
> no need to switch).  Without 4+4 this used to cost much the same as the
> context switches for the pipe benchmark.  Now it costs relatively much
> less since (for netblast to localhost) all of the context switches are
> between kernel threads.
> 
> The pipe benchmark uses select() to avoid busy-waiting.  That was good
> for UP.  But for SMP with just 2 CPUs, it is better to busy-wait and
> poll in the reader and writer.
> 
> netblast already uses busy-waiting.  It used to be a bug that select()
> doesn't work on sockets, at least for UDP, so blasting using busy-waiting
> is the only possible method (timeouts are usually too coarse-grained to
> go as fast as blasting, and if they are fine-grained enough to go fast
> then they are not much better than busy-waiting with time wasted for
> setting up timeouts).  SMP makes this a feature.  It forces use of busy-
> waiting, which is best if you have a CPU free to run it and this method
> doesn't take to much power.
> 
> Context switches to task queues give similar slowness.  This won't be
> affected by 4+4 since task queues are in the kernel.  I don't like
> networking in userland since it has large syscall and context switch
> costs.  Increasing these by factors of 6 and 25 doesn't help.  It
> can only be better by combining i/o in a way that the kernel neglects
> to do or which is imposed by per-packet APIs.  Slowdown factors of 6
> or 25 require the combined i/o to be 6 or 25 larger to amortise the costs.
> 
> Bruce
> ___
> freebsd-current@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to 

Re: i386 4/4 change

2018-04-01 Thread Bruce Evans


On Sun, 1 Apr 2018, Dimitry Andric wrote:


On 31 Mar 2018, at 17:57, Bruce Evans  wrote:


On Sat, 31 Mar 2018, Konstantin Belousov wrote:


the change to provide full 4G of address space for both kernel and
user on i386 is ready to land.  The motivation for the work was to both
mitigate Meltdown on i386, and to give more breazing space for still
used 32bit architecture.  The patch was tested by Peter Holm, and I am
satisfied with the code.

If you use i386 with HEAD, I recommend you to apply the patch from
https://reviews.freebsd.org/D14633
and report any regressions before the commit, not after.  Unless
a significant issue is reported, I plan to commit the change somewhere
at Wed/Thu next week.

Also I welcome patch comments and reviews.


It crashes at boot time in getmemsize() unless booted with loader which
I don't want to use.



For me, it at least compiles and boots OK, but I'm one of those crazy
people who use the default boot loader. ;)


I found a quick fix and sent it to kib.  (2 crashes in vm86 code for memory
sizing.  This is not called if loader is used && the system has smap.  Old
systems don't have smap, so they crash even if loader is used.)


I haven't yet run any performance tests, I'll try building world and a
few large ports tomorrow.  General operation from the command line does
not feel "sluggish" in any way, however.


Further performance tests:
- reading /dev/zero using tinygrams is 6 times slower
- read/write of a pipe using tinygrams is 25 times slower.  It also gives
  unexpected values in wait statuses on exit, hopefully just because the
  bug is in the test program is exposed by the changed timing (but later
  it also gave SIGBUS errors).  This does a context switch or 2 for every
  read/write.  It now runs 7 times slower using 2 4.GHz CPUs than in
  FreeBSD-5 using 1 2.0 GHz CPU.  The faster CPUs and 2 of them used to
  make it run 4 times faster.  It shows another slowdown since FreeBSD-5,
  and much larger slowdowns since FreeBSD-1:

  1996 FreeBSD on P1  133MHz:   72k/s
  1997 FreeBSD on P1  133MHz:   44k/s (after dyson's opts for large sizes)
  1997 Linux   on P1  133MHz:   93k/s (simpler is faster for small sizes)
  1999 FreeBSD on K6  266MHz:  129k/s
  2018 FBSD-~5 on AthXP 2GHz:  696k/s
  2018 FreeBSD on i7  2x4GHz: 2900k/s
  2018 FBSD4+4 on i7  2x4GHz:  113k/s (faster than Linux on a P1 133MHz!!)

Netblast to localhost has much the same 6 times slowness as reading
/dev/zero using tinygrams.  This is the slowdown for syscalls.
Tinygrams are hard to avoid for UDP.  Even 1500 bytes is a tinygram
for /dev/zero.  Without 4+4, localhost is very slow because it does
a context switch or 2 for every packet (even with 2 CPUs when there is
no need to switch).  Without 4+4 this used to cost much the same as the
context switches for the pipe benchmark.  Now it costs relatively much
less since (for netblast to localhost) all of the context switches are
between kernel threads.

The pipe benchmark uses select() to avoid busy-waiting.  That was good
for UP.  But for SMP with just 2 CPUs, it is better to busy-wait and
poll in the reader and writer.

netblast already uses busy-waiting.  It used to be a bug that select()
doesn't work on sockets, at least for UDP, so blasting using busy-waiting
is the only possible method (timeouts are usually too coarse-grained to
go as fast as blasting, and if they are fine-grained enough to go fast
then they are not much better than busy-waiting with time wasted for
setting up timeouts).  SMP makes this a feature.  It forces use of busy-
waiting, which is best if you have a CPU free to run it and this method
doesn't take to much power.

Context switches to task queues give similar slowness.  This won't be
affected by 4+4 since task queues are in the kernel.  I don't like
networking in userland since it has large syscall and context switch
costs.  Increasing these by factors of 6 and 25 doesn't help.  It
can only be better by combining i/o in a way that the kernel neglects
to do or which is imposed by per-packet APIs.  Slowdown factors of 6
or 25 require the combined i/o to be 6 or 25 larger to amortise the costs.

Bruce
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: i386 4/4 change

2018-03-31 Thread Konstantin Belousov
On Sun, Apr 01, 2018 at 01:05:57AM +0200, Dimitry Andric wrote:
> I haven't yet run any performance tests, I'll try building world and a
> few large ports tomorrow.  General operation from the command line does
> not feel "sluggish" in any way, however.

I just updated the review with some changes which should have effect
on the copyout performance.

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: i386 4/4 change

2018-03-31 Thread Dimitry Andric
On 31 Mar 2018, at 17:57, Bruce Evans  wrote:
> 
> On Sat, 31 Mar 2018, Konstantin Belousov wrote:
> 
>> the change to provide full 4G of address space for both kernel and
>> user on i386 is ready to land.  The motivation for the work was to both
>> mitigate Meltdown on i386, and to give more breazing space for still
>> used 32bit architecture.  The patch was tested by Peter Holm, and I am
>> satisfied with the code.
>> 
>> If you use i386 with HEAD, I recommend you to apply the patch from
>> https://reviews.freebsd.org/D14633
>> and report any regressions before the commit, not after.  Unless
>> a significant issue is reported, I plan to commit the change somewhere
>> at Wed/Thu next week.
>> 
>> Also I welcome patch comments and reviews.
> 
> It crashes at boot time in getmemsize() unless booted with loader which
> I don't want to use.

For me, it at least compiles and boots OK, but I'm one of those crazy
people who use the default boot loader. ;)

I haven't yet run any performance tests, I'll try building world and a
few large ports tomorrow.  General operation from the command line does
not feel "sluggish" in any way, however.

-Dimitry



signature.asc
Description: Message signed with OpenPGP


Re: i386 4/4 change

2018-03-31 Thread Bruce Evans

On Sat, 31 Mar 2018, Konstantin Belousov wrote:


the change to provide full 4G of address space for both kernel and
user on i386 is ready to land.  The motivation for the work was to both
mitigate Meltdown on i386, and to give more breazing space for still
used 32bit architecture.  The patch was tested by Peter Holm, and I am
satisfied with the code.

If you use i386 with HEAD, I recommend you to apply the patch from
https://reviews.freebsd.org/D14633
and report any regressions before the commit, not after.  Unless
a significant issue is reported, I plan to commit the change somewhere
at Wed/Thu next week.

Also I welcome patch comments and reviews.


It crashes at boot time in getmemsize() unless booted with loader which
I don't want to use.

It is much slower, and I couldn't find an option to turn it off.

For makeworld, the system time is slightly more than doubled, the user
time is increased by 16%, and the real time is increased by 21%.

On amd64, turning off pti and not having ibrs gives almost no increase
in makeworld times relative to old versions, and pti only costs about
5% IIRC.

Makeworld is not very syscall-intensive.  netblast is very syscall-intensive,
and its throughput is down by a factor of 5 (660/136 = 4.9, 1331/242 = 5.5).

netblast 127.0.0.1 5001 5 10 (localhost, port 5001, 5-byte tinygrams for 10 s):
537 kpps sent, 0 kpps dropped # before this patch (CPU use 1.3)
136 kpps sent, 0 kpps dropped # after (CPU use 2.1)

(Pure software overheads.  It uses 1.6 times as much CPU to go 4 times
slower).

netblast 192.168.2.8 (low end PCI33 lem on low latency 1 Gbps LAN)
275 kpps sent, 1045 kpps dropped  # before (CPU use 1.3)
245 kpps sent, 0kpps dropped  # after (CPU use 1.3)

(The hardware can't do anywhere near line rate of ~1500 kpps, so this
becomes a benchmark of syscalls and dropping packets.  The change makes
FreeBSD so slow that 8 CPUs at 4.08 can't saturate a low end PCI33 NIC
(the hardware saturates at about 282 kpps for tx and about 400 kpps for
rx)).

netblast 192.168.2.8 (low end PCIe em on low latency 1 Gbps LAN)
   1316 kpps sent, 3 kpps dropped # before (CPU use 1.6)
243 kpps sent, 0 kpps dropped # after (CPU use 1.2)

This is seriously slower for the most useful case.  It reduces a system
that could almost reach line rate using about 2 of 8 CPUs at 4 GHz to
one that that is slower than with 1 CPU at 2 GHz (the latter saturates
in software at about 640 kpps in old versions of FreeBSD at at about
400 kpps in -current).

Initial debugging of the crash: it crashes on the first pmap_kenter()
in getmemsize().  I configure debug.late_console to 0.  That works,
and without it getmemsize() can't even be debugged since it is after
console initialization and ddb entry with -d.

In getmemsize(), of course all the preload calls return 0 and smapbase is
NULL.  Then vm86 bios calls work and give basemem = 0x276.  Then
basemem_setup() is called and it returns. Then pmap_kenter() is called
and it crashes:

Stopped at  getmemsize+0xb3:pushl   $0x1000
Stopped at  getmemsize+0xb8:pushl   $0x1000
Stopped at  getmemsize+0xbd:callpmap_kenter
Stopped at  pmap_kenter:pushl   %ebp
Stopped at  pmap_kenter+0x1:movl%esp,%ebp
Stopped at  pmap_kenter+0x3:movl0x8(%ebp),%eax
Stopped at  pmap_kenter+0x6:shrl$0xc,%eax
Stopped at  pmap_kenter+0x9:movl0xc(%ebp),%edx
Stopped at  pmap_kenter+0xc:orl $0x3,%edx
Stopped at  pmap_kenter+0xf:movl%edx,PTmap(,%eax,4)

The last instruction crashes because PTmap is not mapped at this point:

db> p/x $edx
1003
db> p/x PTmap
ff80
db> p/x $eax
   1
db> x/x PTmap
PTmap:KDB: reentering
KDB: stack backtrace:
  db_trace_self_wrapper(cec5cb,1420a04,c6de83,1420978,1,...) at 
db_trace_self_wrapper+0x24/frame 0x142095c
kdb_reenter(1420978,1,ff80003a,1420998,8f1419,...) at kdb_reenter+0x24/frame 
0x1420968
trap(1420a10) at trap+0xa0/frame 0x1420a04
calltrap() at calltrap+0x8/frame 0x1420a04
--- trap 0xc, eip = 0xc5c394, esp = 0x1420a50, ebp = 0x1420a88 ---
db_read_bytes(ff81,3,1420aa0) at db_read_bytes+0x29/frame 0x1420a88
db_get_value(ff80,4,0,0,d2d304,...) at db_get_value+0x20/frame 0x1420ab4
db_examine(ff80,1,,1420b00) at db_examine+0x144/frame 0x1420ae4
db_command(cb1d99,1420be4,8f0f01,d1d28a,0,...) at db_command+0x20a/frame 
0x1420b90
db_command_loop(d1d28a,0,1420bac,1420b9c,1420be4,...) at 
db_command_loop+0x55/frame 0x1420b9c
db_trap(a,4ff0,1,1,80046,...) at db_trap+0xe1/frame 0x1420be4
kdb_trap(a,4ff0,1420cc4) at kdb_trap+0xb1/frame 0x1420c10
trap(1420cc4) at trap+0x523/frame 0x1420cb8
calltrap() at calltrap+0x8/frame 0x1420cb8
--- trap 0xa, eip = 0xc65a4a, esp = 0x1420d04, ebp = 0x1420d04 ---
pmap_kenter(1000,1000,1429000,8efe13,0,...) at pmap_kenter+0xf/frame 0x1420d04
getmemsize(1,5a8807ff,ee,59a80097,ee,...) at getmemsize+0xc2/frame 0x1420fc4

i386 4/4 change

2018-03-31 Thread Konstantin Belousov
Hi,
the change to provide full 4G of address space for both kernel and
user on i386 is ready to land.  The motivation for the work was to both
mitigate Meltdown on i386, and to give more breazing space for still
used 32bit architecture.  The patch was tested by Peter Holm, and I am
satisfied with the code.

If you use i386 with HEAD, I recommend you to apply the patch from
https://reviews.freebsd.org/D14633
and report any regressions before the commit, not after.  Unless
a significant issue is reported, I plan to commit the change somewhere
at Wed/Thu next week.

Also I welcome patch comments and reviews.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"