[PATCH] Implement switch to separate SYSCALL call stack

2018-02-04 Thread Waldemar Kozaczuk
This patch implements separate syscall call stack needed
when runtimes like Golang use SYSCALL instruction to make 
system call.

More specifically each application thread pre-alllocates 
"tiny" (1024 bytes deep) syscall call stack. When SYSCALL 
instruction is called the call stack is switched to the 
tiny stack and the "large" stack is allocated if has not 
been alllocated yet (first call on given thread). In the end 
SYSCALL ends up being exexuted on the "large" stack.

This patch is based on the original patch provided by @Hawxchen. 

Fixes #808

Signed-off-by: Waldemar Kozaczuk 
---
 arch/x64/arch-switch.hh | 61 +
 arch/x64/arch-tls.hh|  9 ++
 arch/x64/entry.S| 81 +
 include/osv/sched.hh|  7 +
 linux.cc|  5 +++
 5 files changed, 137 insertions(+), 26 deletions(-)

diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index d1a039a..f291a26 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -11,6 +11,7 @@
 #include "msr.hh"
 #include 
 #include 
+#include 
 
 extern "C" {
 void thread_main(void);
@@ -114,6 +115,7 @@ void thread::init_stack()
 _state.rip = reinterpret_cast(thread_main);
 _state.rsp = stacktop;
 _state.exception_stack = _arch.exception_stack + 
sizeof(_arch.exception_stack);
+debug("***-> %d, Stack top: %#x\n", this->_id, stacktop);
 }
 
 void thread::setup_tcb()
@@ -146,6 +148,50 @@ void thread::setup_tcb()
 _tcb = static_cast(p + tls.size + user_tls_size);
 _tcb->self = _tcb;
 _tcb->tls_base = p + user_tls_size;
+
+if(is_app()) {
+//
+// Allocate tiny syscall call stack
+auto& tiny_stack = _attr._tiny_syscall_stack;
+
+if(!tiny_stack.begin) {
+tiny_stack.size = 1024; //512 seems to be too small sometimes
+tiny_stack.begin = malloc(tiny_stack.size);
+tiny_stack.deleter = tiny_stack.default_deleter;
+}
+
+_tcb->tiny_syscall_stack_addr = tiny_stack.begin + tiny_stack.size;
+_tcb->large_syscall_stack_addr = nullptr;
+debug("***-> %d, Tiny syscall stack top: %#x with size: %d bytes\n", 
this->_id,
+  _tcb->tiny_syscall_stack_addr, tiny_stack.size);
+}
+}
+
+void thread::setup_large_syscall_stack()
+{
+auto& large_stack = _attr._large_syscall_stack;
+if(large_stack.begin) {
+return;
+}
+
+large_stack.size = 16 * PAGE_SIZE;
+
+/*
+ * Mmapping (without mprotect()) seems to work but occasionally leads to 
crashes (
+large_stack.begin = mmu::map_anon(nullptr, large_stack.size, 
mmu::mmap_populate, mmu::perm_rw);
+//mmu::mprotect(large_stack.begin, PAGE_SIZE, 0); //THIS DOES NOT WORK at 
all -> results in crash
+large_stack.deleter = free_large_syscall_stack;
+*/
+
+large_stack.begin = malloc(large_stack.size);
+large_stack.deleter = large_stack.default_deleter;
+
+_tcb->large_syscall_stack_addr = large_stack.begin + large_stack.size;
+}
+
+void thread::free_large_syscall_stack(sched::thread::stack_info si)
+{
+mmu::munmap(si.begin, si.size);
 }
 
 void thread::free_tcb()
@@ -156,6 +202,21 @@ void thread::free_tcb()
 } else {
 free(_tcb->tls_base);
 }
+
+if(is_app()) {
+auto& tiny_stack = _attr._tiny_syscall_stack;
+
+assert(tiny_stack.begin);
+
+if(tiny_stack.deleter) {
+tiny_stack.deleter(tiny_stack);
+}
+
+auto& large_stack = _attr._large_syscall_stack;
+if(large_stack.begin && large_stack.deleter) {
+large_stack.deleter(large_stack);
+}
+}
 }
 
 void thread_main_c(thread* t)
diff --git a/arch/x64/arch-tls.hh b/arch/x64/arch-tls.hh
index 1bf86fd..64a1788 100644
--- a/arch/x64/arch-tls.hh
+++ b/arch/x64/arch-tls.hh
@@ -8,9 +8,18 @@
 #ifndef ARCH_TLS_HH
 #define ARCH_TLS_HH
 
+// Don't change the declaration sequence of all existing members'.
+// Please add new members from the last.
 struct thread_control_block {
 thread_control_block* self;
 void* tls_base;
+// These fields, a per-thread stack for SYSCALL instruction,
+// are used in arch/x64/entry.S for %fs's offset.
+// We currently keep these fields in the TCB to make it easier to access 
in assembly code
+// through a known offset into %fs. But with more effort, we could have 
used an
+// ordinary thread-local variable instead and avoided extending the TCB 
here.
+void* tiny_syscall_stack_addr;
+void* large_syscall_stack_addr;
 };
 
 #endif /* ARCH_TLS_HH */
diff --git a/arch/x64/entry.S b/arch/x64/entry.S
index 04d809d..87f37b8 100644
--- a/arch/x64/entry.S
+++ b/arch/x64/entry.S
@@ -170,25 +170,57 @@ syscall_entry:
 .cfi_register rflags, r11 # r11 took previous rflags value
 # There is no ring transition and rflags are left unchanged.
 
-# Skip the "red zone" allowed by the AMD64 

Re: [PATCH] Implement switch to separate SYSCALL call stack

2018-02-04 Thread Waldek Kozaczuk
I am expecting this patch will need some tweaking but I think it 
fundamentally implements SYSCALL call stack switch. I have tested in using 
my golang working branch.

First of all I am assembly novice so I am not sure if I have not messed 
some stuff in entry.S. For example I am not sure I have not messed the CFI 
- on other hand given by previous GDB research it might be mute point given 
gdb cannot handle backtracing non-contiguous stacks.

Secondly I am not sure if tiny stack is enough (1024). I am afraid there 
maybe some scenarios where thread which calls setup_large_syscall_stack() 
may be interrupted (or something else) and bigger stack is needed.

For example this is a crash I got only once when the tiny stack was 512 
bytes deep (not sure if increasing to 1024 has fully addressed it as I have 
not seen it anymore):
[backtrace]
0x0044d85c 
0x0044d88a 
0x0032b3eb 
0x0038b4e8 
0x0038a356 
0x003e5381  
>)+561>
0x003e583b 
0x003e5f01 
0x003c5957 
0x003d4885 
0x003d4a5a 
0x815d203f 
0x003e5381  
>)+561>
0x003e583b 
0x003e5f01 
0x003c5957 
0x003d4885 
0x003d4a5a 
0x003d4d46 
0x003e2780 
0x0038b324 
0x1113cd91 
0x11125e23 
0x1112c35f 
0x1112c2e5 
0x11306bf2 
0x20ddff3f 
qemu failed.

I also tried to use mmap() and mprotect() (instead of malloc) to allocate 
memory for large stack and it did not quite work (please see commented out 
code in the patch). Everytime I ran it would crash like this (not event 
same thread when mprotect() was called):

0  0x00225930 in abort (fmt=fmt@entry=0x71d020 "exception nested 
too deeply") at runtime.cc:130
#1  0x00388804 in sched::arch_cpu::enter_exception (this=) at arch/x64/arch-cpu.cc:19
#2  sched::exception_guard::exception_guard (this=) at 
arch/x64/arch-cpu.cc:37
#3  0x0038c6da in general_protection (ef=0x80f2f048) at 
arch/x64/exceptions.cc:313
#4  
#5  0x0038890e in safe_load (data=, 
potentially_bad_pointer=0x56415741e5894855) at ar
ch/x64/safe-ptr.hh:33
#6  backtrace_safe (pc=pc@entry=0x80f2d330, nr=nr@entry=128) at 
arch/x64/backtrace.cc:25
#7  0x0020203c in print_backtrace () at runtime.cc:79
#8  0x0022591c in abort (fmt=fmt@entry=0x71d3a0 "general protection 
fault\n") at runtime.cc:121
#9  0x0038c73a in general_protection (ef=0x80f2e048) at 
arch/x64/exceptions.cc:322
#10 
#11 std::unique_ptr::get (this=0x5bd
8894808c483d0)
at /usr/include/c++/5/bits/unique_ptr.h:305
#12 
sched::thread::do_wake_with(waiter::wake()::{lambda()#1},
 
unsigned int) (allowed_initi
al_states_mask=24, 
action=..., this=0x5bd8894808c48348) at include/osv/sched.hh:1261
#13 
sched::thread::wake_with_from_mutex(waiter::wake()::{lambda()#1})
 
(action=..., this=0x
5bd8894808c48348)
at include/osv/sched.hh:1287
#14 waiter::wake (this=0x33185e 

)
at include/osv/wait_record.hh:38
#15 lockfree::mutex::lock (this=0xa2166648, 
this@entry=0x82953040) at core/lfmutex.cc:74
#16 lockfree_mutex_lock (m=m@entry=0xa2166648) at 
core/lfmutex.cc:282
#17 0x002e213b in mutex_lock (m=0xa2166648) at 
include/osv/mutex.h:59
#18 sa_bulk_lookup (hdl=0xa2166648, 
attrs=attrs@entry=0x829579c0, count=3)
at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:1851
#19 0x00316cd4 in zfs_getattr (vp=0xa221e780, 
vap=0x82957aa0)
at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:2055
#20 0x0042cf2c in vn_stat (vp=vp@entry=0xa221e780, 
st=st@entry=0x82957b60) at fs/vfs/vfs_vnode.c
c:325
#21 0x0043029a in vfs_file::stat (this=, 
st=0x82957b60) at fs/vfs/vfs_fops.cc:118
#22 0x0041d48c in size (f=...) at fs/fs.cc:16
#23 0x003300ed in mmu::file_vma::fault (this=0xa221ed00, 
addr=17592205996032, ef=0x82958098) at 
core/mmu.cc:1689
#24 0x0032b484 in mmu::vm_fault (addr=, 
addr@entry=17592205997277,