Re: [PATCH] Implement switch to separate SYSCALL call stack
On Wed, Mar 7, 2018 at 6:01 AM, Waldek Kozaczuk wrote: > Nadav, > gic with malloc()? > >> >> Hmm, I just now realized you were using mprotect() to guard one page on >> the end of the stack (i.e., its beginning) and not the entire thing :-) >> Makes more sense now... >> I can't say I understand why mmap() should not work here. Sounds like it >> should have... We're already using mmap() for stacks, successfully. >> To implement a canary with malloc() what you can do is to put a fixed >> value, say 0xdeadbeef, in the beginning of the stack buffer, and every time >> you exit from a system call, you check that it still has this value. >> This doesn't catch stack overflows immediately, but at least it will not >> allow some other thread who's data we just ruined to run. It's not >> foolproof though - it's possible that the data we overflowed is actually >> OSv's own data and ruining it messed with the function of the system call >> itself. >> > > I think we can implement canary logic later. > Right. Maybe a canary would have been useful for you for debugging which stack sizes are enough, but if you already got things working, we can postpone this. > So meanwhile I will stick to regular malloc(). Also isn't my problems with > mmap()/mprotectd() related to this issue - https://github.com/cloudius- > systems/osv/issues/143 ? > I don't see how it's related - issue 143 is about using mmap() *without* the populate option, so that the page is only populated when first used (lazy allocation). But you're giving the populate option, right? -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [PATCH] Implement switch to separate SYSCALL call stack
Nadav, Thanks for the review and answers to my questions. I will try to prepare next (hopefully final version of the patch). Please see some of my comments below, On Tuesday, March 6, 2018 at 6:25:47 PM UTC-5, Nadav Har'El wrote: > > > On Mon, Feb 5, 2018 at 7:58 AM, Waldek Kozaczuk > wrote: > >> I am expecting this patch will need some tweaking but I think it >> fundamentally implements SYSCALL call stack switch. I have tested in using >> my golang working branch. >> > > Excellent. > > >> >> First of all I am assembly novice so I am not sure if I have not messed >> some stuff in entry.S. For example I am not sure I have not messed the CFI >> - on other hand given by previous GDB research it might be mute point given >> gdb cannot handle backtracing non-contiguous stacks. >> > > I think not doing the CFI business around the new function call is fine > (we won't debug this function...) but there's possibly one error in the CFI > thing which I pointed out in my review. > Anyway, this CFI stuff is hard to get right without actually testing gdb... > > >> >> Secondly I am not sure if tiny stack is enough (1024). I am afraid there >> maybe some scenarios where thread which calls setup_large_syscall_stack() >> may be interrupted (or something else) and bigger stack is needed. >> >> For example this is a crash I got only once when the tiny stack was 512 >> bytes deep (not sure if increasing to 1024 has fully addressed it as I have >> not seen it anymore): >> > > I'm not sure what exactly I'm seeing in this stack trace, it seems > corrupt, it is conceivable that 512 bytes is not enough, but I don't really > know why - which of the functions on the malloc() path use a lot of stack > space? > 512 is not a lot - just our "push all the registers on the stack" shtick > consumes 128 bytes (16*8) of that. > > > >> [backtrace] >> 0x0044d85c >> 0x0044d88a > exception_frame*)+26> >> 0x0032b3eb >> 0x0038b4e8 >> 0x0038a356 >> 0x003e5381 > chrono::duration >> >)+561> >> 0x003e583b >> 0x003e5f01 >> 0x003c5957 >> 0x003d4885 >> 0x003d4a5a >> 0x815d203f >> 0x003e5381 > chrono::duration >> >)+561> >> 0x003e583b >> 0x003e5f01 >> 0x003c5957 >> 0x003d4885 >> 0x003d4a5a >> 0x003d4d46 >> 0x003e2780 >> 0x0038b324 >> 0x1113cd91 >> 0x11125e23 >> 0x1112c35f >> 0x1112c2e5 >> 0x11306bf2 >> 0x20ddff3f >> qemu failed. >> >> I also tried to use mmap() and mprotect() (instead of malloc) to allocate >> memory for large stack and it did not quite work (please see commented out >> code in the patch). Everytime I ran it would crash like this (not event >> same thread when mprotect() was called): >> >> 0 0x00225930 in abort (fmt=fmt@entry=0x71d020 "exception nested >> too deeply") at runtime.cc:130 >> #1 0x00388804 in sched::arch_cpu::enter_exception >> (this=) at arch/x64/arch-cpu.cc:19 >> #2 sched::exception_guard::exception_guard (this=) at >> arch/x64/arch-cpu.cc:37 >> #3 0x0038c6da in general_protection (ef=0x80f2f048) at >> arch/x64/exceptions.cc:313 >> #4 >> #5 0x0038890e in safe_load (data=, >> potentially_bad_pointer=0x56415741e5894855) at ar >> ch/x64/safe-ptr.hh:33 >> #6 backtrace_safe (pc=pc@entry=0x80f2d330, nr=nr@entry=128) at >> arch/x64/backtrace.cc:25 >> #7 0x0020203c in print_backtrace () at runtime.cc:79 >> #8 0x0022591c in abort (fmt=fmt@entry=0x71d3a0 "general >> protection fault\n") at runtime.cc:121 >> #9 0x0038c73a in general_protection (ef=0x80f2e048) at >> arch/x64/exceptions.cc:322 >> #10 >> #11 std::unique_ptr> std::default_delete >::get (this=0x5bd >> 8894808c483d0) >> at /usr/include/c++/5/bits/unique_ptr.h:305 >> #12 >> sched::thread::do_wake_with(waiter::wake()::{lambda()#1}, >> >> unsigned int) (allowed_initi >> al_states_mask=24, >> action=..., this=0x5bd8894808c48348) at include/osv/sched.hh:1261 >> #13 >> sched::thread::wake_with_from_mutex(waiter::wake()::{lambda()#1}) >> >> (action=..., this=0x >> 5bd8894808c48348) >> at include/osv/sched.hh:1287 >> #14 waiter::wake (this=0x33185e >> > unsigned long)+30> >> ) >> at include/osv/wait_record.hh:38 >> #15 lockfree::mutex::lock (this=0xa2166648, >> this@entry=0x82953040) at core/lfmutex.cc:74 >> #16 lockfree_mutex_lock (m=m@entry=0xa2166648) at >> core/lfmutex.cc:282 >> #17 0x002e213b in mutex_lock (m=0xa2166648) at >> include/osv/mutex.h:59 >> #18 sa_bulk_lookup (hdl=0xa2166648, >> attrs=attrs@entry=0x829579c0, count=3) >> at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:1851 >> #19 0x00316cd4 in zfs_getattr (vp=0xa221e780, >> vap=0x82957aa0) >> at bsd/sys/cddl/contrib/opensolar
Re: [PATCH] Implement switch to separate SYSCALL call stack
On Mon, Feb 5, 2018 at 7:58 AM, Waldek Kozaczuk wrote: > I am expecting this patch will need some tweaking but I think it > fundamentally implements SYSCALL call stack switch. I have tested in using > my golang working branch. > Excellent. > > First of all I am assembly novice so I am not sure if I have not messed > some stuff in entry.S. For example I am not sure I have not messed the CFI > - on other hand given by previous GDB research it might be mute point given > gdb cannot handle backtracing non-contiguous stacks. > I think not doing the CFI business around the new function call is fine (we won't debug this function...) but there's possibly one error in the CFI thing which I pointed out in my review. Anyway, this CFI stuff is hard to get right without actually testing gdb... > > Secondly I am not sure if tiny stack is enough (1024). I am afraid there > maybe some scenarios where thread which calls setup_large_syscall_stack() > may be interrupted (or something else) and bigger stack is needed. > > For example this is a crash I got only once when the tiny stack was 512 > bytes deep (not sure if increasing to 1024 has fully addressed it as I have > not seen it anymore): > I'm not sure what exactly I'm seeing in this stack trace, it seems corrupt, it is conceivable that 512 bytes is not enough, but I don't really know why - which of the functions on the malloc() path use a lot of stack space? 512 is not a lot - just our "push all the registers on the stack" shtick consumes 128 bytes (16*8) of that. > [backtrace] > 0x0044d85c > 0x0044d88a exception_frame*)+26> > 0x0032b3eb > 0x0038b4e8 > 0x0038a356 > 0x003e5381 chrono::duration > >)+561> > 0x003e583b > 0x003e5f01 > 0x003c5957 > 0x003d4885 > 0x003d4a5a > 0x815d203f > 0x003e5381 chrono::duration > >)+561> > 0x003e583b > 0x003e5f01 > 0x003c5957 > 0x003d4885 > 0x003d4a5a > 0x003d4d46 > 0x003e2780 > 0x0038b324 > 0x1113cd91 > 0x11125e23 > 0x1112c35f > 0x1112c2e5 > 0x11306bf2 > 0x20ddff3f > qemu failed. > > I also tried to use mmap() and mprotect() (instead of malloc) to allocate > memory for large stack and it did not quite work (please see commented out > code in the patch). Everytime I ran it would crash like this (not event > same thread when mprotect() was called): > > 0 0x00225930 in abort (fmt=fmt@entry=0x71d020 "exception nested > too deeply") at runtime.cc:130 > #1 0x00388804 in sched::arch_cpu::enter_exception > (this=) at arch/x64/arch-cpu.cc:19 > #2 sched::exception_guard::exception_guard (this=) at > arch/x64/arch-cpu.cc:37 > #3 0x0038c6da in general_protection (ef=0x80f2f048) at > arch/x64/exceptions.cc:313 > #4 > #5 0x0038890e in safe_load (data=, > potentially_bad_pointer=0x56415741e5894855) at ar > ch/x64/safe-ptr.hh:33 > #6 backtrace_safe (pc=pc@entry=0x80f2d330, nr=nr@entry=128) at > arch/x64/backtrace.cc:25 > #7 0x0020203c in print_backtrace () at runtime.cc:79 > #8 0x0022591c in abort (fmt=fmt@entry=0x71d3a0 "general > protection fault\n") at runtime.cc:121 > #9 0x0038c73a in general_protection (ef=0x80f2e048) at > arch/x64/exceptions.cc:322 > #10 > #11 std::unique_ptr std::default_delete >::get (this=0x5bd > 8894808c483d0) > at /usr/include/c++/5/bits/unique_ptr.h:305 > #12 > sched::thread::do_wake_with(waiter::wake()::{lambda()#1}, > unsigned int) (allowed_initi > al_states_mask=24, > action=..., this=0x5bd8894808c48348) at include/osv/sched.hh:1261 > #13 sched::thread::wake_with_from_mutex )#1}>(waiter::wake()::{lambda()#1}) (action=..., this=0x > 5bd8894808c48348) > at include/osv/sched.hh:1287 > #14 waiter::wake (this=0x33185e > unsigned long, unsigned long)+30> > ) > at include/osv/wait_record.hh:38 > #15 lockfree::mutex::lock (this=0xa2166648, > this@entry=0x82953040) > at core/lfmutex.cc:74 > #16 lockfree_mutex_lock (m=m@entry=0xa2166648) at > core/lfmutex.cc:282 > #17 0x002e213b in mutex_lock (m=0xa2166648) at > include/osv/mutex.h:59 > #18 sa_bulk_lookup (hdl=0xa2166648, > attrs=attrs@entry=0x829579c0, > count=3) > at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:1851 > #19 0x00316cd4 in zfs_getattr (vp=0xa221e780, > vap=0x82957aa0) > at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:2055 > #20 0x0042cf2c in vn_stat (vp=vp@entry=0xa221e780, > st=st@entry=0x82957b60) at fs/vfs/vfs_vnode.c > c:325 > #21 0x0043029a in vfs_file::stat (this=, > st=0x82957b60) at fs/vfs/vfs_fops.cc:118 > #22 0x0041d48c in size (f=...) at fs/fs.cc:16 > #23 0x003300ed in mmu::file_vma::fault (this=0xa221ed00, > addr=1759
Re: [PATCH] Implement switch to separate SYSCALL call stack
On Mon, Feb 5, 2018 at 7:28 AM, Waldemar Kozaczuk wrote: > This patch implements separate syscall call stack needed > when runtimes like Golang use SYSCALL instruction to make > system call. > > More specifically each application thread pre-alllocates > "tiny" (1024 bytes deep) syscall call stack. When SYSCALL > instruction is called the call stack is switched to the > tiny stack and the "large" stack is allocated if has not > been alllocated yet (first call on given thread). In the end > SYSCALL ends up being exexuted on the "large" stack. > > This patch is based on the original patch provided by @Hawxchen. > > Fixes #808 > > Signed-off-by: Waldemar Kozaczuk > --- > arch/x64/arch-switch.hh | 61 + > arch/x64/arch-tls.hh| 9 ++ > arch/x64/entry.S| 81 ++ > +++ > include/osv/sched.hh| 7 + > linux.cc| 5 +++ > 5 files changed, 137 insertions(+), 26 deletions(-) > > diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh > index d1a039a..f291a26 100644 > --- a/arch/x64/arch-switch.hh > +++ b/arch/x64/arch-switch.hh > @@ -11,6 +11,7 @@ > #include "msr.hh" > #include > #include > +#include > Maybe this is not needed any more? > > extern "C" { > void thread_main(void); > @@ -114,6 +115,7 @@ void thread::init_stack() > _state.rip = reinterpret_cast(thread_main); > _state.rsp = stacktop; > _state.exception_stack = _arch.exception_stack + > sizeof(_arch.exception_stack); > +debug("***-> %d, Stack top: %#x\n", this->_id, stacktop); > Please remove this from the final version. > } > > void thread::setup_tcb() > @@ -146,6 +148,50 @@ void thread::setup_tcb() > _tcb = static_cast(p + tls.size + > user_tls_size); > _tcb->self = _tcb; > _tcb->tls_base = p + user_tls_size; > + > +if(is_app()) { > +// > +// Allocate tiny syscall call stack > +auto& tiny_stack = _attr._tiny_syscall_stack; > I doubt we'll ever want to configure tiny_stack's size or location, so I don't think it needs to be configurable in _attr. Do you think it should? > + > +if(!tiny_stack.begin) { > +tiny_stack.size = 1024; //512 seems to be too small sometimes > +tiny_stack.begin = malloc(tiny_stack.size); > +tiny_stack.deleter = tiny_stack.default_deleter; > +} > + > +_tcb->tiny_syscall_stack_addr = tiny_stack.begin + > tiny_stack.size; > +_tcb->large_syscall_stack_addr = nullptr; > I think we don't really need two entries in the TCB here - we can have just one, which starts with the pointer to the tiny buffer and then replaced by the pointer to the bigger one after that is allocated. > +debug("***-> %d, Tiny syscall stack top: %#x with size: %d > bytes\n", this->_id, > + _tcb->tiny_syscall_stack_addr, tiny_stack.size); > Another debug to remove. > +} > +} > + > +void thread::setup_large_syscall_stack() > +{ > +auto& large_stack = _attr._large_syscall_stack; > +if(large_stack.begin) { > +return; > +} > Again, not sure we need this in _attr, would anybody need to configure it? (it's supposed to depend on OSv's needs, not the application's needs). > + > +large_stack.size = 16 * PAGE_SIZE; > + > +/* > + * Mmapping (without mprotect()) seems to work but occasionally leads > to crashes ( > +large_stack.begin = mmu::map_anon(nullptr, large_stack.size, > mmu::mmap_populate, mmu::perm_rw); > +//mmu::mprotect(large_stack.begin, PAGE_SIZE, 0); //THIS DOES NOT > WORK at all -> results in crash > Why would protect work? Doesn't it mean the page is not at all readable or writable, making it crash when used? I wonder why mmap didn't work. Should have... Unless someone tries to execute the stack? Or the mmap() call itself simply needs a bigger stack than 1024 bytes? > +large_stack.deleter = free_large_syscall_stack; > +*/ > + > +large_stack.begin = malloc(large_stack.size); > +large_stack.deleter = large_stack.default_deleter; > + > +_tcb->large_syscall_stack_addr = large_stack.begin + large_stack.size; > Theoretically, we could free the old stack after setting up a new one. In practice, it will be ugly to do (how can a thread free the stack it is currently using?) so maybe not worth it. > +} > + > +void thread::free_large_syscall_stack(sched::thread::stack_info si) > +{ > +mmu::munmap(si.begin, si.size); > } > This function is not used (because the mmap() version is commented) and should be removed, right? > > void thread::free_tcb() > @@ -156,6 +202,21 @@ void thread::free_tcb() > } else { > free(_tcb->tls_base); > } > + > +if(is_app()) { > +auto& tiny_stack = _attr._tiny_syscall_stack; > + > +assert(tiny_stack.begin); > + > +if(tiny_stack.deleter) { > +tiny_stack.deleter(tiny_stack); > +} > + > +auto& large
Re: [PATCH] Implement switch to separate SYSCALL call stack
Bump. Sent from my iPhone > On Feb 5, 2018, at 00:58, Waldek Kozaczuk wrote: > > I am expecting this patch will need some tweaking but I think it > fundamentally implements SYSCALL call stack switch. I have tested in using my > golang working branch. > > First of all I am assembly novice so I am not sure if I have not messed some > stuff in entry.S. For example I am not sure I have not messed the CFI - on > other hand given by previous GDB research it might be mute point given gdb > cannot handle backtracing non-contiguous stacks. > > Secondly I am not sure if tiny stack is enough (1024). I am afraid there > maybe some scenarios where thread which calls setup_large_syscall_stack() may > be interrupted (or something else) and bigger stack is needed. > > For example this is a crash I got only once when the tiny stack was 512 bytes > deep (not sure if increasing to 1024 has fully addressed it as I have not > seen it anymore): > [backtrace] > 0x0044d85c > 0x0044d88a exception_frame*)+26> > 0x0032b3eb > 0x0038b4e8 > 0x0038a356 > 0x003e5381 std::chrono::duration > >)+561> > 0x003e583b > 0x003e5f01 > 0x003c5957 > 0x003d4885 > 0x003d4a5a > 0x815d203f > 0x003e5381 std::chrono::duration > >)+561> > 0x003e583b > 0x003e5f01 > 0x003c5957 > 0x003d4885 > 0x003d4a5a > 0x003d4d46 > 0x003e2780 > 0x0038b324 > 0x1113cd91 > 0x11125e23 > 0x1112c35f > 0x1112c2e5 > 0x11306bf2 > 0x20ddff3f > qemu failed. > > I also tried to use mmap() and mprotect() (instead of malloc) to allocate > memory for large stack and it did not quite work (please see commented out > code in the patch). Everytime I ran it would crash like this (not event same > thread when mprotect() was called): > > 0 0x00225930 in abort (fmt=fmt@entry=0x71d020 "exception nested too > deeply") at runtime.cc:130 > #1 0x00388804 in sched::arch_cpu::enter_exception (this= out>) at arch/x64/arch-cpu.cc:19 > #2 sched::exception_guard::exception_guard (this=) at > arch/x64/arch-cpu.cc:37 > #3 0x0038c6da in general_protection (ef=0x80f2f048) at > arch/x64/exceptions.cc:313 > #4 > #5 0x0038890e in safe_load (data=, > potentially_bad_pointer=0x56415741e5894855) at ar > ch/x64/safe-ptr.hh:33 > #6 backtrace_safe (pc=pc@entry=0x80f2d330, nr=nr@entry=128) at > arch/x64/backtrace.cc:25 > #7 0x0020203c in print_backtrace () at runtime.cc:79 > #8 0x0022591c in abort (fmt=fmt@entry=0x71d3a0 "general protection > fault\n") at runtime.cc:121 > #9 0x0038c73a in general_protection (ef=0x80f2e048) at > arch/x64/exceptions.cc:322 > #10 > #11 std::unique_ptr std::default_delete >::get (this=0x5bd > 8894808c483d0) > at /usr/include/c++/5/bits/unique_ptr.h:305 > #12 > sched::thread::do_wake_with(waiter::wake()::{lambda()#1}, > unsigned int) (allowed_initi > al_states_mask=24, > action=..., this=0x5bd8894808c48348) at include/osv/sched.hh:1261 > #13 > sched::thread::wake_with_from_mutex(waiter::wake()::{lambda()#1}) > (action=..., this=0x > 5bd8894808c48348) > at include/osv/sched.hh:1287 > #14 waiter::wake (this=0x33185e > unsigned long)+30> > ) > at include/osv/wait_record.hh:38 > #15 lockfree::mutex::lock (this=0xa2166648, > this@entry=0x82953040) at core/lfmutex.cc:74 > #16 lockfree_mutex_lock (m=m@entry=0xa2166648) at core/lfmutex.cc:282 > #17 0x002e213b in mutex_lock (m=0xa2166648) at > include/osv/mutex.h:59 > #18 sa_bulk_lookup (hdl=0xa2166648, > attrs=attrs@entry=0x829579c0, count=3) > at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:1851 > #19 0x00316cd4 in zfs_getattr (vp=0xa221e780, > vap=0x82957aa0) > at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:2055 > #20 0x0042cf2c in vn_stat (vp=vp@entry=0xa221e780, > st=st@entry=0x82957b60) at fs/vfs/vfs_vnode.c > c:325 > #21 0x0043029a in vfs_file::stat (this=, > st=0x82957b60) at fs/vfs/vfs_fops.cc:118 > #22 0x0041d48c in size (f=...) at fs/fs.cc:16 > #23 0x003300ed in mmu::file_vma::fault (this=0xa221ed00, > addr=17592205996032, ef=0x82958098) at > core/mmu.cc:1689 > #24 0x0032b484 in mmu::vm_fault (addr=, > addr@entry=17592205997277, ef=ef@entry=0x82958098 > ) at core/mmu.cc:1342 > #25 0x0038b529 in page_fault (ef=0x82958098) at > arch/x64/mmu.cc:38 > #26 > #27 runtime.memmove () at > /home/wkozaczuk/tools/go/src/runtime/memmove_amd64.s:161 > #28 0x1112864f in runtime.recordForPanic (b=...) at > /home/wkozaczuk/tools/go/src/runtime/print.go:45 > > When I commented mprotect() (still mmap()) the test app would work but > oc
Re: [PATCH] Implement switch to separate SYSCALL call stack
I am expecting this patch will need some tweaking but I think it fundamentally implements SYSCALL call stack switch. I have tested in using my golang working branch. First of all I am assembly novice so I am not sure if I have not messed some stuff in entry.S. For example I am not sure I have not messed the CFI - on other hand given by previous GDB research it might be mute point given gdb cannot handle backtracing non-contiguous stacks. Secondly I am not sure if tiny stack is enough (1024). I am afraid there maybe some scenarios where thread which calls setup_large_syscall_stack() may be interrupted (or something else) and bigger stack is needed. For example this is a crash I got only once when the tiny stack was 512 bytes deep (not sure if increasing to 1024 has fully addressed it as I have not seen it anymore): [backtrace] 0x0044d85c 0x0044d88a 0x0032b3eb 0x0038b4e8 0x0038a356 0x003e5381 >)+561> 0x003e583b 0x003e5f01 0x003c5957 0x003d4885 0x003d4a5a 0x815d203f 0x003e5381 >)+561> 0x003e583b 0x003e5f01 0x003c5957 0x003d4885 0x003d4a5a 0x003d4d46 0x003e2780 0x0038b324 0x1113cd91 0x11125e23 0x1112c35f 0x1112c2e5 0x11306bf2 0x20ddff3f qemu failed. I also tried to use mmap() and mprotect() (instead of malloc) to allocate memory for large stack and it did not quite work (please see commented out code in the patch). Everytime I ran it would crash like this (not event same thread when mprotect() was called): 0 0x00225930 in abort (fmt=fmt@entry=0x71d020 "exception nested too deeply") at runtime.cc:130 #1 0x00388804 in sched::arch_cpu::enter_exception (this=) at arch/x64/arch-cpu.cc:19 #2 sched::exception_guard::exception_guard (this=) at arch/x64/arch-cpu.cc:37 #3 0x0038c6da in general_protection (ef=0x80f2f048) at arch/x64/exceptions.cc:313 #4 #5 0x0038890e in safe_load (data=, potentially_bad_pointer=0x56415741e5894855) at ar ch/x64/safe-ptr.hh:33 #6 backtrace_safe (pc=pc@entry=0x80f2d330, nr=nr@entry=128) at arch/x64/backtrace.cc:25 #7 0x0020203c in print_backtrace () at runtime.cc:79 #8 0x0022591c in abort (fmt=fmt@entry=0x71d3a0 "general protection fault\n") at runtime.cc:121 #9 0x0038c73a in general_protection (ef=0x80f2e048) at arch/x64/exceptions.cc:322 #10 #11 std::unique_ptr >::get (this=0x5bd 8894808c483d0) at /usr/include/c++/5/bits/unique_ptr.h:305 #12 sched::thread::do_wake_with(waiter::wake()::{lambda()#1}, unsigned int) (allowed_initi al_states_mask=24, action=..., this=0x5bd8894808c48348) at include/osv/sched.hh:1261 #13 sched::thread::wake_with_from_mutex(waiter::wake()::{lambda()#1}) (action=..., this=0x 5bd8894808c48348) at include/osv/sched.hh:1287 #14 waiter::wake (this=0x33185e ) at include/osv/wait_record.hh:38 #15 lockfree::mutex::lock (this=0xa2166648, this@entry=0x82953040) at core/lfmutex.cc:74 #16 lockfree_mutex_lock (m=m@entry=0xa2166648) at core/lfmutex.cc:282 #17 0x002e213b in mutex_lock (m=0xa2166648) at include/osv/mutex.h:59 #18 sa_bulk_lookup (hdl=0xa2166648, attrs=attrs@entry=0x829579c0, count=3) at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:1851 #19 0x00316cd4 in zfs_getattr (vp=0xa221e780, vap=0x82957aa0) at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:2055 #20 0x0042cf2c in vn_stat (vp=vp@entry=0xa221e780, st=st@entry=0x82957b60) at fs/vfs/vfs_vnode.c c:325 #21 0x0043029a in vfs_file::stat (this=, st=0x82957b60) at fs/vfs/vfs_fops.cc:118 #22 0x0041d48c in size (f=...) at fs/fs.cc:16 #23 0x003300ed in mmu::file_vma::fault (this=0xa221ed00, addr=17592205996032, ef=0x82958098) at core/mmu.cc:1689 #24 0x0032b484 in mmu::vm_fault (addr=, addr@entry=17592205997277, ef=ef@entry=0x82958098 ) at core/mmu.cc:1342 #25 0x0038b529 in page_fault (ef=0x82958098) at arch/x64/mmu.cc:38 #26 #27 runtime.memmove () at /home/wkozaczuk/tools/go/src/runtime/memmove_amd64.s:161 #28 0x1112864f in runtime.recordForPanic (b=...) at /home/wkozaczuk/tools/go/src/runtime/print.go:45 When I commented mprotect() (still mmap()) the test app would work but occasionally crash (I do not have an example). SO I wonder if there is some fundamental constraint which is violated when calling mmap()/mprotect() when setting up large stack (called from assembly in tiny stack) comparing to regular malloc(). Is it because mprotect() requires updating memory page tables and it triggers some logic that messes things up? If I cannot use mmap() how do I implement canary type of logic with malloc()
[PATCH] Implement switch to separate SYSCALL call stack
This patch implements separate syscall call stack needed when runtimes like Golang use SYSCALL instruction to make system call. More specifically each application thread pre-alllocates "tiny" (1024 bytes deep) syscall call stack. When SYSCALL instruction is called the call stack is switched to the tiny stack and the "large" stack is allocated if has not been alllocated yet (first call on given thread). In the end SYSCALL ends up being exexuted on the "large" stack. This patch is based on the original patch provided by @Hawxchen. Fixes #808 Signed-off-by: Waldemar Kozaczuk --- arch/x64/arch-switch.hh | 61 + arch/x64/arch-tls.hh| 9 ++ arch/x64/entry.S| 81 + include/osv/sched.hh| 7 + linux.cc| 5 +++ 5 files changed, 137 insertions(+), 26 deletions(-) diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh index d1a039a..f291a26 100644 --- a/arch/x64/arch-switch.hh +++ b/arch/x64/arch-switch.hh @@ -11,6 +11,7 @@ #include "msr.hh" #include #include +#include extern "C" { void thread_main(void); @@ -114,6 +115,7 @@ void thread::init_stack() _state.rip = reinterpret_cast(thread_main); _state.rsp = stacktop; _state.exception_stack = _arch.exception_stack + sizeof(_arch.exception_stack); +debug("***-> %d, Stack top: %#x\n", this->_id, stacktop); } void thread::setup_tcb() @@ -146,6 +148,50 @@ void thread::setup_tcb() _tcb = static_cast(p + tls.size + user_tls_size); _tcb->self = _tcb; _tcb->tls_base = p + user_tls_size; + +if(is_app()) { +// +// Allocate tiny syscall call stack +auto& tiny_stack = _attr._tiny_syscall_stack; + +if(!tiny_stack.begin) { +tiny_stack.size = 1024; //512 seems to be too small sometimes +tiny_stack.begin = malloc(tiny_stack.size); +tiny_stack.deleter = tiny_stack.default_deleter; +} + +_tcb->tiny_syscall_stack_addr = tiny_stack.begin + tiny_stack.size; +_tcb->large_syscall_stack_addr = nullptr; +debug("***-> %d, Tiny syscall stack top: %#x with size: %d bytes\n", this->_id, + _tcb->tiny_syscall_stack_addr, tiny_stack.size); +} +} + +void thread::setup_large_syscall_stack() +{ +auto& large_stack = _attr._large_syscall_stack; +if(large_stack.begin) { +return; +} + +large_stack.size = 16 * PAGE_SIZE; + +/* + * Mmapping (without mprotect()) seems to work but occasionally leads to crashes ( +large_stack.begin = mmu::map_anon(nullptr, large_stack.size, mmu::mmap_populate, mmu::perm_rw); +//mmu::mprotect(large_stack.begin, PAGE_SIZE, 0); //THIS DOES NOT WORK at all -> results in crash +large_stack.deleter = free_large_syscall_stack; +*/ + +large_stack.begin = malloc(large_stack.size); +large_stack.deleter = large_stack.default_deleter; + +_tcb->large_syscall_stack_addr = large_stack.begin + large_stack.size; +} + +void thread::free_large_syscall_stack(sched::thread::stack_info si) +{ +mmu::munmap(si.begin, si.size); } void thread::free_tcb() @@ -156,6 +202,21 @@ void thread::free_tcb() } else { free(_tcb->tls_base); } + +if(is_app()) { +auto& tiny_stack = _attr._tiny_syscall_stack; + +assert(tiny_stack.begin); + +if(tiny_stack.deleter) { +tiny_stack.deleter(tiny_stack); +} + +auto& large_stack = _attr._large_syscall_stack; +if(large_stack.begin && large_stack.deleter) { +large_stack.deleter(large_stack); +} +} } void thread_main_c(thread* t) diff --git a/arch/x64/arch-tls.hh b/arch/x64/arch-tls.hh index 1bf86fd..64a1788 100644 --- a/arch/x64/arch-tls.hh +++ b/arch/x64/arch-tls.hh @@ -8,9 +8,18 @@ #ifndef ARCH_TLS_HH #define ARCH_TLS_HH +// Don't change the declaration sequence of all existing members'. +// Please add new members from the last. struct thread_control_block { thread_control_block* self; void* tls_base; +// These fields, a per-thread stack for SYSCALL instruction, +// are used in arch/x64/entry.S for %fs's offset. +// We currently keep these fields in the TCB to make it easier to access in assembly code +// through a known offset into %fs. But with more effort, we could have used an +// ordinary thread-local variable instead and avoided extending the TCB here. +void* tiny_syscall_stack_addr; +void* large_syscall_stack_addr; }; #endif /* ARCH_TLS_HH */ diff --git a/arch/x64/entry.S b/arch/x64/entry.S index 04d809d..87f37b8 100644 --- a/arch/x64/entry.S +++ b/arch/x64/entry.S @@ -170,25 +170,57 @@ syscall_entry: .cfi_register rflags, r11 # r11 took previous rflags value # There is no ring transition and rflags are left unchanged. -# Skip the "red zone" allowed by the AMD64 ABI (the caller used a -# SYSCALL instruction an