Re: [PATCH] Implement switch to separate SYSCALL call stack

2018-03-07 Thread Nadav Har'El
On Wed, Mar 7, 2018 at 6:01 AM, Waldek Kozaczuk 
wrote:

> Nadav,
> gic with malloc()?
>
>>
>> Hmm, I just now realized you were using mprotect() to guard one page on
>> the end of the stack (i.e., its beginning) and not the entire thing :-)
>> Makes more sense now...
>> I can't say I understand why mmap() should not work here. Sounds like it
>> should have... We're already using mmap() for stacks, successfully.
>> To implement a canary with malloc() what you can do is to put a fixed
>> value, say 0xdeadbeef, in the beginning of the stack buffer, and every time
>> you exit from a system call, you check that it still has this value.
>> This doesn't catch stack overflows immediately, but at least it will not
>> allow some other thread who's data we just ruined to run. It's not
>> foolproof though - it's possible that the data we overflowed is actually
>> OSv's own data and ruining it messed with the function of the system call
>> itself.
>>
>
> I think we can implement canary logic later.
>

Right. Maybe a canary would have been useful for you for debugging which
stack sizes are enough, but if you already got things working, we can
postpone this.


> So meanwhile I will stick to regular malloc(). Also isn't my problems with
> mmap()/mprotectd() related to this issue - https://github.com/cloudius-
> systems/osv/issues/143  ?
>

I don't see how it's related - issue 143 is about using mmap() *without*
the populate option, so that the page is only populated when first
used (lazy allocation). But you're giving the populate option, right?

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [PATCH] Implement switch to separate SYSCALL call stack

2018-03-06 Thread Waldek Kozaczuk
Nadav,

Thanks for the review and answers to my questions. I will try to prepare 
next (hopefully final version of the patch).

Please see some of my comments below,

On Tuesday, March 6, 2018 at 6:25:47 PM UTC-5, Nadav Har'El wrote:
>
>
> On Mon, Feb 5, 2018 at 7:58 AM, Waldek Kozaczuk  > wrote:
>
>> I am expecting this patch will need some tweaking but I think it 
>> fundamentally implements SYSCALL call stack switch. I have tested in using 
>> my golang working branch.
>>
>
> Excellent.
>  
>
>>
>> First of all I am assembly novice so I am not sure if I have not messed 
>> some stuff in entry.S. For example I am not sure I have not messed the CFI 
>> - on other hand given by previous GDB research it might be mute point given 
>> gdb cannot handle backtracing non-contiguous stacks.
>>
>
> I think not doing the CFI business around the new function call is fine 
> (we won't debug this function...) but there's possibly one error in the CFI 
> thing which I pointed out in my review.
> Anyway, this CFI stuff is hard to get right without actually testing gdb...
>  
>
>>
>> Secondly I am not sure if tiny stack is enough (1024). I am afraid there 
>> maybe some scenarios where thread which calls setup_large_syscall_stack() 
>> may be interrupted (or something else) and bigger stack is needed.
>>
>> For example this is a crash I got only once when the tiny stack was 512 
>> bytes deep (not sure if increasing to 1024 has fully addressed it as I have 
>> not seen it anymore):
>>
>
> I'm not sure what exactly I'm seeing in this stack trace, it seems 
> corrupt, it is conceivable that 512 bytes is not enough, but I don't really 
> know why - which of the functions on the malloc() path use a lot of stack 
> space?
> 512 is not a lot - just our "push all the registers on the stack" shtick 
> consumes 128 bytes (16*8) of that.
>  
>  
>
>> [backtrace]
>> 0x0044d85c 
>> 0x0044d88a > exception_frame*)+26>
>> 0x0032b3eb 
>> 0x0038b4e8 
>> 0x0038a356 
>> 0x003e5381 > chrono::duration 
>> >)+561>
>> 0x003e583b 
>> 0x003e5f01 
>> 0x003c5957 
>> 0x003d4885 
>> 0x003d4a5a 
>> 0x815d203f 
>> 0x003e5381 > chrono::duration 
>> >)+561>
>> 0x003e583b 
>> 0x003e5f01 
>> 0x003c5957 
>> 0x003d4885 
>> 0x003d4a5a 
>> 0x003d4d46 
>> 0x003e2780 
>> 0x0038b324 
>> 0x1113cd91 
>> 0x11125e23 
>> 0x1112c35f 
>> 0x1112c2e5 
>> 0x11306bf2 
>> 0x20ddff3f 
>> qemu failed.
>>
>> I also tried to use mmap() and mprotect() (instead of malloc) to allocate 
>> memory for large stack and it did not quite work (please see commented out 
>> code in the patch). Everytime I ran it would crash like this (not event 
>> same thread when mprotect() was called):
>>
>> 0  0x00225930 in abort (fmt=fmt@entry=0x71d020 "exception nested 
>> too deeply") at runtime.cc:130
>> #1  0x00388804 in sched::arch_cpu::enter_exception 
>> (this=) at arch/x64/arch-cpu.cc:19
>> #2  sched::exception_guard::exception_guard (this=) at 
>> arch/x64/arch-cpu.cc:37
>> #3  0x0038c6da in general_protection (ef=0x80f2f048) at 
>> arch/x64/exceptions.cc:313
>> #4  
>> #5  0x0038890e in safe_load (data=, 
>> potentially_bad_pointer=0x56415741e5894855) at ar
>> ch/x64/safe-ptr.hh:33
>> #6  backtrace_safe (pc=pc@entry=0x80f2d330, nr=nr@entry=128) at 
>> arch/x64/backtrace.cc:25
>> #7  0x0020203c in print_backtrace () at runtime.cc:79
>> #8  0x0022591c in abort (fmt=fmt@entry=0x71d3a0 "general 
>> protection fault\n") at runtime.cc:121
>> #9  0x0038c73a in general_protection (ef=0x80f2e048) at 
>> arch/x64/exceptions.cc:322
>> #10 
>> #11 std::unique_ptr> std::default_delete >::get (this=0x5bd
>> 8894808c483d0)
>> at /usr/include/c++/5/bits/unique_ptr.h:305
>> #12 
>> sched::thread::do_wake_with(waiter::wake()::{lambda()#1},
>>  
>> unsigned int) (allowed_initi
>> al_states_mask=24, 
>> action=..., this=0x5bd8894808c48348) at include/osv/sched.hh:1261
>> #13 
>> sched::thread::wake_with_from_mutex(waiter::wake()::{lambda()#1})
>>  
>> (action=..., this=0x
>> 5bd8894808c48348)
>> at include/osv/sched.hh:1287
>> #14 waiter::wake (this=0x33185e 
>> > unsigned long)+30>
>> )
>> at include/osv/wait_record.hh:38
>> #15 lockfree::mutex::lock (this=0xa2166648, 
>> this@entry=0x82953040) at core/lfmutex.cc:74
>> #16 lockfree_mutex_lock (m=m@entry=0xa2166648) at 
>> core/lfmutex.cc:282
>> #17 0x002e213b in mutex_lock (m=0xa2166648) at 
>> include/osv/mutex.h:59
>> #18 sa_bulk_lookup (hdl=0xa2166648, 
>> attrs=attrs@entry=0x829579c0, count=3)
>> at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:1851
>> #19 0x00316cd4 in zfs_getattr (vp=0xa221e780, 
>> vap=0x82957aa0)
>> at bsd/sys/cddl/contrib/opensolar

Re: [PATCH] Implement switch to separate SYSCALL call stack

2018-03-06 Thread Nadav Har'El
On Mon, Feb 5, 2018 at 7:58 AM, Waldek Kozaczuk 
wrote:

> I am expecting this patch will need some tweaking but I think it
> fundamentally implements SYSCALL call stack switch. I have tested in using
> my golang working branch.
>

Excellent.


>
> First of all I am assembly novice so I am not sure if I have not messed
> some stuff in entry.S. For example I am not sure I have not messed the CFI
> - on other hand given by previous GDB research it might be mute point given
> gdb cannot handle backtracing non-contiguous stacks.
>

I think not doing the CFI business around the new function call is fine (we
won't debug this function...) but there's possibly one error in the CFI
thing which I pointed out in my review.
Anyway, this CFI stuff is hard to get right without actually testing gdb...


>
> Secondly I am not sure if tiny stack is enough (1024). I am afraid there
> maybe some scenarios where thread which calls setup_large_syscall_stack()
> may be interrupted (or something else) and bigger stack is needed.
>
> For example this is a crash I got only once when the tiny stack was 512
> bytes deep (not sure if increasing to 1024 has fully addressed it as I have
> not seen it anymore):
>

I'm not sure what exactly I'm seeing in this stack trace, it seems corrupt,
it is conceivable that 512 bytes is not enough, but I don't really know why
- which of the functions on the malloc() path use a lot of stack space?
512 is not a lot - just our "push all the registers on the stack" shtick
consumes 128 bytes (16*8) of that.



> [backtrace]
> 0x0044d85c 
> 0x0044d88a  exception_frame*)+26>
> 0x0032b3eb 
> 0x0038b4e8 
> 0x0038a356 
> 0x003e5381  chrono::duration
> >)+561>
> 0x003e583b 
> 0x003e5f01 
> 0x003c5957 
> 0x003d4885 
> 0x003d4a5a 
> 0x815d203f 
> 0x003e5381  chrono::duration
> >)+561>
> 0x003e583b 
> 0x003e5f01 
> 0x003c5957 
> 0x003d4885 
> 0x003d4a5a 
> 0x003d4d46 
> 0x003e2780 
> 0x0038b324 
> 0x1113cd91 
> 0x11125e23 
> 0x1112c35f 
> 0x1112c2e5 
> 0x11306bf2 
> 0x20ddff3f 
> qemu failed.
>
> I also tried to use mmap() and mprotect() (instead of malloc) to allocate
> memory for large stack and it did not quite work (please see commented out
> code in the patch). Everytime I ran it would crash like this (not event
> same thread when mprotect() was called):
>
> 0  0x00225930 in abort (fmt=fmt@entry=0x71d020 "exception nested
> too deeply") at runtime.cc:130
> #1  0x00388804 in sched::arch_cpu::enter_exception
> (this=) at arch/x64/arch-cpu.cc:19
> #2  sched::exception_guard::exception_guard (this=) at
> arch/x64/arch-cpu.cc:37
> #3  0x0038c6da in general_protection (ef=0x80f2f048) at
> arch/x64/exceptions.cc:313
> #4  
> #5  0x0038890e in safe_load (data=,
> potentially_bad_pointer=0x56415741e5894855) at ar
> ch/x64/safe-ptr.hh:33
> #6  backtrace_safe (pc=pc@entry=0x80f2d330, nr=nr@entry=128) at
> arch/x64/backtrace.cc:25
> #7  0x0020203c in print_backtrace () at runtime.cc:79
> #8  0x0022591c in abort (fmt=fmt@entry=0x71d3a0 "general
> protection fault\n") at runtime.cc:121
> #9  0x0038c73a in general_protection (ef=0x80f2e048) at
> arch/x64/exceptions.cc:322
> #10 
> #11 std::unique_ptr std::default_delete >::get (this=0x5bd
> 8894808c483d0)
> at /usr/include/c++/5/bits/unique_ptr.h:305
> #12 
> sched::thread::do_wake_with(waiter::wake()::{lambda()#1},
> unsigned int) (allowed_initi
> al_states_mask=24,
> action=..., this=0x5bd8894808c48348) at include/osv/sched.hh:1261
> #13 sched::thread::wake_with_from_mutex )#1}>(waiter::wake()::{lambda()#1}) (action=..., this=0x
> 5bd8894808c48348)
> at include/osv/sched.hh:1287
> #14 waiter::wake (this=0x33185e 
>  unsigned long, unsigned long)+30>
> )
> at include/osv/wait_record.hh:38
> #15 lockfree::mutex::lock (this=0xa2166648, 
> this@entry=0x82953040)
> at core/lfmutex.cc:74
> #16 lockfree_mutex_lock (m=m@entry=0xa2166648) at
> core/lfmutex.cc:282
> #17 0x002e213b in mutex_lock (m=0xa2166648) at
> include/osv/mutex.h:59
> #18 sa_bulk_lookup (hdl=0xa2166648, 
> attrs=attrs@entry=0x829579c0,
> count=3)
> at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:1851
> #19 0x00316cd4 in zfs_getattr (vp=0xa221e780,
> vap=0x82957aa0)
> at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:2055
> #20 0x0042cf2c in vn_stat (vp=vp@entry=0xa221e780,
> st=st@entry=0x82957b60) at fs/vfs/vfs_vnode.c
> c:325
> #21 0x0043029a in vfs_file::stat (this=,
> st=0x82957b60) at fs/vfs/vfs_fops.cc:118
> #22 0x0041d48c in size (f=...) at fs/fs.cc:16
> #23 0x003300ed in mmu::file_vma::fault (this=0xa221ed00,
> addr=1759

Re: [PATCH] Implement switch to separate SYSCALL call stack

2018-03-06 Thread Nadav Har'El
On Mon, Feb 5, 2018 at 7:28 AM, Waldemar Kozaczuk 
wrote:

> This patch implements separate syscall call stack needed
> when runtimes like Golang use SYSCALL instruction to make
> system call.
>
> More specifically each application thread pre-alllocates
> "tiny" (1024 bytes deep) syscall call stack. When SYSCALL
> instruction is called the call stack is switched to the
> tiny stack and the "large" stack is allocated if has not
> been alllocated yet (first call on given thread). In the end
> SYSCALL ends up being exexuted on the "large" stack.
>
> This patch is based on the original patch provided by @Hawxchen.
>
> Fixes #808
>
> Signed-off-by: Waldemar Kozaczuk 
> ---
>  arch/x64/arch-switch.hh | 61 +
>  arch/x64/arch-tls.hh|  9 ++
>  arch/x64/entry.S| 81 ++
> +++
>  include/osv/sched.hh|  7 +
>  linux.cc|  5 +++
>  5 files changed, 137 insertions(+), 26 deletions(-)
>
> diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
> index d1a039a..f291a26 100644
> --- a/arch/x64/arch-switch.hh
> +++ b/arch/x64/arch-switch.hh
> @@ -11,6 +11,7 @@
>  #include "msr.hh"
>  #include 
>  #include 
> +#include 
>

Maybe this is not needed any more?


>
>  extern "C" {
>  void thread_main(void);
> @@ -114,6 +115,7 @@ void thread::init_stack()
>  _state.rip = reinterpret_cast(thread_main);
>  _state.rsp = stacktop;
>  _state.exception_stack = _arch.exception_stack +
> sizeof(_arch.exception_stack);
> +debug("***-> %d, Stack top: %#x\n", this->_id, stacktop);
>

Please remove this from the final version.


>  }
>
>  void thread::setup_tcb()
> @@ -146,6 +148,50 @@ void thread::setup_tcb()
>  _tcb = static_cast(p + tls.size +
> user_tls_size);
>  _tcb->self = _tcb;
>  _tcb->tls_base = p + user_tls_size;
> +
> +if(is_app()) {
> +//
> +// Allocate tiny syscall call stack
> +auto& tiny_stack = _attr._tiny_syscall_stack;
>

I doubt we'll ever want to configure tiny_stack's size or location, so I
don't
think it needs to be configurable in _attr. Do you think it should?


> +
> +if(!tiny_stack.begin) {
> +tiny_stack.size = 1024; //512 seems to be too small sometimes
> +tiny_stack.begin = malloc(tiny_stack.size);
> +tiny_stack.deleter = tiny_stack.default_deleter;
> +}
> +
> +_tcb->tiny_syscall_stack_addr = tiny_stack.begin +
> tiny_stack.size;
> +_tcb->large_syscall_stack_addr = nullptr;
>

I think we don't really need two entries in the TCB here - we can have just
one,
which starts with the pointer to the tiny buffer and then replaced by the
pointer to the bigger
one after that is allocated.


> +debug("***-> %d, Tiny syscall stack top: %#x with size: %d
> bytes\n", this->_id,
> +  _tcb->tiny_syscall_stack_addr, tiny_stack.size);
>

Another debug to remove.


> +}
> +}
> +
> +void thread::setup_large_syscall_stack()
> +{
> +auto& large_stack = _attr._large_syscall_stack;
>
+if(large_stack.begin) {
> +return;
> +}
>

Again, not sure we need this in _attr, would anybody need to configure it?
(it's supposed to depend
on OSv's needs, not the application's needs).


> +
> +large_stack.size = 16 * PAGE_SIZE;
> +
> +/*
> + * Mmapping (without mprotect()) seems to work but occasionally leads
> to crashes (
> +large_stack.begin = mmu::map_anon(nullptr, large_stack.size,
> mmu::mmap_populate, mmu::perm_rw);
> +//mmu::mprotect(large_stack.begin, PAGE_SIZE, 0); //THIS DOES NOT
> WORK at all -> results in crash
>

Why would protect work? Doesn't it mean the page is not at all readable or
writable, making it crash when used?

I wonder why mmap didn't work. Should have... Unless someone tries to
execute the stack? Or the mmap() call itself simply needs a bigger stack
than 1024 bytes?


> +large_stack.deleter = free_large_syscall_stack;
> +*/
> +
> +large_stack.begin = malloc(large_stack.size);
> +large_stack.deleter = large_stack.default_deleter;
> +
> +_tcb->large_syscall_stack_addr = large_stack.begin + large_stack.size;
>

Theoretically, we could free the old stack after setting up a new one.
In practice, it will be ugly to do (how can a thread free the stack it is
currently using?) so maybe not worth it.



> +}
> +
> +void thread::free_large_syscall_stack(sched::thread::stack_info si)
> +{
> +mmu::munmap(si.begin, si.size);
>  }
>

This function is not used (because the mmap() version is commented) and
should be removed, right?

>
>  void thread::free_tcb()
> @@ -156,6 +202,21 @@ void thread::free_tcb()
>  } else {
>  free(_tcb->tls_base);
>  }
> +
> +if(is_app()) {
> +auto& tiny_stack = _attr._tiny_syscall_stack;
> +
> +assert(tiny_stack.begin);
> +
> +if(tiny_stack.deleter) {
> +tiny_stack.deleter(tiny_stack);
> +}
> +
> +auto& large

Re: [PATCH] Implement switch to separate SYSCALL call stack

2018-02-25 Thread Waldek Kozaczuk
Bump. 

Sent from my iPhone

> On Feb 5, 2018, at 00:58, Waldek Kozaczuk  wrote:
> 
> I am expecting this patch will need some tweaking but I think it 
> fundamentally implements SYSCALL call stack switch. I have tested in using my 
> golang working branch.
> 
> First of all I am assembly novice so I am not sure if I have not messed some 
> stuff in entry.S. For example I am not sure I have not messed the CFI - on 
> other hand given by previous GDB research it might be mute point given gdb 
> cannot handle backtracing non-contiguous stacks.
> 
> Secondly I am not sure if tiny stack is enough (1024). I am afraid there 
> maybe some scenarios where thread which calls setup_large_syscall_stack() may 
> be interrupted (or something else) and bigger stack is needed.
> 
> For example this is a crash I got only once when the tiny stack was 512 bytes 
> deep (not sure if increasing to 1024 has fully addressed it as I have not 
> seen it anymore):
> [backtrace]
> 0x0044d85c 
> 0x0044d88a  exception_frame*)+26>
> 0x0032b3eb 
> 0x0038b4e8 
> 0x0038a356 
> 0x003e5381  std::chrono::duration 
> >)+561>
> 0x003e583b 
> 0x003e5f01 
> 0x003c5957 
> 0x003d4885 
> 0x003d4a5a 
> 0x815d203f 
> 0x003e5381  std::chrono::duration 
> >)+561>
> 0x003e583b 
> 0x003e5f01 
> 0x003c5957 
> 0x003d4885 
> 0x003d4a5a 
> 0x003d4d46 
> 0x003e2780 
> 0x0038b324 
> 0x1113cd91 
> 0x11125e23 
> 0x1112c35f 
> 0x1112c2e5 
> 0x11306bf2 
> 0x20ddff3f 
> qemu failed.
> 
> I also tried to use mmap() and mprotect() (instead of malloc) to allocate 
> memory for large stack and it did not quite work (please see commented out 
> code in the patch). Everytime I ran it would crash like this (not event same 
> thread when mprotect() was called):
> 
> 0  0x00225930 in abort (fmt=fmt@entry=0x71d020 "exception nested too 
> deeply") at runtime.cc:130
> #1  0x00388804 in sched::arch_cpu::enter_exception (this= out>) at arch/x64/arch-cpu.cc:19
> #2  sched::exception_guard::exception_guard (this=) at 
> arch/x64/arch-cpu.cc:37
> #3  0x0038c6da in general_protection (ef=0x80f2f048) at 
> arch/x64/exceptions.cc:313
> #4  
> #5  0x0038890e in safe_load (data=, 
> potentially_bad_pointer=0x56415741e5894855) at ar
> ch/x64/safe-ptr.hh:33
> #6  backtrace_safe (pc=pc@entry=0x80f2d330, nr=nr@entry=128) at 
> arch/x64/backtrace.cc:25
> #7  0x0020203c in print_backtrace () at runtime.cc:79
> #8  0x0022591c in abort (fmt=fmt@entry=0x71d3a0 "general protection 
> fault\n") at runtime.cc:121
> #9  0x0038c73a in general_protection (ef=0x80f2e048) at 
> arch/x64/exceptions.cc:322
> #10 
> #11 std::unique_ptr std::default_delete >::get (this=0x5bd
> 8894808c483d0)
> at /usr/include/c++/5/bits/unique_ptr.h:305
> #12 
> sched::thread::do_wake_with(waiter::wake()::{lambda()#1},
>  unsigned int) (allowed_initi
> al_states_mask=24, 
> action=..., this=0x5bd8894808c48348) at include/osv/sched.hh:1261
> #13 
> sched::thread::wake_with_from_mutex(waiter::wake()::{lambda()#1})
>  (action=..., this=0x
> 5bd8894808c48348)
> at include/osv/sched.hh:1287
> #14 waiter::wake (this=0x33185e 
>  unsigned long)+30>
> )
> at include/osv/wait_record.hh:38
> #15 lockfree::mutex::lock (this=0xa2166648, 
> this@entry=0x82953040) at core/lfmutex.cc:74
> #16 lockfree_mutex_lock (m=m@entry=0xa2166648) at core/lfmutex.cc:282
> #17 0x002e213b in mutex_lock (m=0xa2166648) at 
> include/osv/mutex.h:59
> #18 sa_bulk_lookup (hdl=0xa2166648, 
> attrs=attrs@entry=0x829579c0, count=3)
> at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:1851
> #19 0x00316cd4 in zfs_getattr (vp=0xa221e780, 
> vap=0x82957aa0)
> at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:2055
> #20 0x0042cf2c in vn_stat (vp=vp@entry=0xa221e780, 
> st=st@entry=0x82957b60) at fs/vfs/vfs_vnode.c
> c:325
> #21 0x0043029a in vfs_file::stat (this=, 
> st=0x82957b60) at fs/vfs/vfs_fops.cc:118
> #22 0x0041d48c in size (f=...) at fs/fs.cc:16
> #23 0x003300ed in mmu::file_vma::fault (this=0xa221ed00, 
> addr=17592205996032, ef=0x82958098) at 
> core/mmu.cc:1689
> #24 0x0032b484 in mmu::vm_fault (addr=, 
> addr@entry=17592205997277, ef=ef@entry=0x82958098
> ) at core/mmu.cc:1342
> #25 0x0038b529 in page_fault (ef=0x82958098) at 
> arch/x64/mmu.cc:38
> #26 
> #27 runtime.memmove () at 
> /home/wkozaczuk/tools/go/src/runtime/memmove_amd64.s:161
> #28 0x1112864f in runtime.recordForPanic (b=...) at 
> /home/wkozaczuk/tools/go/src/runtime/print.go:45
> 
> When I commented mprotect() (still mmap()) the test app would work but 
> oc

Re: [PATCH] Implement switch to separate SYSCALL call stack

2018-02-04 Thread Waldek Kozaczuk
I am expecting this patch will need some tweaking but I think it 
fundamentally implements SYSCALL call stack switch. I have tested in using 
my golang working branch.

First of all I am assembly novice so I am not sure if I have not messed 
some stuff in entry.S. For example I am not sure I have not messed the CFI 
- on other hand given by previous GDB research it might be mute point given 
gdb cannot handle backtracing non-contiguous stacks.

Secondly I am not sure if tiny stack is enough (1024). I am afraid there 
maybe some scenarios where thread which calls setup_large_syscall_stack() 
may be interrupted (or something else) and bigger stack is needed.

For example this is a crash I got only once when the tiny stack was 512 
bytes deep (not sure if increasing to 1024 has fully addressed it as I have 
not seen it anymore):
[backtrace]
0x0044d85c 
0x0044d88a 
0x0032b3eb 
0x0038b4e8 
0x0038a356 
0x003e5381  
>)+561>
0x003e583b 
0x003e5f01 
0x003c5957 
0x003d4885 
0x003d4a5a 
0x815d203f 
0x003e5381  
>)+561>
0x003e583b 
0x003e5f01 
0x003c5957 
0x003d4885 
0x003d4a5a 
0x003d4d46 
0x003e2780 
0x0038b324 
0x1113cd91 
0x11125e23 
0x1112c35f 
0x1112c2e5 
0x11306bf2 
0x20ddff3f 
qemu failed.

I also tried to use mmap() and mprotect() (instead of malloc) to allocate 
memory for large stack and it did not quite work (please see commented out 
code in the patch). Everytime I ran it would crash like this (not event 
same thread when mprotect() was called):

0  0x00225930 in abort (fmt=fmt@entry=0x71d020 "exception nested 
too deeply") at runtime.cc:130
#1  0x00388804 in sched::arch_cpu::enter_exception (this=) at arch/x64/arch-cpu.cc:19
#2  sched::exception_guard::exception_guard (this=) at 
arch/x64/arch-cpu.cc:37
#3  0x0038c6da in general_protection (ef=0x80f2f048) at 
arch/x64/exceptions.cc:313
#4  
#5  0x0038890e in safe_load (data=, 
potentially_bad_pointer=0x56415741e5894855) at ar
ch/x64/safe-ptr.hh:33
#6  backtrace_safe (pc=pc@entry=0x80f2d330, nr=nr@entry=128) at 
arch/x64/backtrace.cc:25
#7  0x0020203c in print_backtrace () at runtime.cc:79
#8  0x0022591c in abort (fmt=fmt@entry=0x71d3a0 "general protection 
fault\n") at runtime.cc:121
#9  0x0038c73a in general_protection (ef=0x80f2e048) at 
arch/x64/exceptions.cc:322
#10 
#11 std::unique_ptr >::get (this=0x5bd
8894808c483d0)
at /usr/include/c++/5/bits/unique_ptr.h:305
#12 
sched::thread::do_wake_with(waiter::wake()::{lambda()#1},
 
unsigned int) (allowed_initi
al_states_mask=24, 
action=..., this=0x5bd8894808c48348) at include/osv/sched.hh:1261
#13 
sched::thread::wake_with_from_mutex(waiter::wake()::{lambda()#1})
 
(action=..., this=0x
5bd8894808c48348)
at include/osv/sched.hh:1287
#14 waiter::wake (this=0x33185e 

)
at include/osv/wait_record.hh:38
#15 lockfree::mutex::lock (this=0xa2166648, 
this@entry=0x82953040) at core/lfmutex.cc:74
#16 lockfree_mutex_lock (m=m@entry=0xa2166648) at 
core/lfmutex.cc:282
#17 0x002e213b in mutex_lock (m=0xa2166648) at 
include/osv/mutex.h:59
#18 sa_bulk_lookup (hdl=0xa2166648, 
attrs=attrs@entry=0x829579c0, count=3)
at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:1851
#19 0x00316cd4 in zfs_getattr (vp=0xa221e780, 
vap=0x82957aa0)
at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:2055
#20 0x0042cf2c in vn_stat (vp=vp@entry=0xa221e780, 
st=st@entry=0x82957b60) at fs/vfs/vfs_vnode.c
c:325
#21 0x0043029a in vfs_file::stat (this=, 
st=0x82957b60) at fs/vfs/vfs_fops.cc:118
#22 0x0041d48c in size (f=...) at fs/fs.cc:16
#23 0x003300ed in mmu::file_vma::fault (this=0xa221ed00, 
addr=17592205996032, ef=0x82958098) at 
core/mmu.cc:1689
#24 0x0032b484 in mmu::vm_fault (addr=, 
addr@entry=17592205997277, ef=ef@entry=0x82958098
) at core/mmu.cc:1342
#25 0x0038b529 in page_fault (ef=0x82958098) at 
arch/x64/mmu.cc:38
#26 
#27 runtime.memmove () at 
/home/wkozaczuk/tools/go/src/runtime/memmove_amd64.s:161
#28 0x1112864f in runtime.recordForPanic (b=...) at 
/home/wkozaczuk/tools/go/src/runtime/print.go:45

When I commented mprotect() (still mmap()) the test app would work but 
occasionally crash (I do not have an example). SO I wonder if there is some 
fundamental constraint which is violated when calling mmap()/mprotect() 
when setting up large stack (called from assembly in tiny stack) comparing 
to regular malloc(). Is it because mprotect() requires updating memory page 
tables and it triggers some logic that messes things up? If I cannot use 
mmap() how do I implement canary type of logic with malloc()

[PATCH] Implement switch to separate SYSCALL call stack

2018-02-04 Thread Waldemar Kozaczuk
This patch implements separate syscall call stack needed
when runtimes like Golang use SYSCALL instruction to make 
system call.

More specifically each application thread pre-alllocates 
"tiny" (1024 bytes deep) syscall call stack. When SYSCALL 
instruction is called the call stack is switched to the 
tiny stack and the "large" stack is allocated if has not 
been alllocated yet (first call on given thread). In the end 
SYSCALL ends up being exexuted on the "large" stack.

This patch is based on the original patch provided by @Hawxchen. 

Fixes #808

Signed-off-by: Waldemar Kozaczuk 
---
 arch/x64/arch-switch.hh | 61 +
 arch/x64/arch-tls.hh|  9 ++
 arch/x64/entry.S| 81 +
 include/osv/sched.hh|  7 +
 linux.cc|  5 +++
 5 files changed, 137 insertions(+), 26 deletions(-)

diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index d1a039a..f291a26 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -11,6 +11,7 @@
 #include "msr.hh"
 #include 
 #include 
+#include 
 
 extern "C" {
 void thread_main(void);
@@ -114,6 +115,7 @@ void thread::init_stack()
 _state.rip = reinterpret_cast(thread_main);
 _state.rsp = stacktop;
 _state.exception_stack = _arch.exception_stack + 
sizeof(_arch.exception_stack);
+debug("***-> %d, Stack top: %#x\n", this->_id, stacktop);
 }
 
 void thread::setup_tcb()
@@ -146,6 +148,50 @@ void thread::setup_tcb()
 _tcb = static_cast(p + tls.size + user_tls_size);
 _tcb->self = _tcb;
 _tcb->tls_base = p + user_tls_size;
+
+if(is_app()) {
+//
+// Allocate tiny syscall call stack
+auto& tiny_stack = _attr._tiny_syscall_stack;
+
+if(!tiny_stack.begin) {
+tiny_stack.size = 1024; //512 seems to be too small sometimes
+tiny_stack.begin = malloc(tiny_stack.size);
+tiny_stack.deleter = tiny_stack.default_deleter;
+}
+
+_tcb->tiny_syscall_stack_addr = tiny_stack.begin + tiny_stack.size;
+_tcb->large_syscall_stack_addr = nullptr;
+debug("***-> %d, Tiny syscall stack top: %#x with size: %d bytes\n", 
this->_id,
+  _tcb->tiny_syscall_stack_addr, tiny_stack.size);
+}
+}
+
+void thread::setup_large_syscall_stack()
+{
+auto& large_stack = _attr._large_syscall_stack;
+if(large_stack.begin) {
+return;
+}
+
+large_stack.size = 16 * PAGE_SIZE;
+
+/*
+ * Mmapping (without mprotect()) seems to work but occasionally leads to 
crashes (
+large_stack.begin = mmu::map_anon(nullptr, large_stack.size, 
mmu::mmap_populate, mmu::perm_rw);
+//mmu::mprotect(large_stack.begin, PAGE_SIZE, 0); //THIS DOES NOT WORK at 
all -> results in crash
+large_stack.deleter = free_large_syscall_stack;
+*/
+
+large_stack.begin = malloc(large_stack.size);
+large_stack.deleter = large_stack.default_deleter;
+
+_tcb->large_syscall_stack_addr = large_stack.begin + large_stack.size;
+}
+
+void thread::free_large_syscall_stack(sched::thread::stack_info si)
+{
+mmu::munmap(si.begin, si.size);
 }
 
 void thread::free_tcb()
@@ -156,6 +202,21 @@ void thread::free_tcb()
 } else {
 free(_tcb->tls_base);
 }
+
+if(is_app()) {
+auto& tiny_stack = _attr._tiny_syscall_stack;
+
+assert(tiny_stack.begin);
+
+if(tiny_stack.deleter) {
+tiny_stack.deleter(tiny_stack);
+}
+
+auto& large_stack = _attr._large_syscall_stack;
+if(large_stack.begin && large_stack.deleter) {
+large_stack.deleter(large_stack);
+}
+}
 }
 
 void thread_main_c(thread* t)
diff --git a/arch/x64/arch-tls.hh b/arch/x64/arch-tls.hh
index 1bf86fd..64a1788 100644
--- a/arch/x64/arch-tls.hh
+++ b/arch/x64/arch-tls.hh
@@ -8,9 +8,18 @@
 #ifndef ARCH_TLS_HH
 #define ARCH_TLS_HH
 
+// Don't change the declaration sequence of all existing members'.
+// Please add new members from the last.
 struct thread_control_block {
 thread_control_block* self;
 void* tls_base;
+// These fields, a per-thread stack for SYSCALL instruction,
+// are used in arch/x64/entry.S for %fs's offset.
+// We currently keep these fields in the TCB to make it easier to access 
in assembly code
+// through a known offset into %fs. But with more effort, we could have 
used an
+// ordinary thread-local variable instead and avoided extending the TCB 
here.
+void* tiny_syscall_stack_addr;
+void* large_syscall_stack_addr;
 };
 
 #endif /* ARCH_TLS_HH */
diff --git a/arch/x64/entry.S b/arch/x64/entry.S
index 04d809d..87f37b8 100644
--- a/arch/x64/entry.S
+++ b/arch/x64/entry.S
@@ -170,25 +170,57 @@ syscall_entry:
 .cfi_register rflags, r11 # r11 took previous rflags value
 # There is no ring transition and rflags are left unchanged.
 
-# Skip the "red zone" allowed by the AMD64 ABI (the caller used a
-# SYSCALL instruction an