20-some years ago I was very much involved in the integration of the stack-protector into OpenBSD. This subsystem was developed as a gcc patch by Hiroaki Etoh. Many years later it adopted and substantially rewritten to incorporate it into mainline gcc. Thus, OpenBSD for a few years was the first & only system with the stack protector.
Miod Vallat, Dale Rahn, (some forgotten), and I incorporated the code into OpenBSD, fixed many problems with Etoh, and made some decisions along the way. One of these decisions was that the original stack protector protected all functions. But this was too expensive. gcc at that time did not know the types of various local variables were (to for instance, look for character arrays). It only knew the total size. So it was I who chose the value of 16, which has infected the ecosystem for 2 decades. Only functions with 16 or more bytes of local storage are protected. Another decision was where the stack protector check function was located. We placed it into libc. And then we re-wrote it very carefully to be reentrant and safe in all calling conditions; the original proposal was not clean. Originally there was only one cookie per program, but Matthew Dempsky made changes so that every DSO (shared library object) had a different cookie. So a program like ssh would have 6 cookies, and far more than that in a dynamically linked browser. If you crash in one DSO and get visibility of your own cookie, that doesn't help you do function calls in another DSO (for example libc.so). Via Matthew Dempsky and others, I provided ideas for better heuristic to select functions to protect rather than the 16-byte heuristic, and eventually some people at google wrote the -fstack-protector-strong diffs for gcc and clang, because modern compilers keep better track of the types and format of their local variables. Years later, Todd Mortimer and I developed RETGUARD. At the start of that initiative he proposed we protect all functions, to try to guard all the RET instructions, and therefore achieve a state we call "ROP-free". I felt this was impossible, but after a couple hurdles the RETGUARD performance was vastly better than the stack protector and we were able to protect all functions and get to ROP-free (on fixed-sized instruction architecures). Performance was acceptable to trade against improved security. On variable-sized instruction architectures, polymorphic RET and other control flow instructions can and will surface, but the available RET gadgets are seriously reduced and exploitation may not be possible. Other methods attempt to reduce the impact of the poly-gadgets. Still the effort has value on all architectures. So amd64 isn't as good as arm64, riscv64, mips64, powerpc, or powerpc64. RETGUARD provides up to 4096 cookies per DSO, per-function, but limited to avoid excessive bloat. It is difficult to do on architectures with very few registers. Code was only written for clang, there is no gcc codebase doing it. clang code for some architectures was never written (riscv64). I hope that sets the stage for what is coming next. We were able to enable RETGUARD on all functions because it was fast. Why is RETGUARD fast, and the stack protector slow? In pseudo-code, the OpenBSD stack-protector model creates function prologue and epilogue which look like this: { local local_saved_cookie = per_DSO_cookie; <body of function> if (per_DSO_cookie != local_saved_cooke) __stack_smash_handler(name_of_function) <return from function> } This issues a warning to stdout, then crashes a manual SIGABRT, and you can use the debugger on the core file in the wrong frame (you are inside __stack_smash_handler, not at the momemt of the fault). RETGUARD made a choice to not use a smash-reporting function, and instead does: { local <retguard_setup> <body of function> if (retguard_matches> <return from function> illegal-instruction } So a detected-corruption causes an immediate crash, generally with a SIGABRT, you get no detailed report. But you can use the debugger on the core file in exactly the correct frame (an improvement). At first glance RETGUARD is faster because it has less instructions. But remember we are now living in a speculative-execution universe. A few years ago some speculation reseachers I talked to pointed out that the stack-protector generated instructions to do the call into __stack_smash_handler(), and even many instructions inside the function itself, are fetched, decoded, issued, and their results are discarded. That's a waste of cpu resources. It might be a slowdown because those execution slots are not used exclusively for straight-line speculation following the RET. Modern cpus also have complicated branch-target caches which may not be heuristically tuned to the stack protector approach. On the other hand the RETGUARD approach uses an illegal instruction (of some sort), which is a speculation barrier. That prevents the cpu from heading off into an alternative set of weeds. It will go decode more instructions along the post-RET execution path. I filed that idea as interesting but did nothing with it. Until now. Here is a diff which changes clang (on a few architectures) to not issue a call to __stack_smash_handler(), but use the RETGUARD approach. Index: gnu/llvm/llvm/lib/CodeGen/StackProtector.cpp =================================================================== RCS file: /cvs/src/gnu/llvm/llvm/lib/CodeGen/StackProtector.cpp,v retrieving revision 1.1.1.3 diff -u -p -u -r1.1.1.3 StackProtector.cpp --- gnu/llvm/llvm/lib/CodeGen/StackProtector.cpp 17 Dec 2021 12:23:27 -0000 1.1.1.3 +++ gnu/llvm/llvm/lib/CodeGen/StackProtector.cpp 24 Sep 2023 15:38:23 -0000 @@ -576,11 +576,27 @@ BasicBlock *StackProtector::CreateFailBB B.SetCurrentDebugLocation( DILocation::get(Context, 0, 0, F->getSubprogram())); if (Trip.isOSOpenBSD()) { - FunctionCallee StackChkFail = M->getOrInsertFunction( - "__stack_smash_handler", Type::getVoidTy(Context), - Type::getInt8PtrTy(Context)); + // ::ppc, ::ppc64, ::aarch64, and ::mips64 default to using + // RETGUARD, but if user forces stack-protector, we should figure out + // which of ::trap or ::debugtrap is suitable for each architecture + switch (Trip.getArch()) { + case llvm::Triple::x86: + case llvm::Triple::x86_64: + case llvm::Triple::riscv64: + case llvm::Triple::sparcv9: + B.CreateCall(Intrinsic::getDeclaration(M, Intrinsic::debugtrap)); + break; + case llvm::Triple::arm: + B.CreateCall(Intrinsic::getDeclaration(M, Intrinsic::trap)); + break; + default: + FunctionCallee StackChkFail = M->getOrInsertFunction( + "__stack_smash_handler", Type::getVoidTy(Context), + Type::getInt8PtrTy(Context)); - B.CreateCall(StackChkFail, B.CreateGlobalStringPtr(F->getName(), "SSH")); + B.CreateCall(StackChkFail, B.CreateGlobalStringPtr(F->getName(), "SSH")); + break; + } } else { FunctionCallee StackChkFail = M->getOrInsertFunction("__stack_chk_fail", Type::getVoidTy(Context)); In testing, it is quite a bit faster. The binaries are also a bit smaller. The ::debugtrap versus ::trap thing is an ugly wart inside clang. I'm reusing an instruction issuer intended for other purposes, rather than writing a new ::ssptrap issuer for every relevant architecture. Here is a similar diff which does this in gcc4 on the alpha: Index: gnu/gcc/gcc/function.c =================================================================== RCS file: /cvs/src/gnu/gcc/gcc/function.c,v retrieving revision 1.1.1.1 diff -u -p -u -r1.1.1.1 function.c --- gnu/gcc/gcc/function.c 15 Oct 2009 17:11:28 -0000 1.1.1.1 +++ gnu/gcc/gcc/function.c 24 Sep 2023 01:09:06 -0000 @@ -4031,7 +4031,11 @@ stack_protect_epilogue (void) if (JUMP_P (tmp)) predict_insn_def (tmp, PRED_NORETURN, TAKEN); +#if defined(__alpha__) + emit_insn (gen_illop ()); +#else expand_expr_stmt (targetm.stack_protect_fail ()); +#endif emit_label (label); } Index: gnu/gcc/gcc/config/alpha/alpha.md =================================================================== RCS file: /cvs/src/gnu/gcc/gcc/config/alpha/alpha.md,v retrieving revision 1.3 diff -u -p -u -r1.3 alpha.md --- gnu/gcc/gcc/config/alpha/alpha.md 10 Dec 2012 18:06:12 -0000 1.3 +++ gnu/gcc/gcc/config/alpha/alpha.md 23 Sep 2023 02:14:29 -0000 @@ -7181,6 +7181,12 @@ "" "ldq_u $31,0($30)") +(define_insn "illop" + [(const_int 3)] + "" + "halt" + [(set_attr "type" "ilog")]) + ;; On Unicos/Mk we use a macro for aligning code. (define_insn "realign" Alpha is also faster. The same approach should work on sparc64, but there is an internal gcc bug to chase first, and it may need a different approach for emiting the illegal instruction in the right place. clang on sparc64 actually generates correct binaries with the diff above, but gcc is still our default compiler. But I'd rather have much-more-secure machines then slightly-faster machines. So the question is, with these new less-expensive epilogues can we enable -fstack-protector-all by default as we did with RETGUARD, and achieve acceptable performance? This diff turns on -fstack-protector-all for i386 and riscv64. (I am still testing 32-bit arm). Index: gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.h =================================================================== RCS file: /cvs/src/gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.h,v retrieving revision 1.3 diff -u -p -u -r1.3 OpenBSD.h --- gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.h 9 Mar 2022 00:09:45 -0000 1.3 +++ gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.h 23 Sep 2023 21:36:18 -0000 @@ -91,7 +91,13 @@ public: LangOptions::StackProtectorMode GetDefaultStackProtectorLevel(bool KernelOrKext) const override { - return LangOptions::SSPStrong; + switch (getArch()) { + case llvm::Triple::x86: + case llvm::Triple::riscv64: + return LangOptions::SSPReq; + default: + return LangOptions::SSPStrong; + } } unsigned GetDefaultDwarfVersion() const override { return 2; } I don't feel a performance loss from this. The first round of testing sees 1% slowdown or something? Anyways I'm not seeing anything catastrophic. Oh, another necessary step: We have traditionally disabled RETGUARD in the compiler tools because there is no escalation risk and in that binary the cost is too high, so we do the same thing for these machines, pushing it back to stack-protector-strong. It may seem like a cheat, but it is not a new cheat, we did this with RETGUARD years ago. Index: gnu/usr.bin/clang/Makefile.inc =================================================================== RCS file: /cvs/src/gnu/usr.bin/clang/Makefile.inc,v retrieving revision 1.26 diff -u -p -u -r1.26 Makefile.inc --- gnu/usr.bin/clang/Makefile.inc 12 May 2022 15:51:23 -0000 1.26 +++ gnu/usr.bin/clang/Makefile.inc 22 Sep 2023 15:19:29 -0000 @@ -32,6 +32,11 @@ CPPFLAGS+= -DNDEBUG ${MACHINE_ARCH} == "mips64el" || ${MACHINE_ARCH} == "powerpc" CXXFLAGS+= -fno-ret-protector .endif +.if ${MACHINE_ARCH} == "i386" || ${MACHINE_ARCH} == "riscv64" || \ + ${MACHINE_ARCH} == "arm" +CXXFLAGS+= -fstack-protector-strong +.endif + .if ${MACHINE_ARCH} == "amd64" || ${MACHINE_ARCH} == "i386" CXXFLAGS+= -mno-retpoline .endif In gcc4, the alpha can go to stack-protector-all, to achieve ROP-free. Index: gnu/gcc/gcc/common.opt =================================================================== RCS file: /cvs/src/gnu/gcc/gcc/common.opt,v retrieving revision 1.16 diff -u -p -u -r1.16 common.opt --- gnu/gcc/gcc/common.opt 10 Jan 2023 12:06:18 -0000 1.16 +++ gnu/gcc/gcc/common.opt 24 Sep 2023 05:01:55 -0000 @@ -871,7 +871,7 @@ Common Report Var(flag_stack_protect, 3) Use propolice as a stack protection method fstack-protector-all -Common Report RejectNegative Var(flag_stack_protect, 2) VarExists +Common Report Var(flag_stack_protect, 2) VarExists Use a stack protection method for every function fstack-protector-strong Index: gnu/gcc/gcc/toplev.c =================================================================== RCS file: /cvs/src/gnu/gcc/gcc/toplev.c,v retrieving revision 1.7 diff -u -p -u -r1.7 toplev.c --- gnu/gcc/gcc/toplev.c 8 May 2017 20:58:40 -0000 1.7 +++ gnu/gcc/gcc/toplev.c 24 Sep 2023 16:06:17 -0000 @@ -1829,6 +1829,13 @@ process_options (void) if (flag_cx_limited_range) flag_complex_method = 0; + /* On OpenBSD/alpha, stack-protector-all tries to give us ROP-free */ + if (flag_stack_protect == -1) { +#ifdef __alpha__ + flag_stack_protect = FRAME_GROWS_DOWNWARD ? 2 : 0; +#endif + } + /* Targets must be able to place spill slots at lower addresses. If the target already uses a soft frame pointer, the transition is trivial. */ if (flag_stack_protect == -1) Again, I don't sense a serious performance loss on alpha. Unfinished work: 1. More detailed performance analysis. We can give up a small amount of performance for this security. 2. To actually get to ROP-free, there are ASM functions which must be manually modified to add stack-protector prologue/epilogue, otherwise they expose naked RET-gadgets. 3. Continue the sparc64 effort. 4. Expose the Linux pepole to this surprising change. They would probably appreciate the performance increase, but try the ROP-free changes.