20-some years ago I was very much involved in the integration of the
stack-protector into OpenBSD.  This subsystem was developed as a gcc
patch by Hiroaki Etoh.  Many years later it adopted and substantially
rewritten to incorporate it into mainline gcc.  Thus, OpenBSD for a few
years was the first & only system with the stack protector.

Miod Vallat, Dale Rahn, (some forgotten), and I incorporated the code
into OpenBSD, fixed many problems with Etoh, and made some decisions
along the way.

One of these decisions was that the original stack protector protected
all functions.  But this was too expensive.  gcc at that time did not
know the types of various local variables were (to for instance, look
for character arrays).  It only knew the total size.  So it was I who
chose the value of 16, which has infected the ecosystem for 2 decades.
Only functions with 16 or more bytes of local storage are protected.

Another decision was where the stack protector check function was
located.  We placed it into libc.  And then we re-wrote it very
carefully to be reentrant and safe in all calling conditions; the
original proposal was not clean.

Originally there was only one cookie per program, but Matthew Dempsky
made changes so that every DSO (shared library object) had a different
cookie.  So a program like ssh would have 6 cookies, and far more than
that in a dynamically linked browser.  If you crash in one DSO and get
visibility of your own cookie, that doesn't help you do function calls
in another DSO (for example libc.so).

Via Matthew Dempsky and others, I provided ideas for better heuristic
to select functions to protect rather than the 16-byte heuristic, and
eventually some people at google wrote the -fstack-protector-strong
diffs for gcc and clang, because modern compilers keep better track of
the types and format of their local variables.

Years later, Todd Mortimer and I developed RETGUARD.  At the start of
that initiative he proposed we protect all functions, to try to guard
all the RET instructions, and therefore achieve a state we call
"ROP-free".  I felt this was impossible, but after a couple hurdles the
RETGUARD performance was vastly better than the stack protector and we
were able to protect all functions and get to ROP-free (on fixed-sized
instruction architecures).  Performance was acceptable to trade against
improved security.

On variable-sized instruction architectures, polymorphic RET and other
control flow instructions can and will surface, but the available RET
gadgets are seriously reduced and exploitation may not be possible.
Other methods attempt to reduce the impact of the poly-gadgets.  Still
the effort has value on all architectures.  So amd64 isn't as good as
arm64, riscv64, mips64, powerpc, or powerpc64.

RETGUARD provides up to 4096 cookies per DSO, per-function, but limited to
avoid excessive bloat.  It is difficult to do on architectures with very
few registers.  Code was only written for clang, there is no gcc codebase
doing it.  clang code for some architectures was never written (riscv64).

I hope that sets the stage for what is coming next.

We were able to enable RETGUARD on all functions because it was fast.

Why is RETGUARD fast, and the stack protector slow?  In pseudo-code, the
OpenBSD stack-protector model creates function prologue and epilogue which
look like this:

        local local_saved_cookie = per_DSO_cookie;

        <body of function>

        if (per_DSO_cookie != local_saved_cooke)
        <return from function>

This issues a warning to stdout, then crashes a manual SIGABRT, and you
can use the debugger on the core file in the wrong frame (you are inside
__stack_smash_handler, not at the momemt of the fault).

RETGUARD made a choice to not use a smash-reporting function, and instead

        local <retguard_setup>

        <body of function>

        if (retguard_matches>
           <return from function>

So a detected-corruption causes an immediate crash, generally with a
SIGABRT, you get no detailed report. But you can use the debugger on the
core file in exactly the correct frame (an improvement).

At first glance RETGUARD is faster because it has less instructions.

But remember we are now living in a speculative-execution universe.  A
few years ago some speculation reseachers I talked to pointed out that
the stack-protector generated instructions to do the call into
__stack_smash_handler(), and even many instructions inside the function
itself, are fetched, decoded, issued, and their results are discarded.
That's a waste of cpu resources.  It might be a slowdown because those
execution slots are not used exclusively for straight-line speculation
following the RET.  Modern cpus also have complicated branch-target
caches which may not be heuristically tuned to the stack protector approach.

On the other hand the RETGUARD approach uses an illegal instruction (of
some sort), which is a speculation barrier. That prevents the cpu from
heading off into an alternative set of weeds.  It will go decode more
instructions along the post-RET execution path.

I filed that idea as interesting but did nothing with it.  Until now.

Here is a diff which changes clang (on a few architectures) to not issue
a call to __stack_smash_handler(), but use the RETGUARD approach.

Index: gnu/llvm/llvm/lib/CodeGen/StackProtector.cpp
RCS file: /cvs/src/gnu/llvm/llvm/lib/CodeGen/StackProtector.cpp,v
retrieving revision
diff -u -p -u -r1.1.1.3 StackProtector.cpp
--- gnu/llvm/llvm/lib/CodeGen/StackProtector.cpp        17 Dec 2021 12:23:27 
+++ gnu/llvm/llvm/lib/CodeGen/StackProtector.cpp        24 Sep 2023 15:38:23 
@@ -576,11 +576,27 @@ BasicBlock *StackProtector::CreateFailBB
         DILocation::get(Context, 0, 0, F->getSubprogram()));
   if (Trip.isOSOpenBSD()) {
-    FunctionCallee StackChkFail = M->getOrInsertFunction(
-        "__stack_smash_handler", Type::getVoidTy(Context),
-        Type::getInt8PtrTy(Context));
+    // ::ppc, ::ppc64, ::aarch64, and ::mips64 default to using
+    // RETGUARD, but if user forces stack-protector, we should figure out
+    // which of ::trap or ::debugtrap is suitable for each architecture
+    switch (Trip.getArch()) {
+      case llvm::Triple::x86:
+      case llvm::Triple::x86_64:
+      case llvm::Triple::riscv64:
+      case llvm::Triple::sparcv9:
+        B.CreateCall(Intrinsic::getDeclaration(M, Intrinsic::debugtrap));
+       break;
+      case llvm::Triple::arm:
+        B.CreateCall(Intrinsic::getDeclaration(M, Intrinsic::trap));
+       break;
+      default:
+        FunctionCallee StackChkFail = M->getOrInsertFunction(
+          "__stack_smash_handler", Type::getVoidTy(Context),
+          Type::getInt8PtrTy(Context));
-    B.CreateCall(StackChkFail, B.CreateGlobalStringPtr(F->getName(), "SSH"));
+        B.CreateCall(StackChkFail, B.CreateGlobalStringPtr(F->getName(), 
+       break;
+    }
   } else {
     FunctionCallee StackChkFail =
         M->getOrInsertFunction("__stack_chk_fail", Type::getVoidTy(Context));

In testing, it is quite a bit faster.  The binaries are also a bit smaller.

The ::debugtrap versus ::trap thing is an ugly wart inside clang.  I'm
reusing an instruction issuer intended for other purposes, rather than
writing a new ::ssptrap issuer for every relevant architecture.

Here is a similar diff which does this in gcc4 on the alpha:

Index: gnu/gcc/gcc/function.c
RCS file: /cvs/src/gnu/gcc/gcc/function.c,v
retrieving revision
diff -u -p -u -r1.1.1.1 function.c
--- gnu/gcc/gcc/function.c      15 Oct 2009 17:11:28 -0000
+++ gnu/gcc/gcc/function.c      24 Sep 2023 01:09:06 -0000
@@ -4031,7 +4031,11 @@ stack_protect_epilogue (void)
   if (JUMP_P (tmp))
     predict_insn_def (tmp, PRED_NORETURN, TAKEN);
+#if defined(__alpha__)
+  emit_insn (gen_illop ());
   expand_expr_stmt (targetm.stack_protect_fail ());
   emit_label (label);
Index: gnu/gcc/gcc/config/alpha/alpha.md
RCS file: /cvs/src/gnu/gcc/gcc/config/alpha/alpha.md,v
retrieving revision 1.3
diff -u -p -u -r1.3 alpha.md
--- gnu/gcc/gcc/config/alpha/alpha.md   10 Dec 2012 18:06:12 -0000      1.3
+++ gnu/gcc/gcc/config/alpha/alpha.md   23 Sep 2023 02:14:29 -0000
@@ -7181,6 +7181,12 @@
   "ldq_u $31,0($30)")
+(define_insn "illop"
+  [(const_int 3)]
+  ""
+  "halt"
+  [(set_attr "type" "ilog")])
 ;; On Unicos/Mk we use a macro for aligning code.
 (define_insn "realign"

Alpha is also faster.

The same approach should work on sparc64, but there is an internal gcc
bug  to chase first, and it may need a different approach for emiting the
illegal instruction in the right place.  clang on sparc64 actually generates
correct binaries with the diff above, but gcc is still our default compiler.

But I'd rather have much-more-secure machines then slightly-faster machines.

So the question is, with these new less-expensive epilogues can we
enable -fstack-protector-all by default as we did with RETGUARD, and
achieve acceptable performance?

This diff turns on -fstack-protector-all for i386 and riscv64.
(I am still testing 32-bit arm).

Index: gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.h
RCS file: /cvs/src/gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.h,v
retrieving revision 1.3
diff -u -p -u -r1.3 OpenBSD.h
--- gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.h      9 Mar 2022 00:09:45 
-0000       1.3
+++ gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.h      23 Sep 2023 21:36:18 
@@ -91,7 +91,13 @@ public:
   GetDefaultStackProtectorLevel(bool KernelOrKext) const override {
-    return LangOptions::SSPStrong;
+    switch (getArch()) {
+      case llvm::Triple::x86:
+      case llvm::Triple::riscv64:
+        return LangOptions::SSPReq;
+      default:
+        return LangOptions::SSPStrong;
+    }
   unsigned GetDefaultDwarfVersion() const override { return 2; }
I don't feel a performance loss from this.  The first round of testing
sees 1% slowdown or something?  Anyways I'm not seeing anything catastrophic.

Oh, another necessary step: We have traditionally disabled RETGUARD in
the compiler tools because there is no escalation risk and in that
binary the cost is too high, so we do the same thing for these machines,
pushing it back to stack-protector-strong.  It may seem like a cheat,
but it is not a new cheat, we did this with RETGUARD years ago.

Index: gnu/usr.bin/clang/Makefile.inc
RCS file: /cvs/src/gnu/usr.bin/clang/Makefile.inc,v
retrieving revision 1.26
diff -u -p -u -r1.26 Makefile.inc
--- gnu/usr.bin/clang/Makefile.inc      12 May 2022 15:51:23 -0000      1.26
+++ gnu/usr.bin/clang/Makefile.inc      22 Sep 2023 15:19:29 -0000
@@ -32,6 +32,11 @@ CPPFLAGS+=   -DNDEBUG
     ${MACHINE_ARCH} == "mips64el" || ${MACHINE_ARCH} == "powerpc"
 CXXFLAGS+=     -fno-ret-protector
+.if ${MACHINE_ARCH} == "i386" || ${MACHINE_ARCH} == "riscv64" || \
+    ${MACHINE_ARCH} == "arm"
+CXXFLAGS+=     -fstack-protector-strong
 .if ${MACHINE_ARCH} == "amd64" || ${MACHINE_ARCH} == "i386"
 CXXFLAGS+=     -mno-retpoline

In gcc4, the alpha can go to stack-protector-all, to achieve ROP-free.

Index: gnu/gcc/gcc/common.opt
RCS file: /cvs/src/gnu/gcc/gcc/common.opt,v
retrieving revision 1.16
diff -u -p -u -r1.16 common.opt
--- gnu/gcc/gcc/common.opt      10 Jan 2023 12:06:18 -0000      1.16
+++ gnu/gcc/gcc/common.opt      24 Sep 2023 05:01:55 -0000
@@ -871,7 +871,7 @@ Common Report Var(flag_stack_protect, 3)
 Use propolice as a stack protection method
-Common Report RejectNegative Var(flag_stack_protect, 2) VarExists
+Common Report Var(flag_stack_protect, 2) VarExists
 Use a stack protection method for every function
Index: gnu/gcc/gcc/toplev.c
RCS file: /cvs/src/gnu/gcc/gcc/toplev.c,v
retrieving revision 1.7
diff -u -p -u -r1.7 toplev.c
--- gnu/gcc/gcc/toplev.c        8 May 2017 20:58:40 -0000       1.7
+++ gnu/gcc/gcc/toplev.c        24 Sep 2023 16:06:17 -0000
@@ -1829,6 +1829,13 @@ process_options (void)
   if (flag_cx_limited_range)
     flag_complex_method = 0;
+  /* On OpenBSD/alpha, stack-protector-all tries to give us ROP-free */
+  if (flag_stack_protect == -1) {
+#ifdef __alpha__
+      flag_stack_protect = FRAME_GROWS_DOWNWARD ? 2 : 0;
+  }
   /* Targets must be able to place spill slots at lower addresses.  If the
      target already uses a soft frame pointer, the transition is trivial.  */
   if (flag_stack_protect == -1)

Again, I don't sense a serious performance loss on alpha.

Unfinished work:

1. More detailed performance analysis.  We can give up a small amount of
   performance for this security.

2. To actually get to ROP-free, there are ASM functions which must be
   manually modified to add stack-protector prologue/epilogue, otherwise
   they expose naked RET-gadgets.

3. Continue the sparc64 effort.

4. Expose the Linux pepole to this surprising change.  They would probably
   appreciate the performance increase, but try the ROP-free changes.

Reply via email to