Here is an attempt to clean up backslash() in tr(1).

There are two real problems:

- We should not treat '8' or '9' as being a part of an octal escape.
They not octal digits.  This yields some strange results:

$ printf AB | tr AB '\78' | hexdump -c
0000000   @   @                                                        
0000002

  ... Octal 070 + decimal 8 is octal 0100, i.e. '@'.  We use it
  twice to fill in string2, hence two '@'.

  With this patch we now produce the correct result:

$ printf AB | obj/tr AB '\78' | hexdump -c
0000000  \a   8                                                        
0000002

  As expected 'A' -> 07 ('\a') and 'B' -> '8'.

- We should reject octal values over UCHAR_MAX.  Our tr(1) is
  byte-oriented, accepting backslashed octal values beyond
  UCHAR_MAX makes no sense in most contexts.

  One context where it makes sense: we *do* accept backslash
  octals as sequence lengths, i.e. you can do this:

        [a*\777]

  to specify 511 'a' characters.

  This is an extension to the spec.  I don't think this extension is
  particularly useful, though.  We already accept sequence lengths in
  decimal, leading-zero octal, or leading-0x hexademical already.
  Truncating the valid range of backslash octals to throw out invalid
  bytes in other contexts seems like a fair trade to me.

Then there are some stylistic things that make the code harder
to understand:

- We should handle empty escapes, i.e. the backslash is at the end of
  the NUL-terminated string, at the top of the function.  Leaves less
  guesswork for later.

  Probably worth noting that this is an extension to the spec.  POSIX
  says empty escapes are undefined.  I think ours is reasonable behavior,
  though.

- Use the for-loop conditional to terminate instead of having multiple
  break conditions in the loop body.  Makes it clearer we expect up to
  three octal digits.

While here, ANSI-fy the function definition.

Thoughts?  OK?

Index: str.c
===================================================================
RCS file: /cvs/src/usr.bin/tr/str.c,v
retrieving revision 1.12
diff -u -p -r1.12 str.c
--- str.c       5 Dec 2012 23:20:26 -0000       1.12
+++ str.c       31 Oct 2021 16:31:31 -0000
@@ -32,6 +32,7 @@
 
 #include <sys/types.h>
 
+#include <assert.h>
 #include <errno.h>
 #include <stddef.h>
 #include <stdio.h>
@@ -280,25 +281,34 @@ genseq(s)
  * an escape code or a literal character.
  */
 static int
-backslash(s)
-       STR *s;
+backslash(STR *s)
 {
-       int ch, cnt, val;
+       size_t i;
+       int ch, val;
 
-       for (cnt = val = 0;;) {
-               ch = *++s->str;
-               if (!isascii(ch) || !isdigit(ch))
-                       break;
-               val = val * 8 + ch - '0';
-               if (++cnt == 3) {
-                       ++s->str;
+       assert(*s->str == '\\');
+       s->str++;
+
+       /* Empty escapes become plain backslashes. */
+       if (*s->str == '\0') {
+               s->state = EOS;
+               return ('\\');
+       }
+
+       val = 0;
+       for (i = 0; i < 3; i++) {
+               if (s->str[i] < '0' || '7' < s->str[i])
                        break;
-               }
+               val = val * 8 + s->str[i] - '0';
        }
-       if (cnt)
+       if (i > 0) {
+               if (val > UCHAR_MAX)
+                       errx(1, "octal value out of range: %d", val);
+               s->str += i;
                return (val);
-       if (ch != '\0')
-               ++s->str;
+       }
+
+       ch = *s->str++;
        switch (ch) {
                case 'a':                       /* escape characters */
                        return ('\7');
@@ -314,9 +324,6 @@ backslash(s)
                        return ('\t');
                case 'v':
                        return ('\13');
-               case '\0':                      /*  \" -> \ */
-                       s->state = EOS;
-                       return ('\\');
                default:                        /* \x" -> x */
                        return (ch);
        }

Reply via email to