Here is an attempt to clean up backslash() in tr(1). There are two real problems:
- We should not treat '8' or '9' as being a part of an octal escape. They not octal digits. This yields some strange results: $ printf AB | tr AB '\78' | hexdump -c 0000000 @ @ 0000002 ... Octal 070 + decimal 8 is octal 0100, i.e. '@'. We use it twice to fill in string2, hence two '@'. With this patch we now produce the correct result: $ printf AB | obj/tr AB '\78' | hexdump -c 0000000 \a 8 0000002 As expected 'A' -> 07 ('\a') and 'B' -> '8'. - We should reject octal values over UCHAR_MAX. Our tr(1) is byte-oriented, accepting backslashed octal values beyond UCHAR_MAX makes no sense in most contexts. One context where it makes sense: we *do* accept backslash octals as sequence lengths, i.e. you can do this: [a*\777] to specify 511 'a' characters. This is an extension to the spec. I don't think this extension is particularly useful, though. We already accept sequence lengths in decimal, leading-zero octal, or leading-0x hexademical already. Truncating the valid range of backslash octals to throw out invalid bytes in other contexts seems like a fair trade to me. Then there are some stylistic things that make the code harder to understand: - We should handle empty escapes, i.e. the backslash is at the end of the NUL-terminated string, at the top of the function. Leaves less guesswork for later. Probably worth noting that this is an extension to the spec. POSIX says empty escapes are undefined. I think ours is reasonable behavior, though. - Use the for-loop conditional to terminate instead of having multiple break conditions in the loop body. Makes it clearer we expect up to three octal digits. While here, ANSI-fy the function definition. Thoughts? OK? Index: str.c =================================================================== RCS file: /cvs/src/usr.bin/tr/str.c,v retrieving revision 1.12 diff -u -p -r1.12 str.c --- str.c 5 Dec 2012 23:20:26 -0000 1.12 +++ str.c 31 Oct 2021 16:31:31 -0000 @@ -32,6 +32,7 @@ #include <sys/types.h> +#include <assert.h> #include <errno.h> #include <stddef.h> #include <stdio.h> @@ -280,25 +281,34 @@ genseq(s) * an escape code or a literal character. */ static int -backslash(s) - STR *s; +backslash(STR *s) { - int ch, cnt, val; + size_t i; + int ch, val; - for (cnt = val = 0;;) { - ch = *++s->str; - if (!isascii(ch) || !isdigit(ch)) - break; - val = val * 8 + ch - '0'; - if (++cnt == 3) { - ++s->str; + assert(*s->str == '\\'); + s->str++; + + /* Empty escapes become plain backslashes. */ + if (*s->str == '\0') { + s->state = EOS; + return ('\\'); + } + + val = 0; + for (i = 0; i < 3; i++) { + if (s->str[i] < '0' || '7' < s->str[i]) break; - } + val = val * 8 + s->str[i] - '0'; } - if (cnt) + if (i > 0) { + if (val > UCHAR_MAX) + errx(1, "octal value out of range: %d", val); + s->str += i; return (val); - if (ch != '\0') - ++s->str; + } + + ch = *s->str++; switch (ch) { case 'a': /* escape characters */ return ('\7'); @@ -314,9 +324,6 @@ backslash(s) return ('\t'); case 'v': return ('\13'); - case '\0': /* \" -> \ */ - s->state = EOS; - return ('\\'); default: /* \x" -> x */ return (ch); }