Re: snmp(1) initial UTF-8 support

Martijn van Duren Sat, 13 Jun 2020 01:13:32 -0700
And of course I still had a potential buffer overflow in there...

On Sat, 2020-06-13 at 09:16 +0200, Martijn van Duren wrote:
> Minor change: I forgot to forward the display_hint flag to
> smi_displayhint_os. Now -OQ and -Oq work as well.
> 
> On Thu, 2020-06-11 at 21:17 +0200, Martijn van Duren wrote:
> > Hello Ingo,
> > 
> > Thanks for looking into this.
> > 
> > On Sun, 2020-05-31 at 16:32 +0200, Ingo Schwarze wrote:
> > > Hi Martijn,
> > > 
> > > Martijn van Duren wrote on Tue, May 19, 2020 at 10:25:37PM +0200:
> > > 
> > > > So according to RFC2579 an octetstring can contain UTF-8 characters if  
> > > > so described in the DISPLAY-HINT. One of the main consumers of this is
> > > > SnmpAdminString, which is used quite a lot.
> > > > 
> > > > Now that we trimmed a little fat from snmp's oid, I wanted to fill it
> > > > up again and implemented the bare minimum DISPLAY-HINT parsing that is
> > > > required for SnmpAdminString. Other parts of the syntax can be atter
> > > > at a later state if desirable.
> > > > 
> > > > Now I decided to implement this a little different from net-snmp,
> > > > because they throw incomplete sequences directly to the terminal,
> > > > which I recon is not a sane approach.
> > > 
> > > No, definitely not.  Never throw invalid or incomplete UTF-8 at the
> > > terminal unless the user unambiguously requests just that (like
> > > with `printf "a\x80z\n"` or `cat /usr/bin/cc` or something like that).
> > > 
> > > > I choose to take the approach taken by the likes of wall(1) and replace
> > > > invalid and incomplete sequences with a '?'.
> > > 
> > > If the specification says that the user is only allowed to put valid
> > > UTF-8 into octetstrings (and not arbitrary bytes), replacing invalid
> > > bytes at output time is both acceptable and feels like the only
> > > sensible option, unless something like vis(3) is considered more
> > > appropriate.
> > 
> > RFC2579 section 3.1 bullet (3): or `t' for UTF-8
> > So it should be valid UTF-8.
> > > However, the reason why wall(1) uses '?' as the replacement character
> > > is because wall(1) must not assume that the output device can handle
> > > UTF-8 and has to work safely if all the output device can handle is
> > > ASCII.  So, wall(1) has to live with the inconvenience that when you
> > > see a question mark, you don't know whether it really is a question
> > > mark or whether it originally used to be garbage input.
> > 
> > Fair enough.
> > > If i understand correctly, here we are talking about output that is
> > > UTF-8 anyway.  So instead of (ab)using the question mark, you might
> > > consider using the U+FFFD REPLACEMENT CHARACTER.  The intended
> > > purpose of that one is to stand in for invalid and inconplete
> > > byte sequences, and also for characters that cannot be displayed.
> > 
> > I like that one, used in diff below.
> > > > This because DISPLAY-HINT already states that too long sequences
> > > > strings should be cut short at the final multibyte character fitting
> > > > inside the specified length, meaning that output can already get
> > > > mangled.
> > > > 
> > > > This also brings me to the question how we should handle DISPLAY-HINT
> > > > in the case of -Oa and -Ox. Net-snmp prioritizes DISPLAY-HINT over
> > > > these, but it has the option to disable this by disabling MIB-loading
> > > > via -m''; This implementation doesn't give us the that option (yet?).
> > > > The current diff follows net-snmp, but there is something to say to
> > > > prioritize -Ox over DISPLAY-HINT, so an admin can use it to debug
> > > > snmp-output without it being mangled. Any feedback here is welcome.
> > > > 
> > > > Once it's clear if this is the right approach I'll do a thorough search
> > > > through mib.h on which objects actually can use this new definition.
> > > 
> > > I can't really comment on what snmp(1) should do, or which settings
> > > should override which other settings.
> > 
> > I'll leave it as is for now. If people actually need to original mangled
> > data we can always revisit. It's not something major that people heavily
> > rely on.
> > > Do i understand correctly that you want to make -Oa print invalid
> > > UTF-8 to the terminal?
> > 
> > No, to be in line with net-snmp, what snmp(1) does is the following:
> > - Without a -O flag it sees if all bytes are printable and if so print
> >   the string, if not prints the full string in hex.
> > - With -Oa it prints all printable bytes and replaces unprintable
> >   characters with '.'.
> > - With -Ox it prints all bytes in hex.
> > 
> > > If so, that would sound reasonable to me,
> > > for the following reason.  Printing non-printable ASCII to the
> > > terminal is often dangerous, it can screw up or reconfigure the
> > > terminal, so -Oa is already an option that must be used with care,
> > > just like `cat /usr/bin/cc`.  While printing invalid UTF-8 is not
> > > a good idea in general, it is less dangerous than printing non-printable
> > > ASCII, so just passing the bytes through for -Oa doesn't make
> > > anything more dangerous than it already is.
> > > 
> > > Maybe -Oa should have a warning in the manual page about the dangers
> > > of using it on untrusted data?  Not sure.
> > 
> > So no, it's not dangerous.
> > > Oh, by the way, when -Oa is not specified and there is UTF-8
> > > content, don't forget to still filter out non-printable bytes:
> > > both non-printable ASCII bytes and non-printable UTF-8 characters.
> > > Right?
> > 
> > You're right. Fixed in new diff.
> > > > Index: smi.c
> > > > ===================================================================
> > > > RCS file: /cvs/src/usr.bin/snmp/smi.c,v
> > > [...]
> > > > @@ -598,6 +638,60 @@ smi_foreach(struct oid *oid)
> > > >  }
> > > >  
> > > >  int
> > > > +isu8cont(unsigned char c)
> > > > +{
> > > > +       return (c & (0x80 | 0x40)) == 0x80;
> > > > +}
> > > > +
> > > > +char *
> > > > +smi_displayhint_os(struct textconv *tc, const char *buf, size_t buflen)
> > > > +{
> > > > +       size_t octetlength, i, j, start;
> > > > +       char *displayformat;
> > > > +       char *rbuf;
> > > > +       mbstate_t mbs;
> > > > +
> > > > +       errno = 0;
> > > > +       octetlength = (size_t) strtol(tc->tc_display_hint, 
> > > > &displayformat, 10);
> > > > +       if (octetlength == 0 && errno != 0) {
> > > > +               errno = EINVAL;
> > > > +               return NULL;
> > > > +       }
> > > > +       if (displayformat[0] == 't') {
> > > > +               rbuf = malloc(octetlength + 1);
> > > 
> > > NULL pointer access ahead on ENOMEM.
> > 
> > fixed
> > > > +               if (strcmp(nl_langinfo(CODESET), "UTF-8") == 0) {
> > > 
> > > On OpenBSD, that is not the normal way of checking whether
> > > the locale is UTF-8.  Also, it is not portable because the CODESET
> > > values are not standardized.
> > 
> > I took this one from tmux(1).
> > > We usually do
> > > 
> > >   #include <stdlib.h>
> > > 
> > >   if (MB_CUR_MAX > 1) {
> > >           /* UTF-8 here */
> > > 
> > > Yes, that means if you want to do a portable version, quite a bit
> > > of additional work will be needed.  But the same holds when using
> > > CODESET.
> > 
> > Changed
> > > > +                       bzero(&mbs, sizeof(mbs));
> > > > +                       for (start = j = i = 0; i < octetlength && i < 
> > > > buflen; i++) {
> > > > +                               switch (mbrtowc(NULL, &(buf[i]), 1, 
> > > > &mbs)) {
> > > 
> > > Why are you using mbrtowc(3) here?  That function is unnecessarily
> > > complicated and should only be used when absolutely necessary, i.e.
> > > in the following two situations: (1) the function has to fail on
> > > invalid sequences but must return incomplete sequences, because more
> > > data may arrive later, then making the sequence complete OR (2) the
> > > program is multi-threaded and more than one thread might do UTF-8
> > > decoding at the same time.  Neither seems to be the case here.
> > > 
> > > Using mbtowc(3) is almost always preferable to using mbrtowc(3)
> > > because the resulting code is much easier to read.
> > 
> > I thought I was being lazy by reading one byte at a time. :-)
> > Changed.
> > > I'm delaying full review of the following bits until this is switched
> > > to mbtowc(3).
> > > 
> > > > +                               case 0:
> > > > +                                       rbuf[j++] = '\0';
> > > > +                                       break;
> > > > +                               case -1:
> > > > +                                       rbuf[j++] = '?';
> > > 
> > > You might want U+FFFD here, see above.
> > 
> > Done. Minor nit might be that we exchange a single byte for 3, where we
> > only have octetlength return buffer. But since we expect decent output
> > and the spec says we can trim from the end there's no big harm.
> > > > +                                       bzero(&mbs, sizeof(mbs));
> > > > +                                       break;
> > > > +                               case -2:
> > > > +                                       break;
> > > > +                               default:
> > > 
> > > It looks as if the printability check might be missing here.
> > > Don't you need iswprint(3) before printing to the terminal,
> > > and print U+FFFD instead if it fails?
> > 
> > I choose to go with '.' here to be in line with what -Oa does.
> > Don't know if it's better, but it feels more consistent. :-)
> > > > +                                       memcpy(&(rbuf[j]), 
> > > > &(buf[start]), i - start + 1);
> > > > +                                       j += i - start + 1;
> > > > +                                       start = i + 1;
> > > > +                                       break;
> > > > +                               }
> > > > +                       }
> > > > +               } else {
> > > > +                       for (j = i = 0; i < octetlength && i < buflen; 
> > > > i++) {
> > > > +                               if (isu8cont(buf[i]))
> > > > +                                       continue;
> > > 
> > > This is weird.
> > > 
> > > So if the user sets LC_CTYPE=C, then UTF-8 continuation bytes are
> > > silently skipped?  Consider input like
> > > 
> > >   printf "a\x80z"
> > > 
> > > That printing as "az" would seem unacceptable to me, shouldn't it
> > > rather print "a?z"?
> > > 
> > > > +                               rbuf[j++] = isprint(buf[i]) ? buf[i] : 
> > > > '?';
> > > 
> > > That statement will handle bytes with the high bit set as non-printable
> > > anyway, so i see no need for the isu8cont() test before it.
> > > 
> > > Also, why do you handle the ASCII case separately from the UTF-8
> > > case at all?  A typical mbtowc(3) loop works just fine for ASCII,
> > > too.  In rare cases, there may be a performance gain from handling
> > > ASCII seperately without mbtowc(3), but i don't assume that's the
> > > case here.
> > > 
> > > In case you prefer printing "a?z" for a valid two-byte UTF-8 character
> > > in ASCII output mode rather than "a??z", you can do the following:
> > > inside the mbtowc(3) loop, when finding an invalid sequence, check
> > > whether you are in ASCII mode (with MB_CUR_MAX == 0) and if so,
> > > skip all subsequent isu8cont() bytes.  But that's just a bit of
> > > icing on the cake, one might argue it's a matter of taste whether
> > > printing "a?z" or "a??z" is better.
> > 
> > First: I prefer a single '?' because it gives the user a better
> > indication of what the intended character was. Similar to the discussion
> > had on wall(1). That being said I'd like to stay a little more true to
> > what -Oa does. This means that I now make a best effort guess for UTF-8
> > characters (assuming input should be valid UTF-8, this seems reasonable
> > to me). Full UTF-8 characters are now printed as '?' and similar to -Oa
> > !isprint characters are returned as '.'.
> > 
> > As before I haven't added any more snmpadminstrings, but syscontact
> > should be enough for testing.
> > > Yours,
> > >   Ingo
> 
Index: mib.c
===================================================================
RCS file: /cvs/src/usr.bin/snmp/mib.c,v
retrieving revision 1.2
diff -u -p -r1.2 mib.c
--- mib.c       19 May 2020 13:41:01 -0000      1.2
+++ mib.c       13 Jun 2020 08:10:39 -0000
@@ -27,9 +27,11 @@
 #include "smi.h"
 
 static struct oid mib_tree[] = MIB_TREE;
+static struct textconv textconv_tree[] = TEXTCONV_TREE;
 
 void
 mib_init(void)
 {
        smi_mibtree(mib_tree);
+       smi_textconvtree(textconv_tree);
 }
Index: mib.h
===================================================================
RCS file: /cvs/src/usr.bin/snmp/mib.h,v
retrieving revision 1.1
diff -u -p -r1.1 mib.h
--- mib.h       9 Aug 2019 06:17:59 -0000       1.1
+++ mib.h       13 Jun 2020 08:10:39 -0000
@@ -751,7 +751,7 @@
        { MIBDECL(sysDescr) },                          \
        { MIBDECL(sysOID) },                            \
        { MIBDECL(sysUpTime) },                         \
-       { MIBDECL(sysContact) },                        \
+       { MIBDECL(sysContact), "SnmpAdminString" },     \
        { MIBDECL(sysName) },                           \
        { MIBDECL(sysLocation) },                       \
        { MIBDECL(sysServices) },                       \
@@ -1345,6 +1345,11 @@
        { MIBDECL(ipfRouteEntRouteMetric5) },           \
        { MIBDECL(ipfRouteEntStatus) },                 \
        { MIBEND }                                      \
+}
+
+#define TEXTCONV_TREE {                                        \
+       { "SnmpAdminString", "255t", BER_TYPE_OCTETSTRING }, \
+       { NULL, NULL }                                  \
 }
 
 void    mib_init(void);
Index: smi.c
===================================================================
RCS file: /cvs/src/usr.bin/snmp/smi.c,v
retrieving revision 1.9
diff -u -p -r1.9 smi.c
--- smi.c       31 May 2020 20:38:28 -0000      1.9
+++ smi.c       13 Jun 2020 08:10:39 -0000
@@ -24,10 +24,13 @@
 #include <arpa/inet.h>
 
 #include <ctype.h>
+#include <errno.h>
 #include <stdlib.h>
 #include <stdio.h>
 #include <string.h>
 #include <strings.h>
+#include <wchar.h>
+#include <wctype.h>
 
 #include "ber.h"
 #include "mib.h"
@@ -36,8 +39,13 @@
 
 #define MINIMUM(a, b)  (((a) < (b)) ? (a) : (b))
 
+int isu8cont(unsigned char);
+int isu8start(unsigned char);
+char *smi_displayhint_os(struct textconv *, int, const char *, size_t);
+
 int smi_oid_cmp(struct oid *, struct oid *);
 int smi_key_cmp(struct oid *, struct oid *);
+int smi_textconv_cmp(struct textconv *, struct textconv *);
 struct oid * smi_findkey(char *);
 
 RB_HEAD(oidtree, oid);
@@ -48,6 +56,10 @@ RB_HEAD(keytree, oid);
 RB_PROTOTYPE(keytree, oid, o_keyword, smi_key_cmp)
 struct keytree smi_keytree;
 
+RB_HEAD(textconvtree, textconv);
+RB_PROTOTYPE(textconvtree, textconv, tc_entry, smi_textconv_cmp);
+struct textconvtree smi_tctree;
+
 int
 smi_init(void)
 {
@@ -181,7 +193,7 @@ smi_debug_elements(struct ber_element *r
        fprintf(stderr, "(%u) encoding %u ",
            root->be_type, root->be_encoding);
 
-       if ((value = smi_print_element(root, 1, smi_os_default,
+       if ((value = smi_print_element(NULL, root, 1, smi_os_default,
            smi_oidl_numeric)) == NULL)
                goto invalid;
 
@@ -236,10 +248,13 @@ smi_debug_elements(struct ber_element *r
 }
 
 char *
-smi_print_element(struct ber_element *root, int print_hint,
+smi_print_element(struct ber_oid *oid, struct ber_element *root, int 
print_hint,
     enum smi_output_string output_string, enum smi_oid_lookup lookup)
 {
        char            *str = NULL, *buf, *p;
+       struct oid       okey;
+       struct oid      *object = NULL;
+       struct textconv  tckey;
        size_t           len, i, slen;
        long long        v, ticks;
        int              d;
@@ -249,6 +264,20 @@ smi_print_element(struct ber_element *ro
        char            *hint;
        int              days, hours, min, sec, csec;
 
+       if (oid != NULL) {
+               bcopy(oid, &(okey.o_id), sizeof(okey));
+               do {
+                       object = RB_FIND(oidtree, &smi_oidtree, &okey);
+                       okey.o_id.bo_n--;
+               } while (object == NULL && okey.o_id.bo_n > 0);
+               if (object != NULL && object->o_textconv == NULL &&
+                   object->o_tcname != NULL) {
+                       tckey.tc_name = object->o_tcname;
+                       object->o_textconv = RB_FIND(textconvtree, &smi_tctree,
+                           &tckey);
+               }
+       }
+
        switch (root->be_encoding) {
        case BER_TYPE_BOOLEAN:
                if (ober_get_boolean(root, &d) == -1)
@@ -379,6 +408,10 @@ smi_print_element(struct ber_element *ro
                        else
                                str = strdup("Unknown status at this OID");
                } else {
+                       if (object != NULL && object->o_textconv != NULL &&
+                           object->o_textconv->tc_syntax == root->be_encoding)
+                               return smi_displayhint_os(object->o_textconv,
+                                   print_hint, buf, root->be_len);
                        for (i = 0; i < root->be_len; i++) {
                                if (!isprint(buf[i])) {
                                        if (output_string == smi_os_default)
@@ -575,6 +608,15 @@ smi_mibtree(struct oid *oids)
        }
 }
 
+void
+smi_textconvtree(struct textconv *textconvs)
+{
+       size_t           i = 0;
+
+       for (i = 0; textconvs[i].tc_name != NULL; i++)
+               RB_INSERT(textconvtree, &smi_tctree, &(textconvs[i]));
+}
+
 struct oid *
 smi_findkey(char *name)
 {
@@ -598,6 +640,105 @@ smi_foreach(struct oid *oid)
 }
 
 int
+isu8cont(unsigned char c)
+{
+       return (c & (0x80 | 0x40)) == 0x80;
+}
+
+int
+isu8start(unsigned char c)
+{
+       return (c & 0xc0) == 0xc0;
+}
+
+#define REPLACEMENT "\357\277\275"
+#define U8MAX 4
+char *
+smi_displayhint_os(struct textconv *tc, int print_hint, const char *buf,
+    size_t buflen)
+{
+       size_t octetlength, i, j = 0, u8cnt = U8MAX;
+       int clen;
+       char *displayformat;
+       char *rbuf;
+       wchar_t wc;
+
+       errno = 0;
+       octetlength = (size_t) strtol(tc->tc_display_hint, &displayformat, 10);
+       if (octetlength == 0 && errno != 0) {
+               errno = EINVAL;
+               return NULL;
+       }
+       if (displayformat[0] == 't') {
+               rbuf = malloc(octetlength + sizeof("STRING: "));
+               if (rbuf == NULL)
+                       return NULL;
+               if (print_hint) {
+                       memcpy(rbuf, "STRING: ", sizeof("STRING: ") - 1);
+                       j = sizeof("STRING: ") - 1;
+               }
+               if (MB_CUR_MAX > 1) {
+                       for (i = 0; j < octetlength && i < buflen;) {
+                               clen = mbtowc(&wc, &(buf[i]), buflen - i);
+                               switch (clen) {
+                               case 0:
+                                       rbuf[j] = '\0';
+                                       return rbuf;
+                               case -1:
+                                       if (octetlength - j <
+                                           sizeof(REPLACEMENT) - 1) {
+                                               rbuf[j] = '\0';
+                                               return rbuf;
+                                       }
+                                       memcpy(&(rbuf[j]), REPLACEMENT,
+                                           sizeof(REPLACEMENT) - 1);
+                                       j += sizeof(REPLACEMENT) - 1;
+                                       i++;
+                                       break;
+                               default:
+                                       if (!iswprint(wc))
+                                               rbuf[j++] = '.';
+                                       else {
+                                               if (octetlength - j < clen) {
+                                                       rbuf[j] = '\0';
+                                                       return rbuf;
+                                               }
+                                               memcpy(&(rbuf[j]), &(buf[i]),
+                                                   clen);
+                                               j += clen;
+                                       }
+                                       i += clen;
+                                       break;
+                               }
+                       }
+               } else {
+                       for (j = i = 0; i < octetlength && i < buflen; i++) {
+                               if (isu8start(buf[i])) {
+                                       u8cnt = 0;
+                                       continue;
+                               }
+                               if (isu8cont(buf[i]) && u8cnt < U8MAX) {
+                                       if (u8cnt == 0)
+                                               rbuf[j++] = '?';
+                                       u8cnt++;
+                                       continue;
+                               }
+                               if (u8cnt == 0) {
+                                       u8cnt = U8MAX;
+                                       rbuf[j++] = '.';
+                               }
+                               rbuf[j++] = isprint(buf[i]) ? buf[i] : '.';
+                       }
+                       if (u8cnt == 0)
+                               rbuf[j++] = '.';
+               }
+               rbuf[j] = '\0';
+               return rbuf;
+       }
+       return NULL;
+}
+
+int
 smi_oid_cmp(struct oid *a, struct oid *b)
 {
        size_t   i;
@@ -618,5 +759,12 @@ smi_key_cmp(struct oid *a, struct oid *b
        return (strcasecmp(a->o_name, b->o_name));
 }
 
+int
+smi_textconv_cmp(struct textconv *a, struct textconv *b)
+{
+       return strcmp(a->tc_name, b->tc_name);
+}
+
 RB_GENERATE(oidtree, oid, o_element, smi_oid_cmp)
 RB_GENERATE(keytree, oid, o_keyword, smi_key_cmp)
+RB_GENERATE(textconvtree, textconv, tc_entry, smi_textconv_cmp);
Index: smi.h
===================================================================
RCS file: /cvs/src/usr.bin/snmp/smi.h,v
retrieving revision 1.2
diff -u -p -r1.2 smi.h
--- smi.h       19 May 2020 13:41:01 -0000      1.2
+++ smi.h       13 Jun 2020 08:10:39 -0000
@@ -59,18 +59,28 @@ struct oid {
 #define o_oid                   o_id.bo_id
 #define o_oidlen                o_id.bo_n
 
-       char                    *o_name;
+       const char              *o_name;
+       const char              *o_tcname;
+       struct textconv         *o_textconv;
 
        RB_ENTRY(oid)            o_element;
        RB_ENTRY(oid)            o_keyword;
 };
 
+struct textconv {
+       const char              *tc_name;
+       const char              *tc_display_hint;
+       unsigned int             tc_syntax;
+       RB_ENTRY(textconv)       tc_entry;
+};
+
 int smi_init(void);
 unsigned int smi_application(struct ber_element *);
 int smi_string2oid(const char *, struct ber_oid *);
 char *smi_oid2string(struct ber_oid *, char *, size_t, enum smi_oid_lookup);
 void smi_mibtree(struct oid *);
+void smi_textconvtree(struct textconv *);
 struct oid *smi_foreach(struct oid *);
 void smi_debug_elements(struct ber_element *);
-char *smi_print_element(struct ber_element *, int, enum smi_output_string,
-    enum smi_oid_lookup);
+char *smi_print_element(struct ber_oid *, struct ber_element *, int,
+    enum smi_output_string, enum smi_oid_lookup);
Index: snmpc.c
===================================================================
RCS file: /cvs/src/usr.bin/snmp/snmpc.c,v
retrieving revision 1.26
diff -u -p -r1.26 snmpc.c
--- snmpc.c     31 May 2020 21:01:59 -0000      1.26
+++ snmpc.c     13 Jun 2020 08:10:39 -0000
@@ -29,6 +29,7 @@
 #include <ctype.h>
 #include <err.h>
 #include <errno.h>
+#include <locale.h>
 #include <netdb.h>
 #include <poll.h>
 #include <stdio.h>
@@ -129,6 +130,8 @@ main(int argc, char *argv[])
        int ch;
        size_t i;
 
+       setlocale(LC_CTYPE, "");
+
        if (pledge("stdio inet dns unix", NULL) == -1)
                err(1, "pledge");
 
@@ -1069,7 +1072,8 @@ snmpc_print(struct ber_element *elm)
        }
 
        elm = elm->be_next;
-       value = smi_print_element(elm, smi_print_hint, output_string, 
oid_lookup);
+       value = smi_print_element(&oid, elm, smi_print_hint, output_string,
+           oid_lookup);
        if (value == NULL)
                return 0;
Re: snmp(1) initial UTF-8 support

Reply via email to