Re: snmp(1) initial UTF-8 support

Martijn van Duren Thu, 11 Jun 2020 12:17:51 -0700

Hello Ingo,

Thanks for looking into this.


On Sun, 2020-05-31 at 16:32 +0200, Ingo Schwarze wrote:
> Hi Martijn,
> 
> Martijn van Duren wrote on Tue, May 19, 2020 at 10:25:37PM +0200:
> 
> > So according to RFC2579 an octetstring can contain UTF-8 characters if  
> > so described in the DISPLAY-HINT. One of the main consumers of this is
> > SnmpAdminString, which is used quite a lot.
> > 
> > Now that we trimmed a little fat from snmp's oid, I wanted to fill it
> > up again and implemented the bare minimum DISPLAY-HINT parsing that is
> > required for SnmpAdminString. Other parts of the syntax can be atter
> > at a later state if desirable.
> > 
> > Now I decided to implement this a little different from net-snmp,
> > because they throw incomplete sequences directly to the terminal,
> > which I recon is not a sane approach.
> 
> No, definitely not.  Never throw invalid or incomplete UTF-8 at the
> terminal unless the user unambiguously requests just that (like
> with `printf "a\x80z\n"` or `cat /usr/bin/cc` or something like that).
> 
> > I choose to take the approach taken by the likes of wall(1) and replace
> > invalid and incomplete sequences with a '?'.
> 
> If the specification says that the user is only allowed to put valid
> UTF-8 into octetstrings (and not arbitrary bytes), replacing invalid
> bytes at output time is both acceptable and feels like the only
> sensible option, unless something like vis(3) is considered more
> appropriate.

RFC2579 section 3.1 bullet (3): or `t' for UTF-8
So it should be valid UTF-8.
> 
> However, the reason why wall(1) uses '?' as the replacement character
> is because wall(1) must not assume that the output device can handle
> UTF-8 and has to work safely if all the output device can handle is
> ASCII.  So, wall(1) has to live with the inconvenience that when you
> see a question mark, you don't know whether it really is a question
> mark or whether it originally used to be garbage input.

Fair enough.
> 
> If i understand correctly, here we are talking about output that is
> UTF-8 anyway.  So instead of (ab)using the question mark, you might
> consider using the U+FFFD REPLACEMENT CHARACTER.  The intended
> purpose of that one is to stand in for invalid and inconplete
> byte sequences, and also for characters that cannot be displayed.

I like that one, used in diff below.
> 
> > This because DISPLAY-HINT already states that too long sequences
> > strings should be cut short at the final multibyte character fitting
> > inside the specified length, meaning that output can already get
> > mangled.
> > 
> > This also brings me to the question how we should handle DISPLAY-HINT
> > in the case of -Oa and -Ox. Net-snmp prioritizes DISPLAY-HINT over
> > these, but it has the option to disable this by disabling MIB-loading
> > via -m''; This implementation doesn't give us the that option (yet?).
> > The current diff follows net-snmp, but there is something to say to
> > prioritize -Ox over DISPLAY-HINT, so an admin can use it to debug
> > snmp-output without it being mangled. Any feedback here is welcome.
> > 
> > Once it's clear if this is the right approach I'll do a thorough search
> > through mib.h on which objects actually can use this new definition.
> 
> I can't really comment on what snmp(1) should do, or which settings
> should override which other settings.

I'll leave it as is for now. If people actually need to original mangled
data we can always revisit. It's not something major that people heavily
rely on.
> 
> Do i understand correctly that you want to make -Oa print invalid
> UTF-8 to the terminal?

No, to be in line with net-snmp, what snmp(1) does is the following:
- Without a -O flag it sees if all bytes are printable and if so print
  the string, if not prints the full string in hex.
- With -Oa it prints all printable bytes and replaces unprintable
  characters with '.'.
- With -Ox it prints all bytes in hex.

> If so, that would sound reasonable to me,
> for the following reason.  Printing non-printable ASCII to the
> terminal is often dangerous, it can screw up or reconfigure the
> terminal, so -Oa is already an option that must be used with care,
> just like `cat /usr/bin/cc`.  While printing invalid UTF-8 is not
> a good idea in general, it is less dangerous than printing non-printable
> ASCII, so just passing the bytes through for -Oa doesn't make
> anything more dangerous than it already is.
> 
> Maybe -Oa should have a warning in the manual page about the dangers
> of using it on untrusted data?  Not sure.

So no, it's not dangerous.
> 
> Oh, by the way, when -Oa is not specified and there is UTF-8
> content, don't forget to still filter out non-printable bytes:
> both non-printable ASCII bytes and non-printable UTF-8 characters.
> Right?

You're right. Fixed in new diff.
> 
> > Index: smi.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/snmp/smi.c,v
> [...]
> > @@ -598,6 +638,60 @@ smi_foreach(struct oid *oid)
> >  }
> >  
> >  int
> > +isu8cont(unsigned char c)
> > +{
> > +   return (c & (0x80 | 0x40)) == 0x80;
> > +}
> > +
> > +char *
> > +smi_displayhint_os(struct textconv *tc, const char *buf, size_t buflen)
> > +{
> > +   size_t octetlength, i, j, start;
> > +   char *displayformat;
> > +   char *rbuf;
> > +   mbstate_t mbs;
> > +
> > +   errno = 0;
> > +   octetlength = (size_t) strtol(tc->tc_display_hint, &displayformat, 10);
> > +   if (octetlength == 0 && errno != 0) {
> > +           errno = EINVAL;
> > +           return NULL;
> > +   }
> > +   if (displayformat[0] == 't') {
> > +           rbuf = malloc(octetlength + 1);
> 
> NULL pointer access ahead on ENOMEM.

fixed
> 
> > +           if (strcmp(nl_langinfo(CODESET), "UTF-8") == 0) {
> 
> On OpenBSD, that is not the normal way of checking whether
> the locale is UTF-8.  Also, it is not portable because the CODESET
> values are not standardized.

I took this one from tmux(1).
> 
> We usually do
> 
>       #include <stdlib.h>
> 
>       if (MB_CUR_MAX > 1) {
>               /* UTF-8 here */
> 
> Yes, that means if you want to do a portable version, quite a bit
> of additional work will be needed.  But the same holds when using
> CODESET.

Changed
> 
> > +                   bzero(&mbs, sizeof(mbs));
> > +                   for (start = j = i = 0; i < octetlength && i < buflen; 
> > i++) {
> > +                           switch (mbrtowc(NULL, &(buf[i]), 1, &mbs)) {
> 
> Why are you using mbrtowc(3) here?  That function is unnecessarily
> complicated and should only be used when absolutely necessary, i.e.
> in the following two situations: (1) the function has to fail on
> invalid sequences but must return incomplete sequences, because more
> data may arrive later, then making the sequence complete OR (2) the
> program is multi-threaded and more than one thread might do UTF-8
> decoding at the same time.  Neither seems to be the case here.
> 
> Using mbtowc(3) is almost always preferable to using mbrtowc(3)
> because the resulting code is much easier to read.

I thought I was being lazy by reading one byte at a time. :-)
Changed.
> 
> I'm delaying full review of the following bits until this is switched
> to mbtowc(3).
> 
> > +                           case 0:
> > +                                   rbuf[j++] = '\0';
> > +                                   break;
> > +                           case -1:
> > +                                   rbuf[j++] = '?';
> 
> You might want U+FFFD here, see above.

Done. Minor nit might be that we exchange a single byte for 3, where we
only have octetlength return buffer. But since we expect decent output
and the spec says we can trim from the end there's no big harm.
> 
> > +                                   bzero(&mbs, sizeof(mbs));
> > +                                   break;
> > +                           case -2:
> > +                                   break;
> > +                           default:
> 
> It looks as if the printability check might be missing here.
> Don't you need iswprint(3) before printing to the terminal,
> and print U+FFFD instead if it fails?

I choose to go with '.' here to be in line with what -Oa does.
Don't know if it's better, but it feels more consistent. :-)
> 
> > +                                   memcpy(&(rbuf[j]), &(buf[start]), i - 
> > start + 1);
> > +                                   j += i - start + 1;
> > +                                   start = i + 1;
> > +                                   break;
> > +                           }
> > +                   }
> > +           } else {
> > +                   for (j = i = 0; i < octetlength && i < buflen; i++) {
> > +                           if (isu8cont(buf[i]))
> > +                                   continue;
> 
> This is weird.
> 
> So if the user sets LC_CTYPE=C, then UTF-8 continuation bytes are
> silently skipped?  Consider input like
> 
>   printf "a\x80z"
> 
> That printing as "az" would seem unacceptable to me, shouldn't it
> rather print "a?z"?
> 
> > +                           rbuf[j++] = isprint(buf[i]) ? buf[i] : '?';
> 
> That statement will handle bytes with the high bit set as non-printable
> anyway, so i see no need for the isu8cont() test before it.
> 
> Also, why do you handle the ASCII case separately from the UTF-8
> case at all?  A typical mbtowc(3) loop works just fine for ASCII,
> too.  In rare cases, there may be a performance gain from handling
> ASCII seperately without mbtowc(3), but i don't assume that's the
> case here.
> 
> In case you prefer printing "a?z" for a valid two-byte UTF-8 character
> in ASCII output mode rather than "a??z", you can do the following:
> inside the mbtowc(3) loop, when finding an invalid sequence, check
> whether you are in ASCII mode (with MB_CUR_MAX == 0) and if so,
> skip all subsequent isu8cont() bytes.  But that's just a bit of
> icing on the cake, one might argue it's a matter of taste whether
> printing "a?z" or "a??z" is better.

First: I prefer a single '?' because it gives the user a better
indication of what the intended character was. Similar to the discussion
had on wall(1). That being said I'd like to stay a little more true to
what -Oa does. This means that I now make a best effort guess for UTF-8
characters (assuming input should be valid UTF-8, this seems reasonable
to me). Full UTF-8 characters are now printed as '?' and similar to -Oa
!isprint characters are returned as '.'.

As before I haven't added any more snmpadminstrings, but syscontact
should be enough for testing.
> 
> Yours,
>   Ingo

Index: mib.c
===================================================================
RCS file: /cvs/src/usr.bin/snmp/mib.c,v
retrieving revision 1.2
diff -u -p -r1.2 mib.c
--- mib.c       19 May 2020 13:41:01 -0000      1.2
+++ mib.c       11 Jun 2020 19:16:45 -0000
@@ -27,9 +27,11 @@
 #include "smi.h"
 
 static struct oid mib_tree[] = MIB_TREE;
+static struct textconv textconv_tree[] = TEXTCONV_TREE;
 
 void
 mib_init(void)
 {
        smi_mibtree(mib_tree);
+       smi_textconvtree(textconv_tree);
 }
Index: mib.h
===================================================================
RCS file: /cvs/src/usr.bin/snmp/mib.h,v
retrieving revision 1.1
diff -u -p -r1.1 mib.h
--- mib.h       9 Aug 2019 06:17:59 -0000       1.1
+++ mib.h       11 Jun 2020 19:16:45 -0000
@@ -751,7 +751,7 @@
        { MIBDECL(sysDescr) },                          \
        { MIBDECL(sysOID) },                            \
        { MIBDECL(sysUpTime) },                         \
-       { MIBDECL(sysContact) },                        \
+       { MIBDECL(sysContact), "SnmpAdminString" },     \
        { MIBDECL(sysName) },                           \
        { MIBDECL(sysLocation) },                       \
        { MIBDECL(sysServices) },                       \
@@ -1345,6 +1345,11 @@
        { MIBDECL(ipfRouteEntRouteMetric5) },           \
        { MIBDECL(ipfRouteEntStatus) },                 \
        { MIBEND }                                      \
+}
+
+#define TEXTCONV_TREE {                                        \
+       { "SnmpAdminString", "255t", BER_TYPE_OCTETSTRING }, \
+       { NULL, NULL }                                  \
 }
 
 void    mib_init(void);
Index: smi.c
===================================================================
RCS file: /cvs/src/usr.bin/snmp/smi.c,v
retrieving revision 1.9
diff -u -p -r1.9 smi.c
--- smi.c       31 May 2020 20:38:28 -0000      1.9
+++ smi.c       11 Jun 2020 19:16:45 -0000
@@ -24,10 +24,13 @@
 #include <arpa/inet.h>
 
 #include <ctype.h>
+#include <errno.h>
 #include <stdlib.h>
 #include <stdio.h>
 #include <string.h>
 #include <strings.h>
+#include <wchar.h>
+#include <wctype.h>
 
 #include "ber.h"
 #include "mib.h"
@@ -36,8 +39,13 @@
 
 #define MINIMUM(a, b)  (((a) < (b)) ? (a) : (b))
 
+int isu8cont(unsigned char);
+int isu8start(unsigned char);
+char *smi_displayhint_os(struct textconv *, const char *, size_t);
+
 int smi_oid_cmp(struct oid *, struct oid *);
 int smi_key_cmp(struct oid *, struct oid *);
+int smi_textconv_cmp(struct textconv *, struct textconv *);
 struct oid * smi_findkey(char *);
 
 RB_HEAD(oidtree, oid);
@@ -48,6 +56,10 @@ RB_HEAD(keytree, oid);
 RB_PROTOTYPE(keytree, oid, o_keyword, smi_key_cmp)
 struct keytree smi_keytree;
 
+RB_HEAD(textconvtree, textconv);
+RB_PROTOTYPE(textconvtree, textconv, tc_entry, smi_textconv_cmp);
+struct textconvtree smi_tctree;
+
 int
 smi_init(void)
 {
@@ -181,7 +193,7 @@ smi_debug_elements(struct ber_element *r
        fprintf(stderr, "(%u) encoding %u ",
            root->be_type, root->be_encoding);
 
-       if ((value = smi_print_element(root, 1, smi_os_default,
+       if ((value = smi_print_element(NULL, root, 1, smi_os_default,
            smi_oidl_numeric)) == NULL)
                goto invalid;
 
@@ -236,10 +248,13 @@ smi_debug_elements(struct ber_element *r
 }
 
 char *
-smi_print_element(struct ber_element *root, int print_hint,
+smi_print_element(struct ber_oid *oid, struct ber_element *root, int 
print_hint,
     enum smi_output_string output_string, enum smi_oid_lookup lookup)
 {
        char            *str = NULL, *buf, *p;
+       struct oid       okey;
+       struct oid      *object = NULL;
+       struct textconv  tckey;
        size_t           len, i, slen;
        long long        v, ticks;
        int              d;
@@ -249,6 +264,20 @@ smi_print_element(struct ber_element *ro
        char            *hint;
        int              days, hours, min, sec, csec;
 
+       if (oid != NULL) {
+               bcopy(oid, &(okey.o_id), sizeof(okey));
+               do {
+                       object = RB_FIND(oidtree, &smi_oidtree, &okey);
+                       okey.o_id.bo_n--;
+               } while (object == NULL && okey.o_id.bo_n > 0);
+               if (object != NULL && object->o_textconv == NULL &&
+                   object->o_tcname != NULL) {
+                       tckey.tc_name = object->o_tcname;
+                       object->o_textconv = RB_FIND(textconvtree, &smi_tctree,
+                           &tckey);
+               }
+       }
+
        switch (root->be_encoding) {
        case BER_TYPE_BOOLEAN:
                if (ober_get_boolean(root, &d) == -1)
@@ -379,6 +408,10 @@ smi_print_element(struct ber_element *ro
                        else
                                str = strdup("Unknown status at this OID");
                } else {
+                       if (object != NULL && object->o_textconv != NULL &&
+                           object->o_textconv->tc_syntax == root->be_encoding)
+                               return smi_displayhint_os(object->o_textconv,
+                                   buf, root->be_len);
                        for (i = 0; i < root->be_len; i++) {
                                if (!isprint(buf[i])) {
                                        if (output_string == smi_os_default)
@@ -575,6 +608,15 @@ smi_mibtree(struct oid *oids)
        }
 }
 
+void
+smi_textconvtree(struct textconv *textconvs)
+{
+       size_t           i = 0;
+
+       for (i = 0; textconvs[i].tc_name != NULL; i++)
+               RB_INSERT(textconvtree, &smi_tctree, &(textconvs[i]));
+}
+
 struct oid *
 smi_findkey(char *name)
 {
@@ -598,6 +640,98 @@ smi_foreach(struct oid *oid)
 }
 
 int
+isu8cont(unsigned char c)
+{
+       return (c & (0x80 | 0x40)) == 0x80;
+}
+
+int
+isu8start(unsigned char c)
+{
+       return (c & 0xc0) == 0xc0;
+}
+
+#define REPLACEMENT "\357\277\275"
+#define U8MAX 4
+char *
+smi_displayhint_os(struct textconv *tc, const char *buf, size_t buflen)
+{
+       size_t octetlength, i, j, u8cnt = U8MAX;
+       int clen;
+       char *displayformat;
+       char *rbuf;
+       wchar_t wc;
+
+       errno = 0;
+       octetlength = (size_t) strtol(tc->tc_display_hint, &displayformat, 10);
+       if (octetlength == 0 && errno != 0) {
+               errno = EINVAL;
+               return NULL;
+       }
+       if (displayformat[0] == 't') {
+               rbuf = malloc(octetlength + sizeof("STRING: "));
+               if (rbuf == NULL)
+                       return NULL;
+               memcpy(rbuf, "STRING: ", sizeof("STRING: ") - 1);
+               j = sizeof("STRING: ") - 1;
+               if (MB_CUR_MAX > 1) {
+                       for (i = 0; i < octetlength && i < buflen;) {
+                               clen = mbtowc(&wc, &(buf[i]), buflen - i);
+                               switch (clen) {
+                               case 0:
+                                       rbuf[j] = '\0';
+                                       return rbuf;
+                               case -1:
+                                       if (octetlength - j <
+                                           sizeof(REPLACEMENT) - 1) {
+                                               rbuf[j] = '\0';
+                                               return rbuf;
+                                       }
+                                       memcpy(&(rbuf[j]), REPLACEMENT,
+                                           sizeof(REPLACEMENT) - 1);
+                                       j += sizeof(REPLACEMENT) - 1;
+                                       i++;
+                                       break;
+                               default:
+                                       if (!iswprint(wc))
+                                               rbuf[j++] = '.';
+                                       else {
+                                               memcpy(&(rbuf[j]), &(buf[i]),
+                                                   clen);
+                                               j += clen;
+                                       }
+                                       i += clen;
+                                       break;
+                               }
+                       }
+               } else {
+                       for (j = i = 0; i < octetlength && i < buflen; i++) {
+                               if (isu8start(buf[i])) {
+                                       u8cnt = 0;
+                                       continue;
+                               }
+                               if (isu8cont(buf[i]) && u8cnt < U8MAX) {
+                                       if (u8cnt == 0)
+                                               rbuf[j++] = '?';
+                                       u8cnt++;
+                                       continue;
+                               }
+                               if (u8cnt == 0) {
+                                       u8cnt = U8MAX;
+                                       rbuf[j++] = '.';
+                               }
+                               rbuf[j++] = isprint(buf[i]) ? buf[i] : '.';
+                       }
+                       if (u8cnt == 0)
+                               rbuf[j++] = '.';
+               }
+               rbuf[j] = '\0';
+               return rbuf;
+       }
+       return NULL;
+}
+
+int
 smi_oid_cmp(struct oid *a, struct oid *b)
 {
        size_t   i;
@@ -618,5 +752,12 @@ smi_key_cmp(struct oid *a, struct oid *b
        return (strcasecmp(a->o_name, b->o_name));
 }
 
+int
+smi_textconv_cmp(struct textconv *a, struct textconv *b)
+{
+       return strcmp(a->tc_name, b->tc_name);
+}
+
 RB_GENERATE(oidtree, oid, o_element, smi_oid_cmp)
 RB_GENERATE(keytree, oid, o_keyword, smi_key_cmp)
+RB_GENERATE(textconvtree, textconv, tc_entry, smi_textconv_cmp);
Index: smi.h
===================================================================
RCS file: /cvs/src/usr.bin/snmp/smi.h,v
retrieving revision 1.2
diff -u -p -r1.2 smi.h
--- smi.h       19 May 2020 13:41:01 -0000      1.2
+++ smi.h       11 Jun 2020 19:16:45 -0000
@@ -59,18 +59,28 @@ struct oid {
 #define o_oid                   o_id.bo_id
 #define o_oidlen                o_id.bo_n
 
-       char                    *o_name;
+       const char              *o_name;
+       const char              *o_tcname;
+       struct textconv         *o_textconv;
 
        RB_ENTRY(oid)            o_element;
        RB_ENTRY(oid)            o_keyword;
 };
 
+struct textconv {
+       const char              *tc_name;
+       const char              *tc_display_hint;
+       unsigned int             tc_syntax;
+       RB_ENTRY(textconv)       tc_entry;
+};
+
 int smi_init(void);
 unsigned int smi_application(struct ber_element *);
 int smi_string2oid(const char *, struct ber_oid *);
 char *smi_oid2string(struct ber_oid *, char *, size_t, enum smi_oid_lookup);
 void smi_mibtree(struct oid *);
+void smi_textconvtree(struct textconv *);
 struct oid *smi_foreach(struct oid *);
 void smi_debug_elements(struct ber_element *);
-char *smi_print_element(struct ber_element *, int, enum smi_output_string,
-    enum smi_oid_lookup);
+char *smi_print_element(struct ber_oid *, struct ber_element *, int,
+    enum smi_output_string, enum smi_oid_lookup);
Index: snmpc.c
===================================================================
RCS file: /cvs/src/usr.bin/snmp/snmpc.c,v
retrieving revision 1.26
diff -u -p -r1.26 snmpc.c
--- snmpc.c     31 May 2020 21:01:59 -0000      1.26
+++ snmpc.c     11 Jun 2020 19:16:45 -0000
@@ -29,6 +29,7 @@
 #include <ctype.h>
 #include <err.h>
 #include <errno.h>
+#include <locale.h>
 #include <netdb.h>
 #include <poll.h>
 #include <stdio.h>
@@ -129,6 +130,8 @@ main(int argc, char *argv[])
        int ch;
        size_t i;
 
+       setlocale(LC_CTYPE, "");
+
        if (pledge("stdio inet dns unix", NULL) == -1)
                err(1, "pledge");
 
@@ -1069,7 +1072,8 @@ snmpc_print(struct ber_element *elm)
        }
 
        elm = elm->be_next;
-       value = smi_print_element(elm, smi_print_hint, output_string, 
oid_lookup);
+       value = smi_print_element(&oid, elm, smi_print_hint, output_string,
+           oid_lookup);
        if (value == NULL)
                return 0;

Re: snmp(1) initial UTF-8 support

Reply via email to