Re: snmp(1) initial UTF-8 support

Ingo Schwarze Sun, 31 May 2020 07:33:58 -0700

Hi Martijn,

Martijn van Duren wrote on Tue, May 19, 2020 at 10:25:37PM +0200:


> So according to RFC2579 an octetstring can contain UTF-8 characters if  
> so described in the DISPLAY-HINT. One of the main consumers of this is
> SnmpAdminString, which is used quite a lot.
> 
> Now that we trimmed a little fat from snmp's oid, I wanted to fill it
> up again and implemented the bare minimum DISPLAY-HINT parsing that is
> required for SnmpAdminString. Other parts of the syntax can be atter
> at a later state if desirable.
> 
> Now I decided to implement this a little different from net-snmp,
> because they throw incomplete sequences directly to the terminal,
> which I recon is not a sane approach.

No, definitely not.  Never throw invalid or incomplete UTF-8 at the
terminal unless the user unambiguously requests just that (like
with `printf "a\x80z\n"` or `cat /usr/bin/cc` or something like that).

> I choose to take the approach taken by the likes of wall(1) and replace
> invalid and incomplete sequences with a '?'.

If the specification says that the user is only allowed to put valid
UTF-8 into octetstrings (and not arbitrary bytes), replacing invalid
bytes at output time is both acceptable and feels like the only
sensible option, unless something like vis(3) is considered more
appropriate.

However, the reason why wall(1) uses '?' as the replacement character
is because wall(1) must not assume that the output device can handle
UTF-8 and has to work safely if all the output device can handle is
ASCII.  So, wall(1) has to live with the inconvenience that when you
see a question mark, you don't know whether it really is a question
mark or whether it originally used to be garbage input.

If i understand correctly, here we are talking about output that is
UTF-8 anyway.  So instead of (ab)using the question mark, you might
consider using the U+FFFD REPLACEMENT CHARACTER.  The intended
purpose of that one is to stand in for invalid and inconplete
byte sequences, and also for characters that cannot be displayed.

> This because DISPLAY-HINT already states that too long sequences
> strings should be cut short at the final multibyte character fitting
> inside the specified length, meaning that output can already get
> mangled.
> 
> This also brings me to the question how we should handle DISPLAY-HINT
> in the case of -Oa and -Ox. Net-snmp prioritizes DISPLAY-HINT over
> these, but it has the option to disable this by disabling MIB-loading
> via -m''; This implementation doesn't give us the that option (yet?).
> The current diff follows net-snmp, but there is something to say to
> prioritize -Ox over DISPLAY-HINT, so an admin can use it to debug
> snmp-output without it being mangled. Any feedback here is welcome.
> 
> Once it's clear if this is the right approach I'll do a thorough search
> through mib.h on which objects actually can use this new definition.

I can't really comment on what snmp(1) should do, or which settings
should override which other settings.

Do i understand correctly that you want to make -Oa print invalid
UTF-8 to the terminal?  If so, that would sound reasonable to me,
for the following reason.  Printing non-printable ASCII to the
terminal is often dangerous, it can screw up or reconfigure the
terminal, so -Oa is already an option that must be used with care,
just like `cat /usr/bin/cc`.  While printing invalid UTF-8 is not
a good idea in general, it is less dangerous than printing non-printable
ASCII, so just passing the bytes through for -Oa doesn't make
anything more dangerous than it already is.

Maybe -Oa should have a warning in the manual page about the dangers
of using it on untrusted data?  Not sure.

Oh, by the way, when -Oa is not specified and there is UTF-8
content, don't forget to still filter out non-printable bytes:
both non-printable ASCII bytes and non-printable UTF-8 characters.
Right?

> Index: smi.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/snmp/smi.c,v
[...]
> @@ -598,6 +638,60 @@ smi_foreach(struct oid *oid)
>  }
>  
>  int
> +isu8cont(unsigned char c)
> +{
> +     return (c & (0x80 | 0x40)) == 0x80;
> +}
> +
> +char *
> +smi_displayhint_os(struct textconv *tc, const char *buf, size_t buflen)
> +{
> +     size_t octetlength, i, j, start;
> +     char *displayformat;
> +     char *rbuf;
> +     mbstate_t mbs;
> +
> +     errno = 0;
> +     octetlength = (size_t) strtol(tc->tc_display_hint, &displayformat, 10);
> +     if (octetlength == 0 && errno != 0) {
> +             errno = EINVAL;
> +             return NULL;
> +     }
> +     if (displayformat[0] == 't') {
> +             rbuf = malloc(octetlength + 1);

NULL pointer access ahead on ENOMEM.

> +             if (strcmp(nl_langinfo(CODESET), "UTF-8") == 0) {

On OpenBSD, that is not the normal way of checking whether
the locale is UTF-8.  Also, it is not portable because the CODESET
values are not standardized.

We usually do

        #include <stdlib.h>

        if (MB_CUR_MAX > 1) {
                /* UTF-8 here */

Yes, that means if you want to do a portable version, quite a bit
of additional work will be needed.  But the same holds when using
CODESET.

> +                     bzero(&mbs, sizeof(mbs));
> +                     for (start = j = i = 0; i < octetlength && i < buflen; 
> i++) {
> +                             switch (mbrtowc(NULL, &(buf[i]), 1, &mbs)) {

Why are you using mbrtowc(3) here?  That function is unnecessarily
complicated and should only be used when absolutely necessary, i.e.
in the following two situations: (1) the function has to fail on
invalid sequences but must return incomplete sequences, because more
data may arrive later, then making the sequence complete OR (2) the
program is multi-threaded and more than one thread might do UTF-8
decoding at the same time.  Neither seems to be the case here.

Using mbtowc(3) is almost always preferable to using mbrtowc(3)
because the resulting code is much easier to read.

I'm delaying full review of the following bits until this is switched
to mbtowc(3).

> +                             case 0:
> +                                     rbuf[j++] = '\0';
> +                                     break;
> +                             case -1:
> +                                     rbuf[j++] = '?';

You might want U+FFFD here, see above.

> +                                     bzero(&mbs, sizeof(mbs));
> +                                     break;
> +                             case -2:
> +                                     break;
> +                             default:

It looks as if the printability check might be missing here.
Don't you need iswprint(3) before printing to the terminal,
and print U+FFFD instead if it fails?

> +                                     memcpy(&(rbuf[j]), &(buf[start]), i - 
> start + 1);
> +                                     j += i - start + 1;
> +                                     start = i + 1;
> +                                     break;
> +                             }
> +                     }
> +             } else {
> +                     for (j = i = 0; i < octetlength && i < buflen; i++) {
> +                             if (isu8cont(buf[i]))
> +                                     continue;

This is weird.

So if the user sets LC_CTYPE=C, then UTF-8 continuation bytes are
silently skipped?  Consider input like

  printf "a\x80z"

That printing as "az" would seem unacceptable to me, shouldn't it
rather print "a?z"?

> +                             rbuf[j++] = isprint(buf[i]) ? buf[i] : '?';

That statement will handle bytes with the high bit set as non-printable
anyway, so i see no need for the isu8cont() test before it.

Also, why do you handle the ASCII case separately from the UTF-8
case at all?  A typical mbtowc(3) loop works just fine for ASCII,
too.  In rare cases, there may be a performance gain from handling
ASCII seperately without mbtowc(3), but i don't assume that's the
case here.

In case you prefer printing "a?z" for a valid two-byte UTF-8 character
in ASCII output mode rather than "a??z", you can do the following:
inside the mbtowc(3) loop, when finding an invalid sequence, check
whether you are in ASCII mode (with MB_CUR_MAX == 0) and if so,
skip all subsequent isu8cont() bytes.  But that's just a bit of
icing on the cake, one might argue it's a matter of taste whether
printing "a?z" or "a??z" is better.

Yours,
  Ingo

Re: snmp(1) initial UTF-8 support

Reply via email to