Re: snmp(1) initial UTF-8 support

Ingo Schwarze Fri, 03 Jul 2020 03:57:00 -0700

Hi Martijn,

sorry for the delay, now i finally looked at the function
smi_displayhint_os() from the line "if (MB_CUR_MAX > 1) {" to the
end of the corresponding else clause.  IIUC, what that code is
trying to do is iterate the input buffer "const char *buf" of length
"size_t buflen".


Before starting, i will consider the sizing of rbuf.  It is octetlength
plus strlen("STRING: ") plus 1 byte for the terminating NUL.  Then
optionally, "STRING: " is copied into it, so octetlength plus one
byte for the terminating NUL remains.

However, if the "STRING: " is copied in, j is intialized to
strlen("STRING: ") and then later compared to octetlength which
remains unchanged.  That feels odd.  I would expect that if j starts
at strlen("STRING: "), octetlength should be increased accordingly,
or alternatively, if octetlength remains as it is, j should always
start at 0.  Otherwise, the last eight bytes allocated always seem
to remain unused.  Is the octetlength input parameter supposed to
include strlen("STRING: ") or not?  [1]

What the else clause does (i.e. when the user selected the C locale)
seems to be this:

 * j is unconditionally re-initialized to 0, so if "STRING: "
   was written earlier, that gets overwritten.  Is that
   intentional?  [2]
 * UTF-8 starting bytes followed by another UTF-8 starting byte
   are totally ignored.  Is that intentional?  If feels odd.  [3]
 * UTF-8 starting bytes followed by an ascii byte
   are represented as ".".  That seems intentional.
 * UTF-8 starting bytes followed by at least one UTF-8 cont byte
   are represented by "?".  Up to FOUR cont bytes are totally
   ignored, which seems quite odd because an UTF-8 sequence
   can have at most THREE cont bytes (four bytes total including
   the starting byte).  [4]  From the FIFTH following cont byte
   onwards, "." is written for each cont byte.
   Note that this is *not* trying to check whether the UTF-8 sequence
   is valid (which wouldn't be trivial at all), but probably that
   isn't required.  It merely tries to check whether the sequence
   is longer than the absolute maximum (and seems to allow one byte
   too much unless i misread the code).
 * A sequence "start + cont + ASCII + cont" prints "?a",
   then totally ignores the second cont byte as if it were
   a continuation of the initial start byte, despite the
   intervening ASCII byte.  [5]
   The state machine doesn't appear to be fully thought out.
   Did you prepare a list for yourself containing all the
   possible states and all the edges of the machine?
 * An initial cont byte, or one after start + ASCII, or one after
   start + at least 4 cont + ASCII is represented by ".";
   that seems intentional.
 * In the for loop in the else clause, the "i < octetlength" feels
   confusing to me.  It cannot cause an overflow because in the
   else clause, we always have j < i (caveat, this might only be
   due to the possibly unintentional re-initializing of j to 0).
   But assume an input buf containing lots of UTF-8 characters.  In
   that case, you abort processing long before the output rbuf is
   full.  Is that intentional?  Either way, wouldn't "j < octetlength"
   be both clearer and safer?  [6]
 * The input "UTF-8 start + ASCII, end of input" is represented
   as "x." rather than ".x".  [7]

So the idea probably is to print

 * "." for clearly invalid bytes including non-printable ASCII bytes
 * "?" for UTF-8 start bytes including following cont bytes that
   _might_ form a valid UTF-8 sequence, accepting invalid sequences
   that would be non-trivial to identify.

But it doesn't quite do that.

For the "if (MB_CUR_MAX > 1)" clause (i.e., when the user selected
a UTF-8 lcale), it seems to do this:

 * Embedded NUL bytes terminate processing, returning what was
   decoded so far.  Note that differs from what the ASCII clause
   does, which represents NUL as "." and continues processing.
   I don't think it can be intentional that processing stops
   at different places depending on the user's locale.  [8]
 * The "case -1" clause obviously lacks a call
   (void)mbtowc(NULL, NULL, MB_CUR_MAX);
   see the mbtowc(3) manual page,
   /usr/src/usr.bin/fmt/fmt.c for simple examples, or
   /usr/src/usr.bin/ssh/utf8.c for a full-blown example.  [9]
 * In contrast to the ASCII clause, the UTF-8 clause does detect
   all invalid UTF-8 sequences and writes a replacement character
   for each invalid byte (rather than a replacement character
   for groups of invalid bytes as the ASCII clause does).
   I think this difference in behaviour makes sense because
   the UTF-8 clause can easily detect such conditions while
   the ASCII clause should better not try.
 * For non-printable UTF-8 characters, it prints ".".
   This feels inconsistent.  We have:

   v-- input             ASCII   UTF-8  <-  locale
   invalid byte          "."     REPLACEMENT
   non-printable char    "."     "."
   UTF-8 printable char  "?"     itself
   ASCII printable char  itself  itself

   So judging from the UTF-8 case, "." means "valid but non-printable"
   and something else is used for "invalid".  Maybe for consistency,
   it should behave similarly when the user selects an ASCII locale,
   i.e. "." for "(potentially) valid but cannot be printed, either
   not at all, or at least not in your locale" and "?" for "clearly
   invalid", like this:  [10]

   v-- input             ASCII   UTF-8  <-  locale
   invalid byte          "?"     REPLACEMENT
   non-printable char    "."     "."
   UTF-8 printable char  "."     itself
   ASCII printable char  itself  itself

So i see ten potential inconsistencies in this function, marked
[1] to [10] above, that i think you should check.


Given that the UTF-8 handling appears to be non-trivial, i recommend
splitting it out into its own, self-contained function that can be
tested independently without needing the "textconv" type and that
reduces the indentation by one level, taking these arguments:

 1. pointer to the output buffer:                 char *rbuf
 2. max output string length to be put into rbuf
    (not including the terminating NUL):          size_t octetlength
 3. input string:                                 const char *buf
 4. input string length in bytes:                 size_t buflen

Maybe even put it into its own file for easier testing and for
not pulling the surrounding code with such a totally different
topic?  Your call, of course.

Naming like obuf, olen, ibuf, ilen with indices oi and ii rather
than j and i might be clearer, but that's optional.  Also, it should
be trivial to defer the malloc(3) until you know whether you actually
want "STRING: " and only allocate space for it when really needed.

For bonus points, return the string length written from the
inner pure UTF-8 function and realloc(3) down unless the
allocated space was really exhausted.

Pure code inspection so far, i didn't even try to compile.

I hope this helps and i didn't misunderstand your intentions
too badly...  :-/

Yours,
  Ingo

Re: snmp(1) initial UTF-8 support

Reply via email to