Hi Martijn,
sorry for the delay, now i finally looked at the function
smi_displayhint_os() from the line "if (MB_CUR_MAX > 1) {" to the
end of the corresponding else clause. IIUC, what that code is
trying to do is iterate the input buffer "const char *buf" of length
"size_t buflen".
Before starting, i will consider the sizing of rbuf. It is octetlength
plus strlen("STRING: ") plus 1 byte for the terminating NUL. Then
optionally, "STRING: " is copied into it, so octetlength plus one
byte for the terminating NUL remains.
However, if the "STRING: " is copied in, j is intialized to
strlen("STRING: ") and then later compared to octetlength which
remains unchanged. That feels odd. I would expect that if j starts
at strlen("STRING: "), octetlength should be increased accordingly,
or alternatively, if octetlength remains as it is, j should always
start at 0. Otherwise, the last eight bytes allocated always seem
to remain unused. Is the octetlength input parameter supposed to
include strlen("STRING: ") or not? [1]
What the else clause does (i.e. when the user selected the C locale)
seems to be this:
* j is unconditionally re-initialized to 0, so if "STRING: "
was written earlier, that gets overwritten. Is that
intentional? [2]
* UTF-8 starting bytes followed by another UTF-8 starting byte
are totally ignored. Is that intentional? If feels odd. [3]
* UTF-8 starting bytes followed by an ascii byte
are represented as ".". That seems intentional.
* UTF-8 starting bytes followed by at least one UTF-8 cont byte
are represented by "?". Up to FOUR cont bytes are totally
ignored, which seems quite odd because an UTF-8 sequence
can have at most THREE cont bytes (four bytes total including
the starting byte). [4] From the FIFTH following cont byte
onwards, "." is written for each cont byte.
Note that this is *not* trying to check whether the UTF-8 sequence
is valid (which wouldn't be trivial at all), but probably that
isn't required. It merely tries to check whether the sequence
is longer than the absolute maximum (and seems to allow one byte
too much unless i misread the code).
* A sequence "start + cont + ASCII + cont" prints "?a",
then totally ignores the second cont byte as if it were
a continuation of the initial start byte, despite the
intervening ASCII byte. [5]
The state machine doesn't appear to be fully thought out.
Did you prepare a list for yourself containing all the
possible states and all the edges of the machine?
* An initial cont byte, or one after start + ASCII, or one after
start + at least 4 cont + ASCII is represented by ".";
that seems intentional.
* In the for loop in the else clause, the "i < octetlength" feels
confusing to me. It cannot cause an overflow because in the
else clause, we always have j < i (caveat, this might only be
due to the possibly unintentional re-initializing of j to 0).
But assume an input buf containing lots of UTF-8 characters. In
that case, you abort processing long before the output rbuf is
full. Is that intentional? Either way, wouldn't "j < octetlength"
be both clearer and safer? [6]
* The input "UTF-8 start + ASCII, end of input" is represented
as "x." rather than ".x". [7]
So the idea probably is to print
* "." for clearly invalid bytes including non-printable ASCII bytes
* "?" for UTF-8 start bytes including following cont bytes that
_might_ form a valid UTF-8 sequence, accepting invalid sequences
that would be non-trivial to identify.
But it doesn't quite do that.
For the "if (MB_CUR_MAX > 1)" clause (i.e., when the user selected
a UTF-8 lcale), it seems to do this:
* Embedded NUL bytes terminate processing, returning what was
decoded so far. Note that differs from what the ASCII clause
does, which represents NUL as "." and continues processing.
I don't think it can be intentional that processing stops
at different places depending on the user's locale. [8]
* The "case -1" clause obviously lacks a call
(void)mbtowc(NULL, NULL, MB_CUR_MAX);
see the mbtowc(3) manual page,
/usr/src/usr.bin/fmt/fmt.c for simple examples, or
/usr/src/usr.bin/ssh/utf8.c for a full-blown example. [9]
* In contrast to the ASCII clause, the UTF-8 clause does detect
all invalid UTF-8 sequences and writes a replacement character
for each invalid byte (rather than a replacement character
for groups of invalid bytes as the ASCII clause does).
I think this difference in behaviour makes sense because
the UTF-8 clause can easily detect such conditions while
the ASCII clause should better not try.
* For non-printable UTF-8 characters, it prints ".".
This feels inconsistent. We have:
v-- input ASCII UTF-8 <- locale
invalid byte "." REPLACEMENT
non-printable char "." "."
UTF-8 printable char "?" itself
ASCII printable char itself itself
So judging from the UTF-8 case, "." means "valid but non-printable"
and something else is used for "invalid". Maybe for consistency,
it should behave similarly when the user selects an ASCII locale,
i.e. "." for "(potentially) valid but cannot be printed, either
not at all, or at least not in your locale" and "?" for "clearly
invalid", like this: [10]
v-- input ASCII UTF-8 <- locale
invalid byte "?" REPLACEMENT
non-printable char "." "."
UTF-8 printable char "." itself
ASCII printable char itself itself
So i see ten potential inconsistencies in this function, marked
[1] to [10] above, that i think you should check.
Given that the UTF-8 handling appears to be non-trivial, i recommend
splitting it out into its own, self-contained function that can be
tested independently without needing the "textconv" type and that
reduces the indentation by one level, taking these arguments:
1. pointer to the output buffer: char *rbuf
2. max output string length to be put into rbuf
(not including the terminating NUL): size_t octetlength
3. input string: const char *buf
4. input string length in bytes: size_t buflen
Maybe even put it into its own file for easier testing and for
not pulling the surrounding code with such a totally different
topic? Your call, of course.
Naming like obuf, olen, ibuf, ilen with indices oi and ii rather
than j and i might be clearer, but that's optional. Also, it should
be trivial to defer the malloc(3) until you know whether you actually
want "STRING: " and only allocate space for it when really needed.
For bonus points, return the string length written from the
inner pure UTF-8 function and realloc(3) down unless the
allocated space was really exhausted.
Pure code inspection so far, i didn't even try to compile.
I hope this helps and i didn't misunderstand your intentions
too badly... :-/
Yours,
Ingo