On Sat, Sep 23, 2023 at 12:10:41PM +0200, Walter Alejandro Iglesias wrote:
> > On Thu, Sep 21, 2023 at 02:12:50PM +0200, Stefan Sperling wrote:
> > > Your implementation lacks proper bounds checking. It accesses
> > > s[i + 3] based purely on the contents of the input string, without
> > > checking whether len < i + 3. Entering the while (i != len) loop with
> 
> You surely meant "len > i + 3" (grater than).  The patch below is wrong.
> 
> I know it doesn't matter anymore but I'm still clarifying so that no one
> wastes time trying the patch.

The bounds checks that you added is not the correct way to improve this code.

There are a lot of potentially dangerous coding mistakes, mis-understandings
and bad style going on here.  Maybe we wouldn't have the first two, if it
wasn't for the last one.

The very specific issue with s[i + 3] that Stefan mentioned does not exist,
assuming that the code is compiled with a compliant C compiler, refer to
ISO 9899 6.5.13, the && operator creates a sequence point so the
s[i + 3] that Stefan is worried about will not be evaluated unless the
previous s[i + 2] comparisons are evaluated.  Those in turn will not be
evaluated unless the s[i + 1] is evalulated, which will not be evaluated
unless s[i] == 0xf0.

Therefore, entering the loop with i == len-1 and a specially crafted input
string should not be exploitable IFF s[len]==0, as this would fail the
s[i+1] >= 0x90 test.

Except that you don't strictly terminate the input at s[len].

You terminate the string at s[i] where is the value of i after the while
loop which terminates when you get EOF.  So it's _probably_ going to be
len, but there is no guarantee because getc can return EOF on a read error.

That would mean you terminate the string early, but still use the full
supplied value of len, (which ultimately came from st_size), when you
loop through the characters checking the validity of the UTF-8 stream.

So overall, you have now added error checking code for the wrong reasons
which either nobody noticed, or nobody bothered to point out possibly due
to the unclear style of coding making review more difficult.

Additionally, you have not addressed the more important issue of using
the value of i upon leaving the first loop as the offset for adding the
terminating byte, and the value of len for the actual termination
condition of the second loop.

That might be a useful technique in the underhanded C content, but not
so much in OpenBSD code :-) :-).

But joking aside, your claim that:

On Thu, Sep 21, 2023 at 09:01:25PM +0200, Walter Alejandro Iglesias wrote:
> Notice that you saw the issue in my code (bounds checking) at a first
> glance, that's because my code is neither too complicated (citrus) nor
> too elegant (tmux), hence by far easier to read, understand and debug.
> Among other things it deals with utf-8 without using wchar.h.

is not valid.  In fact, the opposite is true.  Your code is hard to
review, and that increases the chances that bugs will be missed.

To be fair, I wouldn't recommend learning from the code in utf8_isvalid()
in tmux either.  That's also not a great example, and absolutely not
appropriate for inclusion in mail to validate UTF-8 for the purposes
of adding or not adding a content encoding header.

As a final note, surely the solution here is not to add any parsing or
other intelligence to mail itself, but just to optionally allow mail to
call an external user-configurable program with the proposed mail body
and then have that program return a simple 0 or 1 to indicate whether
to include the UTF-8 content header or not?

That way, the user can chose their own level of parsing, or even opt to
have _all_ mail sent as UTF-8, (which is often fine, since 7-bit ASCII is
a valid UTF-8 stream).

I'll try to find time to write up some demo code showing reliable ways to
comprehensively parse UTF-8 streams in C.

Reply via email to