On Sat, Sep 23, 2023 at 12:10:41PM +0200, Walter Alejandro Iglesias wrote: > > On Thu, Sep 21, 2023 at 02:12:50PM +0200, Stefan Sperling wrote: > > > Your implementation lacks proper bounds checking. It accesses > > > s[i + 3] based purely on the contents of the input string, without > > > checking whether len < i + 3. Entering the while (i != len) loop with > > You surely meant "len > i + 3" (grater than). The patch below is wrong. > > I know it doesn't matter anymore but I'm still clarifying so that no one > wastes time trying the patch.
The bounds checks that you added is not the correct way to improve this code. There are a lot of potentially dangerous coding mistakes, mis-understandings and bad style going on here. Maybe we wouldn't have the first two, if it wasn't for the last one. The very specific issue with s[i + 3] that Stefan mentioned does not exist, assuming that the code is compiled with a compliant C compiler, refer to ISO 9899 6.5.13, the && operator creates a sequence point so the s[i + 3] that Stefan is worried about will not be evaluated unless the previous s[i + 2] comparisons are evaluated. Those in turn will not be evaluated unless the s[i + 1] is evalulated, which will not be evaluated unless s[i] == 0xf0. Therefore, entering the loop with i == len-1 and a specially crafted input string should not be exploitable IFF s[len]==0, as this would fail the s[i+1] >= 0x90 test. Except that you don't strictly terminate the input at s[len]. You terminate the string at s[i] where is the value of i after the while loop which terminates when you get EOF. So it's _probably_ going to be len, but there is no guarantee because getc can return EOF on a read error. That would mean you terminate the string early, but still use the full supplied value of len, (which ultimately came from st_size), when you loop through the characters checking the validity of the UTF-8 stream. So overall, you have now added error checking code for the wrong reasons which either nobody noticed, or nobody bothered to point out possibly due to the unclear style of coding making review more difficult. Additionally, you have not addressed the more important issue of using the value of i upon leaving the first loop as the offset for adding the terminating byte, and the value of len for the actual termination condition of the second loop. That might be a useful technique in the underhanded C content, but not so much in OpenBSD code :-) :-). But joking aside, your claim that: On Thu, Sep 21, 2023 at 09:01:25PM +0200, Walter Alejandro Iglesias wrote: > Notice that you saw the issue in my code (bounds checking) at a first > glance, that's because my code is neither too complicated (citrus) nor > too elegant (tmux), hence by far easier to read, understand and debug. > Among other things it deals with utf-8 without using wchar.h. is not valid. In fact, the opposite is true. Your code is hard to review, and that increases the chances that bugs will be missed. To be fair, I wouldn't recommend learning from the code in utf8_isvalid() in tmux either. That's also not a great example, and absolutely not appropriate for inclusion in mail to validate UTF-8 for the purposes of adding or not adding a content encoding header. As a final note, surely the solution here is not to add any parsing or other intelligence to mail itself, but just to optionally allow mail to call an external user-configurable program with the proposed mail body and then have that program return a simple 0 or 1 to indicate whether to include the UTF-8 content header or not? That way, the user can chose their own level of parsing, or even opt to have _all_ mail sent as UTF-8, (which is often fine, since 7-bit ASCII is a valid UTF-8 stream). I'll try to find time to write up some demo code showing reliable ways to comprehensively parse UTF-8 streams in C.