Hi Martin, Martin Natano wrote on Thu, Nov 13, 2014 at 10:49:48AM +0100:
> While reading preconv.c two peculiarities catched my eye: > > 1. The preconv_encode() function does manual byteswapping where none is > necessary (and harmful). While the bit-shifting used to construct the > value in 'accum' might seem endian specific on first sight, it is not. > The bug manifests itself like this (\xc3\xbc being the utf8 sequence for > umlaut u): > > $ printf '.TH UMLAUT 7 2014-11-05\n.SH SYNOPSIS\n\xc3\xbc\n'|mandoc -Tutf8 > mandoc: <stdin>:3:1: WARNING: invalid escape sequence: \[uFC00000 > ... > > The correct unicode codepoint would be U+00FC, not U+FC000000. 'accum' > already contained the correct value _before_ byteswapping. Code review and testing confirmed that this analysis is completely correct. So i have committed this part of the patch. > 2. The code in preconv_encode() assumes 'accum' to be a 32 bit integer, > but the type declared is 'unsigned int'. While int is 32 bit on all > platforms that OpenBSD supports, it is not guaranteed by the standard. It is true that neither the ANSI C89 nor the ISO C99 standard requires more than 16 bit, but Henri Kemppainen correctly pointed out to me in private mail that POSIX requires at least 32 bit. Thanks for that cluestick. The mandoc toolbox is targetting POSIX platforms. If somebody wants to port it to some non-POSIX platform - which didn't happen so far, unless i missed something - those adaptions should be done out of the OpenBSD tree and not encumber the mainline code. > I think the correct type to use would therefore be uint_least32_t. Consequently, that abomination is not needed. > Below is a diff to fix both issues (tested on macppc). Thanks for reporting the issue and providing the patch! Yours, Ingo
