Re: mklocale(1): TODIGIT > 255 clobbers type and width data

Ingo Schwarze Mon, 02 May 2016 13:23:07 -0700

Hi,

Andrew Fresh wrote on Mon, May 02, 2016 at 11:55:52AM -0700:
> On Mon, May 02, 2016 at 08:34:43PM +0200, Ingo Schwarze wrote:

>> The following patch fixes the bug by limiting digit values to
>> the range 0x00 to 0xff that our LC_CTYPE file format can actually
>> store.

> Does this fix mean that these numbers (like the super cool Aegean
> numerals) and the more likely (I guess?) used Ancient Greek numbers
> and roman numerals can still not be used in calculations?

Well, how would you convert Unicode non-arabic numbers to a numeric
variable in the first place?  The wcsto*(3) don't handle that, look
at libc/locale/_wcstol.h and libc/locale/wctoint.h.  Nor do the
*wscanf(3) %d, %i, %o, %u, and %x conversions, look at CT_INT in
libc/stdio/vfwscanf.c.  Grepping for "re_map" and "re_rune_types"
across libc/locale shows that the C library provides no read access
to the lower byte storing the numerical value.

> As you probably noticed, there are several languages that seem to have a
> character for 1000, but I don't know how likely those are to be used
> in OpenBSD and if so, whether they will use those characters and expect
> them to work as numbers.
> 
> (I, for one, think this is almost as nice as scientific notation and way
> cooler)
> https://en.wikipedia.org/wiki/Aegean_numerals
> 
> Would it be "better" to comment out these TODIGIT entries in
> en_US.UTF-8.src noting that our tools currently don't support it?

Well, we could delete TODIGIT support outright, we don't use it for
anything in the first place, the only effect it had during the first
ten years of its life was causing this long-lived and elusive bug.

Should i send a patch instead to kill it completely?
It doesn't really look that urgent to me,
and later cleanup is likely to get round to it anyway.

> Possibly making it a validation failure instead of forcing to 0xff
> in mklocale?

Well, gen_ctype_utf8.pl and en_US.UTF-8.src aren't really incorrect,
these characters really have these meanings.  So why should these
tools complain?

The point is really that we don't care all that much about advanced
Unicode semantics, and that mklocale(1) shouldn't screw up.

> This might make for better errors.

Who will look at errors from gen_ctype_utf8.pl and mklocale(1)
except you and me, and what are people supposed to do about
warning messages from these tools?  I don't really see the point.
Why warn about edge cases for something that is completely
unimplemented in the first place?

Todd Miller wrote:

> Wouldn't it be better to just ignore values > 0xff instead of
> clamping to 0xff?

It doesn't really matter, the value is unused in the first place.
Putting 0xff at least signals "big digit" to people looking at it
with gdb(1).  Ignoring it would put 0x00 instead, which really makes
no functional difference.  A nice bikeshed, so colourful.

The point that matters is that this irrelevant information must
not clobber character type data and width data.  In the worst
case, this could cause vulnerabilities, and it is quite hard
to evalute what the effects of corrupt type date may be.

Yours,
  Ingo

Re: mklocale(1): TODIGIT > 255 clobbers type and width data

Reply via email to