https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85824

            Bug ID: 85824
           Summary: regex constructor crashes under UTF-8 locale on
                    Solaris-sparc when parsing a simple character class
           Product: gcc
           Version: 4.9.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wanyingloo at gmail dot com
  Target Milestone: ---

$ cat test.cpp
#include <locale.h>
#include <regex>

int main (int argc, char *argv[]) {
    setlocale(LC_ALL, "");
    std::regex("[0-9]");
}

$ echo $LANG
en_US.UTF-8

$ g++ -std=c++11 test.cpp

$ ./a.out 
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::append
Abort (core dumped)

$ uname -a
SunOS t-solaris11sparc-02 5.11 11.3 sun4v sparc sun4v Solaris

$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/tmp/wluo/othello/solaris-sparc-packages/bin/../libexec/gcc/sparc-sun-solaris2.10/4.9.2/lto-wrapper
Target: sparc-sun-solaris2.10
Configured with: ../gcc-4.9.2/configure --prefix=/usr
--with-local-prefix=/usr/local --enable-languages=c,c++ --disable-nls
--disable-lto --enable-clocale=generic
--with-stage1-ldflags='-L/data00/builds/trprince/platform-packages-build/idir/solaris-sparc/stage1-packages/lib
-static-libgcc -static-libstdc++ -laio -lmd'
--with-boot-ldflags='-L/data00/builds/trprince/platform-packages-build/idir/solaris-sparc/stage1-packages/lib
-static-libgcc -static-libstdc++ -laio -lmd' --disable-werror
--with-libiconv-prefix=/data00/builds/trprince/platform-packages-build/idir/solaris-sparc/stage1-packages
--with-gnu-ld --with-gnu-as --disable-multiarch --disable-bootstrap
Thread model: posix
gcc version 4.9.2 (GCC) 


I can't reproduce it on Linux using the same GCC version. I did some
investigation and it seems to be because regex compiler doesn't account for
implementation-defined behavior of strxfrm(). I ran the following test on the
same Solaris SPARC machine.

$ cat more_test.cpp 
#include <locale.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>

int main (int argc, char *argv[]) {
    setlocale(LC_ALL, "");
    char a[] = { 0x80, '\0' };
    printf("%lu\n", strxfrm(NULL, a, 0));
    printf("%s\n", strerror(errno));
}

$ g++ -std=c++11 -w more_test.cpp 

$ ./a.out 
4294967295
Illegal byte sequence


In libstdc++-v3/include/bits/locale_classes.tcc, do_transform() is defined as
follows:

    do_transform(const _CharT* __lo, const _CharT* __hi) const
    {
...
              size_t __res = _M_transform(__c, __p, __len);
...
              __ret.append(__c, __res);


When _M_transform() calls strxfrm() and gets -1 when converting 0x80 under the
UTF-8 locale on Solaris SPARC, it simply assigns -1 to __res of type size_t
which creates a very large number. This causes __ret.append(__c, __res) to
crash. I think it would be nice if the code checks errno and issues a better
error message than the one above.

When I ran this test on a Linux machine with GCC 4.9.2, glibc's strxfrm()
converts 0x80 to 6 bytes. I tend to think Solaris SPARC's libc behavior makes
more sense here since 0x80 on its own isn't a valid UTF-8 code point even
though it's a valid UTF-8 code unit. I have no idea why glibc converts it to 6
bytes. In any event, how strxfrm() converts 0x80 under UTF-8 is
implementation-defined, and I'm not sure do_transform() accounts for that. At
the very least, it can be more defensive by checking errno, I think.

Reply via email to