https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85824
Bug ID: 85824 Summary: regex constructor crashes under UTF-8 locale on Solaris-sparc when parsing a simple character class Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: wanyingloo at gmail dot com Target Milestone: --- $ cat test.cpp #include <locale.h> #include <regex> int main (int argc, char *argv[]) { setlocale(LC_ALL, ""); std::regex("[0-9]"); } $ echo $LANG en_US.UTF-8 $ g++ -std=c++11 test.cpp $ ./a.out terminate called after throwing an instance of 'std::length_error' what(): basic_string::append Abort (core dumped) $ uname -a SunOS t-solaris11sparc-02 5.11 11.3 sun4v sparc sun4v Solaris $ g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/tmp/wluo/othello/solaris-sparc-packages/bin/../libexec/gcc/sparc-sun-solaris2.10/4.9.2/lto-wrapper Target: sparc-sun-solaris2.10 Configured with: ../gcc-4.9.2/configure --prefix=/usr --with-local-prefix=/usr/local --enable-languages=c,c++ --disable-nls --disable-lto --enable-clocale=generic --with-stage1-ldflags='-L/data00/builds/trprince/platform-packages-build/idir/solaris-sparc/stage1-packages/lib -static-libgcc -static-libstdc++ -laio -lmd' --with-boot-ldflags='-L/data00/builds/trprince/platform-packages-build/idir/solaris-sparc/stage1-packages/lib -static-libgcc -static-libstdc++ -laio -lmd' --disable-werror --with-libiconv-prefix=/data00/builds/trprince/platform-packages-build/idir/solaris-sparc/stage1-packages --with-gnu-ld --with-gnu-as --disable-multiarch --disable-bootstrap Thread model: posix gcc version 4.9.2 (GCC) I can't reproduce it on Linux using the same GCC version. I did some investigation and it seems to be because regex compiler doesn't account for implementation-defined behavior of strxfrm(). I ran the following test on the same Solaris SPARC machine. $ cat more_test.cpp #include <locale.h> #include <errno.h> #include <stdio.h> #include <string.h> int main (int argc, char *argv[]) { setlocale(LC_ALL, ""); char a[] = { 0x80, '\0' }; printf("%lu\n", strxfrm(NULL, a, 0)); printf("%s\n", strerror(errno)); } $ g++ -std=c++11 -w more_test.cpp $ ./a.out 4294967295 Illegal byte sequence In libstdc++-v3/include/bits/locale_classes.tcc, do_transform() is defined as follows: do_transform(const _CharT* __lo, const _CharT* __hi) const { ... size_t __res = _M_transform(__c, __p, __len); ... __ret.append(__c, __res); When _M_transform() calls strxfrm() and gets -1 when converting 0x80 under the UTF-8 locale on Solaris SPARC, it simply assigns -1 to __res of type size_t which creates a very large number. This causes __ret.append(__c, __res) to crash. I think it would be nice if the code checks errno and issues a better error message than the one above. When I ran this test on a Linux machine with GCC 4.9.2, glibc's strxfrm() converts 0x80 to 6 bytes. I tend to think Solaris SPARC's libc behavior makes more sense here since 0x80 on its own isn't a valid UTF-8 code point even though it's a valid UTF-8 code unit. I have no idea why glibc converts it to 6 bytes. In any event, how strxfrm() converts 0x80 under UTF-8 is implementation-defined, and I'm not sure do_transform() accounts for that. At the very least, it can be more defensive by checking errno, I think.