Re: [patch] Experiment with c++11 unicode strings
Guillaume Munch wrote: > Le 30/08/2016 à 21:12, Georg Baum a écrit : >> Guillaume Munch wrote: >> >>> >>> * Why ascii_num_get_facet::do_get uses from_local8bit at some >>> point. >> >> The encoding does not matter here: Our own numpunct_facet does not >> override truename() and falsename(), so std::numpunct::truename() and >> std::numpunct::falsename() are used. I don't know why Enrico chose >> from_local8bit and not from_ascii() here. >> >>> * Is the numpunct facet correctly set with the way >>> it occurs in the code? >> >> I think so. Otherwise we would have the classical separator problem, >> e.g. outputting 2.3 as "2,3" when using a german locale. Do you see >> any specific problem? > > The custom facets circumvent using std::numpunct, e.g. > ascii_num_get_facet converts from std::numpunct (which by the way > is always the C locale so one could just use from_ascii("true"), etc.). > This is the only place where there is a numpunct so I imagine that this > is the one your are referring to with "our own numpunct". Yes. > Still, we cannot assume that std::numpunct is available. So > either someone is confident that nothing else requires it, or it should > be implemented like the other three facets. I am confident that it is not required. Otherwise we would currently need a replacement implementation without your patch when using boost::uint32_t on mingw and cygwin. Besides, you implemented it already in char32_t_facets.h. >> Unfortunately I cannot try the patch, since I get undefined refernces >> to the std::codecvt functions that we previously defined for gcc on >> windows: > > Unlike std::codecvt, std::codecvt is part of > the spec so it's quite a different situation... Does it match with > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66030#c7 ? No. The bug is about gcc5 (I am using gcc 4.9.2), and on windows (it fails for me on linux). Also, std::codecvt<...>::id is not among the missing symbols. Finally I found some time to do more tests: - Cross compiling with mingw gcc 4.9.3 works. - Compiling with clang (linking against the same libstdc++ as gcc 4.9.2) works as well. - Compiling with gcc 5.1 works also. Quick tests with running the two latter versions did not show any problem, so it really looks like gcc 4.9.2 is buggy here. Georg
Re: [patch] Experiment with c++11 unicode strings
Le 30/08/2016 à 21:12, Georg Baum a écrit : Guillaume Munch wrote: * Why ascii_num_get_facet::do_get uses from_local8bit at some point. The encoding does not matter here: Our own numpunct_facet does not override truename() and falsename(), so std::numpunct::truename() and std::numpunct::falsename() are used. I don't know why Enrico chose from_local8bit and not from_ascii() here. * Is the numpunct facet correctly set with the way it occurs in the code? I think so. Otherwise we would have the classical separator problem, e.g. outputting 2.3 as "2,3" when using a german locale. Do you see any specific problem? The custom facets circumvent using std::numpunct, e.g. ascii_num_get_facet converts from std::numpunct (which by the way is always the C locale so one could just use from_ascii("true"), etc.). This is the only place where there is a numpunct so I imagine that this is the one your are referring to with "our own numpunct". Still, we cannot assume that std::numpunct is available. So either someone is confident that nothing else requires it, or it should be implemented like the other three facets. I do not trust versa_string, e.g. bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67557 You are right, it is more risky than I thought to use versa_string. Thanks also for the explanations about iconv. There is an additional conversion from char * to std::string. I was once told that this is less efficient (here on the LyX mailing list, but I forgot who wrote that), but I never tested that myself. Surely, but the combined amount of time and energy spent in this conversion, over all user's computers present and future, is surely less than the time and energy spent writing this reply... Well, having these changes combined with the string type changes in one patch causes all interested persons to spend more time. If you do not want comments on the empty_string changes then please do such changes in a separate patch next time (the string type changes can be done independently of removing empty_string() and empty_docstring(). Don't get me wrong, I was not implying that the discussion is a loss of time. That would not be a clever way to try to make the point that the difference is negligible ;) Unfortunately I cannot try the patch, since I get undefined refernces to the std::codecvt functions that we previously defined for gcc on windows: Unlike std::codecvt, std::codecvt is part of the spec so it's quite a different situation... Does it match with https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66030#c7 ?
Re: [patch] Experiment with c++11 unicode strings
Guillaume Munch wrote: > I made it work with libc++ too, which was not straightforward. In this > case, the template is undefined and cannot be inherited from. > > Could you or somebody else please double-check the following, which I am > not sure to understand correctly: > > * Why ascii_num_get_facet::do_get uses from_local8bit at some point. The encoding does not matter here: Our own numpunct_facet does not override truename() and falsename(), so std::numpunct::truename() and std::numpunct::falsename() are used. I don't know why Enrico chose from_local8bit and not from_ascii() here. > * Is the numpunct facet correctly set with the way it > occurs in the code? I think so. Otherwise we would have the classical separator problem, e.g. outputting 2.3 as "2,3" when using a german locale. Do you see any specific problem? >> I would like it more simple as well, but it seems this is not easily >> possible without a C++11 compliant std::string. > > I agree for std::string, but for docstring I have demonstrated that it > is very simple. Moreover I find it reassuring that versa_string is the > precursor of the new basic_string. If you look at docstring alone I agree, but if you look at LyX as a whole this would mean to use 3 string base classes (std::basic_string, ext::versa_string and lyx::trivial_string) instead of two (std::basic_string and lyx::trivial_string). This is more complicated IMHO. Besides that I do not trust versa_string, e.g. bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67557 is likely to exist in gcc older than 5.1. If trivstring had a (yet unknown) bug this would be annoying, but its usage is mostly limited to work done in threads (e.g. export), so data loss is not that likely. If we based our central workhorse docstring on a class which we do not know how widely it is used and what bugs might exist then I would be nervous. > There is also a built-in conversion function between utf-8 and utf-32 in > C++11 which could replace iconv. But I don't know whether there are > clear advantages in replacing iconv. Iconv might still be needed for > other encodings. We definitely need other conversions. Even if we decided to drop support for non-utf8 LaTeX output (which we might do one day) we would still need it for importing .tex files. > But then I don't understand why for from_local8bit it > uses QString::fromLocal8Bit instead of iconv. Because it is easier. If you want to use iconv, you need to know the name of the local encoding, and determining it is platform dependent. With qt all this is nicely abstracted away. The only thing I would do differently is the replacement of empty_string() and empty_docstring(): Those were introduced to avoid including . Since is now required, it would be better to initialize the default arguments with docstring() or std::string(). >>> >>> There is not much a difference from what I did, other than matters of >>> taste, is there? >> >> There is an additional conversion from char * to std::string. I was once >> told that this is less efficient (here on the LyX mailing list, but I >> forgot who wrote that), but I never tested that myself. > > Surely, but the combined amount of time and energy spent in this > conversion, over all user's computers present and future, is surely less > than the time and energy spent writing this reply... Well, having these changes combined with the string type changes in one patch causes all interested persons to spend more time. If you do not want comments on the empty_string changes then please do such changes in a separate patch next time (the string type changes can be done independently of removing empty_string() and empty_docstring(). Unfortunately I cannot try the patch, since I get undefined refernces to the std::codecvt functions that we previously defined for gcc on windows: g++ -fPIC -O2 -std=c++14 -std=c++14 -fmessage-length=0 -g -Wunused-result -Wuninitialized -Winit-self -o lyx main.o BiblioInfo.o Box.o Compare.o Dimension.o EnchantChecker.o PersonalWordList.o LaTeXFonts.o PrinterParams.o Thesaurus.o liblyxcore.a liblyxmathed.a liblyxinsets.a frontends/liblyxfrontends.a frontends/qt4/liblyxqt4.a liblyxgraphics.a support/liblyxsupport.a -lmythes-1.2 -lenchant -lmagic -lz - lQt5Concurrent -lQt5Svg -lQt5Widgets -lQt5X11Extras -lQt5Gui -lQt5Core -lxcb support/liblyxsupport.a(docstream.o): (.data.rel.ro._ZTVSt7codecvtIDic11__mbstate_tE[_ZTVSt7codecvtIDic11__mbstate_tE]+0x20): undefined reference to `std::codecvt::do_out(__mbstate_t&, char32_t const*, char32_t const*, char32_t const*&, char*, char*, char*&) const' support/liblyxsupport.a(docstream.o): (.data.rel.ro._ZTVSt7codecvtIDic11__mbstate_tE[_ZTVSt7codecvtIDic11__mbstate_tE]+0x28): undefined reference to `std::codecvt ::do_unshift(__mbstate_t&, char*, char*, char*&) const' support/liblyxsupport.a(docstream.o):
Re: [patch] Experiment with c++11 unicode strings
Le 25/08/2016 à 20:36, Georg Baum a écrit : Guillaume Munch wrote: Le 22/08/2016 à 20:56, Georg Baum a écrit : Our own facets work, and the implementation is confined to one file which nobody needs to look at (unless he wants to). I made it work with libc++ too, which was not straightforward. In this case, the template is undefined and cannot be inherited from. Could you or somebody else please double-check the following, which I am not sure to understand correctly: * Why ascii_num_get_facet::do_get uses from_local8bit at some point. * Is the numpunct facet correctly set with the way it occurs in the code? Attached is the new version. The point is to remove the need for trivdocstring entirely, so that we can already use docstring knowing it is thread-safe. This might not change the existing code, but could simplify the future one. (In fact I was unaware of trivstring and these threading issues until last week -- It's good to simplify the situation.) I would like it more simple as well, but it seems this is not easily possible without a C++11 compliant std::string. I agree for std::string, but for docstring I have demonstrated that it is very simple. Moreover I find it reassuring that versa_string is the precursor of the new basic_string. Of course we would need some good testing, but it is good to know that there are currently no further known problems. Regarding the documentation I am not aware of anything else either. There is also a built-in conversion function between utf-8 and utf-32 in C++11 which could replace iconv. But I don't know whether there are clear advantages in replacing iconv. Iconv might still be needed for other encodings. But then I don't understand why for from_local8bit it uses QString::fromLocal8Bit instead of iconv. Then of course another thing to do is to remove all the now-superfluous from_ascii from the code. The only thing I would do differently is the replacement of empty_string() and empty_docstring(): Those were introduced to avoid including . Since is now required, it would be better to initialize the default arguments with docstring() or std::string(). There is not much a difference from what I did, other than matters of taste, is there? There is an additional conversion from char * to std::string. I was once told that this is less efficient (here on the LyX mailing list, but I forgot who wrote that), but I never tested that myself. Surely, but the combined amount of time and energy spent in this conversion, over all user's computers present and future, is surely less than the time and energy spent writing this reply... And in a second step we should get rid of strfwd.h completely, the name would be wrong for all compilers then. I can also try to keep docstring forward-declared. This does not work with clang and libc++ AFAIK. I never liked the current mixture of forward declaration or full include depending on the used compiler (can lead to compile errors). Therefore using the full includesd for all compilers is good IMHO. OK. >From 541a3c85af9ba92f73743b072d5d7bf7b2348a16 Mon Sep 17 00:00:00 2001 From: Guillaume MunchDate: Sat, 20 Aug 2016 16:27:52 +0100 Subject: [PATCH] typedef char32_t char_type; typedef std::u32string docstring; MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit char_type is now defined as char32_t and docstring is now defined as std::u32string on all platforms. * Enable the use of Unicode literals, i.e. one can write: docstring s = U"Here is a Unicode literalâº"; char_type ellipsis = U'\u2026'; s == U"\u203D"; etc. * Remove empty_string() and empty_docstring() as they are now useless. * Remove the now useless USE_WCHAR_T and all related code * Note that gcc < 5.1 is not compliant with C++11 regarding thread-safety. It is still necesary to use trivstring and trivdocstring for thread-safety. * Note that the C++11 standard does not require all facets for char32_t which are required for using char32_t streams (!). This patch reuses the ascii_*_facets from support/docstring.cpp. * In addition, libstdc++ and libc++ have facets that cannot be derived from. A new configure test is introduced for these bugs. --- config/lyxinclude.m4 | 24 ++ configure.ac | 20 - src/Buffer.h | 2 +- src/Encoding.cpp | 4 +- src/Format.h | 2 +- src/LaTeX.cpp| 2 +- src/LaTeX.h | 8 +- src/LayoutFile.h | 2 +- src/frontends/alert.h| 9 +-- src/frontends/qt4/qt_helpers.h | 2 +- src/insets/InsetText.h | 3 +- src/mathed/InsetMathHull.cpp | 8 +- src/mathed/InsetMathUnknown.h| 4 +- src/support/FileName.h | 8 +- src/support/ForkedCalls.h| 4 +-
Re: [patch] Experiment with c++11 unicode strings
Guillaume Munch wrote: > Le 22/08/2016 à 20:56, Georg Baum a écrit : >> Guillaume Munch wrote: >> >>> This is not the final version >>> of the patch however because there is one big disappointment: the C++11 >>> standard does not require several facets (including ctype) >>> that are necessary to use stringstreams of char32_t. So these need to be >>> defined by hand. >> >> Which is not a problem IMHO. Our own facets work, and the implementation >> is confined to one file which nobody needs to look at (unless he wants >> to). > > I was not sure because they were guarded by >#ifdef !defined(USE_WCHAR_T) && defined(__GNUC__) > and I was fearing it could have become dead code with the time. This combination is used on mingw and cygwin, so I am pretty confident that it works. >> AFAIK we do currently not need string literals in combination with >> trivstring, so I do not see an advantage of replacing it with something >> else, especially looking at 4. IMHO we should simply keep trivstring >> until we can require gcc 5.1. > > The point is to remove the need for trivdocstring entirely, so that we > can already use docstring knowing it is thread-safe. This might not > change the existing code, but could simplify the future one. (In fact I > was unaware of trivstring and these threading issues until last week -- > It's good to simplify the situation.) I would like it more simple as well, but it seems this is not easily possible without a C++11 compliant std::string. >> Is there anything besides the facets that you think is not ready? > > * There could remain some bogus documentation and workarounds that > escaped me (I could remove ones that were easy to find). > > * Other problems that would only appear at runtime (I learnt about > facets the hard way...). Of course we would need some good testing, but it is good to know that there are currently no further known problems. Regarding the documentation I am not aware of anything else either. >> The only thing I would do differently is the replacement of >> empty_string() and empty_docstring(): Those were introduced to avoid >> including . Since is now required, it would be better to >> initialize the default arguments with docstring() or std::string(). > > There is not much a difference from what I did, other than matters of > taste, is there? There is an additional conversion from char * to std::string. I was once told that this is less efficient (here on the LyX mailing list, but I forgot who wrote that), but I never tested that myself. >> And in a second step we should >> get rid of strfwd.h completely, the name would be wrong for all compilers >> then. > > I can also try to keep docstring forward-declared. This does not work with clang and libc++ AFAIK. I never liked the current mixture of forward declaration or full include depending on the used compiler (can lead to compile errors). Therefore using the full includesd for all compilers is good IMHO. Georg
Re: [patch] Experiment with c++11 unicode strings
Le 22/08/2016 à 20:56, Georg Baum a écrit : Guillaume Munch wrote: This is not the final version of the patch however because there is one big disappointment: the C++11 standard does not require several facets (including ctype) that are necessary to use stringstreams of char32_t. So these need to be defined by hand. Which is not a problem IMHO. Our own facets work, and the implementation is confined to one file which nobody needs to look at (unless he wants to). I was not sure because they were guarded by #ifdef !defined(USE_WCHAR_T) && defined(__GNUC__) and I was fearing it could have become dead code with the time. There is also the option to test for an existing facet before replacing it. On the other hand I am wondering whether it's in fact good to define our own, so that there is no surprise of locale-specific behaviour that change with the weather and the time of day. AFAIK we do currently not need string literals in combination with trivstring, so I do not see an advantage of replacing it with something else, especially looking at 4. IMHO we should simply keep trivstring until we can require gcc 5.1. The point is to remove the need for trivdocstring entirely, so that we can already use docstring knowing it is thread-safe. This might not change the existing code, but could simplify the future one. (In fact I was unaware of trivstring and these threading issues until last week -- It's good to simplify the situation.) going from string to char * to string again can be a lossy conversion. Right, I'll keep that in mind. To me the most important part (the first patch) looks very promising. Very nice work! Thank you Is there anything besides the facets that you think is not ready? * There could remain some bogus documentation and workarounds that escaped me (I could remove ones that were easy to find). * Other problems that would only appear at runtime (I learnt about facets the hard way...). The only thing I would do differently is the replacement of empty_string() and empty_docstring(): Those were introduced to avoid including . Since is now required, it would be better to initialize the default arguments with docstring() or std::string(). There is not much a difference from what I did, other than matters of taste, is there? And in a second step we should get rid of strfwd.h completely, the name would be wrong for all compilers then. I can also try to keep docstring forward-declared. Guillaume
Re: [patch] Experiment with c++11 unicode strings
Guillaume Munch wrote: > Dear all, > > Here's a few patches proposing to improve the definitions in > support/strfwd.h, results of my experiments. > > 1. Define docstring using the Unicode strings from C++11 (with > char_type=char32_t). This allows us to write docstrings directly with > the syntax U"". By extension this is necessary to have Unicode > translation strings as discussed before. This is not the final version > of the patch however because there is one big disappointment: the C++11 > standard does not require several facets (including ctype) > that are necessary to use stringstreams of char32_t. So these need to be > defined by hand. Which is not a problem IMHO. Our own facets work, and the implementation is confined to one file which nobody needs to look at (unless he wants to). > I reused ones that are currently in > support/docstring.cpp written about 10yrs ago but I wonder whether one > cannot just copy ones for wchar_t from libstdc++ or libc++ (while being > no expert on this matter). Help/opinions on this problem are welcome. I would not try that. First you would need to be sure about the licensing (the difference to the stuff in 3rdparty is that it would not be optional, and the libstdc++ license changed from the time when we included the old stuff in strfwd.h), then you would need to check for compiler specific parts. > 3. This patch addresses the issue of std::basic_string being > thread-unsafe on gcc < 5.1 by noticing that the thread-safe > implementation from gcc 5.1 is available for gcc >= 4.6 in > . I think that this works well. AFAIK we do currently not need string literals in combination with trivstring, so I do not see an advantage of replacing it with something else, especially looking at 4. IMHO we should simply keep trivstring until we can require gcc 5.1. > 4. This patch is just to show what it would involve to completely get > rid of trivstring.h in favour of . Given the result I do > not recommend its inclusion because it adds a lot of noise with the > explicit conversions. They are not only noise, going from string to char * to string again can be a lossy conversion. We _should_ not use embededded 0 characters in strings, but I would not bet on that. > The patches can be applied independently. (For 3. this requires a small > adaptation but docstring can be made thread-safe independently of > whether Unicode strings are used.) To me the most important part (the first patch) looks very promising. Very nice work! Is there anything besides the facets that you think is not ready? The only thing I would do differently is the replacement of empty_string() and empty_docstring(): Those were introduced to avoid including . Since is now required, it would be better to initialize the default arguments with docstring() or std::string(). And in a second step we should get rid of strfwd.h completely, the name would be wrong for all compilers then. Georg
[patch] Experiment with c++11 unicode strings
Dear all, Here's a few patches proposing to improve the definitions in support/strfwd.h, results of my experiments. 1. Define docstring using the Unicode strings from C++11 (with char_type=char32_t). This allows us to write docstrings directly with the syntax U"". By extension this is necessary to have Unicode translation strings as discussed before. This is not the final version of the patch however because there is one big disappointment: the C++11 standard does not require several facets (including ctype) that are necessary to use stringstreams of char32_t. So these need to be defined by hand. I reused ones that are currently in support/docstring.cpp written about 10yrs ago but I wonder whether one cannot just copy ones for wchar_t from libstdc++ or libc++ (while being no expert on this matter). Help/opinions on this problem are welcome. 3. This patch addresses the issue of std::basic_string being thread-unsafe on gcc < 5.1 by noticing that the thread-safe implementation from gcc 5.1 is available for gcc >= 4.6 in . I think that this works well. 4. This patch is just to show what it would involve to completely get rid of trivstring.h in favour of . Given the result I do not recommend its inclusion because it adds a lot of noise with the explicit conversions. The patches can be applied independently. (For 3. this requires a small adaptation but docstring can be made thread-safe independently of whether Unicode strings are used.) Guillaume >From 7ca4a7383c4a77d85996d77dbb3b9f4110a83cc5 Mon Sep 17 00:00:00 2001 From: Guillaume MunchDate: Sat, 20 Aug 2016 16:27:52 +0100 Subject: [PATCH 1/4] typedef char32_t char_type; typedef std::u32string docstring; MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit char_type is now defined as char32_t and docstring is now defined as std::u32string on all platforms. * Enable the use of Unicode literals, i.e. one can write: docstring s = U"Here is a Unicode literalâº"; char_type ellipsis = U'\u2026'; s == U"\u203D"; etc. * Remove empty_string() and empty_docstring() as they are now useless. * Remove the now useless USE_WCHAR_T and all related code * Note that gcc < 5.1 is not compliant with C++11 regarding thread-safety. It is still necesary to use trivstring and trivdocstring for thread-safety. * Note that the C++11 standard does not require all facets for char32_t which are required for using char32_t streams (!). This patch reuses the ascii_*_facets from support/docstring.cpp. --- configure.ac | 20 -- src/Buffer.h | 2 +- src/Encoding.cpp | 4 +- src/Format.h | 2 +- src/LaTeX.cpp| 2 +- src/LaTeX.h | 8 +-- src/LayoutFile.h | 2 +- src/frontends/alert.h| 9 ++- src/frontends/qt4/qt_helpers.h | 2 +- src/insets/InsetText.h | 3 +- src/mathed/InsetMathHull.cpp | 8 +-- src/mathed/InsetMathUnknown.h| 4 +- src/support/FileName.h | 8 +-- src/support/ForkedCalls.h| 4 +- src/support/Makefile.am | 1 - src/support/Systemcall.h | 6 +- src/support/docstream.cpp| 90 - src/support/docstream.h | 15 - src/support/docstring.cpp| 22 --- src/support/docstring.h | 5 ++ src/support/lstrings.cpp | 17 + src/support/numpunct_lyx_char_type.h | 58 src/support/os.h | 2 +- src/support/strfwd.h | 124 +++ src/tex2lyx/Preamble.cpp | 6 +- 25 files changed, 94 insertions(+), 330 deletions(-) delete mode 100644 src/support/numpunct_lyx_char_type.h diff --git a/configure.ac b/configure.ac index 75df6c7..4b609ad 100644 --- a/configure.ac +++ b/configure.ac @@ -133,22 +133,6 @@ LYX_CHECK_CALLSTACK_PRINTING # C++14 only LYX_CHECK_DEF(make_unique, memory, [using std::make_unique;]) -# Needed for our char_type -AC_CHECK_SIZEOF(wchar_t) - -# Taken from gettext, needed for libiconv -AC_CACHE_CHECK([for wchar_t], [gt_cv_c_wchar_t], - [AC_TRY_COMPILE([#include - wchar_t foo = (wchar_t)'\0';], , - [gt_cv_c_wchar_t=yes], [gt_cv_c_wchar_t=no])]) -if test $gt_cv_c_wchar_t = yes; then - AC_DEFINE([HAVE_WCHAR_T], [1], [Define if you have the 'wchar_t' type.]) - HAVE_WCHAR_T=1 -else - HAVE_WCHAR_T=0 -fi -AC_SUBST([HAVE_WCHAR_T]) - # Needed for Mingw-w64 AC_TYPE_LONG_LONG_INT if test "$ac_cv_type_long_long_int" = yes; then @@ -330,10 +314,6 @@ char * strerror(int n); # endif #endif -#if defined(HAVE_WCHAR_T) && SIZEOF_WCHAR_T == 4 -# define USE_WCHAR_T -#endif - #ifdef HAVE_LONG_LONG_INT #if SIZEOF_LONG_LONG > SIZEOF_LONG #define LYX_USE_LONG_LONG diff --git a/src/Buffer.h