Re: multibyte support (round 3)
Hello, Inching forward: attached the unorm+multibyte patch, with initial fold(1) multibyte support. (haven't yet address the issues in http://lists.gnu.org/archive/html/coreutils/2016-10/msg1.html but I hope to do that soon). -assaf i18n-2017-04-04.patch.xz Description: Binary data
Re: multibyte support (round 3)
Hello Pádraig, Thanks for the review! > On Sep 19, 2016, at 09:25, Pádraig Brady wrote: > > A general point is I'm still of the opinion that > it would be better to have all conversion and checking > in unorm(1), thus simplifying/optimizing the checking/processing > in all other utils. It would be good to get an idea > of performance/overhead as the patches progress. In the latest code, the multibyte parsing in 'mbbuffer' does no conversion (except UTF-16 surrogate in Cygwin). It calls mbrtowc(3) and returns the result. Error checking is done, because it was my understanding that the coreutils program should never reject invalid sequences - instead they must be processed as single octets and output as-is. The added complexity in 'mbbuffer' is mainly reading from the file and ensuring there are always enough input octets for mbrtowc (this is very similar to the implementation in the current i18n patch). I will try to compile some performance measurements soon. > I'm thinking this could be merged in the next major > version of coreutils, which would come after the > next minor release which hopefully will be released > in the next few weeks. Thank you - obviously there is no rush, and this will need a lot more testing. > A couple of points on very quick review: > > is_utf8_locale_name() > In gnulib, "UTF-8" is commented as the only > variant that needs to be checked from the > return of locale_charset() > > is_valid_mb_character() > There can invalid characters in many single byte locales: > For example 81,8d,90,9d,93 in cp1252, as shown by: > recode -lh cp1252 | grep -C1 -F " " > > mbtowc_utf16() > The assert() triggers -Werror=suggest-attribute=noreturn > when not on cygwin, so it's better to avoid compiling > that function altogether on other platforms. I.E. put > #ifdef HAVE_UTF16_SURROGATES around the function definition > (and declaration to get a compile error rather than a link error) > All good points, I'll fix them. regards, - assaf
Re: multibyte support (round 3)
On 19/09/16 07:11, Assaf Gordon wrote: > Hello, > > Updated patch attached. > > Improvements from last time ( > http://lists.gnu.org/archive/html/coreutils/2016-09/msg00011.html ): > > 1. 'multibyte' and 'mbbuffer' are in gl/ , behave more like gnulib modules. > Tests cover all items mentioned in Markus Kuhn's UTF-8 decoder page > (https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt). > > 2. cygwin/UTF-16 surrogates are handled transparently in 'mbbuffer'. > Applications under cygwin see 'ucs4_t' and don't need to worry about > surrogates (but, wcwidth() will present some problem). Tests ensure parsing > under cygwin behaves like other systems. > > 3. 'cut' supports multibyte '-c' and '-n -b' (but not multibyte '-d' yet). > Some tests included. Very nice work, especially on the tests. A general point is I'm still of the opinion that it would be better to have all conversion and checking in unorm(1), thus simplifying/optimizing the checking/processing in all other utils. It would be good to get an idea of performance/overhead as the patches progress. I'm thinking this could be merged in the next major version of coreutils, which would come after the next minor release which hopefully will be released in the next few weeks. A couple of points on very quick review: is_utf8_locale_name() In gnulib, "UTF-8" is commented as the only variant that needs to be checked from the return of locale_charset() is_valid_mb_character() There can invalid characters in many single byte locales: For example 81,8d,90,9d,93 in cp1252, as shown by: recode -lh cp1252 | grep -C1 -F " " mbtowc_utf16() The assert() triggers -Werror=suggest-attribute=noreturn when not on cygwin, so it's better to avoid compiling that function altogether on other platforms. I.E. put #ifdef HAVE_UTF16_SURROGATES around the function definition (and declaration to get a compile error rather than a link error) thanks! Pádraig