Re: multibyte support (round 4) - tr

2018-02-02 Thread Pádraig Brady
On 31/01/18 15:33, Assaf Gordon wrote:
> Hello,
> 
> On 2018-01-30 09:10 AM, Sebastian Kisela wrote:
>>>  The patch is getting too big to attach, so it is available here:
>  >>  [...]
>>>  (perhaps a non-master branch on the savannah git would be better?)
>>
>> Yes that would be nice, if that is not too problematic.
> 
> I'm inclined to do so as well.
> Any objections from others about creating a non-master branch
> dedicated for multibyte efforts on the official gnu git repository ?

+1

thanks,
Pádraig




Re: multibyte support (round 4) - tr

2018-01-31 Thread Assaf Gordon

Hello,

On 2018-01-30 09:10 AM, Sebastian Kisela wrote:

 The patch is getting too big to attach, so it is available here:

>>  [...]

 (perhaps a non-master branch on the savannah git would be better?)


Yes that would be nice, if that is not too problematic.


I'm inclined to do so as well.
Any objections from others about creating a non-master branch
dedicated for multibyte efforts on the official gnu git repository ?


I tried the `tr` part of the patch and the tests passed well.


Thank you for testing and reporting back.



Although I am not sure if I get it correctly,
but there is a wide usage of wchar_t type in it. From what I understood 
so far, it is risky to use it

in case a cygwin(or the others..)


That is very true, and therefore the implementation is partial at best.

Especially given recent discussion here:
https://lists.gnu.org/archive/html/coreutils/2018-01/msg00035.html

Complete multibyte support in 'tr' will require better implementation
(possibly something using 'mbbuffer' like the other programs in the patch).

Since most of the characters ever translated will probably not take more 
than 2 bytes,
(which is most important in my opinion) do I get it right, that the 
wider characters are not considered so far?


example usage of a problematic use case:
(Georgian letter AEN)
printf '\xe1\x83\xbd' | src/tr '[:lower:]' '[:upper:]'


Please note a subtle but important issue:

The cygwin/wchar_t/utf-16 limitation is not about how many bytes
the encoded multibyte character occupies, but whether its decoded
unicode codepoint is larger than 65535 (which then does not fit in a
16-bit wchar_t).

In your example, the UTF-8 encoding of "GEORGIAN LETTER AEN"
is indeed 3 bytes: 0xE1 0x83 0xBD.
But it encodes a unicode codepoint of U+10FD (or decimal 4349)
which fits without a problem in 16-bits.
Cygwin should be able to handle that character without a problem.

The problem in cygwin would happen for characters whose unicode
codepoint is above 65535 (also known as characters outside the "Basic 
Multilingual Plane").


For example, the character "SMILING FACE WITH SUNGLASSES" is encoded
in UTF-8 as 4 bytes: 0xF0 0x9F 0x98 0x8E.
This encodes the unicode value U+1F60E (128526 decimal) which does not
fit in 16-bits.

In cygwin such input would be returned as two 16bit wchar_t's
(e.g need to call mbrtowc(3) twice), the first is 0xD83D and the second
0xDE0E.
Then the application (e.g. 'tr') would need to merge these two UTF-16 
surrogates into one unicode value.


---

Regarding the issue of which characters are most important - I think we
should aim to support all characters, not just the basic multilingual 
plane. Especially with the proliferation of emoji and new fancy 
characters - those might be quite often used in the future.


regards,
 - assaf












Re: multibyte support (round 4) - tr

2018-01-30 Thread Sebastian Kisela
Hi!


> The patch is getting too big to attach, so it is available here:
> https://files.housegordon.org/src/coreutils-multibyte-2017-12-11.patch.xz
> (perhaps a non-master branch on the savannah git would be better?)
>
> Yes that would be nice, if that is not too problematic.

I tried the `tr` part of the patch and the tests passed well.

Although I am not sure if I get it correctly,
but there is a wide usage of wchar_t type in it. From what I understood so
far, it is risky to use it
in case a cygwin(or the others..) script tries to translate a character
which takes more than 2 bytes.

Since most of the characters ever translated will probably not take more
than 2 bytes,
(which is most important in my opinion) do I get it right, that the wider
characters are not considered so far?

example usage of a problematic use case:
(Georgian letter AEN)
printf '\xe1\x83\xbd' | src/tr '[:lower:]' '[:upper:]'

Thanks!

Best regards,
Sebastian.


Re: multibyte support (round 4) - tr

2017-12-23 Thread Assaf Gordon

Hello,

More progress on tr with multibyte support, available here:
https://files.housegordon.org/src/coreutils-multibyte-2017-12-23.patch.xz

translation (mostly) working:

   $ echo abcdefg | ./src/tr 'abcd' 'αβγδ'
   αβγδefg

   $ echo '1234 ABCD ΨΔΩΣ *$%()' \
  | ./src/tr -c '[:alpha:][:cntrl:]' 'Ψ'
   ΨABCDΨΨΔΩΣΨΨ

   $ echo 'ααα' | ./src/tr -s 'β' 'χ'
   αααχ

   $ echo 'aAbBcC ✀  χΧλΛσΣ' | ./src/tr '[:lower:]' '[:upper:]'
   AABBCC ✀  ΧΧΛΛΣΣ


The current implementation could be a starting point for
testing and discussing specific edge-cases (some tests are already 
included).


It is not tuned for efficiency (neither implementation nor run time 
performance).


There's a lot of code duplication due to keeping the entire current 
unibyte code-path intact.



comments welcomed.
 - assaf