Re: [Jsource] Problems dealing with UTF-8

Don Guinn Mon, 11 Jul 2016 17:38:17 -0700

Thank you, Raul, I appreciate your questions on my reasoning. That said, I
am quite pleased with J when working with Unicode. The tools provided make
it easy.



> Can you give me an example where this would give a different result from:
>
> append=: dyad define
>   if. 131074 = x +&(3!:0) y do. x ,&(7&u:) y else. x, y end.
> )
>
> For that matter, is there some reason you would not want to use
> ,&(7&u:) if you are mixing utf-16 and utf-8 characters?
>

Sorry for the delay in responding. This is how I was asking for the default
to work. That is what I do when there is a possibility for UTF-8 to be in
data. 7&u: is a pretty powerful verb.


>
> > However, I feel that the current standard of converting
> > with u: monadic should not be allowed at all. It should be an error
> period.
>
> Why is that?
>
> Is this because that is the only use you have? Is this because you
> believe this would break no existing code? Or is this because you
> believe that no one should ever use a 16 bit literal for non-unicode
> data in J? (For example, when dealing with binary files representing
> music, or for representing pixels?)
>

I feel that the default action for combining char with wide is that <7f
data is not UTF-8 is no longer a good choice. Most of the time it is UTF-8.
And I suspect that Unicode in the form of UTF-8 will grow. For that reason
I felt that it should be the default action. However, if one should make a
conscious decision as to how char maps to wide then there should be no
default. Although browser data is really strange in how it supports
Unicode. Fortunately most of that disappears before we see it in J



>
> > In the current world one never really can predict when some data may
> appear
> > with UTF-8 characters unexpectedly. This would force manual conversion
> > insuring that the proper conversion from char to wide as required by the
> > application is done. Otherwise testing with only ASCII char would not
> catch
> > the possible error.
>
> I feel that you have not encountered enough problems with "almost
> utf-8 data", or "utf-8 data mixed in with other binary data in a file)
> if you are saying stuff like this.
>

True. Char and wide
can be used for all sorts of things. Right now, wide being relatively new I
thought that it would not be used for other things. But 16 bit audio files
do fit nicely in wide to save space.

 

>
> For that matter, if by "manual conversion" you mean using 7&u: then I
> do not see that why this should be a problem.
>
> > It seems to me that automatic conversion from char to wide assume UTF-8
> is
> > a proper choice now. It is possible that one could run into a need to
> leave
> > the conversion as it is now, but where would that data come from?
>
> A file, most likely. Or a network stream.
>
> > And it would really be a pain do view given that J is so insistent to
> treat char
> > as UTF-8 when displaying.
>
> Usually you convert such data to numbers (possibly hexadecimal) when
> you want to inspect it. But you expect J to function in a transparent
> and predictable fashion, to get there.
>
> > J automatically converts integer (64 bit) into float when it can cause a
> > loss of accuracy and we accept that. How is this different?
>
> This conversion changes the shape of the data.
>
> Yes it does. One of the major problems with dealing with UTF-8. Alignment
problems, working with char with UTF-8 where the number of bytes do not
agree with the number of characters is difficult. Converting to wide avoids
these problems. When through simply convert back.



> Thanks,
>
> --
> Raul
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jsource] Problems dealing with UTF-8

Reply via email to