Re: [Jsource] Problems dealing with UTF-8

bill lam Mon, 11 Jul 2016 18:31:34 -0700

What you had asked is not just for append but promotion in
general. For promotion I think it should meet some requirements,


1 preserve semantic
2 same shape
3 round trip conversion

for promotion from integer to double,
1 still a number, the same number
2 same shape
3 demote to the same number

promotion from literal to wide using u: can satisfy these
requirements, whereas using 7&u: would break them.

Users should have the best knowledge about the domain/meaning/encoding
of literals being used and therefore should be reponsible to do 
conversion by themselves.

Пн, 11 июл 2016, Don Guinn написал(а):
> Thank you, Raul, I appreciate your questions on my reasoning. That said, I
> am quite pleased with J when working with Unicode. The tools provided make
> it easy.
> 
> 
> > Can you give me an example where this would give a different result from:
> >
> > append=: dyad define
> >   if. 131074 = x +&(3!:0) y do. x ,&(7&u:) y else. x, y end.
> > )
> >
> > For that matter, is there some reason you would not want to use
> > ,&(7&u:) if you are mixing utf-16 and utf-8 characters?
> >
> 
> Sorry for the delay in responding. This is how I was asking for the default
> to work. That is what I do when there is a possibility for UTF-8 to be in
> data. 7&u: is a pretty powerful verb.
> 
> 
> >
> > > However, I feel that the current standard of converting
> > > with u: monadic should not be allowed at all. It should be an error
> > period.
> >
> > Why is that?
> >
> > Is this because that is the only use you have? Is this because you
> > believe this would break no existing code? Or is this because you
> > believe that no one should ever use a 16 bit literal for non-unicode
> > data in J? (For example, when dealing with binary files representing
> > music, or for representing pixels?)
> >
> 
> I feel that the default action for combining char with wide is that <7f
> data is not UTF-8 is no longer a good choice. Most of the time it is UTF-8.
> And I suspect that Unicode in the form of UTF-8 will grow. For that reason
> I felt that it should be the default action. However, if one should make a
> conscious decision as to how char maps to wide then there should be no
> default. Although browser data is really strange in how it supports
> Unicode. Fortunately most of that disappears before we see it in J
> 
> 
> 
> >
> > > In the current world one never really can predict when some data may
> > appear
> > > with UTF-8 characters unexpectedly. This would force manual conversion
> > > insuring that the proper conversion from char to wide as required by the
> > > application is done. Otherwise testing with only ASCII char would not
> > catch
> > > the possible error.
> >
> > I feel that you have not encountered enough problems with "almost
> > utf-8 data", or "utf-8 data mixed in with other binary data in a file)
> > if you are saying stuff like this.
> >
> 
> True. Char and wide
> can be used for all sorts of things. Right now, wide being relatively new I
> thought that it would not be used for other things. But 16 bit audio files
> do fit nicely in wide to save space.
> 
>  
> 
> >
> > For that matter, if by "manual conversion" you mean using 7&u: then I
> > do not see that why this should be a problem.
> >
> > > It seems to me that automatic conversion from char to wide assume UTF-8
> > is
> > > a proper choice now. It is possible that one could run into a need to
> > leave
> > > the conversion as it is now, but where would that data come from?
> >
> > A file, most likely. Or a network stream.
> >
> > > And it would really be a pain do view given that J is so insistent to
> > treat char
> > > as UTF-8 when displaying.
> >
> > Usually you convert such data to numbers (possibly hexadecimal) when
> > you want to inspect it. But you expect J to function in a transparent
> > and predictable fashion, to get there.
> >
> > > J automatically converts integer (64 bit) into float when it can cause a
> > > loss of accuracy and we accept that. How is this different?
> >
> > This conversion changes the shape of the data.
> >
> > Yes it does. One of the major problems with dealing with UTF-8. Alignment
> problems, working with char with UTF-8 where the number of bytes do not
> agree with the number of characters is difficult. Converting to wide avoids
> these problems. When through simply convert back.
> 
> 
> 
> > Thanks,
> >
> > --
> > Raul
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

-- 
regards,
====================================================
GPG key 1024D/4434BAB3 2008-08-24
gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jsource] Problems dealing with UTF-8

Reply via email to