Re: [X10-users] Char literals and Unicode

Jeff Sweeney Wed, 25 Aug 2010 09:12:42 -0700

I do not have an immediate need for Unicode Char literals. I just want
to nudge you in the right direction :). We write statistical software
with a heavy UI component. We need to revisit our statistical
algorithms to work in parallel computing environments. X10 is one good
option for doing this. I don't see this happening quickly or without
missteps. My hope is that our efforts mature along with those of the
X10 working group.

Generally, the issue with Char literals is a codpage problem and has
little to do with concurrency and parallel computing. So, it may not
be part of your core mandate. However, it is an extremely important
issue for production code and I believe your intent is to encourage
use of X10 in production environments. I cannot over-stress how
important codepage problems become. My rough estimate is a third of
bug-fixing for our venerable flagship application is related to
codepage and transcoding issues.

First, X10 Char and String hide the internal representation of
characters and strings so programmers should not care whether it is
UTF-8, UTF-16BE, etc. These implementations are counting characters
and not bytes of storage. That is a good thing. Eventually, I expect
X10 will be augmented to provide rudimentary transcoding between
Unicode and other codepages. Or better, that task could be delegated
to other components like ICU, for example, leaving X10 String as a
simple container for Unicode.

The issue with Char literals is somewhat separate and easier to
address. It depends on the source code parser. For Java, the source
code parser can parse ANSI codepage, UTF-8, UTF-16BE, etc. There is a
switch for the Java compiler that identifies the codepage of the
source. Choosing UTF-8 for example, the source can be plain ASCII
(because ASCII is a subset of UTF-8) or it can be Unicode with Char
literals that contain Unicode characters that are not escaped in any
way. This is very useful when string literals are localized. The
strings are readable in their native languages and not a jumble of
escaped sequences. This makes it much easier to catch and fix
mistakes.

With regard to C/C++, there seems to be little that can be done to fix
them at this stage. We consider std::string to be a misnamed container
for immutable byte arrays with many equally misnamed and some dubious
methods. (Ditto for std::wstring.) This is something we constantly
reinforce and yet it still causes problems when programmers forget
it. We rely heavily on ICU for codepage support.

With all that, we also interface with native file systems, consoles,
and other applications on the ten or so platforms we support. As you
can imagine, Unicode support varies widely so we often transcode among
various ANSI and Unicode codepages.

        Jeff Sweeney

> It's both.  We have not implemented Unicode support in C++, and the
> parser does not yet understand Unicode in literals and identifiers.

> There are plans to reimplement x10.lang.String in X10, which would
> let us pick an encoding for Strings and the representation of Chars
> that is independent of that in Java.  We believe that UTF-8 is such
> an encoding for Strings, and restricting current Strings to ASCII
> will let us later add such support in a backward-compatible manner.

> Jeff, as Vijay asked, is there an immediate need for Unicode support,
> or are you simply curious?
>         Igor

> Nate Nystrom <n...@nanocow.com> wrote on 08/24/2010 08:24:18 PM:

> > For compatibility with Java, wouldn't we support Unicode rather than
> > ASCII.  I think maybe we don't support Unicode because of the C++
> > translation (representing Char as a C++ char).  Or perhaps it's just
> > that the parser was never implemented to support Unicode.
> > 
> > Nate
> > 
> > 
> > On Tue, Aug 24, 2010 at 19:18, Vijay Saraswat <vi...@saraswat.org> 
> wrote:
> > > Indeed, currently Char is so restricted. The primary reason is
> > > compatibility with Java, so that x10.lang.String can essentially be
> > > implemented as java.lang.String.
> > >
> > > It does make sense to have a "RichString/RichChar" class as well 
which
> > > supports permits UTF-8. Is there some particular interest in getting
> > > this done sooner rather than later...?
> > >
> > > Best,
> > > Vijay
> > >
> > > Jeff Sweeney wrote:
> > >> I am reading the X10 Specification and it seems Char literals are
> > >> restricted to ASCII. Is that correct and if so why?
> -- 
> Igor Peshansky  (note the spelling change!)
> IBM T.J. Watson Research Center
> X10: Parallel Productivity and Performance (http://x10-lang.org/)
> XJ: No More Pain for XML's Gain (http://www.research.ibm.com/xj/)
> "I hear and I forget.  I see and I remember.  I do and I understand" -- 
> Confucius
------------------------------------------------------------------------------
Sell apps to millions through the Intel(R) Atom(Tm) Developer Program
Be part of this innovative community and reach millions of netbook users 
worldwide. Take advantage of special opportunities to increase revenue and 
speed time-to-market. Join now, and jumpstart your future.
http://p.sf.net/sfu/intel-atom-d2d
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Re: [X10-users] Char literals and Unicode

Reply via email to