Re: String Theory

2005-03-28 Thread Larry Wall
On Mon, Mar 28, 2005 at 11:53:07AM -0500, Chip Salzenberg wrote:
: According to Larry Wall:
:  On Fri, Mar 25, 2005 at 07:38:10PM -, Chip Salzenberg wrote:
:  : And might I also ask why in Perl 6 (if not Parrot) there seems to be
:  : no type support for strings with known encodings which are not subsets
:  : of Unicode?
:  
:  Well, because the main point of Unicode is that there *are* no encodings
:  that cannot be considered subsets of Unicode.
: 
: Certainly the Unicode standard makes such a claim about itself.  There
: are people who remain unpersuaded by Unicode's advertising.  I conclude
: that they will find Perl 6 somewhat disappointing.

If it turns out to be a Real Problem, we'll fix it.  Right now I think
it's a Fake Problem, and we have more important things to worry about.
Most of the carping about Unicode is with regard to CJK unifications
that can't be represented in any one existing character set anyway.
Unicode has at least done pretty well with the round-trip guarantee for
any single existing character set.  There are certainly localization
issues with regard to default input and output transformations, and
things like changing the default collation order from Unicodian to
SJISian or Big5ian or whatever.  But those are good things to make
explicit in any event, and that's what the language-dependent level
is for.  And people who are trying to write programs across language
boundaries are already basically screwed over by their national
character sets.  You can't even go back and forth between Japanese
and English without getting all fouled up between ¥ and \.  Unicode
distinguishes them, so it's a distinction that Perl 6 *always makes*.

That being said, there's no reason in the current design that a string
that is viewed as on the language level as, say, French couldn't
actually be encoded in Morse code or some such.  It's *only* the
abstract semantics at the current Unicode level that are required to
be Unicode semantics by default.  And it's as lazy as we care to make
it--when you do s/foo/bar/ on a string, it's not required to convert
the string from any particular encoding to any other.  It only has to
have the same abstract result *as if* you'd translated it to Unicode
and then back to whatever the internal form is.  Even if you don't want
to emulate Unicode in the API, there are options.  For some problems
it'd be more efficient to do translate lazily, and for others it's
more efficient to just translate everything once one input and once
on output.  (It also tends to be a little cleaner to isolate lossy
translations to one spot in the program.  By the round-trip nature
of Unicode, most of the lossy translations would be on output.)

But anyway, a bit about my own psychology.  I grew up as a preacher's
kid in a fundamentalist setting, and I heard a lot of arguments of the
form, I'm not offended by this, but I'm afraid someone else might be
offended, so you shouldn't do it.  I eventually learned to discount
such arguments to preserve my own sanity, so saying someone might
be disappointed is not quite sufficient to motivate me to action.
Plus there are a lot of people out there who are never happy unless
they have something to be unhappy about.  If I thought that I could
design a language that will never disappoint anyone, I'd be a lot
stupider than I already think I am, I think.

All that being said, you can do whatever you like with Parrot, and
if you give a decent enough API, someone will link it into Perl 6.  :-)

Larry


Re: String Theory

2005-03-26 Thread Chip Salzenberg
Would this be a good time to ask for explanation for Cstr being
never Unicode, while CStr is always Unicode, thus leading to an
inability to box a non-Unicode string?

And might I also ask why in Perl 6 (if not Parrot) there seems to be
no type support for strings with known encodings which are not subsets
of Unicode?

If the explanations are you have greatly misunderstood the contents
of Synopsis $foo, I will happily retire to my reading room.
-- 
Chip Salzenberg- a.k.a. -[EMAIL PROTECTED]
 What I cannot create, I do not understand. - Richard Feynman


Re: String Theory

2005-03-26 Thread Rod Adams
Chip Salzenberg wrote:
Would this be a good time to ask for explanation for Cstr being
never Unicode, while CStr is always Unicode, thus leading to an
inability to box a non-Unicode string?
 

That's not quite it. Cstr is a forced Unicode level of Bytes, with 
encoding raw, which happens to not have any Unicode semantics attached 
to it.

And might I also ask why in Perl 6 (if not Parrot) there seems to be
no type support for strings with known encodings which are not subsets
of Unicode?
 

There are two different things to consider at the P6 level: Unicode 
level, and encoding. Level is one of Bytes, CodePoints, Graphemes, or 
Language Dependent Characters (aka LChars aka Chars). It's the way of 
determining what a character means. This can all get a bit confusing 
for people who only speak English, since our language happens to map 
nicely into all the levels at once, with no merging of multiple code 
points into a grapheme monkey business.

Encoding is how a particular string gets mapped into bits. I see P6 as 
needing to support all the common encodings (raw, ASCII, UTF\d+[be|le]?, 
UCS\d+) out of the box, but then allowing the user to add more as they 
see fit (EBCDIC, etc).

Level and Encoding can be mixed and matched independently, except for 
the combos that don't make any sense.

-- Rod Adams



Re: String Theory

2005-03-26 Thread Larry Wall
On Fri, Mar 25, 2005 at 07:38:10PM -, Chip Salzenberg wrote:
: Would this be a good time to ask for explanation for Cstr being
: never Unicode, while CStr is always Unicode, thus leading to an
: inability to box a non-Unicode string?

As Rod said, str is just a way of declaring a byte buffer, for which
characters, graphemes, codepoints, and bytes all mean the
same thing.  Conversion or coercion to more abstract types must be
specified explicitly.

: And might I also ask why in Perl 6 (if not Parrot) there seems to be
: no type support for strings with known encodings which are not subsets
: of Unicode?

Well, because the main point of Unicode is that there *are* no encodings
that cannot be considered subsets of Unicode.  Perl 6 considers
itself to have abstract Unicode semantics regardless of the underlying
representation of the data, which could be Latin-1 or Big5 or UTF-76.

That being said, abstract Unicode itself has varying levels of
abstraction, which is how we end up with .codes, .graphs, and .chars
in addition to .bytes.

Larry


String Theory

2005-03-19 Thread Rod Adams
I propose that we make a few decisions about strings in Perl. I've read
all the synopses, several list threads on the topic, and a few web
guides to Unicode. I've also thought a lot about how to cleanly define
all the string related functions that we expect Perl to have in the face
of all this expanded Unicode support.
What I've come up with is that we need a rule that says:
A single string value has a single encoding and a single Unicode Level
associated with it, and you can only talk to that value on its own
terms. These will be the properties encoding and level.
However, it should be easy to coerce that string into something that
behaves some other way.

To accomplish this, I'm hijacking the Cas method away from the Perl 5
Csprintf (which can be named Cto, and which I plan to do more with
at some later point), and making it a general purpose coercion method.
The general form of this will be something like:
  multi method as ($self : ?Class $to = $self.meta.name, *%options)
The purpose of Cas is to create a view of the invocant in some other
form. Where possible, it will return a lvalue that allows one to alter
the original invocant as if it were a C$to.
This makes several things easy.
 my Str $x = 'Just Another Perl Hacker' but utf8;
 my @x := $x.as(Array of uint8);
 say @x.pop() @x.pop();
 say $x;
Generates:
 114 101
 Just Another Perl Hack
To make things easier, I think we need new types qw/Grapheme CodePoint
LangChar/ that all Cdoes Character (ick! someone come up with a
better name for this role), along with Byte. Character is a role,
not a class, so you can't go creating instances of it.
But we could write:
 my Str $x = 'Just Another Perl Hacker';
 my @x := $x.as(Array of Character);
And then C@x.pop() returns whichever of
Grapheme/CodePoint/LangChar/Byte that $x thought of itself in terms of.
In other words, it's Cchop.
Since by default, Cas assumes the invocant type, we can convert from
one string encoding/level to another with:
 $str.as(encoding = 'utf8', level = 'graph');
But we'll make it where C*%options handles known encodings and levels
as boolean named parameters as well, so
 $str.as:utf8:graph;
does the same thing: makes another Str with the same contents as $str,
only with utf8 encoding and grapheme character semantics.
What does all this buy us? Well... for one thing it all disappears if
you want the default semantics of what you're working with.
Second, it makes it where a position within a string can be thought of
as a single integer again. What that integer means is subject to the
Clevel of the string you're operating with.
We could probably even resurrect Clength if we wanted to, making it
where people who don't care about Unicode don't have to care. Those who
do care exactly which length they are getting can say
Clength $str.as:graph.
To the user, almost the entire string function library winds up looking
like it did in Perl 5.
Some side points:
It is an error to do things like Cindex with strings of different
levels, but not different encodings.
level and encoding should default to whatever the source code was
written in, if known.
Cpack and Cunpack should be able to be replaced with Cas views of
compact structs (see S09).
Cas kills Cvec. Or at least buries it very deeply, without oxygen.
Comments?
-- Rod Adams


Re: String Theory

2005-03-19 Thread Rod Adams
It's been pointed out to me that A12 mentions:
Coercions to other classes can also be defined:
   multi sub *coerce:as (Us $us, Them ::to) { to.transmogrify($us) }
Such coercions allow both explicit conversion:
   $them = $us as Them;
as well as implicit conversions:
   my Them $them = $us;
I read S12 in detail (actually all the S's) before posting. Neither S12 
nor S13 mentions Ccoerce:as, so I missed the A12 mention of it in my 
prep work.

Reading it now, my Casis a bit different, since I'll allowing options 
for defining the encoding and Unicode level. There may be other options 
that make sense in some contexts. Of course one could view the different 
encodings and levels as subclasses of Str, which I considered at some 
point, but it felt like it was going to get rather cumbersome given the 
cross product effect of the two properties.

Also, it is unclear if Ccoerce:as returns an lvalue or not, which my 
C.as does.

There's likely room for unification of the two ideas.
-- Rod Adams


Re: String Theory

2005-03-19 Thread Larry Wall
On Sat, Mar 19, 2005 at 05:07:49PM -0600, Rod Adams wrote:
: I propose that we make a few decisions about strings in Perl. I've read
: all the synopses, several list threads on the topic, and a few web
: guides to Unicode. I've also thought a lot about how to cleanly define
: all the string related functions that we expect Perl to have in the face
: of all this expanded Unicode support.
: 
: What I've come up with is that we need a rule that says:
: 
: A single string value has a single encoding and a single Unicode Level
: associated with it, and you can only talk to that value on its own
: terms. These will be the properties encoding and level.

You've more or less described the semantics available at the use
bytes level, which basically comes down to a pure OO approach where
the user has to be aware of all the types (to the extent that OO
doesn't hide that).  It's one approach to polymorphism, but I think
it shortchanges the natural polymorphism of Unicode, and the approach
of Perl to such natural polymorphisms as evident in autoconversion
between numbers and strings.  That being said, I don't think your
view is so far off my view.  More on that below.

: However, it should be easy to coerce that string into something that
: behaves some other way.

The question is, how easy?  You're proposing a mechanism that,
frankly, looks rather intrusive and makes my eyes glaze over as a
representative of the Pooh clan.  I think the typical user would rather
have at least the option of automatic coercion in a lexical scope.

But let me back up a bit.  What I want to do is to just widen your
definition of a string type slightly.  I see your current view as a
sort of degenerate case of my view.  Instead of viewing a string as
having an exact Unicode level, I prefer to think of it as having a
natural maximum and minimum level when it's born, depending on the
type of data it's trying to represent.  A memory buffer naturally has
a mininum and maximum Unicode level of bytes.  A typical Unicode
string encoded in, say, UTF-8, has a minimum Unicode level of bytes,
and maximum of chars (I'm using that to represent language-dependent
graphemes here.)  A Unicode string revealed by an abstract interface
might not allow any bytes-level view, but use codepoints for the
natural minimum, or even graphemes, but still allow any view up
to chars, as long as it doesn't go below codepoints.

A given lexical scope chooses a default Unicode view, which can be
naturally mapped for any data types that allow that view.  The question
is what to do outside of that range.  (Inside that range, I suspect
we can arrange to find a version of index($str,$targ) that works
even if $str and $targ aren't the same exact type, preferably one
that works at the current Unicode level.  I think the typical user
would prefer that we find such a function for him without him having
to play with coercions.)

If the current lexical view is outside the range allowed by the
current, I think the default behavior is different looking up than
down.  If I'm working at the chars level, then everything looks like
chars, even if it's something smaller.  To take an extreme case,
suppose I do a chop on a string that is allows the byte view as the
highest level, that is, a byte buffer.  I always get the last byte
of the string, even if the data could conceivably be interpreted as
some other encoding.  For that string, the bytes *are* the characters.
They're also the codepoints, and the graphemes.  Likewise, a string
that is max codepoints will behave like a codepoint buffer even under
higher levels.  This seems very dwimmy to me.

Going the other way, if a lower level tries to access a string that is
minimum a higher level, it's just illegal.  In a bytes lexical context,
it will force you to be more specific about what you mean if you want
to do an operation on a string that requires a higher level of abstraction.

As a limiting case, if you force all your incoming strings to be
minimum == maximum, and write your code at the bytes level, this
degenerates to your proposed semantics, more or less.  I don't doubt
that many folks would prefer to program at this explicit level where
all the polymorphism is supplied by the objects, but I also think a
lot of folks would prefer to think at the graphemes or chars level
by default.  It's the natural human way of chunking text.

I know this view of string polymorphism makes a bit more work for us,
but it's one of the basic Perl ideals to try to do a lot of vicarious
work in advance on behalf of the user.  That was Perl's added value
over other languages when it started out, both on the level of mad
configuration and on the level of automatic str/num/int polymorphism.
I think Perl 6 can do this on the level of Str polymorphism.

When it comes to Unicode, most other OO languages are falling into
the Lisp trap of expecting the user think like the computer rather
than the computer like the user.  That's one of the few ideas from
Lisp I'm 

Re: String Theory

2005-03-19 Thread Rod Adams
Larry Wall wrote:
You've more or less described the semantics available at the use
bytes level, which basically comes down to a pure OO approach where
the user has to be aware of all the types (to the extent that OO
doesn't hide that).  It's one approach to polymorphism, but I think
it shortchanges the natural polymorphism of Unicode, and the approach
of Perl to such natural polymorphisms as evident in autoconversion
between numbers and strings.  That being said, I don't think your
view is so far off my view.  More on that below.
[ rest of post snipped, not because it isn't relevant, but because it's long 
and my responses don't match any single part of it. -- RHA ]
What I see here is a need to define what it means to coerce a string 
from one level to another.

First let me lay down my understanding of the different levels. I am 
towards the novice end of the Unicode skill level, so it'll be pretty basic.

At the byte level, all you have is 8 bits, which may have some meaning 
as text if treat them like ASCII.

You can take one or more bytes at a time, lump them together in a 
predefined way, and generate a Code Point, which is an index into the 
Unicode table of characters.

However, Unicode has problem with what it assigns code points to, so you 
have one or more code points together to form a proper character, or 
grapheme.

But Unicode has another problem, where certain graphemes mean very 
different things depending on what language you happen to be in. (Mostly 
a CJK issue, from what I've read.) So we add a language dependent level, 
which is basically graphemes with an implied language.

Even if I got parts of that wrong (very possible), the main point is 
that in general, a higher level takes one _or_more_ units of the level 
below it to construct a unit at it's level.

So now, there's the question of what it means to move something from one 
level to another.

We'll start with moving up to a higher level. I'll use the example of 
moving from Code Points (cpts) to Graphemes (grfs), but the talk should 
translate to other conversions.

There are two approaches I see to this:
1) Convert every cpt into an exactly equivalent grf. The length of the 
strings are equal.
2) Scan through the string, grouping cpts into associated grfs as 
possible. The resulting string length is less than or equal to the 
input. In short, attempt to keep the same semantic meaning of the word.

I see both methods as being useful in certain contexts, but #2 is likely 
what people want more often, and is what I have in mind.

Going down the chain, you stand the possibility of losing information 
in method #1.
However, using #2, you simply expand the relevant grfs into the 
associated group of cpts.

My general approach of how to convert a string from one level to another 
is to pick an encoding both levels understand, generate a bitstring from 
the old level, and then have the new level parse that bitstring into 
it's level. If the start and goal don't allow this, throw an error.

I'm not certain how your views relate to this all this, but I was left 
with the impression that you were talking about conversions of type #1, 
which would make sense to outlaw downward conversions, since it's 
possible the grf won't fit into a cpt.

It would also make sense that you have an allowable levels parameter 
in such a scheme, so you know not to store a grf that can't also be cpt, 
or at least to track that after one does it, they can't go back to cpts.

Taking a step back, perhaps I didn't make it clear (or even mention) 
that my coercions were DWIMish in nature, not pure bit level unions. I 
covered String to String coercions above. For String - Array, what 
happens depends on the type of the array. For String - Array of 
Characters (back to my role), each element of the array corresponds to a 
single of what the string thought a character was. However, String - 
Array of u?int\d+ would do bit level operations, and the encoding scheme 
would matter greatly in this case.

We/I will have to come up with a table of what these DWIMish operations 
are, and how a user could define a new one. That likely will be an 
extension of how you decide tie should happen in Perl 6.

I also see nothing wrong with most operations between strings of two 
levels autocoercing one string to the higher level of the other. Things 
like Ccmp, C~, and many others should be fine in this regard, as 
long as they default to coercing up. I soloed Cindex out, because it 
deals with two strings *and* it deals with positions within those 
strings, and what a given integer position means can vary greatly with 
level. But even there I suppose that we could force the target's level 
onto the term, and make all positions relative to the target, and it's 
level.

As for the exact syntax of the coercion, I'm open to suggestions.
-- Rod Adams