Re: String Theory
On Mon, Mar 28, 2005 at 11:53:07AM -0500, Chip Salzenberg wrote: : According to Larry Wall: : On Fri, Mar 25, 2005 at 07:38:10PM -, Chip Salzenberg wrote: : : And might I also ask why in Perl 6 (if not Parrot) there seems to be : : no type support for strings with known encodings which are not subsets : : of Unicode? : : Well, because the main point of Unicode is that there *are* no encodings : that cannot be considered subsets of Unicode. : : Certainly the Unicode standard makes such a claim about itself. There : are people who remain unpersuaded by Unicode's advertising. I conclude : that they will find Perl 6 somewhat disappointing. If it turns out to be a Real Problem, we'll fix it. Right now I think it's a Fake Problem, and we have more important things to worry about. Most of the carping about Unicode is with regard to CJK unifications that can't be represented in any one existing character set anyway. Unicode has at least done pretty well with the round-trip guarantee for any single existing character set. There are certainly localization issues with regard to default input and output transformations, and things like changing the default collation order from Unicodian to SJISian or Big5ian or whatever. But those are good things to make explicit in any event, and that's what the language-dependent level is for. And people who are trying to write programs across language boundaries are already basically screwed over by their national character sets. You can't even go back and forth between Japanese and English without getting all fouled up between ¥ and \. Unicode distinguishes them, so it's a distinction that Perl 6 *always makes*. That being said, there's no reason in the current design that a string that is viewed as on the language level as, say, French couldn't actually be encoded in Morse code or some such. It's *only* the abstract semantics at the current Unicode level that are required to be Unicode semantics by default. And it's as lazy as we care to make it--when you do s/foo/bar/ on a string, it's not required to convert the string from any particular encoding to any other. It only has to have the same abstract result *as if* you'd translated it to Unicode and then back to whatever the internal form is. Even if you don't want to emulate Unicode in the API, there are options. For some problems it'd be more efficient to do translate lazily, and for others it's more efficient to just translate everything once one input and once on output. (It also tends to be a little cleaner to isolate lossy translations to one spot in the program. By the round-trip nature of Unicode, most of the lossy translations would be on output.) But anyway, a bit about my own psychology. I grew up as a preacher's kid in a fundamentalist setting, and I heard a lot of arguments of the form, I'm not offended by this, but I'm afraid someone else might be offended, so you shouldn't do it. I eventually learned to discount such arguments to preserve my own sanity, so saying someone might be disappointed is not quite sufficient to motivate me to action. Plus there are a lot of people out there who are never happy unless they have something to be unhappy about. If I thought that I could design a language that will never disappoint anyone, I'd be a lot stupider than I already think I am, I think. All that being said, you can do whatever you like with Parrot, and if you give a decent enough API, someone will link it into Perl 6. :-) Larry
Re: String Theory
Would this be a good time to ask for explanation for Cstr being never Unicode, while CStr is always Unicode, thus leading to an inability to box a non-Unicode string? And might I also ask why in Perl 6 (if not Parrot) there seems to be no type support for strings with known encodings which are not subsets of Unicode? If the explanations are you have greatly misunderstood the contents of Synopsis $foo, I will happily retire to my reading room. -- Chip Salzenberg- a.k.a. -[EMAIL PROTECTED] What I cannot create, I do not understand. - Richard Feynman
Re: String Theory
Chip Salzenberg wrote: Would this be a good time to ask for explanation for Cstr being never Unicode, while CStr is always Unicode, thus leading to an inability to box a non-Unicode string? That's not quite it. Cstr is a forced Unicode level of Bytes, with encoding raw, which happens to not have any Unicode semantics attached to it. And might I also ask why in Perl 6 (if not Parrot) there seems to be no type support for strings with known encodings which are not subsets of Unicode? There are two different things to consider at the P6 level: Unicode level, and encoding. Level is one of Bytes, CodePoints, Graphemes, or Language Dependent Characters (aka LChars aka Chars). It's the way of determining what a character means. This can all get a bit confusing for people who only speak English, since our language happens to map nicely into all the levels at once, with no merging of multiple code points into a grapheme monkey business. Encoding is how a particular string gets mapped into bits. I see P6 as needing to support all the common encodings (raw, ASCII, UTF\d+[be|le]?, UCS\d+) out of the box, but then allowing the user to add more as they see fit (EBCDIC, etc). Level and Encoding can be mixed and matched independently, except for the combos that don't make any sense. -- Rod Adams
Re: String Theory
On Fri, Mar 25, 2005 at 07:38:10PM -, Chip Salzenberg wrote: : Would this be a good time to ask for explanation for Cstr being : never Unicode, while CStr is always Unicode, thus leading to an : inability to box a non-Unicode string? As Rod said, str is just a way of declaring a byte buffer, for which characters, graphemes, codepoints, and bytes all mean the same thing. Conversion or coercion to more abstract types must be specified explicitly. : And might I also ask why in Perl 6 (if not Parrot) there seems to be : no type support for strings with known encodings which are not subsets : of Unicode? Well, because the main point of Unicode is that there *are* no encodings that cannot be considered subsets of Unicode. Perl 6 considers itself to have abstract Unicode semantics regardless of the underlying representation of the data, which could be Latin-1 or Big5 or UTF-76. That being said, abstract Unicode itself has varying levels of abstraction, which is how we end up with .codes, .graphs, and .chars in addition to .bytes. Larry
String Theory
I propose that we make a few decisions about strings in Perl. I've read all the synopses, several list threads on the topic, and a few web guides to Unicode. I've also thought a lot about how to cleanly define all the string related functions that we expect Perl to have in the face of all this expanded Unicode support. What I've come up with is that we need a rule that says: A single string value has a single encoding and a single Unicode Level associated with it, and you can only talk to that value on its own terms. These will be the properties encoding and level. However, it should be easy to coerce that string into something that behaves some other way. To accomplish this, I'm hijacking the Cas method away from the Perl 5 Csprintf (which can be named Cto, and which I plan to do more with at some later point), and making it a general purpose coercion method. The general form of this will be something like: multi method as ($self : ?Class $to = $self.meta.name, *%options) The purpose of Cas is to create a view of the invocant in some other form. Where possible, it will return a lvalue that allows one to alter the original invocant as if it were a C$to. This makes several things easy. my Str $x = 'Just Another Perl Hacker' but utf8; my @x := $x.as(Array of uint8); say @x.pop() @x.pop(); say $x; Generates: 114 101 Just Another Perl Hack To make things easier, I think we need new types qw/Grapheme CodePoint LangChar/ that all Cdoes Character (ick! someone come up with a better name for this role), along with Byte. Character is a role, not a class, so you can't go creating instances of it. But we could write: my Str $x = 'Just Another Perl Hacker'; my @x := $x.as(Array of Character); And then C@x.pop() returns whichever of Grapheme/CodePoint/LangChar/Byte that $x thought of itself in terms of. In other words, it's Cchop. Since by default, Cas assumes the invocant type, we can convert from one string encoding/level to another with: $str.as(encoding = 'utf8', level = 'graph'); But we'll make it where C*%options handles known encodings and levels as boolean named parameters as well, so $str.as:utf8:graph; does the same thing: makes another Str with the same contents as $str, only with utf8 encoding and grapheme character semantics. What does all this buy us? Well... for one thing it all disappears if you want the default semantics of what you're working with. Second, it makes it where a position within a string can be thought of as a single integer again. What that integer means is subject to the Clevel of the string you're operating with. We could probably even resurrect Clength if we wanted to, making it where people who don't care about Unicode don't have to care. Those who do care exactly which length they are getting can say Clength $str.as:graph. To the user, almost the entire string function library winds up looking like it did in Perl 5. Some side points: It is an error to do things like Cindex with strings of different levels, but not different encodings. level and encoding should default to whatever the source code was written in, if known. Cpack and Cunpack should be able to be replaced with Cas views of compact structs (see S09). Cas kills Cvec. Or at least buries it very deeply, without oxygen. Comments? -- Rod Adams
Re: String Theory
It's been pointed out to me that A12 mentions: Coercions to other classes can also be defined: multi sub *coerce:as (Us $us, Them ::to) { to.transmogrify($us) } Such coercions allow both explicit conversion: $them = $us as Them; as well as implicit conversions: my Them $them = $us; I read S12 in detail (actually all the S's) before posting. Neither S12 nor S13 mentions Ccoerce:as, so I missed the A12 mention of it in my prep work. Reading it now, my Casis a bit different, since I'll allowing options for defining the encoding and Unicode level. There may be other options that make sense in some contexts. Of course one could view the different encodings and levels as subclasses of Str, which I considered at some point, but it felt like it was going to get rather cumbersome given the cross product effect of the two properties. Also, it is unclear if Ccoerce:as returns an lvalue or not, which my C.as does. There's likely room for unification of the two ideas. -- Rod Adams
Re: String Theory
On Sat, Mar 19, 2005 at 05:07:49PM -0600, Rod Adams wrote: : I propose that we make a few decisions about strings in Perl. I've read : all the synopses, several list threads on the topic, and a few web : guides to Unicode. I've also thought a lot about how to cleanly define : all the string related functions that we expect Perl to have in the face : of all this expanded Unicode support. : : What I've come up with is that we need a rule that says: : : A single string value has a single encoding and a single Unicode Level : associated with it, and you can only talk to that value on its own : terms. These will be the properties encoding and level. You've more or less described the semantics available at the use bytes level, which basically comes down to a pure OO approach where the user has to be aware of all the types (to the extent that OO doesn't hide that). It's one approach to polymorphism, but I think it shortchanges the natural polymorphism of Unicode, and the approach of Perl to such natural polymorphisms as evident in autoconversion between numbers and strings. That being said, I don't think your view is so far off my view. More on that below. : However, it should be easy to coerce that string into something that : behaves some other way. The question is, how easy? You're proposing a mechanism that, frankly, looks rather intrusive and makes my eyes glaze over as a representative of the Pooh clan. I think the typical user would rather have at least the option of automatic coercion in a lexical scope. But let me back up a bit. What I want to do is to just widen your definition of a string type slightly. I see your current view as a sort of degenerate case of my view. Instead of viewing a string as having an exact Unicode level, I prefer to think of it as having a natural maximum and minimum level when it's born, depending on the type of data it's trying to represent. A memory buffer naturally has a mininum and maximum Unicode level of bytes. A typical Unicode string encoded in, say, UTF-8, has a minimum Unicode level of bytes, and maximum of chars (I'm using that to represent language-dependent graphemes here.) A Unicode string revealed by an abstract interface might not allow any bytes-level view, but use codepoints for the natural minimum, or even graphemes, but still allow any view up to chars, as long as it doesn't go below codepoints. A given lexical scope chooses a default Unicode view, which can be naturally mapped for any data types that allow that view. The question is what to do outside of that range. (Inside that range, I suspect we can arrange to find a version of index($str,$targ) that works even if $str and $targ aren't the same exact type, preferably one that works at the current Unicode level. I think the typical user would prefer that we find such a function for him without him having to play with coercions.) If the current lexical view is outside the range allowed by the current, I think the default behavior is different looking up than down. If I'm working at the chars level, then everything looks like chars, even if it's something smaller. To take an extreme case, suppose I do a chop on a string that is allows the byte view as the highest level, that is, a byte buffer. I always get the last byte of the string, even if the data could conceivably be interpreted as some other encoding. For that string, the bytes *are* the characters. They're also the codepoints, and the graphemes. Likewise, a string that is max codepoints will behave like a codepoint buffer even under higher levels. This seems very dwimmy to me. Going the other way, if a lower level tries to access a string that is minimum a higher level, it's just illegal. In a bytes lexical context, it will force you to be more specific about what you mean if you want to do an operation on a string that requires a higher level of abstraction. As a limiting case, if you force all your incoming strings to be minimum == maximum, and write your code at the bytes level, this degenerates to your proposed semantics, more or less. I don't doubt that many folks would prefer to program at this explicit level where all the polymorphism is supplied by the objects, but I also think a lot of folks would prefer to think at the graphemes or chars level by default. It's the natural human way of chunking text. I know this view of string polymorphism makes a bit more work for us, but it's one of the basic Perl ideals to try to do a lot of vicarious work in advance on behalf of the user. That was Perl's added value over other languages when it started out, both on the level of mad configuration and on the level of automatic str/num/int polymorphism. I think Perl 6 can do this on the level of Str polymorphism. When it comes to Unicode, most other OO languages are falling into the Lisp trap of expecting the user think like the computer rather than the computer like the user. That's one of the few ideas from Lisp I'm
Re: String Theory
Larry Wall wrote: You've more or less described the semantics available at the use bytes level, which basically comes down to a pure OO approach where the user has to be aware of all the types (to the extent that OO doesn't hide that). It's one approach to polymorphism, but I think it shortchanges the natural polymorphism of Unicode, and the approach of Perl to such natural polymorphisms as evident in autoconversion between numbers and strings. That being said, I don't think your view is so far off my view. More on that below. [ rest of post snipped, not because it isn't relevant, but because it's long and my responses don't match any single part of it. -- RHA ] What I see here is a need to define what it means to coerce a string from one level to another. First let me lay down my understanding of the different levels. I am towards the novice end of the Unicode skill level, so it'll be pretty basic. At the byte level, all you have is 8 bits, which may have some meaning as text if treat them like ASCII. You can take one or more bytes at a time, lump them together in a predefined way, and generate a Code Point, which is an index into the Unicode table of characters. However, Unicode has problem with what it assigns code points to, so you have one or more code points together to form a proper character, or grapheme. But Unicode has another problem, where certain graphemes mean very different things depending on what language you happen to be in. (Mostly a CJK issue, from what I've read.) So we add a language dependent level, which is basically graphemes with an implied language. Even if I got parts of that wrong (very possible), the main point is that in general, a higher level takes one _or_more_ units of the level below it to construct a unit at it's level. So now, there's the question of what it means to move something from one level to another. We'll start with moving up to a higher level. I'll use the example of moving from Code Points (cpts) to Graphemes (grfs), but the talk should translate to other conversions. There are two approaches I see to this: 1) Convert every cpt into an exactly equivalent grf. The length of the strings are equal. 2) Scan through the string, grouping cpts into associated grfs as possible. The resulting string length is less than or equal to the input. In short, attempt to keep the same semantic meaning of the word. I see both methods as being useful in certain contexts, but #2 is likely what people want more often, and is what I have in mind. Going down the chain, you stand the possibility of losing information in method #1. However, using #2, you simply expand the relevant grfs into the associated group of cpts. My general approach of how to convert a string from one level to another is to pick an encoding both levels understand, generate a bitstring from the old level, and then have the new level parse that bitstring into it's level. If the start and goal don't allow this, throw an error. I'm not certain how your views relate to this all this, but I was left with the impression that you were talking about conversions of type #1, which would make sense to outlaw downward conversions, since it's possible the grf won't fit into a cpt. It would also make sense that you have an allowable levels parameter in such a scheme, so you know not to store a grf that can't also be cpt, or at least to track that after one does it, they can't go back to cpts. Taking a step back, perhaps I didn't make it clear (or even mention) that my coercions were DWIMish in nature, not pure bit level unions. I covered String to String coercions above. For String - Array, what happens depends on the type of the array. For String - Array of Characters (back to my role), each element of the array corresponds to a single of what the string thought a character was. However, String - Array of u?int\d+ would do bit level operations, and the encoding scheme would matter greatly in this case. We/I will have to come up with a table of what these DWIMish operations are, and how a user could define a new one. That likely will be an extension of how you decide tie should happen in Perl 6. I also see nothing wrong with most operations between strings of two levels autocoercing one string to the higher level of the other. Things like Ccmp, C~, and many others should be fine in this regard, as long as they default to coercing up. I soloed Cindex out, because it deals with two strings *and* it deals with positions within those strings, and what a given integer position means can vary greatly with level. But even there I suppose that we could force the target's level onto the term, and make all positions relative to the target, and it's level. As for the exact syntax of the coercion, I'm open to suggestions. -- Rod Adams