Question about list context for String.chars
Hi all, I'm writing a bunch of examples for perl 6 pleac and it seems rather natural to expect $string.chars to return a list of unicode chars in list context, however I can't find anything to confirm that. (The other alternatives being split and unpack.) # unpack @array = unpack(C*, $string); # split @array = split /./, $string; # this too? @array = $string.split(/./) # and how about this? @array = $string.chars # and this explicit list context? @array = $string.chars[]; Thanks, Marcus
Re: Question about list context for String.chars
Hi, gcomnz wrote: I'm writing a bunch of examples for perl 6 pleac and it seems rather natural to expect $string.chars to return a list of unicode chars in list context, however I can't find anything to confirm that. (The other alternatives being split and unpack.) I like that. If one wanted to have the *number* of chars/graphemes/whatever, one could still use the cheap unary + operator. And .keys, .values, .pairs, etc. don't return a plain number, but actual contents, too (consistency!). --Ingo -- Linux, the choice of a GNU | Wissen ist Wissen, wo man es findet. generation on a dual AMD | Athlon!|
Whither use English?
I'm working on docs/S28draft.pod in the pugs project. And consulting perl5's perlvar.pod, the issue of use English comes up. AFAICT from various sources, little has been said about this NOTE: http://groups-beta.google.com/group/perl.perl6.language/msg/fa241233bcfba024: we've already been through the whole Cuse English; thing and how no one uses it What's the word. Will there be something like use English? Regards to all, David
Re: Whither use English?
David Vergin skribis 2005-04-11 9:44 (-0700): What's the word. Will there be something like use English? Yes, and it's the default :) Juerd -- http://convolution.nl/maak_juerd_blij.html http://convolution.nl/make_juerd_happy.html http://convolution.nl/gajigu_juerd_n.html
Re: Whither use English?
On Mon, 2005-04-11 at 14:31, Juerd wrote: David Vergin skribis 2005-04-11 9:44 (-0700): What's the word. Will there be something like use English? Yes, and it's the default :) Yes, but it will be spelled: use $*LANG ;-) Seriously, is there some reason that we would not provide a Language::Russian and Language::Nihongo? Given Perl 6, it would even be quite valid for those modules to add aliases for all of the core functions and keywords, not just global variables. -- Aaron Sherman [EMAIL PROTECTED] Senior Systems Engineer and Toolsmith It's the sound of a satellite saying, 'get me down!' -Shriekback
Re: Whither use English?
Aaron Sherman skribis 2005-04-11 14:49 (-0400): Yes, but it will be spelled: use $*LANG ;-) Seriously, is there some reason that we would not provide a Language::Russian and Language::Nihongo? Given Perl 6, it would even be quite valid for those modules to add aliases for all of the core functions and keywords, not just global variables. Because providing it leads to its use, and when it gets used, knowing English is no longer enough. I have some code that uses Dutch variable names. When I show that code to people who can't read any Dutch, they have a hard time finding out what it does and how it works. If even builtin functions become unfamiliar, this figuring out becomes impossible instead of hard, without learning the language it's written in. English sucks in many interesting ways, but at least it's a de facto standard and documentation will be available in it. I'm not even sure I like the *possibility* of using non-ascii letters in identifiers, even. As a 12-year old, I used several BASIC dialects. One time I found a Dutch BASIC. It had TOON instead of PRINT, and INVOER instead of INPUT. Even though these words were in my own language, I found using them hard just because I was used to something entirely different. You could say it only takes some getting used to, but it's easier to get used to one language than to all languages a grammar exists for. And even though I knew when I wrote it that it was a mistake, I used esperato identifiers in Lingua::EO::Supersignoj. You can't imagine how often I've used new instead of nova since I released that. A next version is going to have English as the primary language, even though I love Esperanto. I do think translating *documentation* is a very good idea. But please let that be an official project, with lots and lots of committers, because every one-man translation operation eventually dies. Juerd -- http://convolution.nl/maak_juerd_blij.html http://convolution.nl/make_juerd_happy.html http://convolution.nl/gajigu_juerd_n.html
Re: Whither use English?
On 2005-04-11 15:00, Juerd [EMAIL PROTECTED] wrote: I'm not even sure I like the *possibility* of using non-ascii letters in identifiers, even. I agree that it would be a nightmare if project A used presu instead of print everywhere, while project B used toon, etc. But non-ASCII identifiers are a good thing, because there are many places even in the English-speaking world even in Ugly America where people are used to such identifiers. I want to be able to use $ for a variable representing angstroms, to see the constant Math::Trig:: in trig functions, to declare a sub that does summations, etc etc. And even if those dont come through in email properly, they make it through CVS/SVN commits and updates just fine. :)
Re: Question about list context for String.chars
On Mon, 2005-04-11 at 14:12, Ingo Blechschmidt wrote: gcomnz wrote: I'm writing a bunch of examples for perl 6 pleac and it seems rather natural to expect $string.chars to return a list of unicode chars in list context, however I can't find anything to confirm that. (The other alternatives being split and unpack.) I like that. Same here, though I have to admit that I'm slow on this whole Unicode thing, so I'm not sure what you mean by Unicode chars. For example, are you expecting to get f, f, i or back when you say .chars? More interestingly, what about all of the Arabic ligatures which someone who speaks that language might reasonably expect to get back as multiple chars, but they have their own Unicode codepoint (e.g. which is U+FCF3 ARABIC LIGATURE SHADDA WITH DAMMA MEDIAL FORM which you might expect to get , from)? Any Arabic speakers to confirm or deny this behavior of ligatures? Please be aware, I'm talking about ligatures above, NOT special letters such as , which are their own letters, and cannot be decomposed into a, e without losing information. Given Parrot, what happens when you are presented with a Big5 string that does not have a strict Unicode equivalent? Does .chars throw an exception, or does it rely on the string to know how to characterify itself according to its vtable? -- Aaron Sherman [EMAIL PROTECTED] Senior Systems Engineer and Toolsmith It's the sound of a satellite saying, 'get me down!' -Shriekback
Re: Question about list context for String.chars
I have to say I'm slightly confused too for some languages, especially for syllabic alphabets. At the same time, I'm pretty clear for CJK, Syllabaries, and alphabets, or at least I hope I'm clear (I guess I'm about to find out), .chars just returns the right unicode level for whatever the string contents requires. abc.chars would return a b c, which I'm guessing would be byte size usually. .chars would return , which can probably be expressed with UTF8? Aaron wrote: Same here, though I have to admit that I'm slow on this whole Unicode thing, so I'm not sure what you mean by Unicode chars. For example, are you expecting to get f, f, i or back when you say .chars? More interestingly, what about all of the Arabic ligatures which someone who speaks that language might reasonably expect to get back as multiple chars, but they have their own Unicode codepoint (e.g. which is U+FCF3 ARABIC LIGATURE SHADDA WITH DAMMA MEDIAL FORM which you might expect to get , from)? Any Arabic speakers to confirm or deny this behavior of ligatures? From Apocalyps 5: Under level 2 Unicode support, a character is assumed to mean a grapheme, that is, a sequence consisting of a base character followed by 0 or more combining characters. Marcus On 4/11/05, Aaron Sherman [EMAIL PROTECTED] wrote: On Mon, 2005-04-11 at 14:12, Ingo Blechschmidt wrote: gcomnz wrote: I'm writing a bunch of examples for perl 6 pleac and it seems rather natural to expect $string.chars to return a list of unicode chars in list context, however I can't find anything to confirm that. (The other alternatives being split and unpack.) I like that. Same here, though I have to admit that I'm slow on this whole Unicode thing, so I'm not sure what you mean by Unicode chars. For example, are you expecting to get f, f, i or back when you say .chars? More interestingly, what about all of the Arabic ligatures which someone who speaks that language might reasonably expect to get back as multiple chars, but they have their own Unicode codepoint (e.g. which is U+FCF3 ARABIC LIGATURE SHADDA WITH DAMMA MEDIAL FORM which you might expect to get , from)? Any Arabic speakers to confirm or deny this behavior of ligatures? Please be aware, I'm talking about ligatures above, NOT special letters such as , which are their own letters, and cannot be decomposed into a, e without losing information. Given Parrot, what happens when you are presented with a Big5 string that does not have a strict Unicode equivalent? Does .chars throw an exception, or does it rely on the string to know how to characterify itself according to its vtable? -- Aaron Sherman [EMAIL PROTECTED] Senior Systems Engineer and Toolsmith It's the sound of a satellite saying, 'get me down!' -Shriekback
Re: Whither use English?
On Mon, 2005-04-11 at 15:00, Juerd wrote: Aaron Sherman skribis 2005-04-11 14:49 (-0400): Yes, but it will be spelled: use $*LANG ;-) Seriously, is there some reason that we would not provide a Language::Russian and Language::Nihongo? Given Perl 6, it would even be quite valid for those modules to add aliases for all of the core functions and keywords, not just global variables. Because providing it leads to its use, and when it gets used, knowing English is no longer enough. I don't think you can say (as Larry has) that you want to be able to fully re-define the language from within itself and still impose the constraint that it can't confuse people who don't know anything about my module. You might argue that Language::Dutch should never ship with the core... that's a valid opinion, but SOMEONE is going to write it. It'd be a kind of strange form of censorship for CPAN not to accept it. After all, there's more than one way to say it... isn't there? English sucks in many interesting ways, but at least it's a de facto standard and documentation will be available in it. Let's think about this in terms other than someone distributing code to the masses. What about teaching? If I were going to teach the basic concepts of programming, I'd like to do so with a language whose constructs are all native. This is simply practical: having to learn vocabulary at the same time that you learn a new WAY of communicating makes it harder. If CPAN had a Language::NYUpperEastSide, then I might consider using that for my elementary computer class rather than try to teach everyone real English AND programming in one year ;-) I'm not even sure I like the *possibility* of using non-ascii letters in identifiers, even. I think we already have Latin-1 in identifiers... let me check. Yep: pugs my $ = 1; undef pugs $; 1 Let's see about UTF-8 pugs my $ = 1; undef pugs $; 1 A-yup! -- Aaron Sherman [EMAIL PROTECTED] Senior Systems Engineer and Toolsmith It's the sound of a satellite saying, 'get me down!' -Shriekback
Re: Question about list context for String.chars
On 2005-04-11 15:40, gcomnz [EMAIL PROTECTED] wrote: .chars would return [EMAIL PROTECTED]@, which can probably be expressed with UTF8? The string is probably represented internally as UTF-8, but that should have no effect on what .chars returns, which should, indeed, be [EMAIL PROTECTED], that is, an array whose elements are strings which each represent one Unicode code point irrespective of encoding. I think that, in general, at the level of Perl code, 1 character should be one code point, and any higher-level support for combining and splitting should be outside the core, in Unicode::Whatever.
Re: Question about list context for String.chars
On Mon, 2005-04-11 at 15:40, gcomnz wrote: I have to say I'm slightly confused too for some languages, especiallyfor syllabic alphabets. At the same time, I'm pretty clear for CJK,Syllabaries, and alphabets, or at least I hope I'm clear (I guess I'mabout to find out), .chars just returns the right unicode level forwhatever the string contents requires. abc.chars would return a b c, which I'm guessing would be bytesize usually. Fair enough. .chars would return , which can probably be expressed with UTF8? I think you're confusing UTF8 (which can represent ALL Unicode characters) and the UTF8 subset which consists of one-byte representations (which happens to overlap with 7-bit ASCII). From Apocalyps 5: Under level 2 Unicode support, a character isassumed to mean a grapheme, that is, a sequence consisting of a basecharacter followed by 0 or more combining characters. Marcus Hmmm... that doesn't answer the ligature question clearly though. That answers for the case of combining diacritical marks: http://en.wikipedia.org/wiki/Combining_diacritical_mark e.g. A vs , which is a pre-combined example, but there are (as I understand it), many valid examples which do not have a pre-combined representation in Unicode. But not for ligatures: http://en.wikipedia.org/wiki/Ligature_%28typography%29 which are, by definition, actually two or more unique characters which have a special typographical representation when adjacent. So, they are a single grapheme, but like I said: certain cultures would be shocked by a .chars that did not decompose their ligatures (and again, I'm mostly thinking Arabic, so I'd defer to someone who actually spoke Arabic and knows how they deal with this).
Re: Question about list context for String.chars
abc.chars would return a b c, which I'm guessing would be bytesize usually. Fair enough. .chars would return [EMAIL PROTECTED]@, which can probably be expressed with UTF8? I think you're confusing UTF8 (which can represent ALL Unicode characters) and the UTF8 subset which consists of one-byte representations (which happens to overlap with 7-bit ASCII). Perhaps my confusion is that I thought, perhaps wrongly, that since .chars returns a count that is appropriate for the given unicode level, that would mean that if it were able to return a list in list context then it would be with the right storage size as needed for the given string contents. For instance, a b c just requires bytes for each element, while Kanji would require more. I'm leaving very wide room open here for me really misunderstanding how all this works. From Apocalyps 5: Under level 2 Unicode support, a character isassumed to mean a grapheme, that is, a sequence consisting of a basecharacter followed by 0 or more combining characters. Marcus Hmmm... that doesn't answer the ligature question clearly though. That answers for the case of combining diacritical marks: I read followed by 0 or more combining characters to mean that it is smart enough to combine the vowels in Arabic and other syllabic alphabets that use special conjuncts. However I'm also not exactly sure if that's even reasonably possible, or even if it makes sense in the counting of characters for languages that use those.
Here documents as positional parameters to a function call
Hey all, more pleac conversion questions: I can't prove with the docs that a heredoc will continue to work as positional params to a function call, particularly where it's not the first param: die Couldn't send mail unless send_mail qq:to/EOTEXT/, $target here doc here ... EOTEXT Any comments? Marcus
Re: Here documents as positional parameters to a function call
gcomnz writes: Hey all, more pleac conversion questions: I can't prove with the docs that a heredoc will continue to work as positional params to a function call, particularly where it's not the first param: die Couldn't send mail unless send_mail qq:to/EOTEXT/, $target here doc here ... EOTEXT Here docs work just like in Perl 5 with two differences: They are spelled qq:to/END/, q:to/END/, etc. and the ending text can have leading whitespace, which is stripped off of the text. Luke
Re: Question about list context for String.chars
gcomnz wrote: Hi all, I'm writing a bunch of examples for perl 6 pleac and it seems rather natural to expect $string.chars to return a list of unicode chars in list context, however I can't find anything to confirm that. (The other alternatives being split and unpack.) # unpack @array = unpack(C*, $string); # split @array = split /./, $string; # this too? @array = $string.split(/./) # and how about this? @array = $string.chars # and this explicit list context? @array = $string.chars[]; Thanks, Marcus Well, in general the word chars has come to mean whatever a character is in the current lexical scope, typically a language level char. It had previously been decided that C.chars,etc would return the length. I'm not about to change that without approval from @Larry. I don't see any technical problem with saying that C.chars returns an array of those chars, when then gets converted to length of array in scalar context. The creating a list just to get length can of course be optimized away. My main issue is that it's it giving two rather different semantics to the same method name, and leaving it to what amounts to context based dispatching. So I don't like this idea as written. However, I do like the idea of treating a string as an array of chars. I remember some discussion a while back about making [] on strings do something useful (but not the same thing as Csubstr), but I forget how it ended, and my brain is too fried to go hunt it down. But overall I like that idea. Then you could just say: @array = $string[]; Which is a lot prettier than anything you mentioned above, let's us get rid of the .split:/null/ issue, has better huffman coding, and lets .chars have only one meaning. For reference, what I'm thinking of having [] do is return the chars specified as a list. This should be lvaluable, so you can hack at individual chars to your heart's content. This is different from substr(), since the latter returns a string of the range of chars, not the individual chars. Consider: $a = $b = All good boys go to heaven.; substr($a,9,3) = girl; $b[9..11] = girl[]; say A: $a; say B: $b; A: All good girls go to heaven. B: All good girs go to heaven. -- Rod Adams
Re: Question about list context for String.chars
Rod wrote: However, I do like the idea of treating a string as an array of chars. I remember some discussion a while back about making [] on strings do something useful (but not the same thing as Csubstr), but I forget how it ended, and my brain is too fried to go hunt it down. But overall I like that idea. Then you could just say: @array = $string[]; This all sounds nice and simple. My only question then is what about the instances where you specifically need the array of graphs, codes, bytes, or whatever? If we can do one, why not all? I recall that a good point Larry made previously is not to bend over backward to let C programmers still think like C programmers in Perl (sorry if my munging didn't get that just right). And to be honest I only came up with this question for the cookbook (pleac) examples, but I'm guessing there's some reasonable use for all this stuff outside of the C-thinking world?
Re: Question about list context for String.chars
On Apr 12, 2005 12:20 AM, gcomnz [EMAIL PROTECTED] wrote: Rod wrote: However, I do like the idea of treating a string as an array of chars. I remember some discussion a while back about making [] on strings do something useful (but not the same thing as Csubstr), but I forget how it ended, and my brain is too fried to go hunt it down. But overall I like that idea. Then you could just say: @array = $string[]; This all sounds nice and simple. My only question then is what about the instances where you specifically need the array of graphs, codes, bytes, or whatever? If we can do one, why not all? That's why C$string.chars[] was proposed -- it would be accompanied by .graphs, .codes, and .bytes. That is all fine and dandy, but I don't think I should have to think about unicode if i don't want to. And if I understand correctly, that means that I want everything to use chars by default. And C$string[] would be a nice shortcut for that. -- matt diephouse http://matt.diephouse.com
Re: Question about list context for String.chars
However, I do like the idea of treating a string as an array of chars. I remember some discussion a while back about making [] on strings do something useful (but not the same thing as Csubstr), but I forget how it ended, and my brain is too fried to go hunt it down. But overall I like that idea. Then you could just say: @array = $string[]; This all sounds nice and simple. My only question then is what about the instances where you specifically need the array of graphs, codes, bytes, or whatever? If we can do one, why not all? That's why C$string.chars[] was proposed -- it would be accompanied by .graphs, .codes, and .bytes. That is all fine and dandy, but I don't think I should have to think about unicode if i don't want to. And if I understand correctly, that means that I want everything to use chars by default. And C$string[] would be a nice shortcut for that. Yes, that's sort of what I was arguing for, in an underhanded way. I agree that $string[] is a good shorthand for the most common usage ($string.chars[]) too.
Re: Question about list context for String.chars
Matt Diephouse wrote: On Apr 12, 2005 12:20 AM, gcomnz [EMAIL PROTECTED] wrote: Rod wrote: However, I do like the idea of treating a string as an array of chars. I remember some discussion a while back about making [] on strings do something useful (but not the same thing as Csubstr), but I forget how it ended, and my brain is too fried to go hunt it down. But overall I like that idea. Then you could just say: @array = $string[]; This all sounds nice and simple. My only question then is what about the instances where you specifically need the array of graphs, codes, bytes, or whatever? If we can do one, why not all? That's why C$string.chars[] was proposed -- it would be accompanied by .graphs, .codes, and .bytes. That is all fine and dandy, but I don't think I should have to think about unicode if i don't want to. And if I understand correctly, that means that I want everything to use chars by default. And C$string[] would be a nice shortcut for that. I've been meaning to ask what people thing about having operators that temporarily change the current lexical Unicode level for just one single expression. I see them as solving all kinds of corner cases. Unfortunately, I don't have a solid proposal handy, which has kept me from posting it. But since there is some interest in this, I'll throw the concept out there, and see if anyone else has a good idea what they should look like, and exactly how they should work. -- Rod Adams