Re: The .bytes/.codepoints/.graphemes methods
In article [EMAIL PROTECTED], [EMAIL PROTECTED] (Larry Wall) wrote: On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : :u0 # use bytes (. is byte) : :u1 # level 1 support (. is codepoint) : :u2 # level 1 support (. is grapheme) : :u3 # level 1 support (. is language dependent) These modifiers might get renamed to match whatever b/c/g/w convention we come up with pragmas. The levels aren't all that intuitive, though there is a kind of progression of semantic complexity that would get lost with ordinary names. bytes codepts graphemes langdepends That's a kind of progression. And codepts seems a natural enough abbreviation, though I don't really know what to do with language_ dependent_thingummies. Though with less typing, the initials b c g l give the same progression. -David except for encodings where cb, of course Green
RE: The .bytes/.codepoints/.graphemes methods
-Original Message- From: Jonadab the Unsightly One [mailto:[EMAIL PROTECTED] Austin Hastings [EMAIL PROTECTED] writes: I think this is something that we'll want as a mode, a la case-insensitivity. Think of it as mark insensitivity. Makes sense to me, but... Maybe it can just roll into :i? It will probably get used in _conjunction_ with case-insensitivity quite a lot, but I suspect people will want to be able to use one without the other. Since mark-insensitivity is probably mostly a non-issue in the ASCII world, it would probably be a better candidate than average for being turned on using a unicode character, if we're running low on letters for designating these rules. How about :i ? :) :) :) =Austin
Re: The .bytes/.codepoints/.graphemes methods
Luke Palmer [EMAIL PROTECTED] writes: Or, god forbid, a word? m:base/que mas/ We're not mathematicians: we're allowed to use more than one letter in a row to designate something :-) Well, if it were *me*, *I* would have voted for keeping the core language 100% pure ASCII, untainted by rogue untypeable characters... So naturally :base is fine by *me*... -- $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}} split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/
Re: The .bytes/.codepoints/.graphemes methods
Austin Hastings [EMAIL PROTECTED] writes: I think this is something that we'll want as a mode, a la case-insensitivity. Think of it as mark insensitivity. Makes sense to me, but... Maybe it can just roll into :i? It will probably get used in _conjunction_ with case-insensitivity quite a lot, but I suspect people will want to be able to use one without the other. Since mark-insensitivity is probably mostly a non-issue in the ASCII world, it would probably be a better candidate than average for being turned on using a unicode character, if we're running low on letters for designating these rules. -- $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}} split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/
Re: The .bytes/.codepoints/.graphemes methods
Jonadab the Unsightly One writes: Austin Hastings [EMAIL PROTECTED] writes: I think this is something that we'll want as a mode, a la case-insensitivity. Think of it as mark insensitivity. Makes sense to me, but... Maybe it can just roll into :i? It will probably get used in _conjunction_ with case-insensitivity quite a lot, but I suspect people will want to be able to use one without the other. Since mark-insensitivity is probably mostly a non-issue in the ASCII world, it would probably be a better candidate than average for being turned on using a unicode character, if we're running low on letters for designating these rules. Or, god forbid, a word? m:base/que mas/ We're not mathematicians: we're allowed to use more than one letter in a row to designate something :-) Luke
Re: The .bytes/.codepoints/.graphemes methods
--- Larry Wall [EMAIL PROTECTED] wrote: On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : Or was that to imply that a literal a in the RE would be : interpretted as a grapheme a when :u2 is active? I don't know what you mean by grapheme a there. If you mean, Does it match any grapheme that happens to be exactly U+0061?, then the answer is yes. In my original question, I meant to differentiate between 'grapheme' and 'possible component of a multibyte expression'. If you mean Does it wildcard to any grapheme that uses U+0061 as the base character?, then the answer is probably no. We have not yet come up with a syntax for that kind of wildcarding, other than dropping down to codepoints [:u1 a \pM+] or some such. That may or may not be sufficient. It'd be pretty easy to define a like a assertion in any case. I think this is something that we'll want as a mode, a la case-insensitivity. Think of it as mark insensitivity. I'm not sure if this should be language/locale dependent or not, but a basic search feature for text is fre'd - fred. Maybe it can just roll into :i? =Austin
Re: The .bytes/.codepoints/.graphemes methods
On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote: : This has no direct bearing on p6l, since performance is a p6i issue. : But perhaps in the interests of performance as well as hackery we : should explicitly provide some sort of variant regex behavior: : : /a./ :bytes : /a./ :graphemes : : where the first would recognize 0x61 followed by any single byte, while : the second would recognize 'a' followed by any number of bytes : composing a single grapheme. : : Isn't that what :u0, :u1, :u2, and :u3 are for? : : :u0 # use bytes (. is byte) : :u1 # level 1 support (. is codepoint) : :u2 # level 1 support (. is grapheme) : :u3 # level 1 support (. is language dependent) These modifiers might get renamed to match whatever b/c/g/w convention we come up with pragmas. The levels aren't all that intuitive, though there is a kind of progression of semantic complexity that would get lost with ordinary names. : These modifiers say nothing about the state of the data, but in : general internal Perl data will already be in Normalization Form : C, so even under :u1, the precomposed characters will usually do : the right thing. These days it might be that most of the data we see will be maximally decomposed rather than maximally composed. But the jury is still out on that. And in any event, :u2 and :u3 should hide that distinction. : Note that these modifiers are for overriding : the default support level, which was probably set by pragma at : the top of the file. Another way of saying that is that these modifiers are, in fact, lexically scoped pragmas with the *exact* same effect as the ordinary Unicode level pragmas. It's just that they're lexically scoped to the rest of a rule or group rather than to the rest of a block. : Or was that to imply that a literal a in the RE would be : interpretted as a grapheme a when :u2 is active? I don't know what you mean by grapheme a there. If you mean, Does it match any grapheme that happens to be exactly U+0061?, then the answer is yes. If you mean Does it wildcard to any grapheme that uses U+0061 as the base character?, then the answer is probably no. We have not yet come up with a syntax for that kind of wildcarding, other than dropping down to codepoints [:u1 a \pM+] or some such. That may or may not be sufficient. It'd be pretty easy to define a like a assertion in any case. Larry
Re: The .bytes/.codepoints/.graphemes methods
On Wed, Jul 07, 2004 at 08:09:51PM -0700, Larry Wall wrote: : On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : : On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote: : : This has no direct bearing on p6l, since performance is a p6i issue. : : But perhaps in the interests of performance as well as hackery we : : should explicitly provide some sort of variant regex behavior: : : : : /a./ :bytes : : /a./ :graphemes : : : : where the first would recognize 0x61 followed by any single byte, while : : the second would recognize 'a' followed by any number of bytes : : composing a single grapheme. : : : : Isn't that what :u0, :u1, :u2, and :u3 are for? : : : : :u0 # use bytes (. is byte) : : :u1 # level 1 support (. is codepoint) : : :u2 # level 1 support (. is grapheme) : : :u3 # level 1 support (. is language dependent) : : These modifiers might get renamed to match whatever b/c/g/w convention : we come up with pragmas. The levels aren't all that intuitive, though : there is a kind of progression of semantic complexity that would get : lost with ordinary names. On the flip side, a good reason to get rid of the numeric values is that in all likelihood people will continually make the mistake of thinking :u1 means one byte at a time and :u2 means two bytes at a time. And then they'll wonder why :u4 doesn't give them UTF-32... Larry
Re: The .bytes/.codepoints/.graphemes methods
Aaron Sherman wrote: On Tue, 2004-06-29 at 11:34, Austin Hastings wrote: (2) Perl6 should equitably support all its target locales; (3) we should set out to make sure the performance is damn fast no matter what locale we're using. Well, that's a nice theory, but you can prove that low-level encodings (e.g. ASCII, EBCDIC) will be more efficient than high-level encodings (e.g. UTF-8), so the only way to accomplish what you suggest in (2) is to break (3) by slowing down the faster handling (not what you wanted, I'm sure). At the Parrot level, codepoint operations will generally be the most efficient, even on strings with exotic charsets. Parrot uses an internal encoding that allows O(1) access to codepoints; essentially, it uses an array of 8-, 16-, or 32-bit integers, depending on the highest codepoint value. This is the default even for character sets with shift characters, like Shift-JIS. On strings where all codepoints have values under 256, bytewise and codepointwise lookup are equivalent; otherwise, though, bytewise lookup will actually be *slower* than codepointwise, as Parrot will maintain the illusion that each codepoint is stored in an integer that's the perfect size for it. If you force Parrot to use the UTF-8 encoding internally then bytewise lookup becomes fastest, and codepointwise slows down a lot. But you really shouldn't do that--UTF-8 is ill-suited for actually *manipulating* text, unlike the Parrot internal encodings. (UTF-16 and UTF-32 will presumably be available too, although I've seen no specific mention of them.) You can also force it to use a raw or bytes encoding, where bytes and codepoints are identical. But you can't store Unicode characters in such a string and have them behave in a reasonable way. (Note: this is all based on my own, possibly false, memory.) -- Brent Dax Royal-Gordon [EMAIL PROTECTED] Perl and Parrot hacker Oceania has always been at war with Eastasia.
Re: The .bytes/.codepoints/.graphemes methods
On Tue, 2004-06-29 at 11:34, Austin Hastings wrote: [...] when you switch to LC_ALL= pick your favorite language, you just get really slow performance: Apparently the 'C' locale is such a totally special case that the performance of LC_ALL=C is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even when the data is 7bit ascii. Well, of course. I can't imagine a way in which this would not be true. After all, in LC_ALL=C the number of characters in a string is equal to the number of bytes in the string. In LC_ALL=en_US.UTF-8 the length of a string is dependent on what exactly you mean by length, and a lot of special cases arise. Special cases and context mean you have more code to execute for the same logical task, which means you have more processing to do. Unicode support is expensive, even if you're just doing ASCII-as-UTF-8. That doesn't mean it's a bad thing to do, it's just that it's expensive. I think that (1) this is unacceptable: the temptation to switch to the 'C' locale has been too great, both at this site and on a lot of the RH support forums; And yet, in English-speaking countries (and Hawaiian and Swahili-speaking countries for that matter) and in situations where the fidelity of certain types of string data (such as names) is not considered critical, this is a fine default. e.g. for general shell work. (2) Perl6 should equitably support all its target locales; (3) we should set out to make sure the performance is damn fast no matter what locale we're using. Well, that's a nice theory, but you can prove that low-level encodings (e.g. ASCII, EBCDIC) will be more efficient than high-level encodings (e.g. UTF-8), so the only way to accomplish what you suggest in (2) is to break (3) by slowing down the faster handling (not what you wanted, I'm sure). Of course, you want to have as much performance out of string handling as possible. This has no direct bearing on p6l, since performance is a p6i issue. But perhaps in the interests of performance as well as hackery we should explicitly provide some sort of variant regex behavior: /a./ :bytes /a./ :graphemes As pointed out by others, this is already there, though I'm not sure that it would be specified that way. More likely: m :u0 /a./ [etc] -- Aaron Sherman [EMAIL PROTECTED] Senior Systems Engineer and Perl Toolsmith http://www.ajs.com/~ajs/resume.html
Re: The .bytes/.codepoints/.graphemes methods
Larry Wall wrote: On Sat, Jun 26, 2004 at 12:27:38PM -0700, Brent 'Dax' Royal-Gordon wrote: : Issues: : * Limits lvalue substr (doesn't allow it to be a different size) : unless splice is used (or a substr method is also provided). That all has to be looked at anyway. What does 5 mean when you pass it to substr, anyway? (I've been trying to make it assume some implicit unit based on the current lexical scope's Unicode level, but issues remain.) We have magical string positions that have different numeric values depending on what units you view them as, but at what point does a number like 5 get translated to such a magical string position? While we're on the topic of substr, allow me to beg. Please, can we replace substr with with array style operations like Ruby and Python? Please? Something like this would be nice: my $string = Hello, World!; say $string[0..4]; # prints Hello\n $string[7...] = Larry!; say $string; # prints Hello, Larry!\n We already have our strings acting as objects, and we have [] as a postcircumfix operator, so it's something that someone could define easily. Of course, I have no idea how to reconcile this with all the talk of unicode other than to say that the easy stuff should be easy. It just follows this would also be nice for arrays, to replace splice. For me, these two functions are the most bothersome part of Perl 5, and I would love to see them go. matt
Re: The .bytes/.codepoints/.graphemes methods
Matt Diephouse skribis 2004-06-30 20:51 (-0400): my $string = Hello, World!; say $string[0..4]; # prints Hello\n $string[7...] = Larry!; say $string; # prints Hello, Larry!\n And that array is one of bytes? graphemes? In general, I like the idea. In [EMAIL PROTECTED], almost the same was suggested, but implemented differently: a string's .bytes method in list context (but isn't it array context, technically?) would dwym. As would the other parts-of-string methods. Perhaps without method, the string in array/list context can default to the default set by a lexical pragma. Which, I hope, has a default itself. (I like default defaults...) Juerd
Re: The .bytes/.codepoints/.graphemes methods
Juerd wrote: Matt Diephouse skribis 2004-06-30 20:51 (-0400): my $string = Hello, World!; say $string[0..4]; # prints Hello\n $string[7...] = Larry!; say $string; # prints Hello, Larry!\n And that array is one of bytes? graphemes? I'm not really up on my unicode, but I think .chars is what I have in mind. I want it to operate like a non-unicode string in Perl 5. Anything unicode can be more complex, as I think this will be the common case. In general, I like the idea. In [EMAIL PROTECTED], almost the same was suggested, but implemented differently: a string's .bytes method in list context (but isn't it array context, technically?) would dwym. As would the other parts-of-string methods. Think of this as Huffmanized .chars then? matt
Re: The .bytes/.codepoints/.graphemes methods
On Thu, 1 Jul 2004, Juerd wrote: Matt Diephouse skribis 2004-06-30 20:51 (-0400): my $string = Hello, World!; say $string[0..4]; # prints Hello\n $string[7...] = Larry!; say $string; # prints Hello, Larry!\n And that array is one of bytes? graphemes? In general, I like the idea. In [EMAIL PROTECTED], almost the same was suggested, but implemented differently: a string's .bytes method in list context (but isn't it array context, technically?) would dwym. As would the other parts-of-string methods. What if you could add the slice onto the method: my $string = Hello, World!; say $string.bytes[0..4]; # prints Hello\n $string.codepoints[7...] = Søren!; say $string; # prints Hello, Søren!\n The string slicing operator would have to return an array of bytes/codepoints/etc in list context and a substr in scalar context. ~ John Williams
Re: The .bytes/.codepoints/.graphemes methods
Dan Sugalski [EMAIL PROTECTED] writes: Hmm. Suppose that I have a system that is friendly to 80 byte records. I want to output meaningful strings, so I want to partition a buffer into 80-ish byte substrings, but preserve any graphemes (i.e., store the data in a legible format). How would I do that? You don't. Or if you do, you do it with a lot of pain, sweat, and annoying hard work. 80 bytes gets you somewhere between three (And this may be a *high* estimate--there may be circumstances where 80 bytes is insufficient for *one* grapheme) and 80 graphemes. This isn't something that can be made generically easy. It's no worse than implementing word wrap. Someone will of course implement it as a generic routine, something along the lines of my @line = breakunicodestringintobytebufferchunks( string = $string, chunksize = 80, keeptogether = 'graphemes', extremelongparts = 'split', # 'split' will try to split it at a mostly-reasonable # place if possible, similar to word wrap that looks # for syllable boundaries. # 'truncate' would do the same but drop the second part, # rather than putting it in the next line. # 'skip' would drop the whole grapheme out. # 'allow' would create a line longer (in bytes) than # the chunksize, which is what a lot of word wrap # algorithms do, but would not work if you really # have to fit in a fixed-byte-size buffer. It would # of course put the thing on a line by itself though, # to minimize the overflow. ); There are reasons for doing this, e.g. if you've got Unicode text to send via a network protocol with an octet-oriented RFC, or if you're interacting with some legacy C code that has fixed-size buffers. Someone will write the routine to do as well as can be expected, and it'll be put on the CPAN, and people who need this sort of thing will use it. I don't think the language needs to be designed around it though. -- $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}} split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/
Re: The .bytes/.codepoints/.graphemes methods
Austin Hastings [EMAIL PROTECTED] writes: A couple of alternatives: substr.bytes($string, 2, 4) = $substitute; Well, that's arguably better than bsubstr. substr($string.bytes, 2, 4) = $substitute; I could live with that, although it doesn't allow mixing units. (Someone will pop in here and say that's to be construed as a feature.) # Make it a pragma use String(bytes); substr($string, 2, 4) = substitute; I think a pragma should set the default unit for the current lexical scope, at least. (The default, in the absense of the pragma, is an open question; at worst the default could be to throw an exception if units aren't specified; personally I think throwing exceptions willy nilly is unPerlish.) # Make it a global mode set_string_mode(bytes); substr($string, 2, 4) = substitute; I don't like this. It's no more useful than the pragma but has bigger caveats. # Make it an object mode $string.access_mode(bytes); substr($string, 2, 4) = $substitute; Wouldn't this add extra operations all over the place? The word bytes is clearly much too long, though, much less graphemes or codepoints. I thought about this: substr($string, 2b, 4b) = $substitute; Problems with: substr($string, 0b, 1b) = $substitute; Is that binary or bytes? Also: I figured it would conflict with something. substr($string, $start b, $end b) = $substitute; Looks unintuitive. *shrug*. I chose it because I thought the other way around looked unintuitive: substr($string, b $start, b $end) = $substitute; That looks like calling a function -- which *is* what's going on, under the hood, but the other way around looks like tagging on units, which seems more natural to me. With presumably g and c for graphemes and codepoints, but I rather suspect that might conflict with some other existing syntax (though I can't think of anything in particular). 0c? 0x16c ? Ick, yes, I missed that. (I was thinking only of numbers specified in decimal.) I knew there'd be something. codes and graphs is better than codepoints and graphemes, at least. In certain (IMO large) sectors of the Perl community, string processing is just about all the work there is. I submit that there needs to be a way to drive the token length to 0: either a pragma, or a global mode, or a type definition. A pragma should set the default, IMO. I think what we're talking about here is what the syntax would be for using a unit other than the default, or for specifying the units if you haven't used the pragma to set the default. You could coin the abbreviation ligs, for Language Independent Graphemes. Then some ingenious rascal can create a pragma or whatever that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness. As opposed to 'ligs' meaning ligatures? Fraught with peril. :-) I thought about that, but figured it wasn't a big deal; there are *lots* of abbreviations with more than one possible interpretation, and you just deal with having to know which one is meant. However, it was then pointed out that it would actually be ldgs, which IMO is unpronounceable and ugly. So something else is needed for those. *shrug*. Make up a word. Call them woohickies for all I care and abbreviate it woo or just w. I like graphemes for the default because I hate and fear graphemes. The whole *code thing just crawls right in my ear, so having the language transparently support it would be a win. I can see the logic in that. Personally I don't care what the default is. Almost none of my code will need to care one way or the other, and that which does can use the pragma. Have the implications of the bytes/codepoints/graphemes/woohickies distinction for the regular expression engine been discussed already? -- $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}} split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/
Re: The .bytes/.codepoints/.graphemes methods
--- Jonadab the Unsightly One [EMAIL PROTECTED] wrote: Have the implications of the bytes/codepoints/graphemes/woohickies distinction for the regular expression engine been discussed already? Not enough. One of my current clients just rolled on to redhat 9, and what a steaming pile of digestive byproducts *that* turned out to be. Apparently the default locale setting changed, so now LC_ALL= out of the box. One effect of this is irritating lack of proper behavior in the utilities. But when you switch to LC_ALL= pick your favorite language, you just get really slow performance: Apparently the 'C' locale is such a totally special case that the performance of LC_ALL=C is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even when the data is 7bit ascii. I think that (1) this is unacceptable: the temptation to switch to the 'C' locale has been too great, both at this site and on a lot of the RH support forums; (2) Perl6 should equitably support all its target locales; (3) we should set out to make sure the performance is damn fast no matter what locale we're using. This has no direct bearing on p6l, since performance is a p6i issue. But perhaps in the interests of performance as well as hackery we should explicitly provide some sort of variant regex behavior: /a./ :bytes /a./ :graphemes where the first would recognize 0x61 followed by any single byte, while the second would recognize 'a' followed by any number of bytes composing a single grapheme. (I'll claim that it's legitimate to want to search for, say, any MBCs introduced via \x0F\x01, regardless of length. This is likely not supported any other way.) =Austin
Re: The .bytes/.codepoints/.graphemes methods
Juerd [EMAIL PROTECTED] writes: substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. That could be combined with a smart substr that only needs the units once (err, only needs a position object for one of the args) and knows how to conver the other number to the same units (err, same type of position object): substr($string, 2, 4 but bytes); This would still allow for specifying units on both if you for some reason wanted them different (which, as Dan S points out, sounds like a bad idea, on the face of it). :bytes is shorter than but bytes, though. -- $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}} split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/
Re: The .bytes/.codepoints/.graphemes methods
On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote: This has no direct bearing on p6l, since performance is a p6i issue. But perhaps in the interests of performance as well as hackery we should explicitly provide some sort of variant regex behavior: /a./ :bytes /a./ :graphemes where the first would recognize 0x61 followed by any single byte, while the second would recognize 'a' followed by any number of bytes composing a single grapheme. Isn't that what :u0, :u1, :u2, and :u3 are for? :u0 # use bytes (. is byte) :u1 # level 1 support (. is codepoint) :u2 # level 1 support (. is grapheme) :u3 # level 1 support (. is language dependent) These modifiers say nothing about the state of the data, but in general internal Perl data will already be in Normalization Form C, so even under :u1, the precomposed characters will usually do the right thing. Note that these modifiers are for overriding the default support level, which was probably set by pragma at the top of the file. Or was that to imply that a literal a in the RE would be interpretted as a grapheme a when :u2 is active? -Scott -- Jonathan Scott Duff Division of Nearshore Research [EMAIL PROTECTED] Senior Systems Analyst II
Re: The .bytes/.codepoints/.graphemes methods
Larry Wall [EMAIL PROTECTED] writes: That all has to be looked at anyway. What does 5 mean when you pass it to substr, anyway? I was just going to ask about substrings, and then didn't because I figured that had been hashed out already and I'd missed it... (I've been trying to make it assume some implicit unit based on the current lexical scope's Unicode level, but issues remain.) We have magical string positions that have different numeric values depending on what units you view them as, but at what point does a number like 5 get translated to such a magical string position? It would be possible to have right-associative operators (that bind at least more tightly than comma and possibly very tightly) and convert a number to one of these objects, so that we can do stuff like this: substr($string, 2 bytes, 4 bytes) = $substitute; Then if you pass a plain number to substr it could either assume something (possibly generating a warning) or spit an error, depending on some feature of the current lexical scope. The word bytes is clearly much too long, though, much less graphemes or codepoints. I thought about this: substr($string, 2b, 4b) = $substitute; With presumably g and c for graphemes and codepoints, but I rather suspect that might conflict with some other existing syntax (though I can't think of anything in particular). And I can't think of another abbreviation that would be remotely intuitive. There's also the possibility of bsubstr and so on, but that leads us down the path of C, having a hillion bajillion functions with names like fgets, stoi, and fstrnclost. Having sprintf is quite enough of that, IMO. I dunno--it reads pretty well. Maybe these'll be heavily enough used that we should Huffmanize them down a bit: $str.bytes $str.codes $str.graphs $str.letters codes and graphs is better than codepoints and graphemes, at least. Though letters is a bit inadequate to describe language-dependent graphemes, since it also divides any non-letters...I suppose we could go with .characters if we don't mind forcing a heavily overloaded word in one particular direction, culturally speaking. Except, I'd kinda like to keep them starting with different letters. (And maybe .chars should be reserved to mean whatever the default unit is in the current lexical scope, as with substr() above.) You could coin the abbreviation ligs, for Language Independent Graphemes. Then some ingenious rascal can create a pragma or whatever that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness. -- $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b-()}} split//,[EMAIL PROTECTED]/ --;$\=$ ;- ();print$/
Re: The .bytes/.codepoints/.graphemes methods
On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote: : You could coin the abbreviation ligs, for Language Independent : Graphemes. Then some ingenious rascal can create a pragma or whatever : that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness. Except they'd have to be ldgs. Graphemes are ligs in current parlance. Larry
Re: The .bytes/.codepoints/.graphemes methods
Jonadab The Unsightly One [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] It would be possible to have right-associative operators (that bind at least more tightly than comma and possibly very tightly) and convert a number to one of these objects, so that we can do stuff like this: substr($string, 2 bytes, 4 bytes) = $substitute; I think that the common case will use the same units for both the index and the length. So perhaps: substr($string, 2, 4 :bytes) would be more appropriate. Also, by only requiring us to write the unit once, the need for ultra-short abbreviations is reduced. Dave.
Re: The .bytes/.codepoints/.graphemes methods
On Mon, 28 Jun 2004, Larry Wall wrote: On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote: : You could coin the abbreviation ligs, for Language Independent : Graphemes. Then some ingenious rascal can create a pragma or whatever : that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness. Except they'd have to be ldgs. Graphemes are ligs in current parlance. And 'ligs' implies ligatures. And since that'd require font, style, and possibly layout information, I think we'd rather not go there right now... Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The .bytes/.codepoints/.graphemes methods
Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. Juerd
Re: The .bytes/.codepoints/.graphemes methods
On Mon, 28 Jun 2004, Juerd wrote: Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. I think mixing strings, bytes, graphemes, and code points together is a phenomenally bad idea, likely to lead to many tears, much gnashing of teeth, and quite a few rampages with sharp objects, not to mention a lot of code guaranteed to fail at the edge cases. If, as a programmer, you *really* want to run with scissors then convert your string to a binary byte buffer and go from there. At least then when you poke out an eye you won't be nearly so surprised. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The .bytes/.codepoints/.graphemes methods
--- Dan Sugalski [EMAIL PROTECTED] wrote: On Mon, 28 Jun 2004, Juerd wrote: Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. I think mixing strings, bytes, graphemes, and code points together is a phenomenally bad idea, likely to lead to many tears, much gnashing of teeth, and quite a few rampages with sharp objects, not to mention a lot of code guaranteed to fail at the edge cases. Hmm. Suppose that I have a system that is friendly to 80 byte records. I want to output meaningful strings, so I want to partition a buffer into 80-ish byte substrings, but preserve any graphemes (i.e., store the data in a legible format). How would I do that? The obvious answer is a gnarly little loop, but I think I'd like to have perl do that for me. Can I say something like: while ($buffer) { $output = substr($buffer, 0, 80 but bytes, units = graphemes); $buffer = substr($buffer, 0, length $output :graphemes); $cout $output nl; # :-) } and get some dwimmery? =Austin If, as a programmer, you *really* want to run with scissors then convert your string to a binary byte buffer and go from there. At least then when you poke out an eye you won't be nearly so surprised. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The .bytes/.codepoints/.graphemes methods
On Mon, 28 Jun 2004, Austin Hastings wrote: --- Dan Sugalski [EMAIL PROTECTED] wrote: On Mon, 28 Jun 2004, Juerd wrote: Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. I think mixing strings, bytes, graphemes, and code points together is a phenomenally bad idea, likely to lead to many tears, much gnashing of teeth, and quite a few rampages with sharp objects, not to mention a lot of code guaranteed to fail at the edge cases. Hmm. Suppose that I have a system that is friendly to 80 byte records. I want to output meaningful strings, so I want to partition a buffer into 80-ish byte substrings, but preserve any graphemes (i.e., store the data in a legible format). How would I do that? You don't. Or if you do, you do it with a lot of pain, sweat, and annoying hard work. 80 bytes gets you somewhere between three (And this may be a *high* estimate--there may be circumstances where 80 bytes is insufficient for *one* grapheme) and 80 graphemes. This isn't something that can be made generically easy. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The .bytes/.codepoints/.graphemes methods
--- Jonadab the Unsightly One [EMAIL PROTECTED] wrote: Larry Wall [EMAIL PROTECTED] writes: (I've been trying to make it assume some implicit unit based on the current lexical scope's Unicode level, but issues remain.) We have magical string positions that have different numeric values depending on what units you view them as, but at what point does a number like 5 get translated to such a magical string position? It would be possible to have right-associative operators (that bind at least more tightly than comma and possibly very tightly) and convert a number to one of these objects, so that we can do stuff like this: substr($string, 2 bytes, 4 bytes) = $substitute; Then if you pass a plain number to substr it could either assume something (possibly generating a warning) or spit an error, depending on some feature of the current lexical scope. A couple of alternatives: substr.bytes($string, 2, 4) = $substitute; substr($string.bytes, 2, 4) = $substitute; # Make it a pragma use String(bytes); substr($string, 2, 4) = substitute; # Make it a global mode set_string_mode(bytes); substr($string, 2, 4) = substitute; # Make it an object mode $string.access_mode(bytes); substr($string, 2, 4) = $substitute; The word bytes is clearly much too long, though, much less graphemes or codepoints. I thought about this: substr($string, 2b, 4b) = $substitute; Problems with: substr($string, 0b, 1b) = $substitute; Is that binary or bytes? Also: substr($string, $start b, $end b) = $substitute; Looks unintuitive. With presumably g and c for graphemes and codepoints, but I rather suspect that might conflict with some other existing syntax (though I can't think of anything in particular). 0c? 0x16c ? And I can't think of another abbreviation that would be remotely intuitive. There's also the possibility of bsubstr and so on, but that leads us down the path of C, having a hillion bajillion functions with names like fgets, stoi, and fstrnclost. Having sprintf is quite enough of that, IMO. I dunno--it reads pretty well. Maybe these'll be heavily enough used that we should Huffmanize them down a bit: $str.bytes $str.codes $str.graphs $str.letters codes and graphs is better than codepoints and graphemes, at least. In certain (IMO large) sectors of the Perl community, string processing is just about all the work there is. I submit that there needs to be a way to drive the token length to 0: either a pragma, or a global mode, or a type definition. Though letters is a bit inadequate to describe language-dependent graphemes, since it also divides any non-letters...I suppose we could go with .characters if we don't mind forcing a heavily overloaded word in one particular direction, culturally speaking. Except, I'd kinda like to keep them starting with different letters. (And maybe .chars should be reserved to mean whatever the default unit is in the current lexical scope, as with substr() above.) You could coin the abbreviation ligs, for Language Independent Graphemes. Then some ingenious rascal can create a pragma or whatever that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness. As opposed to 'ligs' meaning ligatures? Fraught with peril. :-) To me, the right thing to do is provide a 'default' way to work, and allow for changing that default to some other way. The obvious defaults are 'bytes', which gives C-like behavior (unpopular though that may presently be) and imposes little or no conceptual strain but likewise no enormous benefit, and 'graphemes'. I like graphemes for the default because I hate and fear graphemes. The whole *code thing just crawls right in my ear, so having the language transparently support it would be a win. Having the language force me to understand this stuff, if it cannot be transparently supported, would also be a win, on a longer time scale. =Austin