Re: A12: Strings
On Thu, Apr 22, 2004 at 10:34:25AM -0400, Aaron Sherman wrote: : But, what happens if I: : : { : use bytes; : my string $line = $filehandlelikething.getline; : } That might depend on how $filehandlelikething was opened. A filehandle is going to return a string of the type requested when you opened it. As you've written it there, it would presumably try to do a downconversion depending on the definition of "string", which is not a built-in type. Alternately, it could be argued that automatic downconversion is a bad default, and the default should just be to die on a type mismatch unless a coercion is explicitly defined between the two types. (C is the built-in string object type, which probably forces no conversion, presuming conversion is lazy. The builtin C type I'm not so sure about. It might force an octets view. Or maybe we only have bstr, cstr, gstr and lstr, to force bytes, codepoints, graphemes, and letters, and str is an alias to the correct type under the current lexical Unicode support level.) : Does my saying "string" enforce anything, or do I have to: : : { : use bytes; : my string $line is bytes = $filehandlelikething.getline; : } If you want to force a conversion, it's more likely to look like my $line = $filehandlelikething.getline as bstr; or some such. That would be a downconversion, and potentially lossy or "exceptional". If what you want is to treat the internal representation of the string as a sequence of bytes, you'd have to say something else, probably a method on Str to get it to divulge its innards. In which case you're almost certainly on your own as to the interpretation of those bytes. Assume that Perl 6 will change its internal implementation of strings regularly just to keep you on your toes. :-) Larry
Re: A12: Strings
On Wed, 2004-04-21 at 01:51, Larry Wall wrote: > Note these just warp the defaults. Underneath is still a strongly > typed string system. So you can say "use bytes" and know that the > strings that *you* create are byte strings. However, if you get in a > string from another module, you can't necessarily process it as bytes. But, what happens if I: { use bytes; my string $line = $filehandlelikething.getline; } Does my saying "string" enforce anything, or do I have to: { use bytes; my string $line is bytes = $filehandlelikething.getline; } ? -- Aaron Sherman <[EMAIL PROTECTED]> Senior Systems Engineer and Toolsmith "It's the sound of a satellite saying, 'get me down!'" -Shriekback
Re: A12: Strings
On Wed, Apr 21, 2004 at 11:04:02AM +0100, Tim Bunce wrote: : > Hashes should handle various types of built-in key strings properly : > by default. : : What is "properly" for string? The way it oughta, whatever that is... I was aiming to set policy rather than implementation there. :-) : Is it to hash the "sequence of integers" : as if they're 32 bits wide even if they're less? Is that sufficient? That would be one way. The point being that the hash mustn't tell you that two strings are different when they would compare the same, even if they are in different internal representations to begin with. It's okay if the hash occassionally says two strings are the same when in fact they'd compare differently. The actual weakness is likely to be in the definition of comparison rather than the definition of the hash function, especially if we let people specify the standards of comparison for the hash keys. That says that the hash function has to either be weaker than the weakest specifiable comparison, or we have to be able to "weaken" the hash such that it doesn't lie about what might match. That sounds like research... Well, it's probably not that bad. Much like with other sorting problems, all you have to do is keep track of a canonicalized key in addition to the "real" key. The hash is always calculated off of the canonicalized key rather than the actual key. (Whether you choose to store or recreate the canonical key is one of those space/time tradeoffs that "use less" was originally intended to solve...) If Unicode makes your brain hurt, just think of it in terms of case sensitivity. We could have a hash that was case insensitive by always calculating the hash on a lower-cased key, and by doing comparisons between lower-cased keys (notionally, at least). So in Unicode terms, there are probably some speed benefits if you know your keys are already canonicalized to the form required by the comparison and hash functions. That implies that the state of canonicalization must be strongly typed (presumably dynamically in Perl). Canonicalization is one of those things you really don't want to do redundantly. Larry
Re: A12: Strings
On Tue, Apr 20, 2004 at 10:51:04PM -0700, Larry Wall wrote: > > Yes, that's in the works. The plan is to have four Unicode support levels. > These would be declared by lexically scoped declarations: > > use bytes 'ISO-8859-1'; > use codepoints; > use graphemes; > use letters 'Turkish'; > Note these just warp the defaults. Underneath is still a strongly > typed string system. So you can say "use bytes" and know that the > strings that *you* create are byte strings. However, if you get in a > string from another module, you can't necessarily process it as bytes. > If you haven't specified how such a string is to be processed in > your worldview, you're probably going to get an exception. You might > anyway, if what you specified is an impossible downconversion. > > So yes, you can have "use bytes", but it puts more responsibility on > you rather than less. You might rather just specify the type of your > particular string or array, and stay with codepoints or graphemes in > the general case. To the extent that we can preserve the abstraction > that a string is just a sequence of integers, the values of which > have some known relationship to Unicode, it should all just work. > : Is that right, or would there be a key_type property on hashes? More to > : the point, is it worth it, or will I be further slowing down hash access > : because it's special-cased in the default situation? > > Hashes should handle various types of built-in key strings properly > by default. What is "properly" for string? Is it to hash the "sequence of integers" as if they're 32 bits wide even if they're less? Is that sufficient? Tim.
Re: A12: Strings
On Tue, Apr 20, 2004 at 02:16:01PM -0400, Aaron Sherman wrote: : Well, I have a lot to digest, but off the top of my head (and having : nothing to do with objects, but rather the string discussion at the : end), it would be very useful if I could assert: : : no string "complex"; : : or something like that. That is to say, I would love to have a way to : say that my strings are just plain old C-style arrays of 8-bit : characters. Yes, that's in the works. The plan is to have four Unicode support levels. Level 0 character = byte Level 1 character = codepoint Level 2 character = grapheme Level 3 character = letter These would be declared by lexically scoped declarations: use bytes 'ISO-8859-1'; use codepoints; use graphemes; use letters 'Turkish'; It's possible to get into level 0 with a bare "use bytes" but then you just get "C" locale semantics. Often you might specify which 8-bit semantics are the default. It's not possible to get into level 3 without declaring a specific language. You can't just say "use letters". Possibly there's support for "use letters :locale", but don't tell Jarkko. :-) Note these just warp the defaults. Underneath is still a strongly typed string system. So you can say "use bytes" and know that the strings that *you* create are byte strings. However, if you get in a string from another module, you can't necessarily process it as bytes. If you haven't specified how such a string is to be processed in your worldview, you're probably going to get an exception. You might anyway, if what you specified is an impossible downconversion. So yes, you can have "use bytes", but it puts more responsibility on you rather than less. You might rather just specify the type of your particular string or array, and stay with codepoints or graphemes in the general case. To the extent that we can preserve the abstraction that a string is just a sequence of integers, the values of which have some known relationship to Unicode, it should all just work. In particular, latin-1 is by definition the 8-bit subset of Unicode, so if you stick to those codepoints you're safe. Functions and interfaces that require 8-bit bytes will be able to convert such a string regardless of its internal representation. : I know that at a low level Parrot is still going to have its way with : these, but at the very least, I want to be able to put the tag in there : (lexically or otherwise) to make me feel better about myself as a human : being when I do: : : my $n = ''; : for @stuff -> $_ {$n ~= (defined($_)??1::0)} : my $stuff_as_bitvec = pack("b*",$n); : %state_is_known{$stuff_as_bitvec} = 1; : : It's going to be hard for me to accept that that operation is going to : have to worry about codepoints... really hard. Especially so if I'm : doing this is a tight loop as I was recently. If you never put anything into a string bigger than U+00ff, you're guaranteed to get semantics indistinguishable from a byte string, regardless of how the characters might actually be stored. We aimed for this ideal in Perl 5 but were never quite able to achieve it in all the nooks and crannies of the language. There was just too much legacy to deal with. Jarkko took it as far as humanly possible, and in some cases farther. But hopefully we can make a clean break from the looney locale legacy with Perl 6. : I suppose if there were a type: : : my Octets $stuff_as_bitvec = ''; : ... : : Then that would be a start, but even then what of the hashing operation? : Will there be some property of a hash I have to set too? : : class Octets_Num_Pair is Pair { : my Octets $.key; : my Num $.val; : ... redefine key management in terms of Octets ... : } : my Octets_Num_Pair %state_is_known; Hashes aren't declared to return pairs, but rather values. If you need to change the key type it's a trait on the storage class. But... : Is that right, or would there be a key_type property on hashes? More to : the point, is it worth it, or will I be further slowing down hash access : because it's special-cased in the default situation? Hashes should handle various types of built-in key strings properly by default. It's only if you want to start hashing on objects that you have to make sure your class "does" Hashkey or some such. Larry
A12: Strings
Well, I have a lot to digest, but off the top of my head (and having nothing to do with objects, but rather the string discussion at the end), it would be very useful if I could assert: no string "complex"; or something like that. That is to say, I would love to have a way to say that my strings are just plain old C-style arrays of 8-bit characters. I know that at a low level Parrot is still going to have its way with these, but at the very least, I want to be able to put the tag in there (lexically or otherwise) to make me feel better about myself as a human being when I do: my $n = ''; for @stuff -> $_ {$n ~= (defined($_)??1::0)} my $stuff_as_bitvec = pack("b*",$n); %state_is_known{$stuff_as_bitvec} = 1; It's going to be hard for me to accept that that operation is going to have to worry about codepoints... really hard. Especially so if I'm doing this is a tight loop as I was recently. I suppose if there were a type: my Octets $stuff_as_bitvec = ''; ... Then that would be a start, but even then what of the hashing operation? Will there be some property of a hash I have to set too? class Octets_Num_Pair is Pair { my Octets $.key; my Num $.val; ... redefine key management in terms of Octets ... } my Octets_Num_Pair %state_is_known; Is that right, or would there be a key_type property on hashes? More to the point, is it worth it, or will I be further slowing down hash access because it's special-cased in the default situation? -- Aaron Sherman <[EMAIL PROTECTED]> Senior Systems Engineer and Toolsmith "It's the sound of a satellite saying, 'get me down!'" -Shriekback