Re: A12: Strings

2004-04-23 Thread Larry Wall
On Thu, Apr 22, 2004 at 10:34:25AM -0400, Aaron Sherman wrote:
: But, what happens if I:
: 
:   {
:   use bytes;
:   my string $line = $filehandlelikething.getline;
:   }

That might depend on how $filehandlelikething was opened.  A filehandle
is going to return a string of the type requested when you opened
it.  As you've written it there, it would presumably try to do a
downconversion depending on the definition of string, which is
not a built-in type.

Alternately, it could be argued that automatic downconversion is a
bad default, and the default should just be to die on a type mismatch
unless a coercion is explicitly defined between the two types.

(CStr is the built-in string object type, which probably forces
no conversion, presuming conversion is lazy.  The builtin Cstr
type I'm not so sure about.  It might force an octets view.  Or maybe
we only have bstr, cstr, gstr and lstr, to force bytes, codepoints,
graphemes, and letters, and str is an alias to the correct type under
the current lexical Unicode support level.)

: Does my saying string enforce anything, or do I have to:
: 
:   {
:   use bytes;
:   my string $line is bytes = $filehandlelikething.getline;
:   }

If you want to force a conversion, it's more likely to look like

my $line = $filehandlelikething.getline as bstr;

or some such.  That would be a downconversion, and potentially lossy or
exceptional.  If what you want is to treat the internal representation
of the string as a sequence of bytes, you'd have to say something else,
probably a method on Str to get it to divulge its innards.  In which
case you're almost certainly on your own as to the interpretation of
those bytes.  Assume that Perl 6 will change its internal implementation
of strings regularly just to keep you on your toes.  :-)

Larry


Re: A12: Strings

2004-04-22 Thread Aaron Sherman
On Wed, 2004-04-21 at 01:51, Larry Wall wrote:

 Note these just warp the defaults.  Underneath is still a strongly
 typed string system.  So you can say use bytes and know that the
 strings that *you* create are byte strings.  However, if you get in a
 string from another module, you can't necessarily process it as bytes.

But, what happens if I:

{
use bytes;
my string $line = $filehandlelikething.getline;
}

Does my saying string enforce anything, or do I have to:

{
use bytes;
my string $line is bytes = $filehandlelikething.getline;
}

?

-- 
Aaron Sherman [EMAIL PROTECTED]
Senior Systems Engineer and Toolsmith
It's the sound of a satellite saying, 'get me down!' -Shriekback




Re: A12: Strings

2004-04-21 Thread Tim Bunce
On Tue, Apr 20, 2004 at 10:51:04PM -0700, Larry Wall wrote:
 
 Yes, that's in the works.  The plan is to have four Unicode support levels.

 These would be declared by lexically scoped declarations:
 
 use bytes 'ISO-8859-1';
 use codepoints;
 use graphemes;
 use letters 'Turkish';

 Note these just warp the defaults.  Underneath is still a strongly
 typed string system.  So you can say use bytes and know that the
 strings that *you* create are byte strings.  However, if you get in a
 string from another module, you can't necessarily process it as bytes.
 If you haven't specified how such a string is to be processed in
 your worldview, you're probably going to get an exception.  You might
 anyway, if what you specified is an impossible downconversion.
 
 So yes, you can have use bytes, but it puts more responsibility on
 you rather than less.  You might rather just specify the type of your
 particular string or array, and stay with codepoints or graphemes in
 the general case.  To the extent that we can preserve the abstraction
 that a string is just a sequence of integers, the values of which
 have some known relationship to Unicode, it should all just work.

 : Is that right, or would there be a key_type property on hashes? More to
 : the point, is it worth it, or will I be further slowing down hash access
 : because it's special-cased in the default situation?
 
 Hashes should handle various types of built-in key strings properly
 by default.

What is properly for string? Is it to hash the sequence of integers
as if they're 32 bits wide even if they're less?  Is that sufficient?

Tim.


Re: A12: Strings

2004-04-21 Thread Larry Wall
On Wed, Apr 21, 2004 at 11:04:02AM +0100, Tim Bunce wrote:
:  Hashes should handle various types of built-in key strings properly
:  by default.
: 
: What is properly for string?

The way it oughta, whatever that is...  I was aiming to set policy
rather than implementation there.  :-)

: Is it to hash the sequence of integers
: as if they're 32 bits wide even if they're less?  Is that sufficient?

That would be one way.  The point being that the hash mustn't tell
you that two strings are different when they would compare the same,
even if they are in different internal representations to begin with.
It's okay if the hash occassionally says two strings are the same when
in fact they'd compare differently.  The actual weakness is likely to
be in the definition of comparison rather than the definition of the
hash function, especially if we let people specify the standards of
comparison for the hash keys.  That says that the hash function has
to either be weaker than the weakest specifiable comparison, or we have
to be able to weaken the hash such that it doesn't lie about what
might match.  That sounds like research...

Well, it's probably not that bad.  Much like with other sorting
problems, all you have to do is keep track of a canonicalized key
in addition to the real key.  The hash is always calculated off of
the canonicalized key rather than the actual key.  (Whether you choose
to store or recreate the canonical key is one of those space/time
tradeoffs that use less was originally intended to solve...)

If Unicode makes your brain hurt, just think of it in terms of
case sensitivity.  We could have a hash that was case insensitive
by always calculating the hash on a lower-cased key, and by doing
comparisons between lower-cased keys (notionally, at least).

So in Unicode terms, there are probably some speed benefits if you
know your keys are already canonicalized to the form required by
the comparison and hash functions.  That implies that the state
of canonicalization must be strongly typed (presumably dynamically
in Perl).  Canonicalization is one of those things you really don't
want to do redundantly.

Larry


A12: Strings

2004-04-20 Thread Aaron Sherman
Well, I have a lot to digest, but off the top of my head (and having
nothing to do with objects, but rather the string discussion at the
end), it would be very useful if I could assert:

no string complex;

or something like that. That is to say, I would love to have a way to
say that my strings are just plain old C-style arrays of 8-bit
characters.

I know that at a low level Parrot is still going to have its way with
these, but at the very least, I want to be able to put the tag in there
(lexically or otherwise) to make me feel better about myself as a human
being when I do:

my $n = '';
for @stuff - $_ {$n ~= (defined($_)??1::0)}
my $stuff_as_bitvec = pack(b*,$n);
%state_is_known{$stuff_as_bitvec} = 1;

It's going to be hard for me to accept that that operation is going to
have to worry about codepoints... really hard. Especially so if I'm
doing this is a tight loop as I was recently.

I suppose if there were a type:

my Octets $stuff_as_bitvec = '';
...

Then that would be a start, but even then what of the hashing operation?
Will there be some property of a hash I have to set too?

class Octets_Num_Pair is Pair {
my Octets $.key;
my Num $.val;
... redefine key management in terms of Octets ...
}
my Octets_Num_Pair %state_is_known;

Is that right, or would there be a key_type property on hashes? More to
the point, is it worth it, or will I be further slowing down hash access
because it's special-cased in the default situation?

-- 
Aaron Sherman [EMAIL PROTECTED]
Senior Systems Engineer and Toolsmith
It's the sound of a satellite saying, 'get me down!' -Shriekback




Re: A12: Strings

2004-04-20 Thread Larry Wall
On Tue, Apr 20, 2004 at 02:16:01PM -0400, Aaron Sherman wrote:
: Well, I have a lot to digest, but off the top of my head (and having
: nothing to do with objects, but rather the string discussion at the
: end), it would be very useful if I could assert:
: 
:   no string complex;
: 
: or something like that. That is to say, I would love to have a way to
: say that my strings are just plain old C-style arrays of 8-bit
: characters.

Yes, that's in the works.  The plan is to have four Unicode support levels.

Level 0 character = byte
Level 1 character = codepoint
Level 2 character = grapheme
Level 3 character = letter

These would be declared by lexically scoped declarations:

use bytes 'ISO-8859-1';
use codepoints;
use graphemes;
use letters 'Turkish';

It's possible to get into level 0 with a bare use bytes but then
you just get C locale semantics.   Often you might specify which
8-bit semantics are the default.  It's not possible to get into
level 3 without declaring a specific language.  You can't just say
use letters.  Possibly there's support for use letters :locale,
but don't tell Jarkko.  :-)

Note these just warp the defaults.  Underneath is still a strongly
typed string system.  So you can say use bytes and know that the
strings that *you* create are byte strings.  However, if you get in a
string from another module, you can't necessarily process it as bytes.
If you haven't specified how such a string is to be processed in
your worldview, you're probably going to get an exception.  You might
anyway, if what you specified is an impossible downconversion.

So yes, you can have use bytes, but it puts more responsibility on
you rather than less.  You might rather just specify the type of your
particular string or array, and stay with codepoints or graphemes in
the general case.  To the extent that we can preserve the abstraction
that a string is just a sequence of integers, the values of which
have some known relationship to Unicode, it should all just work.
In particular, latin-1 is by definition the 8-bit subset of Unicode,
so if you stick to those codepoints you're safe.  Functions and
interfaces that require 8-bit bytes will be able to convert such a
string regardless of its internal representation.

: I know that at a low level Parrot is still going to have its way with
: these, but at the very least, I want to be able to put the tag in there
: (lexically or otherwise) to make me feel better about myself as a human
: being when I do:
: 
:   my $n = '';
:   for @stuff - $_ {$n ~= (defined($_)??1::0)}
:   my $stuff_as_bitvec = pack(b*,$n);
:   %state_is_known{$stuff_as_bitvec} = 1;
: 
: It's going to be hard for me to accept that that operation is going to
: have to worry about codepoints... really hard. Especially so if I'm
: doing this is a tight loop as I was recently.

If you never put anything into a string bigger than U+00ff, you're
guaranteed to get semantics indistinguishable from a byte string,
regardless of how the characters might actually be stored.  We aimed
for this ideal in Perl 5 but were never quite able to achieve it in
all the nooks and crannies of the language.  There was just too much
legacy to deal with.  Jarkko took it as far as humanly possible, and
in some cases farther.  But hopefully we can make a clean break from
the looney locale legacy with Perl 6.

: I suppose if there were a type:
: 
:   my Octets $stuff_as_bitvec = '';
:   ...
: 
: Then that would be a start, but even then what of the hashing operation?
: Will there be some property of a hash I have to set too?
: 
:   class Octets_Num_Pair is Pair {
:   my Octets $.key;
:   my Num $.val;
:   ... redefine key management in terms of Octets ...
:   }
:   my Octets_Num_Pair %state_is_known;

Hashes aren't declared to return pairs, but rather values.  If you need
to change the key type it's a trait on the storage class.  But...

: Is that right, or would there be a key_type property on hashes? More to
: the point, is it worth it, or will I be further slowing down hash access
: because it's special-cased in the default situation?

Hashes should handle various types of built-in key strings properly
by default.  It's only if you want to start hashing on objects that
you have to make sure your class does Hashkey or some such.

Larry