Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-23 Thread John Cowan
Ray Dillinger scripsit:

> 40 of which don't count because they're not part of the repertoire of 
> normalized characters,

That is, normalizations which remove compatibility characters (NKFC
and NKFD).  There exist good reasons to keep compatibility characters,
though, in which case this characterization is inaccurate.

> and 88 of which are single characters that 
> change under casing operations to single characters, confusing only
> those who have already confused character lengths with codepoint
> lengths. 

"Character" is a vague term; it has five definitions in the Unicode
glossary.  You are identifying characters with DCGs, which are sensible
for some languages and purposes but misfire for others.  Tamil users
think of their abugida as a syllabary, and DCGs work well for them; Hindi
users think of their closely related abugida as either an alphabet or a
set of consonant clusters with vowel marks, depending on the ligaturing
behavior they are most familiar with.  Likewise, in Swedish ä and
ö are as distinct from a and o as i from j or G from C; in German,
the umlauted letters are mere variants of their normal counterparts.
Furthermore, Spanish é is just an e that bears word stress, whereas in
French é, è, and e are three separate entities.

The one true answer is that there is no one true answer.  Codepoints are
the irreducible minimum level: when you go down to code units or octets,
you lose too much semantic import and are in the realm of encodings of
Unicode rather than Unicode itself.  Above that there are many ways to
segment strings, some language-specific, some not.  I don't see much
point in privileging one over another.

What's more, using DCGs means that strings are a denumerably infinite
domain of finite sequences over another denumerably infinite domain, DCGs.
Some might think that one denumerably infinite domain was sufficient.

--
My corporate data's a mess! John Cowan
It's all semi-structured, no less.  http://www.ccil.org/~cowan
But I'll be [email protected]
Using XSLT
On an XML DBMS.

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-23 Thread Alaric Snell-Pym

On 23 Sep 2009, at 7:07 am, Thomas Lord wrote:
> It's not completely absurd to imagine defining strings and
> string lengths inductively (take length 0 and length 1 strings
> as axiomatic and define appending) but it is a bit
> like walking the long way around the block instead of going
> two doors down.   If strings look and quack like finite
> sequences of something, it's nice to be able to reflect on that
> domain of "something".   A first-class character type is a
> natural move.


Thing is, the term "character" gives people certain expectations, that
generally fail to take into account diacritics and all that. Making
them available but calling the codepoints might be a good idea,
however...

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/archives/author/alaric/




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-23 Thread Alaric Snell-Pym

On 23 Sep 2009, at 5:34 am, Brian Harvey wrote:

>> '(escape (meta alt ctrl shift #\cokebottle)) ;-)
>
> IIRC alt, like shift, actually created different glyphs; they
> weren't bucky
> bits.  So no user code would ever see anything like this; it'd be
> (meta ctrl #\someothercharacer)
>
> Maybe you meant to say
> (hyper super meta ctrl #\cokebottle) ?

Heh, no, it's just a reference to the old joke that EMACS stands for
escape-meta-alt-control-shift-... ;-)

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/archives/author/alaric/




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-23 Thread Ray Dillinger
On Wed, 2009-09-23 at 01:59 -0400, John Cowan wrote:
> Thomas Lord scripsit:
> 
> > That is not a problem with Unicode.  That is a problem with 
> > the assumption that there is a bijection between upcase
> > and downcase characters - an assumption violated by one
> > character in one language.  
> 
> A lot more than one.  In addition to ess-zet, there are:
> 
> 13 Latin and Armenian ligatures that uppercase to two characters

... which can never appear in normalized strings because they have 
canonical decompositions 

> 61 Latin and Greek lowercase letters with diacritics that
> uppercase to the uppercase base character followed by the
> combining diacritic(s)

... which are single characters represented with one codepoint that 
uppercase into single characters represented with two codepoints - 
a non-problem if you keep in mind that characters and codepoints are
different ideas 

> I with dot, which lowercases (in non-Turkic contexts) to
> i followed by combining dot in order to maintain canonical
> equivalence rules (only one dot is displayed)

... Which is also a single character represented with one codepoint 
converting to a single character represented with two codepoints ...

> 27 Greek titlecase combinations of an uppercase vowel with
> diacritic(s) followed by a lowercase iota which uppercase to
> the same vowel followed by an uppercase iota.

... Which have canonical decompositions and can never appear in a 
normalized string, and whose normalized forms also have the same 
number of *characters* after a case operation even though the 
number of *codepoints* is different...

> That makes 103 characters altogether that don't work in char-upcase
> or char-downcase.

40 of which don't count because they're not part of the repertoire of 
normalized characters, and 88 of which are single characters that 
change under casing operations to single characters, confusing only
those who have already confused character lengths with codepoint
lengths. 

And exactly one of which *does* count because it's actually a 
different number of characters after the casing operation.

Bear



___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-23 Thread Ray Dillinger
On Tue, 2009-09-22 at 23:57 -0400, Aubrey Jaffer wrote:
> | From: Thomas Lord 
>  | Date: Tue, 22 Sep 2009 20:38:09 -0700
>  | 
>  | On Tue, 2009-09-22 at 20:57 -0400, Aubrey Jaffer wrote:
>  | > Unicode doesn't play well with a character datatype.  Downcasing
>  | > or foldcasing a single scalar-value can result in a length 2
>  | > string.
>  | 
>  | That is not a problem with Unicode.  That is a problem with 
>  | the assumption that there is a bijection between upcase
>  | and downcase characters - an assumption violated by one
>  | character in one language.  
> 
> There are other ligatures which have this property.  A Latin (English)
> example is (lowercase) "fi" (񏐡).  Upcasing it gives "FI";
> downcasing leaves it unchanged, foldcasing yields "fi".

The "fi" character, however, has a canonical decomposition, and may 
never appear in a normalized string; it is replaced by "fi".  If 
you're talking about normalized strings, it is indeed true that 
there is only one character in one language that upcases to a 
different number of characters. 

There are several that upcase or foldcase to a different number of 
codepoints, but that's a different problem and should be below the 
level of abstraction provided by strings. 

Bear



___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-23 Thread John Cowan
homas Lord scripsit:

> The chickens will come home to roost on that hack right around the
> time that the Unicode Consortium runs out of code points to assign.

Willing to put your money where your mouth is?  I've already won several
bets of this type.  If you'll put a time limit on it (so that the bet
won't become a lien on my estate) and agree that the bet is off if and
when humanity begins communicating with ETIs, then I'll be happy to take
such a bet for a reasonable amount.

Characters aren't like IP addresses.  There is simply no place on
Earth where large, totally unknown character sets are at all likely to
be lurking.  Although not all scripts are encoded, all scripts have been
enumerated and their approximate size computed.  It is very unlikely that
we will even break out of the currently specified range of planes 0-3,
special plane E, and private-use planes F and 10.

--
While staying with the Asonu, I met a man from  John Cowan
the Candensian plane, which is very much like   [email protected]
ours, only more of it consists of Toronto.  http://www.ccil.org/~cowan
--Ursula K. Le Guin, Changing Planes

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Thomas Lord
I started writing a silly jokey response to BH
and realized that, for reasons unclear, Alaric's 
original message is one I either didn't get or that
got mysteriously (and wrongly) reclassified as junk
mail.

So, to Alaric, first:

> Shoving these modifers into high bits in characters that are
> represented as some fixed-width cell size is a hack

Yes.  Yes, it is.  Thanks for noticing.

The chickens will come home to roost on that hack right
around the time that the Unicode Consortium runs out of
code points to assign.


On Tue, 2009-09-22 at 21:34 -0700, Brian Harvey wrote:
> > '(escape (meta alt ctrl shift #\cokebottle)) ;-)
> 
> IIRC alt, like shift, actually created different glyphs; they weren't bucky
> bits.  So no user code would ever see anything like this; it'd be
> (meta ctrl #\someothercharacer)
> 
> Maybe you meant to say
> (hyper super meta ctrl #\cokebottle) ?

Everything goes better with a 

level of indirection.   So the Butler says.
And it's often the Butler what done it.

Keyboards give you whatever they give you and you can normalize
it to whatever you like.

Back to Alaric:

> Storing function keys as symbols means you can easily deal
> with good old Sun keyboards [etc.]

Oh, good.  Because, you know, it's not like we haven't
easily dealt with old Sun keyboards (etc.) for about as
long as they've been around.   Er, oops actually we 
have.  Along the lines I described.

Lists or vectors (arrays in Emacs lisp) of symbols and 
chars are *also* a perfectly fine representation for 
event sequences.  It's quite nice to have both, actually.

-t




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Thomas Lord
On Tue, 2009-09-22 at 23:57 -0400, Aubrey Jaffer wrote:
>  | A sequence of what now?   What exactly is it represented as a 
>  | string of length 1?
> 
> (string-ref "abc" 1)  --> "b".

As I mentioned to John, sure... you *can* inductively
define strings and string-length that way but you
aren't going to succeed in doing it without effectively
embedding something like codepoints in your model.
It's also non-standard to talk about lists of things
without specifying a separate, underlying domain of
list elements.   It's also realistic in implementations
that you have to recognize some underlying domain of elements.

So let Scheme reflect on the existence of characters
per se: the type of elements of the finite lists which
are strings.

-t



___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Thomas Lord
On Wed, 2009-09-23 at 01:59 -0400, John Cowan wrote:
> Thomas Lord scripsit:
> 
> > That is not a problem with Unicode.  That is a problem with 
> > the assumption that there is a bijection between upcase
> > and downcase characters - an assumption violated by one
> > character in one language.  
> 
> A lot more than one. 

I stand corrected.  There is more evidence for my
point than I thought.

> That makes 103 characters altogether that don't work in char-upcase
> or char-downcase.

Cool.

> > A sequence of what now?   What exactly is it represented as a 
> > string of length 1?
> 
> A Unicode codepoint.  These languages have no representation of
> codepoints, but they do have representations of sequences of codepoints.
> This is not paradoxical.

Yes, that's my point.

It's not completely absurd to imagine defining strings and
string lengths inductively (take length 0 and length 1 strings
as axiomatic and define appending) but it is a bit
like walking the long way around the block instead of going
two doors down.   If strings look and quack like finite 
sequences of something, it's nice to be able to reflect on that
domain of "something".   A first-class character type is a 
natural move.

It's a little unfair to suggest that because Javascript and
Python lack first class characters, perhaps Scheme should do
without them as well.  Neither Javascript or Python is as 
general purpose a language as Scheme, nor are they deliberately 
conceived of as multi-paradigm languages to the degree that
Scheme is.

-t



___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread John Cowan
Thomas Lord scripsit:

> That is not a problem with Unicode.  That is a problem with 
> the assumption that there is a bijection between upcase
> and downcase characters - an assumption violated by one
> character in one language.  

A lot more than one.  In addition to ess-zet, there are:

13 Latin and Armenian ligatures that uppercase to two characters

61 Latin and Greek lowercase letters with diacritics that
uppercase to the uppercase base character followed by the
combining diacritic(s)

I with dot, which lowercases (in non-Turkic contexts) to
i followed by combining dot in order to maintain canonical
equivalence rules (only one dot is displayed)

27 Greek titlecase combinations of an uppercase vowel with
diacritic(s) followed by a lowercase iota which uppercase to
the same vowel followed by an uppercase iota.

That makes 103 characters altogether that don't work in char-upcase
or char-downcase.

> A sequence of what now?   What exactly is it represented as a 
> string of length 1?

A Unicode codepoint.  These languages have no representation of
codepoints, but they do have representations of sequences of codepoints.
This is not paradoxical.

-- 
John Cowan  http://www.ccil.org/~cowan
It's like if you meet an really old, really rich guy covered in liver
spots and breathing with an oxygen tank, and you say, "I want to be
rich, too, so I'm going to start walking with a cane and I'm going to
act crotchety and I'm going to get liver disease. --Wil Shipley

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread John Cowan
Aubrey Jaffer scripsit:

> If anyone cares, other Unicode-supporting language development efforts
> seem to be moving away from the character datatype:
> 
>  Accoring to , JavaScript
>  lacks chars:

[...]

>  Ruby 1.8 used integers for chars (like C).  Ruby 1.9 returns length 1
>  strings from indexing strings.

[...]

>  Python lacks chars:

[...]

Perl, Basic, Q, and Pure also lack characters.  Haskell uses integers as
characters and lists of integers as strings.

-- 
So they play that [tune] on John Cowan
their fascist banjos, eh?   [email protected]
--Great-Souled Sam  http://www.ccil.org/~cowan

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Brian Harvey
> '(escape (meta alt ctrl shift #\cokebottle)) ;-)

IIRC alt, like shift, actually created different glyphs; they weren't bucky
bits.  So no user code would ever see anything like this; it'd be
(meta ctrl #\someothercharacer)

Maybe you meant to say
(hyper super meta ctrl #\cokebottle) ?

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Aubrey Jaffer
 | From: Thomas Lord 
 | Date: Tue, 22 Sep 2009 20:38:09 -0700
 | 
 | On Tue, 2009-09-22 at 20:57 -0400, Aubrey Jaffer wrote:
 | > Unicode doesn't play well with a character datatype.  Downcasing
 | > or foldcasing a single scalar-value can result in a length 2
 | > string.
 | 
 | That is not a problem with Unicode.  That is a problem with 
 | the assumption that there is a bijection between upcase
 | and downcase characters - an assumption violated by one
 | character in one language.  

There are other ligatures which have this property.  A Latin (English)
example is (lowercase) "fi" (񏐡).  Upcasing it gives "FI";
downcasing leaves it unchanged, foldcasing yields "fi".

 | > If anyone cares, other Unicode-supporting language development
 | > efforts seem to be moving away from the character datatype:
 | 
 | >  Accoring to ,
 | >  JavaScript lacks chars:
 | 
 | >  String is a sequence of zero or more Unicode characters. There
 | >  is no separate character type.  A character is represented as a
 | >  string of length 1.
 | 
 | A sequence of what now?   What exactly is it represented as a 
 | string of length 1?

(string-ref "abc" 1)  --> "b".

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Thomas Lord
On Tue, 2009-09-22 at 20:57 -0400, Aubrey Jaffer wrote:
> Unicode doesn't play well with a character datatype.  Downcasing or
> foldcasing a single scalar-value can result in a length 2 string.

That is not a problem with Unicode.  That is a problem with 
the assumption that there is a bijection between upcase
and downcase characters - an assumption violated by one
character in one language.  

> If anyone cares, other Unicode-supporting language development efforts
> seem to be moving away from the character datatype:

>  Accoring to , JavaScript
>  lacks chars:

>  String is a sequence of zero or more Unicode characters. There is no
>  separate character type.  A character is represented as a string of
>  length 1.

A sequence of what now?   What exactly is it represented as a 
string of length 1?



-t



___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Aubrey Jaffer
 | Date: Sun, 20 Sep 2009 10:29:51 -1000 (HST)
 | From: Shiro Kawai 
 | 
 | From: Thomas Lord 
 | Date: Sun, 20 Sep 2009 12:07:51 -0700
 | 
 | ...
 | > I noticed in the list of string-set! uses that Aubrey
 | > posted from SLIB, one of the uses came from a library
 | > that provided a "format" function: something that takes
 | > a format string and a bunch of other parameters and
 | > creates a new string (like sprintf in C).  That 
 | > strikes me as another case where string mutation is
 | > very handy for avoiding excess data copying and 
 | > consing.
 | 
 | Here I'd like to hear from Aubrey; to me, formatting
 | is one part that string builder type pattern makes
 | much more sense, since the length of the final string
 | isn't generally known beforehand (and Gauche's format
 | is implemented so).   What kind of advantage
 | did you see when you use string-set! in format?

Dirk Lutzebaeck and Ken Dickey were the original authors.  Most of the
uses of STRING-SET! are filling the mantissa and exponent strings
created by MAKE-STRING.  All look to be putting the character at the
end; so I expect a string-port would work as well.

 | > In any application where I/O filtering (read 
 | > some input, tweak it, write output) needs to be
 | > efficient, again, to avoid excessive data copying
 | > string mutation is a big boon.
 | 
 | ...
 | > Given the problematics of Unicode encoding,
 | > I think the time is ripe to bite the bullet and
 | > make the primitive string-replace! (which replaces
 | > in situ an arbitrary substring with an arbitrary string).
 | 
 | Right.  I always feel that just protecting string-set! and
 | string-fill! doesn't make sense.  If mutable-string camp
 | insists length-changing opertaion as well, then it make
 | much more sense.

Unicode doesn't play well with a character datatype.  Downcasing or
foldcasing a single scalar-value can result in a length 2 string.
If anyone cares, other Unicode-supporting language development efforts
seem to be moving away from the character datatype:

 Accoring to , JavaScript
 lacks chars:

   String is a sequence of zero or more Unicode characters. There is no
   separate character type.  A character is represented as a string of
   length 1.

 Ruby 1.8 used integers for chars (like C).  Ruby 1.9 returns length 1
 strings from indexing strings.

 According to
 
 Python lacks chars:

   Characters

   Python has no character type (in contrast to Pascal and C/C++).
   Although a string is a sequence type, the elements of a string are
   not "true" objectes by themselves.

   Strings of length one are used as characters, e.g. in the built-in
   functions chr() and ord().

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Alaric Snell-Pym

On 22 Sep 2009, at 5:05 pm, Thomas Lord wrote:

>> None of this requires what you actually suggested, though, which is
>> storing modifiers in bucky bits. Instead you're storing them in
>> prefixes, like M- and S-.
>
> Nothing *requires* any particular thing in this
> space.  For the record, so that you're clear how
> Emacs is working here, (length "\M-&") => 1
> This simplifies things.  The code that handles
> a key binding specification doesn't have to have
> special cases - it treats the character "\M-&" the
> same way it treats the character "x".

If you treat these keysequences as lists of chars and symbols, then
you get that, too:

(length '((meta #\&))) => 1
(length '((ctrl #\a) #\b 'f11)) => 3

Shoving these modifers into high bits in characters that are
represented as some fixed-width cell size is a hack; it constrains you
to a character representation, and it gives you a limited number of
modifier bits to play with, and both are things you could regret one
day.

Storing function keys as symbols means you can easily deal with good
old Sun keyboards with Open and Help buttons on, and those newfangled
media keyboards with Louder and Quieter and Favourites buttons on; and
implementing modifiers by wrapping lists around symbols (for function
keys) or characters means you can handle as many modifiers as the
future throws at you, and have a uniform modifier abstraction covering
both.

'(escape (meta alt ctrl shift #\cokebottle)) ;-)

>
> -t

 >

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/archives/author/alaric/




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Thomas Lord
On Tue, 2009-09-22 at 09:13 +0100, Alaric Snell-Pym wrote:
> On 22 Sep 2009, at 12:54 am, Thomas Lord wrote:
> 
> [function keys]
> > One could assign them -- not too implausibly --
> > to Unicode's circled numbers (U+2460 onward).
> 
> Yeah, but then how does a user type those? Even if no human keyboard
> has them, most Unicode keyboard drivers have some mechanism for
> entering arbitrary codepoints.

As in Emacs, you normalize events.  That is,
the keyboard sends whatever special code it
sends for a function key but before you look
that code up in a keymap, translate it to 
whatever you like.


> 
> > A very common situation is having a start-up file
> > that sets key-bindings.  Here are two of mine:
> >
> >  (global-set-key "\M-&" 'interactive-background-command)
> >
> > (That's Emacs lisp, not Scheme.)
> >
> > That helps to illustrate how it is convenient to
> > humans to write these things as strings.
> >
> > And, here's one I notice from a famous Emacs lisp
> > extension package called "calc":
> >
> > (define-key calc-mode-map (format "r%c" x) 'calc-recall-quick)
> >
> > Notice that FORMAT - a procedure for formatting strings -
> > is being used to generate a particular keybinding in a
> > systematic way, automatically.
> 
> None of this requires what you actually suggested, though, which is
> storing modifiers in bucky bits. Instead you're storing them in
> prefixes, like M- and S-.


Nothing *requires* any particular thing in this
space.  For the record, so that you're clear how
Emacs is working here, (length "\M-&") => 1
This simplifies things.  The code that handles
a key binding specification doesn't have to have
special cases - it treats the character "\M-&" the 
same way it treats the character "x".

-t



> 
> >
> > -t
> >
> 
> 
> ABS
> 
> --
> Alaric Snell-Pym
> Work: http://www.snell-systems.co.uk/
> Play: http://www.snell-pym.org.uk/alaric/
> Blog: http://www.snell-pym.org.uk/archives/author/alaric/
> 
> 
> 


___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread John Cowan
Alaric Snell-Pym scripsit:

> Yeah, but then how does a user type those? Even if no human keyboard
> has them, most Unicode keyboard drivers have some mechanism for
> entering arbitrary codepoints.

There are 66 reserved non-characters in Unicode, codepoints that will
never be used in interchange and therefore are free for all applications
to use internally for purposes like this.  They are U+FDD0 to U+FDEF
(32 codepoints) plus U+xFFFE through U+x where x = 0 .. 10 (17*2 =
34 codepoints).

-- 
John Cowanhttp://www.ccil.org/~cowan   
"Any legal document draws most of its meaning from context.  A telegram
that says 'SELL HUNDRED THOUSAND SHARES IBM SHORT' (only 190 bits in
5-bit Baudot code plus appropriate headers) is as good a legal document
as any, even sans digital signature." --me

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Alaric Snell-Pym

On 22 Sep 2009, at 12:54 am, Thomas Lord wrote:

[function keys]
> One could assign them -- not too implausibly --
> to Unicode's circled numbers (U+2460 onward).

Yeah, but then how does a user type those? Even if no human keyboard
has them, most Unicode keyboard drivers have some mechanism for
entering arbitrary codepoints.

> A very common situation is having a start-up file
> that sets key-bindings.  Here are two of mine:
>
>  (global-set-key "\M-&" 'interactive-background-command)
>
> (That's Emacs lisp, not Scheme.)
>
> That helps to illustrate how it is convenient to
> humans to write these things as strings.
>
> And, here's one I notice from a famous Emacs lisp
> extension package called "calc":
>
> (define-key calc-mode-map (format "r%c" x) 'calc-recall-quick)
>
> Notice that FORMAT - a procedure for formatting strings -
> is being used to generate a particular keybinding in a
> systematic way, automatically.

None of this requires what you actually suggested, though, which is
storing modifiers in bucky bits. Instead you're storing them in
prefixes, like M- and S-.

>
> -t
>


ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/archives/author/alaric/




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-22 Thread Alaric Snell-Pym

On 21 Sep 2009, at 11:43 pm, Andrew Reilly wrote:

> On Mon, Sep 21, 2009 at 08:49:03AM -0700, Thomas Lord wrote:
>> I don't think so.   For example, I like the idea
>> of using codepoints with buckybits as the names
>> of keyboard events.
>
> Isn't that a fairly gratuitous example of an ad-hoc storage
> optimization for a very specific application, and therefore
> not much of an argument for putting something into "thing-1"?
> (Where do you put the bucky-bits when the input is EBCDIC?
> What's the codepoint for "F11"?)


I much prefer representing key events as a list of characters (for
actual character events) and symbols (for function keys). 'f11 is then
the 'codepoint' for F11 :-)

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/archives/author/alaric/




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-21 Thread Thomas Lord
On Tue, 2009-09-22 at 08:43 +1000, Andrew Reilly wrote:
> On Mon, Sep 21, 2009 at 08:49:03AM -0700, Thomas Lord wrote:
> > I don't think so.   For example, I like the idea
> > of using codepoints with buckybits as the names
> > of keyboard events.
> 
> Isn't that a fairly gratuitous example of an ad-hoc storage
> optimization 

No.  It's not about storage optimization at all.
It's about letting people write keysequences as 
string literals and manipulate them using string
operations (like string for a very specific application, 

It is a fairly specific use but it is not application-
specific.  We can make other examples very different
in character, if you like.

> and therefore
> not much of an argument for putting something into "thing-1"?
> (Where do you put the bucky-bits when the input is EBCDIC?

Nobody is talking about "putting bucky-bits into WG1 Scheme"
per se.   You would put bucky-bits into EBCDIC by
using fat codepoints and putting them in the high bits,
same as in Unicode.

> What's the codepoint for "F11"?)

For that matter, what is the codepoint for,
say, the down transition of a left mouse button?

Traditionally, function keys like "F11" have
been handled in a variety of ways.   In GNU
Emacs these days, a keysequence can be either
a string or an array that includes a mix of 
characters and symbols.  Symbols are used for
keys like F11.  The names of the symbols are
significant - for example, S-F11 parses as a 
shifted version of F11.

In the days of yore, function keys on terminals
typically generated peculiar sequences of other
(non-function-key) characters.   In some situations,
precisely what sequence was generated could be 
reprogrammed by the user.

There was talk, in the past, of adding function 
key codepoints to Unicode.  I suppose there may
be such talk in the future as well.

One could, of course, assign them to codepoints
in a private use area.

One could assign them -- not too implausibly -- 
to Unicode's circled numbers (U+2460 onward).



> > It's a parsimonious choice
> > because it gives me a human-friendly print/read 
> > syntax for individual events and sequences of 
> > events.

> Aren't keyboard events delivered through some sort of GUI
> callback these days?

Humans have to specify their preferred keybindings
in many systems.



> > I can sort a set of strings representing
> > key sequences using string > equality using string=? or string=-ci?  Take
> > substrings.  Concatenate strings.  Even upcasing 
> > and downcasing are useful.
> 
> How do any of those work when the "characters" have been
> peppered with bucky-bits?

For example: 

Assuming your characters use Unicode codepoints
as the underlying unmodified character set,
then they work by extension of the unicode rules.

For example, the upcase M-x is M-X.

Of course, the shift-modified version of M-x is
(in my preference, its a minor issue) S-M-x which
is distinct from M-X (but which most keyboard set-ups
can't generate).



> Sorry about all the questions: I have never needed to code an
> editor (or anything much that had real-time keyboard input),
> so I don't know the issues.  My intuition says that anything
> being pressed by a human is happening slowly enough and few
> enough that no effort on space or time optimization (of the
> keystroke recording) can be necessary.


I suspect that you are not an Emacs users or
at least that you don't customize or write extensions
for Emacs.

A very common situation is having a start-up file
that sets key-bindings.  Here are two of mine:

  (global-set-key "\M-&" 'interactive-background-command)

(That's Emacs lisp, not Scheme.)

That helps to illustrate how it is convenient to
humans to write these things as strings.

And, here's one I notice from a famous Emacs lisp 
extension package called "calc":

 (define-key calc-mode-map (format "r%c" x) 'calc-recall-quick)

Notice that FORMAT - a procedure for formatting strings - 
is being used to generate a particular keybinding in a
systematic way, automatically.

-t






> Cheers,
> 


___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-21 Thread Andrew Reilly
On Mon, Sep 21, 2009 at 08:49:03AM -0700, Thomas Lord wrote:
> I don't think so.   For example, I like the idea
> of using codepoints with buckybits as the names
> of keyboard events.

Isn't that a fairly gratuitous example of an ad-hoc storage
optimization for a very specific application, and therefore
not much of an argument for putting something into "thing-1"?
(Where do you put the bucky-bits when the input is EBCDIC?
What's the codepoint for "F11"?)

> It's a parsimonious choice
> because it gives me a human-friendly print/read 
> syntax for individual events and sequences of 
> events.

Aren't keyboard events delivered through some sort of GUI
callback these days?

> I can sort a set of strings representing
> key sequences using string equality using string=? or string=-ci?  Take
> substrings.  Concatenate strings.  Even upcasing 
> and downcasing are useful.

How do any of those work when the "characters" have been
peppered with bucky-bits?

Sorry about all the questions: I have never needed to code an
editor (or anything much that had real-time keyboard input),
so I don't know the issues.  My intuition says that anything
being pressed by a human is happening slowly enough and few
enough that no effort on space or time optimization (of the
keystroke recording) can be necessary.

Cheers,

-- 
Andrew

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-21 Thread Alaric Snell-Pym

On 21 Sep 2009, at 5:14 pm, John Cowan wrote:

>> [I]f you want to access it as bytes, you do "(make-
>> u8vector-aliasing-blob  [  []])" to
>> get an SRFI-4 byte vector (endianness is a bit moot for u8s, but
>> included for consistency with the corresponding procedures for u16
>> and
>> up, and it defaults to platform endianness), or "(make-record-
>> aliasing-
>> blob   [])", where  is some list of
>> field descriptions documenting the layout of a binary structure.
>
> I prefer a lower level (on top of which both of these are easily
> layered)
> like that of SRFI 74 or R6RS, which says "get/set at a specified byte
> offset, a value of ".  More details in a later posting,
> hopefully
> to become a SRFI.

Yes, that is a valid way of implementing a binary-interop system as a
basic mechanism plus a layer that gives you arrays (and structs and
arrays of structs with arrays in etc), just as long as the end user
gets a nice high level API like that to play with, perhaps after
installing a library to give them it, rather than having to do things
in unpleasant or implementation-optimisation-hurting ways such as
implementing everything themselves on top of raw bytes ;-)

*looks at R6RS bytevectors*

Yes, they look OK... as a blob type ;-) bytevector definitely implies
u8vector to me!

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/archives/author/alaric/




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-21 Thread John Cowan
Alaric Snell-Pym scripsit:

> > Does anyone think that bytevectors are a necessary feature of
> > WG1 Scheme?
> 
> Not me.
> 
> > I can see why WG2 should have both strings and bytevectors, but WG1
> > shouldn't (imho) include anything about bytes, binary, etc.

I'll give my views on binary I/O in a later posting.

> > Maybe not even bitwise logical operators.

They are easily layered over bignums, which are layered (with some loss of
space efficiency) over fixnums.  Or bignums can be layered over blobs,
as the old Unix implementation of dc(1) did, with its base-100 bignums.

> > (Of course all that stuff will be loadable as optional libraries,
> > like the rest of WG2's features.)
> 
> I know where SRFI-4 is when I want it (I see no need for a special
> "bytevector" type).

> FWIW, though, I do think there's some value in a low-level and optional
> mechanism for representing a block of memory, without attaching any
> particular semantics to it. Let's call it a blob.

R6RS bytevectors are in fact such blobs; apparently someone thought the
name "blob" (used in SRFI 74) unattractive or otherwise undesirable,
despite its well-established status as a technical term.

> [I]f you want to access it as bytes, you do "(make-
> u8vector-aliasing-blob  [  []])" to
> get an SRFI-4 byte vector (endianness is a bit moot for u8s, but
> included for consistency with the corresponding procedures for u16 and
> up, and it defaults to platform endianness), or "(make-record-aliasing-
> blob   [])", where  is some list of
> field descriptions documenting the layout of a binary structure. 

I prefer a lower level (on top of which both of these are easily layered)
like that of SRFI 74 or R6RS, which says "get/set at a specified byte
offset, a value of ".  More details in a later posting, hopefully
to become a SRFI.

Tom Lord scripsit:

> Suppose that WG1 Scheme requireds regular vectors, a way to create
> disjoint types, and fixnums.  Then it is possible for library-code to
> define objects with the semantics of bytevectors, though lacking an
> assurance of a compact representation for these.  As an adjunct to
> the core specification (by appendix or via inclusion by reference)
> bytevectors can be given a (usable, not optimal) definitional
> implementation.

I agree entirely, which is why I removed blobs from my Thing One proposals.

-- 
Normally I can handle panic attacks on my own;   John Cowan 
but panic is, at the moment, a way of life.  http://www.ccil.org/~cowan

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-21 Thread Thomas Lord
On Mon, 2009-09-21 at 00:40 -0700, Brian Harvey wrote:
> > I don't know about you but I *do* regard strings primarily as 
> > text intended for human readability
> 
> +1
> 
> Does anyone think that bytevectors are a necessary feature of WG1 Scheme?
> I can see why WG2 should have both strings and bytevectors, but WG1 shouldn't
> (imho) include anything about bytes, binary, etc.  Maybe not even bitwise
> logical operators.  (Of course all that stuff will be loadable as optional
> libraries, like the rest of WG2's features.)

Suppose that WG1 Scheme requireds regular vectors,
a way to create disjoint types, and fixnums.  
Then it is possible for library-code to define
objects with the semantics of bytevectors, though
lacking an assurance of a compact representation for
these.  As an adjunct to the core specification (by
appendix or via inclusion by reference) bytevectors
can be given a (usable, not optimal) definitional 
implementation.

That seems the best of both worlds, to me.

-t



___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-21 Thread Thomas Lord
On Sun, 2009-09-20 at 22:19 -0700, Ray Dillinger wrote:
> Isn't it true that the "more general" uses of strings would 
> all be equally or better served by binary buffers (bytevectors,
> uniform numeric vectors, whatever)?

I don't think so.   For example, I like the idea
of using codepoints with buckybits as the names
of keyboard events.  It's a parsimonious choice
because it gives me a human-friendly print/read 
syntax for individual events and sequences of 
events.  I can sort a set of strings representing
key sequences using string I don't know about you but I *do* regard strings primarily as 
> text intended for human readability and to be manipulated in 
> linguistically (or at least character-oriented) significant 
> ways.  Whenever I find myself doing something else with one, 
> I realize that I am no longer using it as a string.

What counts as a linguistic use gets a bit 
fuzzy, though, as the example of keysequences
shows.

-t




On Sun, 2009-09-20 at 22:19 -0700, Ray Dillinger wrote:
> On Sun, 2009-09-20 at 12:17 -0700, Thomas Lord wrote:
> > On Sun, 2009-09-20 at 08:56 -0700, Ray Dillinger wrote:
> 
> > > This is why I believe that the best semantics for string-length, 
> > > indexes in strings, etc, is that they should count characters 
> > > rather than codepoints.  And this is one of the things that I 
> > > believed then and still believe now that R6RS got wrong.
> > 
> > That's a reasonable view when a string is being regarded
> > primarily as human text to be manipulated in linguistically
> > significant ways.   Strings as a data structure are more 
> > general than that, though.
> 
> Isn't it true that the "more general" uses of strings would 
> all be equally or better served by binary buffers (bytevectors,
> uniform numeric vectors, whatever)?
> 
> I don't know about you but I *do* regard strings primarily as 
> text intended for human readability and to be manipulated in 
> linguistically (or at least character-oriented) significant 
> ways.  Whenever I find myself doing something else with one, 
> I realize that I am no longer using it as a string.
> 
>   Bear
> 
> 


___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-21 Thread Alaric Snell-Pym

On 21 Sep 2009, at 8:40 am, Brian Harvey wrote:

>> I don't know about you but I *do* regard strings primarily as
>> text intended for human readability
>
> +1

+1 from me too.

> Does anyone think that bytevectors are a necessary feature of WG1
> Scheme?

Not me.

> I can see why WG2 should have both strings and bytevectors, but WG1
> shouldn't
> (imho) include anything about bytes, binary, etc.  Maybe not even
> bitwise
> logical operators.  (Of course all that stuff will be loadable as
> optional
> libraries, like the rest of WG2's features.)

I know where SRFI-4 is when I want it (I see no need for a special
"bytevector" type).

FWIW, though, I do think there's some value in a low-level and
optional mechanism for representing a block of memory, without
attaching any particular semantics to it. Let's call it a blob. There
would be no blob-ref; if you want to access it as bytes, you do "(make-
u8vector-aliasing-blob  [  []])" to
get an SRFI-4 byte vector (endianness is a bit moot for u8s, but
included for consistency with the corresponding procedures for u16 and
up, and it defaults to platform endianness), or "(make-record-aliasing-
blob   [])", where  is some list of
field descriptions documenting the layout of a binary structure. This
sort of arrangement helps a lot when dealing with file formats and the
like. Perhaps even "(make-string-from-blob [blob] [encoding] [
])". When I have some free time, I plan to write an
implementation of it for Chicken, that allows the underlying blobs to
manage their own storage so it's easier to interoperate with C code
via the FFI without needing to copy large blocks of memory into and
out of the GC heap; I'd hope to spit an SRFI out when I have a working
implementation...

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/archives/author/alaric/




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-21 Thread Brian Harvey
> I don't know about you but I *do* regard strings primarily as 
> text intended for human readability

+1

Does anyone think that bytevectors are a necessary feature of WG1 Scheme?
I can see why WG2 should have both strings and bytevectors, but WG1 shouldn't
(imho) include anything about bytes, binary, etc.  Maybe not even bitwise
logical operators.  (Of course all that stuff will be loadable as optional
libraries, like the rest of WG2's features.)

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Ray Dillinger
On Sun, 2009-09-20 at 12:17 -0700, Thomas Lord wrote:
> On Sun, 2009-09-20 at 08:56 -0700, Ray Dillinger wrote:

> > This is why I believe that the best semantics for string-length, 
> > indexes in strings, etc, is that they should count characters 
> > rather than codepoints.  And this is one of the things that I 
> > believed then and still believe now that R6RS got wrong.
> 
> That's a reasonable view when a string is being regarded
> primarily as human text to be manipulated in linguistically
> significant ways.   Strings as a data structure are more 
> general than that, though.

Isn't it true that the "more general" uses of strings would 
all be equally or better served by binary buffers (bytevectors,
uniform numeric vectors, whatever)?

I don't know about you but I *do* regard strings primarily as 
text intended for human readability and to be manipulated in 
linguistically (or at least character-oriented) significant 
ways.  Whenever I find myself doing something else with one, 
I realize that I am no longer using it as a string.

Bear



___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Andre van Tonder

On Sun, 20 Sep 2009, Jeff Read wrote:


On Sun, Sep 20, 2009 at 12:24 PM, Abdulaziz Ghuloum  wrote:


Since the time R6RS was being discussed, available memory
has more than doubled.  By the time R7RS is finalized (if
ever), memory would at least quadruple again.  This is
why I asked John about how many years he's been using the
"space efficiency" argument; to me, that argument has been
obsolete about 2 years after I've first heard it.



I'm sure there are those who will want to see R7RS on their PDP-10 or
Amiga 2000. :)

Seriously, though, assuming that the primary use case for your
language will be desktop or server PC's is folly. As others have
pointed out, there are still plenty of places where every byte counts.


Cout me in.  My computer is nine years old.  It runs perfectly
fine and I am very happy with it.  I would like applications to remain
space efficient so that I don't have to replace it unnecessarily.

To give just one example, Windows Vista has been unpopular for
very good reason.  Having to upgrade one's hardware to be able
to run new software must be good for someone's pockets, but certainly
not mine or any other user.

Andre___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Shiro Kawai
From: Thomas Lord 
Subject: Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: 
string-set! must die
Date: Sun, 20 Sep 2009 12:07:51 -0700

> > RnRS abandoning mutable strings does *not* prevent such
> > tiny Scheme from having mutable strings as an implementation's
> > extention. 
> 
> And vice versa.   

No.  There's an asymmetry here.   

* Scheme with mutable-only strings can still use
  string libraries that are written for immutable strings.
* Scheme with immutable-only strings cannot use
  string libraries that are written for mutable strings.

> Both are "nice to have" and I would expect that
> most implementations will want to support both.
> It would be good to sanctify some specification 
> of both types and how they relate.

The problem is that introducing mutable strings suddenly
bloats the spec.  You have to mark which operation returns
immutable strings.   Substring-like operations need two
versions, one returning fresh string and other returns
possibly shared string.

Of course the same can be said to pairs and vectors, but
the usage pattern is pretty different.  I don't think we
should overgeneralize here.

> > [...]
> > Requiring string ports (string builder) shouldn't be much
> > burden to the tiny Scheme; 
> 
> String ports are an example of a generic
> problem for which disjointed, piecemeal 
> solutions seem the wrong approach (puns intended).
[snip]

Yes, it's good to have small, clear core of generic
approach.  Let's have it.  And what does it have
something to do with mutable/immutable strings?

> > I feel that your discussion explains why mutable
> > string benefits tiny Scheme, but doesn't support why
> > mutable strings should be in the standard.
> 
> That is because we have to first agree on the 
> desired form and function of the standard.
[...]
> My thought for R7/small is for an even smaller
> than traditional core, with "the rest" given both
> narrative and code definitions.
[snip]

I basically agree your discussion here.  From my side,
we can provide the code definitions of, say, string
ports, via mutable vector and vector->string; so far
it seems orthogonal to mutable/immutable string discussion.

> > It is plausible, but could you support your opinion with
> > some concrete observation, experience, or algorithms?
> > The counter observation of that 9 years of experience in
> > Gauche community.
> 
> One of the more fun projects I've done in Scheme
> was an Emacs-like text editor.   For that, I found
> a very nice data-structure (good trade-offs) was
> a kind of unholy mix of "gap buffers" (like in GNU
> Emacs) with "ropes" (big strings represented as 
> (in this case) splay trees of smaller strings).  
> Modifying strings in the middle was important to 
> good performance for this.  Not being able to modify
> strings in the middle with expected-case decent
> efficiency would have meant too much copying of data
> or too high a fragmentation of long strings.

If you represent the entire text in elaborated
structure, why do you need the leaf to be Scheme
strings?   You cannot treat the entire text or
subtrees of it as Scheme string anyway; you need
special API to deal with them.  Then you can just
use mutable vectors in the leaf node as well.
(Of course, if your Scheme has mutable strings
then it's ok to use them.  A portable library with
optional implementation-specific optimizations
can be configured either way)

> I noticed in the list of string-set! uses that Aubrey
> posted from SLIB, one of the uses came from a library
> that provided a "format" function: something that takes
> a format string and a bunch of other parameters and
> creates a new string (like sprintf in C).  That 
> strikes me as another case where string mutation is
> very handy for avoiding excess data copying and 
> consing.

Here I'd like to hear from Aubrey; to me, formatting
is one part that string builder type pattern makes
much more sense, since the length of the final string
isn't generally known beforehand (and Gauche's format
is implemented so).   What kind of advantage
did you see when you use string-set! in format?

> In any application where I/O filtering (read 
> some input, tweak it, write output) needs to be
> efficient, again, to avoid excessive data copying
> string mutation is a big boon.

Any *PORTABLE* I/O filtering using character/string
domain have to accept the fact that arbitrary 
binary<->character conversion could be inserted during
input and output.  If you don't like that, you need
to roll your own with binary I/O and bytevectors.

If you're writing for a specific situation where
external and internal encoding match (which is rather

Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Thomas Lord
On Sun, 2009-09-20 at 15:33 -0400, John Cowan wrote:
> Thomas Lord scripsit:
> 
> > It occurs to me that there is no reason to 
> > have immutable vectors in the core.  They can
> > be created out of mutable vectors and the
> > ability to define new disjoint types.
> 
> There's no reason to have anything but lambda.
> It's provably universal.  End of story.


Nope, don't buy it. 

What you are pointing towards is enough to
make a mathematical model of Scheme.  (I'll
spot you syntactic and lexical extensions and
I/O).

What I am pointing towards is, indeed, "fatter"
than that - but it is also 

1) a description of a useful implementation
   technique

2) a description that an be usefully used to 
   reason about the operational behavior of
   programs (e.g., performance analysis),
   at least in broad terms, in a straightforward
   way.

You could do a kind of "SIOD"-style implementation
of what I'm describing and get something useful.
Not so much with a "lambda-only" thing.

And... hmmm that's an interesting idea.  
Something to do...   Prbly take more than "OD",
though.

-t






___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Thomas Lord
On Sun, 2009-09-20 at 15:33 -0400, John Cowan wrote:
> Thomas Lord scripsit:
> 
> > It occurs to me that there is no reason to 
> > have immutable vectors in the core.  They can
> > be created out of mutable vectors and the
> > ability to define new disjoint types.
> 
> There's no reason to have anything but lambda.
> It's provably universal.  End of story.


Nope, don't buy it. 

What you are pointing towards is enough to
make a mathematical model of Scheme.  (I'll
spot you syntactic and lexical extensions and
I/O).

What I am pointing towards is, indeed, "fatter"
than that - but it is also 

1) a description of a useful implementation
   technique

2) a description that an be usefully used to 
   reason about the operational behavior of
   programs (e.g., performance analysis),
   at least in broad terms, in a straightforward
   way.

You could do a kind of "SIOD"-style implementation
of what I'm describing and get something useful.
Not so much with a "lambda-only" thing.

And... hmmm that's an interesting idea.  
Something to do...   Prbly take more than "OD",
though.

-t





___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Alex Queiroz
Hallo,

On Sun, Sep 20, 2009 at 4:26 PM, Thomas Lord  wrote:
> Abdulaziz,
>
> The size of primary memory has kept growing but is
> actually reaching limits.   And, anyway, it DOESN'T
> MATTER.   You can make primary memory a TB and you'll
> *still* want to use compact encodings for Unicode
> characters.
>
> You may ask why.  Think of how memory works.  It is a
> hierarchy and levels of the hierarchy are connected by
> various busses.   The CPU has only so much storage for
> register values.  You have multiple layers of caches,
> which again are of limited size.   Beyond main memory
> you have tertiary storage and network connections.
>
> Space in the CPU and caches is limited.  The busses
> that connect the layers have limited BANDWIDTH.
> Tertiary can be infinite for all practical purposes.
> Main can be ridiculously large.  You will still care
> about compact string data if you care about performance.
> That fact isn't changing anytime soon.

 Indeed: http://people.redhat.com/drepper/cpumemory.pdf

-- 
-alex
http://www.ventonegro.org/

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread John Cowan
Thomas Lord scripsit:

> It occurs to me that there is no reason to 
> have immutable vectors in the core.  They can
> be created out of mutable vectors and the
> ability to define new disjoint types.

There's no reason to have anything but lambda.
It's provably universal.  End of story.

-- 
There is / One art  John Cowan 
No more / No less   http://www.ccil.org/~cowan
To do / All things
With art- / Lessness --Piet Hein

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Thomas Lord
Abdulaziz,

The size of primary memory has kept growing but is
actually reaching limits.   And, anyway, it DOESN'T
MATTER.   You can make primary memory a TB and you'll
*still* want to use compact encodings for Unicode
characters.

You may ask why.  Think of how memory works.  It is a
hierarchy and levels of the hierarchy are connected by
various busses.   The CPU has only so much storage for
register values.  You have multiple layers of caches,
which again are of limited size.   Beyond main memory
you have tertiary storage and network connections.

Space in the CPU and caches is limited.  The busses 
that connect the layers have limited BANDWIDTH.
Tertiary can be infinite for all practical purposes.
Main can be ridiculously large.  You will still care
about compact string data if you care about performance.
That fact isn't changing anytime soon.

-t


On Sun, 2009-09-20 at 11:30 -0500, Brian Mastenbrook wrote:
> On Sep 20, 2009, at 11:24 AM, Abdulaziz Ghuloum wrote:
> 
> > Unicode code points currently require 21 bits of storage
> > if represented uniformly.  That number is unlikely to
> > increase beyond 32 bits in any foreseeable future.
> >
> > Storage (disk, ram, L_n cache, etc.) has been increasing
> > exponentially on all computational devices (personal
> > computers, PDAs, phones, ipods, et cetera).  The desktop
> > I purchased 14 years ago had 4MBs of RAM, and my current
> > laptop has 4GBs: that's a 1000 fold increase.
> >
> > Now maybe in the distant past, I would've cringed if I
> > even thought of using 4 bytes per character in a string.
> > Today, I can use 4 bytes *and* at the same time hold
> > hundreds of times more data in memory than I could back
> > then.  As memory increases, I would worry about the 4
> > bytes even less and less.  As time passes, I find the
> > arguments for "memory efficient" representations less
> > appealing.
> >
> > Since the time R6RS was being discussed, available memory
> > has more than doubled.  By the time R7RS is finalized (if
> > ever), memory would at least quadruple again.  This is
> > why I asked John about how many years he's been using the
> > "space efficiency" argument; to me, that argument has been
> > obsolete about 2 years after I've first heard it.
> 
> I think it's mistaken to conclude that because RAM is growing, space  
> efficiency is unimportant - just as much as it's mistaken to say that  
> because CPU speeds are growing, native compilers are obsolete now.  
> Many people would rather use the available RAM to increase the amount  
> of data their program can process, not to waste it on space- 
> inefficient representations.
> 
> --
> Brian Mastenbrook
> [email protected]
> http://brian.mastenbrook.net/
> 
> ___
> r6rs-discuss mailing list
> [email protected]
> http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Joe Marshall
Just my .02:

I can't see how mutable strings and Unicode can coexist at the
user's abstraction level.  The implementor might want to have
a way to tweak code points or bytes or such, but the average
user probably doesn't want to become a Unicode expert.  He
mostly wants to print stuff or parse stuff, and you don't need
access to the bits and bytes if the abstraction is good.

-- 
~jrm

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Thomas Lord
On Sun, 2009-09-20 at 08:56 -0700, Ray Dillinger wrote:
> On Sat, 2009-09-19 at 16:48 -0700, Thomas Lord wrote:
> 
> > []
> 
> If you are appending and taking substrings, the codepoint level 
> is one of several wrong choices to make about where to allow
> string divisions, for exactly this reason.
> 
> What human beings think of as characters, are represented in unicode
> by a base codepoint plus nondefective sequence of combining 
> modifiers and variant selectors, each of which is also a codepoint.

Certainly.   I've long been attracted to your
notion to have a string-type that works in the way
you've been describing for quite a while.

In particular, I think it is very important that
the definition of a "character" in Scheme not 
preclude the possibility of a string type of the 
sort you describe.

However, as a person who likes "systems programming"
and writing regexp matchers and even implementing basic
Unicode algorithms: I really want choices.  I want to be
able to have encoding unit, codepoint, and full character
strings (not necessarily all in the same string-like object).
And I want to be able to define some generics that work on
all of these (where that makes sense) as well as procedures
that require just a particular kind of string.




> The sequence is usually length zero, but since you're talking about
> renormalizing after divisions, you're already talking about cases 
> where the sequence is nonempty. 
> 
> If you allow division of strings on codepoint boundaries which 
> are not also character boundaries, you can "renormalize" but in 
> this case the renormalization operation makes no semantic sense. 
> You have created characters that were not there, you have 
> vanished characters that were there, you have changed characters 
> into different characters, and so on.  These are not sensible 
> operations; these are bugs.
> 
> If you restrict string division to character boundaries, then 
> you have no need to "renormalize" because by not dividing strings 
> in mid-character or joining strings that start or end with partial
> characters, you never create a denormalized string. Further, 
> the characters on each side of the division are the same 
> characters that were there in the undivided string, so the 
> user does not experience this class of inconsistencies and 
> bugs.
> 
> This is why I believe that the best semantics for string-length, 
> indexes in strings, etc, is that they should count characters 
> rather than codepoints.  And this is one of the things that I 
> believed then and still believe now that R6RS got wrong.

That's a reasonable view when a string is being regarded
primarily as human text to be manipulated in linguistically
significant ways.   Strings as a data structure are more 
general than that, though.

-t



___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Thomas Lord
I think my list for the core of small Scheme 
can be simplified.

I called for syntactic and lexical abstraction,
basic ports, *mutable and immutable* vectors,
fixnums, lambda, program-made disjoint types,
and some way to make generics (mutable but 
applicable objects).

It occurs to me that there is no reason to 
have immutable vectors in the core.  They can
be created out of mutable vectors and the
ability to define new disjoint types.

-t



On Sat, 2009-09-19 at 16:30 -0700, Thomas Lord wrote:
> As a kind of counter-proposal, a perhaps sane "core Scheme"
> (a subset of "small Scheme") could offer:
> 
> 1. *Some* low-level solution for syntactic abstraction
>(I won't dwell on my opinion of what exactly,
>since that would be controversial.)
> 
> 2. *Some* low-level solution for reader extensions.
> 
> 3. *Some* low-level but fairly abstract system of 
>ports and an environment in which certain resources
>(like libraries) can be "opened" to create a port.
> 
> 4. Low-level vectors - both immutable and mutable.
> 
> 5. Lambda.
> 
> 6. A constructor for disjoint types that can wrap
>an arbitrary value.
> 
> 7. Fixnums.  (I/O on low-level ports consumes and yields these.)
> 
> 8. *Perhaps* mutable lambdas (objects which can be 
>applied but you can modify them to change which
>simple lambda is invoked when applying them).
> 
> 
> That'd be about it.
> 
> Everything in Small Scheme can be *explained* fairly 
> well in terms of those things.   For example, cons-pairs
> are length-two vectors wrapped up as a disjoint type.
> Flonums can be explained as a fixnum or pair of fixnums
> wrapped up as a disjoint type.
> 
> Real "core Scheme" code can actually implement those
> familiar types in that way.   In a truly minimalist implementation
> that would actually be potentially useful.   As a semantic
> model, it would be useful.
> 
> Of course, most implementations would natively implement
> many more types and features than what I described.
> But the specification for those additional types and
> features could be expressed quite precisely as core
> Scheme code.
> 
> There would be less pressure, in this kind of approach,
> to haggle over questions like "Strings: mutable or not?"
> We can define both.  We can treat the traditional string
> operators as generics that can work on either.  We can
> quibble over exactly which ones are *required* in Small
> Scheme but also enjoy that Small Scheme supports either
> one.
> 
> -t
> 
> 
> 
> On Sat, 2009-09-19 at 01:24 -0400, John Cowan wrote:
> > This is a proposal for the removal of string-set! (and consequently
> > string-fill!) from the R7RS small Scheme language.  I am publishing this
> > document to invite wide comment.  There is nothing official about it.
> > I very gratefully acknowledge the kind help of Alex Shinn, who provided
> > the topic sentences for most of the paragraphs below.  However, I retain
> > sole responsibility for this document, including all errors.
> > 
> > I believe that despite the prescription of the draft WG1 charter that
> > no features of IEEE Scheme (a subset of R4RS) should be removed from
> > R7RS small Scheme, an exception should be made for string-set!, for at
> > least the following reasons:
> > 
> > 1) Immutable strings are more purely functional, and allow many
> > optimizations, such as being transparently and freely shareable between
> > procedures and between threads without concern for uncontrolled mutation.
> > For this and other reasons, the general trend in new languages/runtimes
> > such as Java and C# is toward immutable strings; unfortunately, this is
> > the kind of argument that Schemers usually don't like, so I won't bother
> > mentioning it.  :-)
> > 
> > 2) Algorithms where you want to modify strings in the middle are rare,
> > and many of the classic devices (such as string-upcase!, a procedure that
> > mutates a string in place) are awkward or impossible with representations
> > that make use of characters of variable length such as UTF-8.  Typical
> > string algorithms want to also be able to do insertions and deletions,
> > which are not directly possible with classical Scheme strings.  Better
> > representations such as trees of immutable strings do allow such changes,
> > as well as making string appends O(n) in the number of strings rather
> > than in the sum of their lengths.
> > 
> > 3) If strings are immutable, it's possible to have both fast O(1)
> > access to individual characters or substrings, and fairly space-efficient
> > representation of full Unicode strings, by using different representations
> > for strings drawn from diferent character repertoires.  For example,
> > an implementation might use 8-bit code units when all characters are
> > less than \#x100, 16-bit code units when all characters are less than
> > \#x1, and 32-bit code units otherwise.
> > 
> > Unfortunately, mutating even a single character in such a representation
> > may require the entire string t

Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Thomas Lord
On Sat, 2009-09-19 at 16:13 -1000, Shiro Kawai wrote:
> From: Thomas Lord 

> > []

> RnRS abandoning mutable strings does *not* prevent such
> tiny Scheme from having mutable strings as an implementation's
> extention. 

And vice versa.   

Both are "nice to have" and I would expect that
most implementations will want to support both.
It would be good to sanctify some specification 
of both types and how they relate.


> [...]
> Requiring string ports (string builder) shouldn't be much
> burden to the tiny Scheme; 

String ports are an example of a generic
problem for which disjointed, piecemeal 
solutions seem the wrong approach (puns intended).
String ports are not the only kind of alternative
port desirable.  Similarly it would be reasonable
to support and encourage new types of numbers,
new types of sequences, and so forth.

In other words, it comes up in many areas that
we look at some traditional "primitive" type 
in Scheme and want to be able to create "variations".
String ports are just one example.

Common Lisp sets an interesting example by
defining certain generic functions, such
as the sequence functions.   It provides a 
mechanism for creating new, disjoint types
and for creating and extending the behavior
of generics.

Core Scheme should get, as simply as practical,
to support for generics and program-created 
disjoint types - and then define these other 
things (like string ports) by giving implementations
in terms of those.  (Implementations are not
obligated to use the definitional implementations,
of course.)


> I feel that your discussion explains why mutable
> string benefits tiny Scheme, but doesn't support why
> mutable strings should be in the standard.

That is because we have to first agree on the 
desired form and function of the standard.

The classic old "50 pagers" made distinctions
between "core", "syntax", and "library" specifications.
The hint was that given the specified core and 
primitive syntax, the rest could be defined in terms
of those.   Oddly, in my opinion, library items 
were given only narrative definitions.

My thought for R7/small is for an even smaller
than traditional core, with "the rest" given both
narrative and code definitions.

I am not sure I would want to see "the rest"
polarized into "require" and "optional".  Rather,
"the rest" should comprise just about everything 
commonly found in implementations and anything widely
acclaimed, and various subsets of what is provided
or not provided from those things can be given names.

That gives a lattice of feature sets where each point
is what might be "built in".  It gives the opportunity
to give names to acclaimed "sweet spots" on that lattice.
Of course, using the definitional implementations of
features not found in a given implementation, programs can
always add any features they need but that aren't built-in.

Implementations then "conform" if they provide that
tiny core - e.g., with only vectors and fixnums, generics,
extensible syntax and lexical language, user-made disjoint
types, lambda, and minimal ports.   Implementations are
"correct" to the extent that any additional standard features
they happen to build-in conform to the definitional implementations
of those features.

That would, as a side effect, take the R7 authors out 
directly enumerating what an implementation MUST build 
in beyond the tiny core.   The choice of sweet spots on 
the lattice of feature mixes is not one that requires 
community-wide agreement.


> > > 2) Algorithms where you want to modify strings in the middle are rare,
> > 
> > Claims like this always make the hairs on the back of my 
> > neck stand up.  There are two problems with them.
> > First, there isn't a really good empirical way to establish
> > such claims.   Second, rarity per se, is not the most 
> > important consideration.
> [...]
> > Rarity is not an especially compelling argument.  More 
> > important is *importance*.   The question is less "how often
> > do I need to reach for string mutation?" so much as the
> > question is "how painful is it if when I want string mutation
> > I can't have it?".
> 
> It is plausible, but could you support your opinion with
> some concrete observation, experience, or algorithms?
> The counter observation of that 9 years of experience in
> Gauche community.

One of the more fun projects I've done in Scheme
was an Emacs-like text editor.   For that, I found
a very nice data-structure (good trade-offs) was
a kind of unholy mix of "gap buffers" (like in GNU
Emacs) with "ropes" (big strings represented as 
(in this case) splay trees of smaller strings).  
Modifying strings in the middle was important to 
good performance for this.  Not being able to modify
strings in the middle with expected-case decent
efficiency would have meant too much copying of data
or too high a fragmentation of long strings.

I noticed in the list of string-set! uses that Aubrey
posted from SLIB, one of the uses came from a library
that provided a

Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread John Cowan
Jeff Read scripsit:

> Seriously, though, assuming that the primary use case for your language
> will be desktop or server PC's is folly. As others have pointed out,
> there are still plenty of places where every byte counts.

I think that will be the use case for large Scheme, but small Scheme
must cover a wider range of processors.

> (Imho it is not the place for a Scheme standard to specify how strings
> will be implemented, but rather to specify the semantics of their
> interface, and leave implementation details up to the implementors.)

Quite.  But pretending that implementation considerations don't matter
at all isn't very sensible either.

-- 
John Cowan  http://ccil.org/[email protected]
There are books that are at once excellent and boring.  Those that at
once leap to the mind are Thoreau's Walden, Emerson's Essays, George
Eliot's Adam Bede, and Landor's Dialogues.  --Somerset Maugham

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Jeff Read
On Sun, Sep 20, 2009 at 12:24 PM, Abdulaziz Ghuloum  wrote:
>
> Since the time R6RS was being discussed, available memory
> has more than doubled.  By the time R7RS is finalized (if
> ever), memory would at least quadruple again.  This is
> why I asked John about how many years he's been using the
> "space efficiency" argument; to me, that argument has been
> obsolete about 2 years after I've first heard it.
>

I'm sure there are those who will want to see R7RS on their PDP-10 or
Amiga 2000. :)

Seriously, though, assuming that the primary use case for your
language will be desktop or server PC's is folly. As others have
pointed out, there are still plenty of places where every byte counts.

(Imho it is not the place for a Scheme standard to specify how strings
will be implemented, but rather to specify the semantics of their
interface, and leave implementation details up to the implementors.)

--Jeff

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread John Cowan
Ray Dillinger scripsit:

> If you are appending and taking substrings, the codepoint level is one
> of several wrong choices to make about where to allow string divisions,
> for exactly this reason.

You understate the case: *every* level is a wrong choice for some purposes
(and a right choice for others).

> What human beings think of as characters, are represented in unicode
> by a base codepoint plus nondefective sequence of combining modifiers
> and variant selectors, each of which is also a codepoint.

The DGC level (which you are describing) is also arbitrary; for some
languages it works well, for others not.

For example, in all (mainstream) Indic scripts the DGC is a consonant
with zero or one vowel added, and this is indeed right for Tamil, whose
users think of it as a syllabary.  In Hindi, though, it's more common
to think of *all* the consonants before a vowel as being part of the
character, even though they are in different DGCs according to Unicode,
because that's the way they (mostly) ligature together.

-- 
Income tax, if I may be pardoned for saying so, John Cowan
is a tax on income.  --Lord Macnaghten (1901)   [email protected]

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread John Cowan
Abdulaziz Ghuloum scripsit:

> Unicode code points currently require 21 bits of storage
> if represented uniformly.  That number is unlikely to
> increase beyond 32 bits in any foreseeable future.

It's extremely unlikely to go past 21 bits, in fact.  (I've
won several bets of the form "Unicode won't need an increase
in bits in the next N years).  If we meet the Galactic Empire,
we may need to go to a 64-bit code, but we'll mostly need that
at the interface to the local net.

> Storage (disk, ram, L_n cache, etc.) has been increasing
> exponentially on all computational devices (personal
> computers, PDAs, phones, ipods, et cetera).  

True of consumer devices.  Not true of industrial embedded devices.

-- 
Unless it was by accident that I hadJohn Cowan
offended someone, I never apologized.   [email protected]
--Quentin Crisp http://www.ccil.org/~cowan

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Abdulaziz Ghuloum

On Sep 20, 2009, at 7:28 PM, Arthur A. Gleckler wrote:

> I can assure you that, on embedded platforms like Android,
> we're still fighting for every byte we can get.

Sure.  So you have to be clever today and undo your cleverness
tomorrow when the tradeoffs change.  I've done that before for
other things, like optimizing for code size instead of speed,
and that's fine.

Aziz,,,

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Abdulaziz Ghuloum

On Sep 20, 2009, at 7:30 PM, Brian Mastenbrook wrote:

> I think it's mistaken to conclude that because RAM is growing, space  
> efficiency is unimportant - just as much as it's mistaken to say  
> that because CPU speeds are growing, native compilers are obsolete  
> now.

I presume you meant "optimizing compilers".

> Many people would rather use the available RAM to increase the  
> amount of data their program can process, not to waste it on space- 
> inefficient representations.

Of course there are tradeoffs.  You can use any good compression  
scheme (you're by no means limited to utf-8) to make some  
representation more space-efficient and less time-efficient.  You have  
to take common usage into account when choosing a representation.   
Special (uncommon) requirements may need custom solutions, and I can't  
argue against that.

Aziz,,,

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Brian Mastenbrook
On Sep 20, 2009, at 11:24 AM, Abdulaziz Ghuloum wrote:

> Unicode code points currently require 21 bits of storage
> if represented uniformly.  That number is unlikely to
> increase beyond 32 bits in any foreseeable future.
>
> Storage (disk, ram, L_n cache, etc.) has been increasing
> exponentially on all computational devices (personal
> computers, PDAs, phones, ipods, et cetera).  The desktop
> I purchased 14 years ago had 4MBs of RAM, and my current
> laptop has 4GBs: that's a 1000 fold increase.
>
> Now maybe in the distant past, I would've cringed if I
> even thought of using 4 bytes per character in a string.
> Today, I can use 4 bytes *and* at the same time hold
> hundreds of times more data in memory than I could back
> then.  As memory increases, I would worry about the 4
> bytes even less and less.  As time passes, I find the
> arguments for "memory efficient" representations less
> appealing.
>
> Since the time R6RS was being discussed, available memory
> has more than doubled.  By the time R7RS is finalized (if
> ever), memory would at least quadruple again.  This is
> why I asked John about how many years he's been using the
> "space efficiency" argument; to me, that argument has been
> obsolete about 2 years after I've first heard it.

I think it's mistaken to conclude that because RAM is growing, space  
efficiency is unimportant - just as much as it's mistaken to say that  
because CPU speeds are growing, native compilers are obsolete now.  
Many people would rather use the available RAM to increase the amount  
of data their program can process, not to waste it on space- 
inefficient representations.

--
Brian Mastenbrook
[email protected]
http://brian.mastenbrook.net/

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Arthur A. Gleckler
> Since the time R6RS was being discussed, available memory
> has more than doubled.  By the time R7RS is finalized (if
> ever), memory would at least quadruple again.  This is
> why I asked John about how many years he's been using the
> "space efficiency" argument; to me, that argument has been
> obsolete about 2 years after I've first heard it.

I can assure you that, on embedded platforms like Android, we're still
fighting for every byte we can get.  Somehow, we always seem to run
out, so I'm nervous about arguments like this.

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Abdulaziz Ghuloum

On Sep 20, 2009, at 3:59 PM, Andy Wingo wrote:

>> For how many years have you been arguing for "space efficient"
>> internal representations and no one has been listening?  Do you
>> know why?  [Hint: it's not because implementors don't care about
>> space efficiency]
>
> I would be interested in knowing your argument :)

Unicode code points currently require 21 bits of storage
if represented uniformly.  That number is unlikely to
increase beyond 32 bits in any foreseeable future.

Storage (disk, ram, L_n cache, etc.) has been increasing
exponentially on all computational devices (personal
computers, PDAs, phones, ipods, et cetera).  The desktop
I purchased 14 years ago had 4MBs of RAM, and my current
laptop has 4GBs: that's a 1000 fold increase.

Now maybe in the distant past, I would've cringed if I
even thought of using 4 bytes per character in a string.
Today, I can use 4 bytes *and* at the same time hold
hundreds of times more data in memory than I could back
then.  As memory increases, I would worry about the 4
bytes even less and less.  As time passes, I find the
arguments for "memory efficient" representations less
appealing.

Since the time R6RS was being discussed, available memory
has more than doubled.  By the time R7RS is finalized (if
ever), memory would at least quadruple again.  This is
why I asked John about how many years he's been using the
"space efficiency" argument; to me, that argument has been
obsolete about 2 years after I've first heard it.

Aziz,,,

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Ray Dillinger
On Sat, 2009-09-19 at 16:48 -0700, Thomas Lord wrote:

> Yes, but when you are building a string-like mutable
> type, appending and taking substrings, suddenly you 
> are renormalizing on every operation.

If you are appending and taking substrings, the codepoint level 
is one of several wrong choices to make about where to allow
string divisions, for exactly this reason.

What human beings think of as characters, are represented in unicode
by a base codepoint plus nondefective sequence of combining 
modifiers and variant selectors, each of which is also a codepoint.
The sequence is usually length zero, but since you're talking about
renormalizing after divisions, you're already talking about cases 
where the sequence is nonempty. 

If you allow division of strings on codepoint boundaries which 
are not also character boundaries, you can "renormalize" but in 
this case the renormalization operation makes no semantic sense. 
You have created characters that were not there, you have 
vanished characters that were there, you have changed characters 
into different characters, and so on.  These are not sensible 
operations; these are bugs.

If you restrict string division to character boundaries, then 
you have no need to "renormalize" because by not dividing strings 
in mid-character or joining strings that start or end with partial
characters, you never create a denormalized string. Further, 
the characters on each side of the division are the same 
characters that were there in the undivided string, so the 
user does not experience this class of inconsistencies and 
bugs.

This is why I believe that the best semantics for string-length, 
indexes in strings, etc, is that they should count characters 
rather than codepoints.  And this is one of the things that I 
believed then and still believe now that R6RS got wrong.

Bear




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Brian Mastenbrook
On Sep 20, 2009, at 7:59 AM, Andy Wingo wrote:

> Guile does this FWIW; though it skips 16-bit, having either latin-1 or
> utf-32.

Out of curiosity, why don't implementations use 24 bits per code point  
as the "full" Unicode string representation? Is the combination of the  
extra time spent in unaligned loads and masking not worth the space  
efficiency benefit?

--
Brian Mastenbrook
[email protected]
http://brian.mastenbrook.net/

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-20 Thread Andy Wingo
Hello,

On Sat 19 Sep 2009 20:54, Abdulaziz Ghuloum  writes:

> On Sep 19, 2009, at 8:24 AM, John Cowan wrote:
>
>> 3) If strings are immutable, it's possible to have both fast O(1)
>> access to individual characters or substrings, and fairly space- 
>> efficient
>> representation of full Unicode strings, by using different  
>> representations
>> for strings drawn from diferent character repertoires.  For example,
>> an implementation might use 8-bit code units when all characters are
>> less than \#x100, 16-bit code units when all characters are less than
>> \#x1, and 32-bit code units otherwise.

Guile does this FWIW; though it skips 16-bit, having either latin-1 or
utf-32.

> For how many years have you been arguing for "space efficient"
> internal representations and no one has been listening?  Do you
> know why?  [Hint: it's not because implementors don't care about
> space efficiency]

I would be interested in knowing your argument :)

(Not that I have a horse in this race.)

Happy hacking,

Andy
-- 
http://wingolog.org/

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread John Cowan
Abdulaziz Ghuloum scripsit:

> For how many years have you been arguing for "space efficient"
> internal representations and no one has been listening?  Do you
> know why?  [Hint: it's not because implementors don't care about
> space efficiency]

I have no idea why, unless that this implementation works poorly
in the presence of string-set!.

-- 
John [email protected]://ccil.org/~cowan
Nobody expects the RESTifarian Inquisition!  Our chief weapon is
surprise ... surprise and tedium  ... tedium and surprise 
Our two weapons are tedium and surprise ... and ruthless disregard
for unpleasant facts  Our three weapons are tedium, surprise, and
ruthless disregard ... and an almost fanatical devotion to Roy Fielding

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Shiro Kawai
From: Per Bothner 
Subject: Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: 
string-set! must die
Date: Sat, 19 Sep 2009 19:59:00 -0700

> On 09/19/2009 07:32 PM, Shiro Kawai wrote:
> > I think nobody opposes generally mutable "string-like" data
> > structure, which allows length-changing mutation as well.
> > Immutable-string camp just thinks such data structure can be
> > built on top of immutable primitive strings.
> 
> You could do that, but better would be to use a mutable
> char vector or a bytebuffer as a gap-buffer.
> 
> I.e. you build such structure on top of a *mutable*
> primitive (i.e. implementation-specific) buffer.

Ok, so I retract that.  The point is that we should
separate (1) whether the standard have constant-length
string mutation or not, and (2) how standard have
general (length-changing) mutation.

--shiro



___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Per Bothner
On 09/19/2009 07:32 PM, Shiro Kawai wrote:
> I think nobody opposes generally mutable "string-like" data
> structure, which allows length-changing mutation as well.
> Immutable-string camp just thinks such data structure can be
> built on top of immutable primitive strings.

You could do that, but better would be to use a mutable
char vector or a bytebuffer as a gap-buffer.

I.e. you build such structure on top of a *mutable*
primitive (i.e. implementation-specific) buffer.
-- 
--Per Bothner
[email protected]   http://per.bothner.com/

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Shiro Kawai
From: Lynn Winebarger 
Subject: Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: 
string-set! must die
Date: Sat, 19 Sep 2009 20:57:24 -0400

> Your proposal should be broken into two parts.  The first is abandoning
> the representation of strings as character vectors in favor of data
> structures that support faster string operations.  The second is the
> mandate of immutability.  These are two distinct and independent
> proposals.
> 
> A better reason for string-copy on an immutable string is to obtain
> a mutable string.   You can't rule out that some algorithm might
> find it useful to share computations about the internal components
> of a string.

We should distinguish what mutability we're talking.

Past RnRS supports mutable strings via string-set! and string-fill!.
They are constant-length operations.

Some Schemes support length-changing mutable operations.  That
implies implementation of strings is more than mere fixed-length
array of characters.

If you argue for "Some algorithm that find it useful to share
compuations about the internal components of a string", I think
you really want flexible mutation including lenght-changing
operaions. 

I think nobody opposes generally mutable "string-like" data
structure, which allows length-changing mutation as well.
Immutable-string camp just thinks such data structure can be
built on top of immutable primitive strings.

Certainly it's a plausible counter argument to claim more
flexible primitive strings that allows length-changing mutation.
I don't see a benefit, however, to stick to constant-length
mutation.

--shiro

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Shiro Kawai
From: Thomas Lord 
Subject: Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: 
string-set! must die
Date: Sat, 19 Sep 2009 15:49:47 -0700

> It is the case that a tiny implementation of small
> Scheme - one small in footprint and simple in implementation - 
> is unlikely to have much by the way of sophisticated
> optimizations.   
> 
> If we're going to talk about optimizations for a 
> small dialect of Scheme, I think we have to talk about
> an implementation class that is somewhere in the middle:
> it might not (except through libraries) offer a "big Scheme"
> environment but aims to provide a rich environment with a
> sophisticated implementation.

RnRS abandoning mutable strings does *not* prevent such
tiny Scheme from having mutable strings as an implementation's
extention.   Such tiny Scheme can still use all portable
RnRS (immutable) string library.  Code written specifically
for the tiny Scheme, implementing optimizations specific to it,
is by definition non-portable.  Such code can freely take
advantage of tiny Scheme's mutable strings.

Requiring string ports (string builder) shouldn't be much
burden to the tiny Scheme; it is trivial to implement it
on top of mutable strings, especially if the tiny Scheme
uses only ASCII or ISO-8859 characters.

> Those considerations lead me to the conclusion we
> should really have both mutable and immutable strings.

I feel that your discussion explains why mutable
string benefits tiny Scheme, but doesn't support why
mutable strings should be in the standard.

> > 2) Algorithms where you want to modify strings in the middle are rare,
> 
> Claims like this always make the hairs on the back of my 
> neck stand up.  There are two problems with them.
> First, there isn't a really good empirical way to establish
> such claims.   Second, rarity per se, is not the most 
> important consideration.
[...]
> Rarity is not an especially compelling argument.  More 
> important is *importance*.   The question is less "how often
> do I need to reach for string mutation?" so much as the
> question is "how painful is it if when I want string mutation
> I can't have it?".

It is plausible, but could you support your opinion with
some concrete observation, experience, or algorithms?
The counter observation of that 9 years of experience in
Gauche community.

> In the absence of mutation, when people want to implement
> "a string like thing whose contents and length can 
> change over time" 

string-set! and string-fill! aren't length-chaning operation,
so discussing length changing case is somewhat irrelevant.

Surely length changing operation is useful.  Mutable strings
via string-set! doesn't give it to you, though.

--shiro

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Per Bothner
On 09/19/2009 06:46 PM, Aubrey Jaffer wrote:
> Below are the occurences of string-set! in SLIB.  One frequent pattern
> of use is to make a string using MAKE-STRING; then fill it with
> STRING-SET!.  Accumulating characters with STRING-APPEND is a bad
> idea; it turns a O(N) process into O(N^2).  So I guess lists-of-chars
> and LIST->STRING would be the alternative to MAKE-STRING with
> STRING-SET!.

The solution is a (mutable) text or string-buffer type,
with a text-append! operation.

"Small Scheme" probably only needs:

(make-text)
(text-append! TEXT CHARACTER)
(text-append! TEXT STRING)
(text->string TEXT)

It is easy to emulate these in "legacy Scheme", so
they shouldn't cause any problems for SLIB.  (A TEXT
could be represented as a pair containing a mutable string
plus an integer length.  Or if you don't have mutable
strings: a pair of a vector of characters plus an integer.)

"Big Scheme" might add insertion, deletion, slices, Unicode
normalization, and whatever.
-- 
--Per Bothner
[email protected]   http://per.bothner.com/

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Aubrey Jaffer
 | From: Thomas Lord 
 | Date: Sat, 19 Sep 2009 15:49:47 -0700
 | 
 | On Sat, 2009-09-19 at 01:24 -0400, John Cowan wrote:
 | []
 | > 2) Algorithms where you want to modify strings in the middle are rare,
 | 
 | Claims like this always make the hairs on the back of my 
 | neck stand up.  There are two problems with them.
 | First, there isn't a really good empirical way to establish
 | such claims.   Second, rarity per se, is not the most 
 | important consideration.
 | 
 | Surely if we scan the source for PLT or various Scheme
 | compilers, SLIB, and similar systems we'll find that 
 | string mutation is less common than immutable uses of
 | strings.   Well, I haven't actually *checked* but I wouldn't
 | be surprised.   Still, that tells us little about what
 | code there is that we can't see and won't here about and it
 | tells us even less about what kinds of code people will 
 | want to write in 3 years.

Below are the occurences of string-set! in SLIB.  One frequent pattern
of use is to make a string using MAKE-STRING; then fill it with
STRING-SET!.  Accumulating characters with STRING-APPEND is a bad
idea; it turns a O(N) process into O(N^2).  So I guess lists-of-chars
and LIST->STRING would be the alternative to MAKE-STRING with
STRING-SET!.

  -=-=-=-=-

  array.scm:375:  ((if (string? store) string-set! vector-set!)

To implement uniform character arrays.

  byte.scm:89:  (string-set! new idx (integer->char (byte-ref bts idx))

Byte-vectors implemented from strings.

  chap.scm

CHAP:NEXT-STRING increments (in the lexicographic order) a copy of a
given string.

  format.scm:725: (string-set! format:fn-str format:fn-len c)
  format.scm:728: (string-set! format:en-str format:en-len c)
  format.scm:769: (string-set! format:en-str format:en-len c)
  format.scm:779: (string-set! format:fn-str i
  format.scm:785: (string-set! format:fn-str i #\0
  format.scm:794:  (string-set! format:fn-str (- i n) (string-ref 
format:fn-str i
  format.scm:805:  (string-set! format:fn-str 0 #\1)
  format.scm:810:  (string-set! format:fn-str i (integer->char
  format.scm:837:(string-set! format:fn-str format:fn-len #\0)
  format.scm:1631:  (string-set! cap-str i (char-downcase c))
  format.scm:1634:(string-set! cap-str i (char-upcase c)

One use is capitalizing a copy of a string.  The other occurences are
harder to figure.

  genwrite.scm:261:  (string-set! result k (string-ref str j))

Fills in the result of a call to MAKE-STRING.

  getparam.scm:109:(string-set! str i #\-))

Converting space to "-" in copy of a string.

  http-cgi.scm:

HTTP:READ-QUERY-STRING and CGI:READ-QUERY-STRING fill strings returned
by MAKE-STRING.

  lineio.scm:60:  (string-set! str i char)

READ-LINE! fills an argument string with characters read from a port.

  matfile.scm:129:   (string-set! namstr idx (read-char port)))

MATFILE:READ-MATRIX reads characters into a string returned by MAKE-STRING.

  printf.scm:

SPRINTF stores output into string passed as argument.

  sc2.scm:

SUBSTRING-MOVE-LEFT!, SUBSTRING-MOVE-RIGHT!, and SUBSTRING-FILL!

  sc4opt.scm:35:  (string-set! s i obj)))

STRING-FILL!

  scanf.scm:105:  (string-set! str i 
(read-input-char
  scanf.scm:225: (string-set! str i (read-input-char)

Filling string made by MAKE-STRING.

  strcase.scm:

STRING-UPCASE!, STRING-DOWNCASE!, and STRING-CAPITALIZE!

  strport.scm:39:(string-set! buf i c)

CALL-WITH-OUTPUT-STRING fills string created by MAKE-STRING or
extended with STRING-APPEND.

  transact.scm:116: (string-set! str idx (read-char iport))
  transact.scm:121:(string-set! name idx (read-char iport))

WORD-LOCK:CERTIFICATE fills result of MAKE-STRING with characters read
from a port.

  xml-parse.scm:263:  (string-set! buffer i c)
  xml-parse.scm:313:(string-set! buffer i c)
  xml-parse.scm:331:(string-set! buffer i c)
  xml-parse.scm:349:(else (string-set! buffer idx chr))

Seems to be filling result of MAKE-STRING with read characters.

  yasyn.scm:27:   (cons string-ref string-set!))

Not a clue.

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Lynn Winebarger
On Sat, Sep 19, 2009 at 1:24 AM, John Cowan  wrote:
> This is a proposal for the removal of string-set! (and consequently
> string-fill!) from the R7RS small Scheme language.  I am publishing this

Your proposal should be broken into two parts.  The first is abandoning
the representation of strings as character vectors in favor of data
structures that support faster string operations.  The second is the
mandate of immutability.  These are two distinct and independent
proposals.

A better reason for string-copy on an immutable string is to obtain
a mutable string.   You can't rule out that some algorithm might
find it useful to share computations about the internal components
of a string.

Lynn

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Abdulaziz Ghuloum

On Sep 19, 2009, at 8:24 AM, John Cowan wrote:

> 3) If strings are immutable, it's possible to have both fast O(1)
> access to individual characters or substrings, and fairly space- 
> efficient
> representation of full Unicode strings, by using different  
> representations
> for strings drawn from diferent character repertoires.  For example,
> an implementation might use 8-bit code units when all characters are
> less than \#x100, 16-bit code units when all characters are less than
> \#x1, and 32-bit code units otherwise.

John,

For how many years have you been arguing for "space efficient"
internal representations and no one has been listening?  Do you
know why?  [Hint: it's not because implementors don't care about
space efficiency]

Aziz,,,

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Aubrey Jaffer
 | From: Thomas Lord 
 | Date: Sat, 19 Sep 2009 15:49:47 -0700
 | 
 | On Sat, 2009-09-19 at 01:24 -0400, John Cowan wrote:
 | []
 | > 2) Algorithms where you want to modify strings in the middle are rare,
 | 
 | Claims like this always make the hairs on the back of my 
 | neck stand up.  There are two problems with them.
 | First, there isn't a really good empirical way to establish
 | such claims.   Second, rarity per se, is not the most 
 | important consideration.
 | 
 | Surely if we scan the source for PLT or various Scheme
 | compilers, SLIB, and similar systems we'll find that 
 | string mutation is less common than immutable uses of
 | strings.   Well, I haven't actually *checked* but I wouldn't
 | be surprised.   Still, that tells us little about what
 | code there is that we can't see and won't here about and it
 | tells us even less about what kinds of code people will 
 | want to write in 3 years.

Here are the occurences of string-set! in SLIB:

  array.scm:375:  ((if (string? store) string-set! vector-set!)

Uniform character arrays.

  byte.scm:89:  (string-set! new idx (integer->char (byte-ref bts idx))

Byte-vectors implemented from strings.

  chap.scm

CHAP:NEXT-STRING increments (in the lexicographic order) a copy of a
given string.

  format.scm:725: (string-set! format:fn-str format:fn-len c)
  format.scm:728: (string-set! format:en-str format:en-len c)
  format.scm:769: (string-set! format:en-str format:en-len c)
  format.scm:779: (string-set! format:fn-str i
  format.scm:785: (string-set! format:fn-str i #\0
  format.scm:794:  (string-set! format:fn-str (- i n) (string-ref 
format:fn-str i
  format.scm:805:  (string-set! format:fn-str 0 #\1)
  format.scm:810:  (string-set! format:fn-str i (integer->char
  format.scm:837:(string-set! format:fn-str format:fn-len #\0)
  format.scm:1631:  (string-set! cap-str i (char-downcase c))
  format.scm:1634:(string-set! cap-str i (char-upcase c)
  genwrite.scm:261:  (string-set! result k (string-ref str j))
  getparam.scm:109:(string-set! str i #\-))
  http-cgi.scm:88:(string-set! str idx chr)
  http-cgi.scm:317:   (string-set! str idx chr)
  lineio.scm:60:  (string-set! str i char)
  matfile.scm:129:   (string-set! namstr idx (read-char port)))
  printf.scm:171:(string-set! res i
  printf.scm:175:  (string-set! res i #\0)
  printf.scm:573: (string-set! s cnt (string-ref x i))
  printf.scm:583:(string-set! s cnt (if (char? x) x 
#\?))
  sc2.scm:26:(string-set! string2 j (string-ref string1 i
  sc2.scm:33:(string-set! string2 j (string-ref string1 i
  sc2.scm:39:(string-set! string i char)))
  sc4opt.scm:35:  (string-set! s i obj)))
  scanf.scm:105:  (string-set! str i 
(read-input-char
  scanf.scm:225: (string-set! str i (read-input-char)
  strcase.scm:20:(string-set! str i (char-upcase (string-ref str i)
  strcase.scm:28:(string-set! str i (char-downcase (string-ref str i)
  strcase.scm:41:   (string-set! str i (char-downcase c))
  strcase.scm:44: (string-set! str i (char-upcase c
  strport.scm:39:(string-set! buf i c)
  transact.scm:116: (string-set! str idx (read-char iport))
  transact.scm:121:(string-set! name idx (read-char iport))
  vet.scm:53:  string-ci>? string-length string-ref string-set! string<=?
  xml-parse.scm:263:  (string-set! buffer i c)
  xml-parse.scm:313:(string-set! buffer i c)
  xml-parse.scm:331:(string-set! buffer i c)
  xml-parse.scm:349:(else (string-set! buffer idx chr))
  yasyn.scm:27:   (cons string-ref string-set!))

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Thomas Lord
On Sat, 2009-09-19 at 19:19 -0400, John Cowan wrote:
> > An implementation that implicitly, constantly, renormalizes all
> strings --
> > even immutable strings --
> 
> Immutable strings don't need renormalization.  You normalize them when
> you create them, period.

Yes, but when you are building a string-like mutable
type, appending and taking substrings, suddenly you 
are renormalizing on every operation.

Surely you want "string-set", no?  (or,
if you prefer, "string-replace").

-t




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Thomas Lord
As a kind of counter-proposal, a perhaps sane "core Scheme"
(a subset of "small Scheme") could offer:

1. *Some* low-level solution for syntactic abstraction
   (I won't dwell on my opinion of what exactly,
   since that would be controversial.)

2. *Some* low-level solution for reader extensions.

3. *Some* low-level but fairly abstract system of 
   ports and an environment in which certain resources
   (like libraries) can be "opened" to create a port.

4. Low-level vectors - both immutable and mutable.

5. Lambda.

6. A constructor for disjoint types that can wrap
   an arbitrary value.

7. Fixnums.  (I/O on low-level ports consumes and yields these.)

8. *Perhaps* mutable lambdas (objects which can be 
   applied but you can modify them to change which
   simple lambda is invoked when applying them).


That'd be about it.

Everything in Small Scheme can be *explained* fairly 
well in terms of those things.   For example, cons-pairs
are length-two vectors wrapped up as a disjoint type.
Flonums can be explained as a fixnum or pair of fixnums
wrapped up as a disjoint type.

Real "core Scheme" code can actually implement those
familiar types in that way.   In a truly minimalist implementation
that would actually be potentially useful.   As a semantic
model, it would be useful.

Of course, most implementations would natively implement
many more types and features than what I described.
But the specification for those additional types and
features could be expressed quite precisely as core
Scheme code.

There would be less pressure, in this kind of approach,
to haggle over questions like "Strings: mutable or not?"
We can define both.  We can treat the traditional string
operators as generics that can work on either.  We can
quibble over exactly which ones are *required* in Small
Scheme but also enjoy that Small Scheme supports either
one.

-t



On Sat, 2009-09-19 at 01:24 -0400, John Cowan wrote:
> This is a proposal for the removal of string-set! (and consequently
> string-fill!) from the R7RS small Scheme language.  I am publishing this
> document to invite wide comment.  There is nothing official about it.
> I very gratefully acknowledge the kind help of Alex Shinn, who provided
> the topic sentences for most of the paragraphs below.  However, I retain
> sole responsibility for this document, including all errors.
> 
> I believe that despite the prescription of the draft WG1 charter that
> no features of IEEE Scheme (a subset of R4RS) should be removed from
> R7RS small Scheme, an exception should be made for string-set!, for at
> least the following reasons:
> 
> 1) Immutable strings are more purely functional, and allow many
> optimizations, such as being transparently and freely shareable between
> procedures and between threads without concern for uncontrolled mutation.
> For this and other reasons, the general trend in new languages/runtimes
> such as Java and C# is toward immutable strings; unfortunately, this is
> the kind of argument that Schemers usually don't like, so I won't bother
> mentioning it.  :-)
> 
> 2) Algorithms where you want to modify strings in the middle are rare,
> and many of the classic devices (such as string-upcase!, a procedure that
> mutates a string in place) are awkward or impossible with representations
> that make use of characters of variable length such as UTF-8.  Typical
> string algorithms want to also be able to do insertions and deletions,
> which are not directly possible with classical Scheme strings.  Better
> representations such as trees of immutable strings do allow such changes,
> as well as making string appends O(n) in the number of strings rather
> than in the sum of their lengths.
> 
> 3) If strings are immutable, it's possible to have both fast O(1)
> access to individual characters or substrings, and fairly space-efficient
> representation of full Unicode strings, by using different representations
> for strings drawn from diferent character repertoires.  For example,
> an implementation might use 8-bit code units when all characters are
> less than \#x100, 16-bit code units when all characters are less than
> \#x1, and 32-bit code units otherwise.
> 
> Unfortunately, mutating even a single character in such a representation
> may require the entire string to be copied, which means that it also
> requires indirection through a separate header that can be redirected
> to point to the newly allocated code unit sequence.  Immutable strings
> can just *be* their sequences, with a few extra bits indicating the
> size of the code units, although this design does prevent easy sharing
> of substrings.
> 
> 4) As currently designed, strings are functionally just vectors of
> characters.  In an 8-bit world, using the traditional representation
> of strings carries a 4:1 storage advantage, making it worthwhile
> to distinguish them clearly from general vectors  But 21-bit Unicode
> characters are a much better fit, if represented as immediate (unboxed)
> v

Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread John Cowan
Thomas Lord scripsit:

> This is true as well for immutable cons pairs and immutable vectors.

Nothing I have said should give you the notion that I am opposed to
those things.  I do hold, however, that strings are more nearly atomic
in actual use than either lists or vectors.

> It is the case that a tiny implementation of small Scheme - one small
> in footprint and simple in implementation - is unlikely to have much
> by the way of sophisticated optimizations.

Check out Chibi Scheme sometime.  Naive beyond belief in some ways,
intensely sophisticated in others.

> Rarity is not an especially compelling argument.  More important is
> *importance*.   The question is less "how often do I need to reach
> for string mutation?" so much as the question is "how painful is it
> if when I want string mutation I can't have it?".

You can have it if you're willing to tolerate indirection.  Especially if
not having it makes the rest of your string-slinging faster and better.

> Specifying only immutable strings will not save programs the kind of
> expenses you are talking about.  It will only give programs fewer ways
> in which to deal with them.

By definition, removing a feature means less flexibility.  (Note well,
I am *not* arguing that conformant small Schemes *remove* or *lack*
string-set!.)  The question is, is the flexibility worth the cost?

In early Fortran, it was possible to change the value of a numeric
constant by passing it by reference to a procedure.  It was quickly
discovered that allowing a random subset of the literal 6s in a program
(perhaps all) changed to 20s was a flexibility not worth having.

> An implementation that implicitly, constantly, renormalizes all strings --
> even immutable strings --

Immutable strings don't need renormalization.  You normalize them when
you create them, period.

> Surely there will be no wise way to write code which makes such a use
> of string-copy in a meaningfully portable way.

Probably not.  It's an implementation hack, a concession to the
limitations of GCs.  Java survives this problem just fine, even without
special GCs.

-- 
No,  John.  I want formats that are actually   John Cowan
useful, rather than over-featured megaliths that   http://www.ccil.org/~cowan
address all questions by piling on ridiculous  [email protected]
internal links in forms which are hideously
over-complex. --Simon St. Laurent on xml-dev

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Thomas Lord
On Sat, 2009-09-19 at 01:24 -0400, John Cowan wrote:
[]
> 1) Immutable strings are more purely functional, and allow many
> optimizations, such as being transparently and freely shareable between
> procedures and between threads without concern for uncontrolled mutation.

This is true as well for immutable cons pairs
and immutable vectors.

It is also true that these immutable representations
prevent many optimizations so there is a bit of a 
toss-up between the two.

It is the case that a tiny implementation of small
Scheme - one small in footprint and simple in implementation - 
is unlikely to have much by the way of sophisticated
optimizations.   

If we're going to talk about optimizations for a 
small dialect of Scheme, I think we have to talk about
an implementation class that is somewhere in the middle:
it might not (except through libraries) offer a "big Scheme"
environment but aims to provide a rich environment with a
sophisticated implementation.

Those considerations lead me to the conclusion we
should really have both mutable and immutable strings.
(And mutable and immutable variants on several other
types, as well.)


> 2) Algorithms where you want to modify strings in the middle are rare,

Claims like this always make the hairs on the back of my 
neck stand up.  There are two problems with them.
First, there isn't a really good empirical way to establish
such claims.   Second, rarity per se, is not the most 
important consideration.

Surely if we scan the source for PLT or various Scheme
compilers, SLIB, and similar systems we'll find that 
string mutation is less common than immutable uses of
strings.   Well, I haven't actually *checked* but I wouldn't
be surprised.   Still, that tells us little about what
code there is that we can't see and won't here about and it
tells us even less about what kinds of code people will 
want to write in 3 years.

Rarity is not an especially compelling argument.  More 
important is *importance*.   The question is less "how often
do I need to reach for string mutation?" so much as the
question is "how painful is it if when I want string mutation
I can't have it?".

Complex numbers are an example of a feature whose use
is (probably) relatively rare.  Yet their absence when
you want them can be quite painful.



> 3) If strings are immutable, it's possible to have both fast O(1)
> access to individual characters or substrings, and fairly space-efficient
> representation of full Unicode strings, by using different representations
> for strings drawn from diferent character repertoires. [...]

Yes, but this is also true of mutable strings.  You
argue otherwise thusly:

> Unfortunately, mutating even a single character in such a representation
> may require the entire string to be copied, which means that it also
> requires indirection through a separate header that can be redirected
> to point to the newly allocated code unit sequence.  Immutable strings
> can just *be* their sequences, with a few extra bits indicating the
> size of the code units, although this design does prevent easy sharing
> of substrings.


In the absence of mutation, when people want to implement
"a string like thing whose contents and length can 
change over time" they will: (a) use an indirect header;
(b) copy strings or construct new "ropes" on *every* mutation,
not just the ones that change the maximum code-point size
in the string.

Specifying only immutable strings will not save programs the
kind of expenses you are talking about.  It will only give 
programs fewer ways in which to deal with them.



> 4) As currently designed, strings are functionally just vectors of
> characters.  In an 8-bit world, using the traditional representation
> of strings carries a 4:1 storage advantage, making it worthwhile
> to distinguish them clearly from general vectors  But 21-bit Unicode
> characters are a much better fit, if represented as immediate (unboxed)
> values, for general vectors using 32-bit pointers.  Granted that not all
> small Scheme systems will provide full Unicode support, general vectors
> start to look much less expensive than they once were.  In short: if
> you want something that behaves like a vector of characters, simply use
> a general vector that contains characters.


This bit really confuses me because there is nothing
in your arguments for the immutability of strings that
would appear to not also apply to vectors.   This is 
especially the case if we contemplate "homogenous vectors"
over types that may have (like codepoints) a variable-width
representation.

Moreover, using a heterogeneous vector representation for
a mutable string has the distinct disadvantage that any
natural representation, each character in the vector must
carry a type tag and no string algorithm can assume that
the string contains only characters.



> 5) Making strings immutable also permits a design in which all strings
> are Unicode-normalized.  Though this has its own costs (for example,
> 

Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread John Cowan
Brian Mastenbrook scripsit:

> I believe that this is expressible in terms of weak pointers and  
> finalizers (both of which are supported by the Boehm collector).[...]

I grant that this technique works, but it's very hairy; I suspect
it would cost more in time (finalizers are expensive) than it would
save in space.

> I'm not proposing that immutable strings be interned, but I don't  
> think any promises of identity should be made. In particular, `make- 
> string' and other string-building objects should be free to return the  
> same (eqv?) object given the same inputs. Some string representations  
> may not even have an identity: on 64-bit systems, it may be natural to  
> provide an immediate representation for strings of one or two code  
> points (or more, if a variable-width encoding is used).

That's quite true.  I'll have to consider what I think about that.

-- 
You know, you haven't stopped talking   John Cowan
since I came here. You must have been   http://www.ccil.org/~cowan
vaccinated with a phonograph [email protected]
--Rufus T. Firefly

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread John Cowan
Joe Marshall scripsit:

> Suppose there were a standard mechanism of `deprecating' some language 
> feature.

Most standards bodies do have such a concept.  Of course, the deprecated
features are rarely removed: Hollerith constants have been deprecated in
Fortran since 1966, but many compilers still provide them for compiling
dusty decks.

Note that 13 features were removed between R2RS and R3RS, and 18 more
between R3RS and R4RS.  3 features were removed completely between
R5RS and R6RS, and 9 features exiled to the R5RS compatibility library,
which is not imported by default.

> There's an asymmetry here.  A new, untested feature is usually easy to add
> because it is unlikely to break anything that already exists.

All too fatally easy.  I have currently only 4 bound identifiers that
don't correspond to something in R5RS, R6RS, or a widely implemented SRFI.
(I renamed a few.)  Those are case-folding, rational-expt, string->vector,
and vector->string.

> But an existing feature is impossible to remove because it is unknown
> if it will break existing code.

Hence the Hollerith strings:  6HABCDEF for 'ABCDEF'.

> It's easy to standardize on a new feature if everyone has independently added 
> it
> as an extension.  The consensus is obvious.  

Hence most of my list.

-- 
John Cowan  [email protected]  http://ccil.org/~cowan
The competent programmer is fully aware of the strictly limited size of his own
skull; therefore he approaches the programming task in full humility, and among
other things he avoids clever tricks like the plague.  --Edsger Dijkstra

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Jeff Bezanson
Ray Dillinger wrote:
> The point about strings having formerly enjoyed a 4:1 storage
> advantage (8 vs. 32 bits) and 21-bit characters being a better fit to 
> 32-bit words?  Firstly, irrelevant.  Small scheme should
> be about semantics, not about hardware.  Secondly, incorrect. The 
> primary encoding used by Operating Systems and underlying libraries is 
> UTF8, which still enjoys a 4:1 storage advantage
> for most strings.  Thirdly, it's growing more incorrect. Since new 
> computers these days use 64-bit pointers, the advantage of packed UTF8 
> strings over general vectors is rapidly shifting to
> 8:1.   

This is all true, and I'm a big fan of UTF8, but string-set! is 
especially difficult (and inefficient) to provide for UTF8 strings.
Wanting the space advantages of UTF8 is a reason to prefer immutable 
strings.
I feel like this is what John's point meant; that implementations with 
Unicode and a fast string-set! would probably be using 32 bits (or 16 
bits) per character.

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Brian Mastenbrook

On Sep 19, 2009, at 1:37 PM, John Cowan wrote:

>> I shouldn't have to use `string-copy' for this; my implementation  
>> should
>> do it for me. If there's no user-exposed backpointer from the  
>> substring to
>> the original string, the GC can dispose of the original string and  
>> copy
>> out the retaining displaced substrings when it makes sense to do so.
>
> Nevertheless, few extant GCs do so, notably not the Boehm  
> conservative GC.
> So string-copy is a kludge, but a useful kludge.

I believe that this is expressible in terms of weak pointers and  
finalizers (both of which are supported by the Boehm collector). Each  
non-displaced string object holds a weak pointer to any displaced  
substrings, and the actual string storage object is separately  
allocated and pointed to by all displaced substrings. When the  
original string object is reaped, its finalizer has the opportunity to  
decide whether to keep the original string storage object around in  
its entirety, to trim or split the original string storage object, or  
to copy out the displaced substrings and turn them into non-displaced  
strings. Any time a displaced substring object is reaped, this process  
can be re-run.

>> There's really no sense to providing a copy operation for an  
>> immutable
>> type.
>
> That's rather strong.  The proposed strings are immutable but not  
> interned;
> two strings that are equal? may still not be eqv?.

I'm not proposing that immutable strings be interned, but I don't  
think any promises of identity should be made. In particular, `make- 
string' and other string-building objects should be free to return the  
same (eqv?) object given the same inputs. Some string representations  
may not even have an identity: on 64-bit systems, it may be natural to  
provide an immediate representation for strings of one or two code  
points (or more, if a variable-width encoding is used).
--
Brian Mastenbrook
[email protected]
http://brian.mastenbrook.net/

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Joe Marshall
Just to take this up one level.

Suppose there were a standard mechanism of `deprecating' some language feature.
The mechanism would warn you that the feature is discouraged, but continue to
permit use.  Let's pretend that `deprecation' has been in the language
since R3RS
and nearly every Scheme supported it.

Now we fast forward to, say, 2012.  Nearly every implementation of Scheme has
deprecated `string-set!' and perhaps `set-car!' and `set-cdr!'.  Now
instead of arguing
about removing these from the language, it's pretty much a no-brainer.

There's an asymmetry here.  A new, untested feature is usually easy to add
because it is unlikely to break anything that already exists.  But an
existing feature
is impossible to remove because it is unknown if it will break existing code.
It's easy to standardize on a new feature if everyone has independently added it
as an extension.  The consensus is obvious.  Old features don't have a mechanism
to become moribund.

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread John Cowan
This is a consolidated response to a few points in people's messages:

Peter Bex scripsit:

> It makes a whole lot of sense for ASCII-only implementations (which
> should be allowed by Thing One, I believe ~everyone agrees on) to
> supply a string-set!.  Why wouldn't they, it's a cheap operation in
> such an environment.

Even in an ASCII-only Scheme, there are still the same concerns as
before with sharing, either between procedures or between threads.
With any mutable object, the question "Who owns it?" must be asked
and answered.

Shiro Kawai scripsit:

> Gauche has deprecated string mutation from the beginning (it costs O(n)
> for every string-set! [...])

Wow.

> I remember a couple of times I rewrote the part using string-set! with
> string ports when porting third-party libraries.   I haven't yet seen
> a nontrivial usage of string-set! that can't easily be replaced for
> string ports or using a vector then converts it to a string.

One reason that I want string ports to be a core part of Thing One,
even if file ports are not present.

> It is a valid point.  However, I'm optimisitc that it can be solved by
> a "text srfi" that implements generic mutable text on top of mutable
> strings.

I assume that you mean "immutable strings".

Taylor Campbell has sketched just such a library at
http://mumble.net/~campbell/proposals/new-text.txt .  His version is
a bit too general for my taste (he allows that the primitive unit of
strings may *not* be characters, but bytes or something between), but
it gives a good indication of what one might want.

Brian Mastenbrook scripsit:

> I'll add another reason for getting rid of mutable strings, as well as a  
> rejoinder to reason #4:

I've used this to replace my reason #4, and also to supplement reason #2,
which now talks of gap buffers as well as ropes.

> I shouldn't have to use `string-copy' for this; my implementation should  
> do it for me. If there's no user-exposed backpointer from the substring to  
> the original string, the GC can dispose of the original string and copy  
> out the retaining displaced substrings when it makes sense to do so.  

Nevertheless, few extant GCs do so, notably not the Boehm conservative GC.
So string-copy is a kludge, but a useful kludge.

> There's really no sense to providing a copy operation for an immutable
> type.

That's rather strong.  The proposed strings are immutable but not interned;
two strings that are equal? may still not be eqv?.

-- 
Is not a patron, my Lord [Chesterfield],John Cowan
one who looks with unconcern on a man   http://www.ccil.org/~cowan
struggling for life in the water, and when  [email protected]
he has reached ground encumbers him with help?
--Samuel Johnson

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread John Cowan
Ray Dillinger scripsit:

> If we want to drop a feature that basic, then surely we should first
> deprecate it and move it to a library (R7) and second make the library
> optional (R8) so that its eventual absence does not come as a shock
> to users.

As noted, that's what R6 did.

> In the case of mutation operators in particular, people will just
> write a "portable string-set!" library in terms of vectors or lists,

I took a look at the SRFI-13 reference implementation to see what use it
makes of string-set!, and the use is basically only for creating strings
a little at a time.  In general, the code allocates fixed-length strings
and then uses string-set to construct them.  In such circumstances, the
general Builder pattern works well: have a mutable builder object which
can be configured, and then a build operator that constructs from it an
immutable object for all further use (the builder can then be discarded).

> and then we will have many incompatible kinds of incompatible strings
> to go with our many incompatible object systems.

To discourage such proliferation, it would be sensible to add
string->vector and vector->string to the small Scheme core.  This follows
the Builder pattern, with the former procedure constructing the builder
and the latter constructing the immutable final result.

> The point about strings having formerly enjoyed a 4:1 storage
> advantage (8 vs. 32 bits) and 21-bit characters being a better fit
> to 32-bit words?  Firstly, irrelevant.  [...] Secondly, incorrect.
> [...]  Thirdly, it's growing more incorrect.  [...]

Quite right on all points.  I have removed this argument from the
living version of the document, accessible from the link at the
bottom of http://tinyurl.com/thing-one .

>From a later posting:

> A valid abstraction from bytevectors to strings could be made; strings
> can be implemented correctly in terms of bytevectors, enforcing a
> choice of representation by fiat.

Indeed.

> We could move all the character and string operations to libraries
> and make bytevectors primitive

The only reason blobs aren't part of the core in my small Scheme
proposals is that they *can* be emulated with general vectors.  I would
hope that a small Scheme implementation would support them directly,
though, and I intend to write a SRFI about them.

> (and I'd recommend it if I could figure out how to make the _syntax_
> for string and character literals importable from a library).

I've struggled with that too.  See my proposal for semi-scoped extensible
syntax; it's linked at the bottom of http://tinyurl.com/thing-one .

> But abstracting in the other direction, as we had been doing before
> bytevectors came along in R6 and as a lot of older code still does,
> is invalid in the presence of choices about how to represent strings.

Absolutely.  Even if there was only one known encoding of characters
(which has never been true, ever), it would be data punning: bytes are
not really the same semantic domain as characters.

-- 
If you have ever wondered if you are in hell, John Cowan
it has been said, then you are on a well-traveled http://www.ccil.org/~cowan
road of spiritual inquiry.  If you are absolutely [email protected]
sure you are in hell, however, then you must be
on the Cross Bronx Expressway.  --Alan Feuer, NYTimes, 2002-09-20

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Ray Dillinger
On Sat, 2009-09-19 at 01:49 -1000, Shiro Kawai wrote:

> We *should* use bytevectors for systems programming when we
> need a mutable, fixed-length byte buffer.  How much
> confusions and bugs have come because C API conflates strings
> and bytevectors into char[]?   We should learn from that.

When we want a buffer of bytes for systems programming purposes,
and we have to care about things like binary representation or 
encoding, we are manifestly not using that buffer as a string.
Since the word "string" has bytevectorish overloadings from 
languages like C, I'll amplify: we are not using that buffer as
text that represents characters in human languages.  Those 
human-language characters can be mapped to binary in any way 
at all, and if that doesn't matter because they still represent
the same characters, then it's text because the characters that 
it represents are primary.  If it does matter, then it's bytes 
instead and its meaning, if any, as characters has to be 
regarded as just a coincidence.

By assuming ascii representation, we were accustomed, in R5RS 
scheme, to doing a bunch of binary stuff like reading JPG files, 
etc, in terms of strings and character input - but that code 
breaks horribly if the character representation changes, and 
the "abstraction" from strings to bytevectors is a false one.
Bytevectors were new in R6RS, and badly needed for such 
operations *especially* since R6 mandates unicode.

A valid abstraction from bytevectors to strings could be made;
strings can be implemented correctly in terms of bytevectors, 
enforcing a choice of representation by fiat.  We could move 
all the character and string operations to libraries and 
make bytevectors primitive (and I'd recommend it if I could 
figure out how to make the _syntax_ for string and character 
literals importable from a library).  But abstracting in the 
other direction, as we had been doing before bytevectors came 
along in R6 and as a lot of older code still does, is invalid 
in the presence of choices about how to represent strings.  

Bear






___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Brian Mastenbrook
On Sat, 19 Sep 2009 00:24:58 -0500, John Cowan  wrote:

> This is a proposal for the removal of string-set! (and consequently
> string-fill!) from the R7RS small Scheme language.  I am publishing this
> document to invite wide comment.  There is nothing official about it.
> I very gratefully acknowledge the kind help of Alex Shinn, who provided
> the topic sentences for most of the paragraphs below.  However, I retain
> sole responsibility for this document, including all errors.
>
> I believe that despite the prescription of the draft WG1 charter that
> no features of IEEE Scheme (a subset of R4RS) should be removed from
> R7RS small Scheme, an exception should be made for string-set!, for at
> least the following reasons:

[Snipped a list of points, most of which I agree with]

> 4) As currently designed, strings are functionally just vectors of
> characters.  In an 8-bit world, using the traditional representation
> of strings carries a 4:1 storage advantage, making it worthwhile
> to distinguish them clearly from general vectors  But 21-bit Unicode
> characters are a much better fit, if represented as immediate (unboxed)
> values, for general vectors using 32-bit pointers.  Granted that not all
> small Scheme systems will provide full Unicode support, general vectors
> start to look much less expensive than they once were.  In short: if
> you want something that behaves like a vector of characters, simply use
> a general vector that contains characters.

I don't think there's any point to using a general vector of characters as  
a replacement for mutable strings.

I'll add another reason for getting rid of mutable strings, as well as a  
rejoinder to reason #4:

6) There's no general utility to `string-set!' without also the ability to  
insert and delete characters in a string. Programs that work with text  
generally want either an immutable string or an editable string into which  
characters can be inserted and deleted. Editable strings are typically  
represented as gap buffers. It's possible to use mutable strings to build  
up an editable string representation, but this is not evidence of general  
utility of fixed-length mutable strings in my opinion.

Nothing will give you immutable strings if you don't already have them. On  
the other hand, if the core Scheme strings were immutable and an editable  
strings library were provided, existing users of mutable strings could  
convert to using the latter representation with little effort, and writing  
programs which require an editable strings representation would be vastly  
simplified.

It's much easier for an implementation that uses a variable-width internal  
string representation to provide immutable strings and editable strings  
than to provide only mutable fixed-length strings. When represented as a  
gap buffer, editable strings retain the 4-to-1 or 8-to-1 compactness  
advantage of strings over general vectors of characters. The  
implementation of editable strings is not significantly more complex than  
the implementation of mutable fixed-length strings. Complex algorithms  
expressed in terms of `string-set!' can be rewritten in terms of insert  
and delete operations with a great increase in clarity.

> As a consequence of removing string-set!, string-fill! (not in IEEE
> Scheme) becomes impossible and string-copy less useful.  I do not propose
> to remove string-copy, however, because it can eliminate space leaks
> that are caused by taking a small shared substring of a large existing
> string: when the larger string should be GC'ed, it is retained as a
> whole because of the shared substring.  Using string-copy judiciously
> can prevent such leaks.

I shouldn't have to use `string-copy' for this; my implementation should  
do it for me. If there's no user-exposed backpointer from the substring to  
the original string, the GC can dispose of the original string and copy  
out the retaining displaced substrings when it makes sense to do so.  
Implementations which don't want to implement this level of GC complexity  
can make `substring' always copy. There's really no sense to providing a  
copy operation for an immutable type.
--
Brian Mastenbrook
[email protected]
http://brian.mastenbrook.net/

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Shiro Kawai
+1 for immutable strings.

From: Ray Dillinger 
Subject: Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: 
string-set! must die
Date: Sat, 19 Sep 2009 02:41:45 -0700

> On Sat, 2009-09-19 at 01:24 -0400, John Cowan wrote:
> > This is a proposal for the removal of string-set! (and consequently
> > string-fill!) from the R7RS small Scheme language.  
> 
> 
> While I certainly think that string-set! is in fact unnecessary, 
> I do not agree that it should be immediately removed.  I'd support
> moving mutable-strings to a library rather than keeping them in the
> core language, but the world is not yet ready for scheme without 
> mutable strings.
[...]
> If we want to drop a feature that basic, then surely we should 
> first deprecate it and move it to a library (R7) and second 
> make the library optional (R8) so that its eventual absence 
> does not come as a shock to users. 

We've already cleared step 1; in R6RS string-set! and string-fill!
are *already* in a separate library (rnrs mutable-strings)
which is not imported if you just say (import (rnrs (6))).

> There is not a widespread scheme that has no string-set!, nor a 
> large body of scheme code that is known to run well without it. 

Right.  But Gauche has deprecated string mutation from the
beginning (it costs O(n) for every string-set!, since Gauche's
string is just a pointer to a immutable string body and
every mutation requires copying the body), and we've lived
happily in that world for years.   I remember a couple of times
I rewrote the part using string-set! with string ports when
porting third-party libraries.   I haven't yet seen a nontrivial
usage of string-set! that can't easily be replaced for
string ports or using a vector then converts it to a string.

> In the case of mutation operators in particular, people
> will just write a "portable string-set!" library in terms of 
> vectors or lists, and then we will have many incompatible kinds
> of incompatible strings to go with our many incompatible object 
> systems.

It is a valid point.  However, I'm optimisitc that it can
be solved by a "text srfi" that implements generic mutable
text on top of mutable strings.  Especially such library
would allow length-changing mutation.

> Thirdly, it's growing more incorrect. Since new computers
> these days use 64-bit pointers, the advantage of packed UTF8
> strings over general vectors is rapidly shifting to 8:1. 

Good point.  This also supports using packed utf-8 string
(or using several different representations according to
the characters, e.g. ascii-only, 2octet-per-char, etc.),
which favors immutable strings.


From: Peter Bex 
Subject: Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: 
string-set! must die
Date: Sat, 19 Sep 2009 13:01:25 +0200

> Agreed. It makes a whole lot of sense for ASCII-only implementations
> (which should be allowed by Thing One, I believe ~everyone agrees on)
> to supply a string-set!.  Why wouldn't they, it's a cheap operation in
> such an environment.  Also, ASCII-only schemes would probably be the more
> minimalistic ones, which likely are used in systems programming, where
> string-set! can still be quite a useful operation.

We *should* use bytevectors for systems programming when we
need a mutable, fixed-length byte buffer.  How much
confusions and bugs have come because C API conflates strings
and bytevectors into char[]?   We should learn from that.

--shiro

___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Peter Bex
On Sat, Sep 19, 2009 at 11:34:57AM +0100, Alaric Snell-Pym wrote:
> > If we want to drop a feature that basic, then surely we should
> > first deprecate it and move it to a library (R7) and second
> > make the library optional (R8) so that its eventual absence
> > does not come as a shock to users.
> 
> I think that's basically the long-term of what John's suggesting, by
> removing it from the Thing One core; the only point of contention here
> is whether it's an *optional* library or not...

I agree this is the important point of the discussion.

> > Scheme has always had as one of its strengths the fact that you
> > can use it to express many different paradigms of programming,
> > and I think that taking steps to *reduce* its value for one or
> > more paradigms is a mistake.
> 
> The thing is, the sheer possibility of string-set! makes other string
> operations consume more time/space resources, which reduces their
> value. It's a choice between two costs which have different effects
> for different situations.
> 
> I'd like to see:
> 
> 1) Some implementations with immutable strings
> 2) Some implementations with mutable strings
> 3) Some implementations that give you both, where all string
> operations that don't mutate work seamlessly on both, with explicit
> make-mutable-string and make-immutable-string operations, and a
> parameter that selects whether newly created strings are one or the
> other (from read, *->string, etc).

Agreed. It makes a whole lot of sense for ASCII-only implementations
(which should be allowed by Thing One, I believe ~everyone agrees on)
to supply a string-set!.  Why wouldn't they, it's a cheap operation in
such an environment.  Also, ASCII-only schemes would probably be the more
minimalistic ones, which likely are used in systems programming, where
string-set! can still be quite a useful operation.

Cheers,
Peter
-- 
http://sjamaan.ath.cx
--
"The process of preparing programs for a digital computer
 is especially attractive, not only because it can be economically
 and scientifically rewarding, but also because it can be an aesthetic
 experience much like composing poetry or music."
-- Donald Knuth


pgp2aClueSWKI.pgp
Description: PGP signature
___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Alaric Snell-Pym

On 19 Sep 2009, at 10:41 am, Ray Dillinger wrote:

> If we want to drop a feature that basic, then surely we should
> first deprecate it and move it to a library (R7) and second
> make the library optional (R8) so that its eventual absence
> does not come as a shock to users.

I think that's basically the long-term of what John's suggesting, by
removing it from the Thing One core; the only point of contention here
is whether it's an *optional* library or not...

> Scheme has always had as one of its strengths the fact that you
> can use it to express many different paradigms of programming,
> and I think that taking steps to *reduce* its value for one or
> more paradigms is a mistake.

The thing is, the sheer possibility of string-set! makes other string
operations consume more time/space resources, which reduces their
value. It's a choice between two costs which have different effects
for different situations.

I'd like to see:

1) Some implementations with immutable strings
2) Some implementations with mutable strings
3) Some implementations that give you both, where all string
operations that don't mutate work seamlessly on both, with explicit
make-mutable-string and make-immutable-string operations, and a
parameter that selects whether newly created strings are one or the
other (from read, *->string, etc).

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/archives/author/alaric/




___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss


Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

2009-09-19 Thread Ray Dillinger
On Sat, 2009-09-19 at 01:24 -0400, John Cowan wrote:
> This is a proposal for the removal of string-set! (and consequently
> string-fill!) from the R7RS small Scheme language.  


While I certainly think that string-set! is in fact unnecessary, 
I do not agree that it should be immediately removed.  I'd support
moving mutable-strings to a library rather than keeping them in the
core language, but the world is not yet ready for scheme without 
mutable strings.

I agree, in principle, that string-set! &co are more problematic, 
for several reasons, in a unicode environment.

And functional strings (ie, strings as immutable values) are also 
a better fit for multiprocessing, which is becoming more common
in hardware. 

In designing a new lisp at this point, I would not (indeed, 
did not) include them. 

But: 

We are not designing a new lisp here.  We are standardizing the 
next scheme. 

There is not a widespread scheme that has no string-set!, nor a 
large body of scheme code that is known to run well without it. 

If we want to drop a feature that basic, then surely we should 
first deprecate it and move it to a library (R7) and second 
make the library optional (R8) so that its eventual absence 
does not come as a shock to users. 

And:

I think that the absence of string mutation would amount to 
an attempt to make it hard to write bad code.  And my experience 
of such attempts is that in practice it makes the bad code 
worse.  In the case of mutation operators in particular, people
will just write a "portable string-set!" library in terms of 
vectors or lists, and then we will have many incompatible kinds
of incompatible strings to go with our many incompatible object 
systems.

Scheme has always had as one of its strengths the fact that you 
can use it to express many different paradigms of programming, 
and I think that taking steps to *reduce* its value for one or 
more paradigms is a mistake. 

Finally:

The point about strings having formerly enjoyed a 4:1 storage 
advantage (8 vs. 32 bits) and 21-bit characters being a better 
fit to 32-bit words?  Firstly, irrelevant.  Small scheme should
be about semantics, not about hardware.  Secondly, incorrect. 
The primary encoding used by Operating Systems and underlying 
libraries is UTF8, which still enjoys a 4:1 storage advantage
for most strings.  Thirdly, it's growing more incorrect. Since 
new computers these days use 64-bit pointers, the advantage of 
packed UTF8 strings over general vectors is rapidly shifting to
8:1. 

Bear



___
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss