Re: [racket-dev] `string-split'

2012-04-19 Thread Eli Barzilay
[Changed title to talk about each one separately.]

Two hours ago, Laurent wrote:
 One string function that I often find useful in various scripting
 languages is a `string-split' (explode in php).  It can be done with
 `regexp-split', but having something more along the lines of a
 `string-split' should belong to a racket/string lib I think.  Plus
 it would be symmetric with `string-join', which already is in
 racket/ string (or at least a doc line pointing to regexp-split
 should be added there).

If you mean something like this:

  (define (string-split str) (regexp-match* #px\\S+ str))

?

If so, then I see a much weaker point for it -- unlike other small
utilities, this one doesn't even compose two function calls.

The very weak point here is if you want a default argument that
specifies the gaps to split on rather than the words:

  (define (string-split str [sep #px\\s+])
(remove* '() (regexp-split sep str)))

but that *does* use regexps, so I don't see the point, still...

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!

_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Sam Tobin-Hochstadt
On Thu, Apr 19, 2012 at 8:21 AM, Eli Barzilay e...@barzilay.org wrote:

 Two hours ago, Laurent wrote:
 One string function that I often find useful in various scripting
 languages is a `string-split' (explode in php).  It can be done with
 `regexp-split', but having something more along the lines of a
 `string-split' should belong to a racket/string lib I think.  Plus
 it would be symmetric with `string-join', which already is in
 racket/ string (or at least a doc line pointing to regexp-split
 should be added there).

 If you mean something like this:

  (define (string-split str) (regexp-match* #px\\S+ str))

 ?

 If so, then I see a much weaker point for it -- unlike other small
 utilities, this one doesn't even compose two function calls.

It composes one function call (with an extremely complex API) with one
domain-specific language (that lots of people don't
know/understand/use) into one extremely simple but useful function.

 The very weak point here is if you want a default argument that
 specifies the gaps to split on rather than the words:

  (define (string-split str [sep #px\\s+])
    (remove* '() (regexp-split sep str)))

 but that *does* use regexps, so I don't see the point, still...

Note that (string-split str ;) works given that implementation,
which I think makes it both easy-to-understand and useful.
-- 
sam th
sa...@ccs.neu.edu

_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
 (define (string-split str [sep #px\\s+])
(remove* '() (regexp-split sep str)))


Nearly, I meant something more like this:

(define (string-split str [splitter  ])
  (regexp-split (regexp-quote splitter) str))

No regexp from the user POV, and much easier to use with little knowledge.
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Matthew Flatt
I agree with this: we should add `string-split', the one-argument case
should be as Eli wrote, and the two-argument case should be as Laurent
wrote. (Probably the optional second argument should be string-or-#f,
where #f means to use #px\\s+.)

At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote:
  (define (string-split str [sep #px\\s+])
 (remove* '() (regexp-split sep str)))
 
 
 Nearly, I meant something more like this:
 
 (define (string-split str [splitter  ])
   (regexp-split (regexp-quote splitter) str))
 
 No regexp from the user POV, and much easier to use with little knowledge.
 _
   Racket Developers list:
   http://lists.racket-lang.org/dev
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Matthias Felleisen

I think Laurent pointed out in his initial message that beginners may be 
intimidated by regexps. I agree. Plus someone who isn't fluent with regexp may 
be more comfortable with string-split. Last but not least, a program documents 
itself more clearly with string-split vs regexp. 



On Apr 19, 2012, at 8:21 AM, Eli Barzilay wrote:

 [Changed title to talk about each one separately.]
 
 Two hours ago, Laurent wrote:
 One string function that I often find useful in various scripting
 languages is a `string-split' (explode in php).  It can be done with
 `regexp-split', but having something more along the lines of a
 `string-split' should belong to a racket/string lib I think.  Plus
 it would be symmetric with `string-join', which already is in
 racket/ string (or at least a doc line pointing to regexp-split
 should be added there).
 
 If you mean something like this:
 
  (define (string-split str) (regexp-match* #px\\S+ str))
 
 ?
 
 If so, then I see a much weaker point for it -- unlike other small
 utilities, this one doesn't even compose two function calls.
 
 The very weak point here is if you want a default argument that
 specifies the gaps to split on rather than the words:
 
  (define (string-split str [sep #px\\s+])
(remove* '() (regexp-split sep str)))
 
 but that *does* use regexps, so I don't see the point, still...
 
 -- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!
 
 _
  Racket Developers list:
  http://lists.racket-lang.org/dev


_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
On Thu, Apr 19, 2012 at 14:33, Matthew Flatt mfl...@cs.utah.edu wrote:

 I agree with this: we should add `string-split', the one-argument case
 should be as Eli wrote,


About this I'm not sure, as one cannot reproduce this behavior by providing
an argument (or it could make the difference between string-as-not-regexps
and regexps? Wouldn't this be different from other places?).
It would then appear somewhat magical. To me the   default splitter seems
more intuitive.

Laurent


 and the two-argument case should be as Laurent
 wrote. (Probably the optional second argument should be string-or-#f,
 where #f means to use #px\\s+.)

 At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote:
   (define (string-split str [sep #px\\s+])
  (remove* '() (regexp-split sep str)))
  
 
  Nearly, I meant something more like this:
 
  (define (string-split str [splitter  ])
(regexp-split (regexp-quote splitter) str))
 
  No regexp from the user POV, and much easier to use with little
 knowledge.
  _
Racket Developers list:
http://lists.racket-lang.org/dev

_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Matthew Flatt
At Thu, 19 Apr 2012 14:43:44 +0200, Laurent wrote:
 On Thu, Apr 19, 2012 at 14:33, Matthew Flatt mfl...@cs.utah.edu wrote:
 
  I agree with this: we should add `string-split', the one-argument case
  should be as Eli wrote,
 
 
 About this I'm not sure, as one cannot reproduce this behavior by providing
 an argument (or it could make the difference between string-as-not-regexps
 and regexps? Wouldn't this be different from other places?).

I'm suggesting that supplying `#f' as the argument would be the same as
not supplying the argument.

It is a special case, though. I don't mind the specialness here,
because I see the job of `string-split' as making a couple of useful
special cases easy (as opposed to the generality of `regexp-split').


 It would then appear somewhat magical. To me the   default splitter seems
 more intuitive.
 
 Laurent
 
 
  and the two-argument case should be as Laurent
  wrote. (Probably the optional second argument should be string-or-#f,
  where #f means to use #px\\s+.)
 
  At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote:
(define (string-split str [sep #px\\s+])
   (remove* '() (regexp-split sep str)))
   
  
   Nearly, I meant something more like this:
  
   (define (string-split str [splitter  ])
 (regexp-split (regexp-quote splitter) str))
  
   No regexp from the user POV, and much easier to use with little
  knowledge.
   _
 Racket Developers list:
 http://lists.racket-lang.org/dev
 
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
On Thu, Apr 19, 2012 at 14:53, Matthew Flatt mfl...@cs.utah.edu wrote:

 At Thu, 19 Apr 2012 14:43:44 +0200, Laurent wrote:
  On Thu, Apr 19, 2012 at 14:33, Matthew Flatt mfl...@cs.utah.edu wrote:
 
   I agree with this: we should add `string-split', the one-argument case
   should be as Eli wrote,
 
 
  About this I'm not sure, as one cannot reproduce this behavior by
 providing
  an argument (or it could make the difference between
 string-as-not-regexps
  and regexps? Wouldn't this be different from other places?).

 I'm suggesting that supplying `#f' as the argument would be the same as
 not supplying the argument.

 It is a special case, though. I don't mind the specialness here,
 because I see the job of `string-split' as making a couple of useful
 special cases easy (as opposed to the generality of `regexp-split').


Then instead of #f one idea is to go one step further and consider
different useful cases based on input symbols like 'whitespaces,
'non-alpha, etc. ? Or even a list of string/symbols that can be used as a
splitter.
That would make a more powerful function for sure. (It's just that I'm
troubled by the uniqueness of this magical default argument)

Laurent





  It would then appear somewhat magical. To me the   default splitter
 seems
  more intuitive.
 
  Laurent
 
 
   and the two-argument case should be as Laurent
   wrote. (Probably the optional second argument should be string-or-#f,
   where #f means to use #px\\s+.)
  
   At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote:
 (define (string-split str [sep #px\\s+])
(remove* '() (regexp-split sep str)))

   
Nearly, I meant something more like this:
   
(define (string-split str [splitter  ])
  (regexp-split (regexp-quote splitter) str))
   
No regexp from the user POV, and much easier to use with little
   knowledge.
_
  Racket Developers list:
  http://lists.racket-lang.org/dev
  

_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Eli Barzilay
A few minutes ago, Laurent wrote:
 
 Then instead of #f one idea is to go one step further and consider
 different useful cases based on input symbols like 'whitespaces,
 'non-alpha, etc. ? Or even a list of string/symbols that can be used
 as a splitter.  That would make a more powerful function for
 sure. (It's just that I'm troubled by the uniqueness of this magical
 default argument)

(This is something that I do object to...  It leads to srfi-14 which
is one overkill way for that, and we already have regexps that do
that.  So I think that simple is a major point.)

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Eli Barzilay
[Meta-note: I'm not just flatly object to these, just trying to
clarify the exact behavior and the possible effects on other
functions.]

10 minutes ago, Laurent wrote:
  
 
  (define (string-split str [sep #px\\s+])
    (remove* '() (regexp-split sep str)))
 
 Nearly, I meant something more like this:
 
 (define (string-split str [splitter  ])
   (regexp-split (regexp-quote splitter) str))
 
 No regexp from the user POV, and much easier to use with little
 knowledge.

That doesn't seem right -- with this you get

  - (string-split  st  ring)
  '( st  ring)

which is why I think that the above is a better definition in terms of
newbie-ness.


10 minutes ago, Matthew Flatt wrote:
 I agree with this: we should add `string-split', the one-argument case
 should be as Eli wrote, and the two-argument case should be as Laurent
 wrote. (Probably the optional second argument should be string-or-#f,
 where #f means to use #px\\s+.)

Continuing with this line, it seems that a better definition is as
follows:

  (define (string-split str [sep  ])
(remove* '() (regexp-split (regexp-quote (or sep  )) str)))

Except that the full definition could be a bit more efficient.

Three questions:

1. Laurent: Does this make more sense?

2. Matthew: Is there any reason to make the #f-as-default part of the
   interface?  (Even with the new reply I don't see a necessity for
   this -- if the target is newbies, then I think that keeping it as a
   string is simpler...)

3. There's also the point of how this optional argument plays with
   other functions in `racket/string'.  If it works as above, then
   `string-trim' and `string-normalize-spaces' should change
   accordingly so they take the same kind of input simplified
   regexp.

4. Related to Q3: what does xy as that argument mean exactly?
   a. #rx[xy]
   b. #rx[xy]+
   c. #rxxy
   d. #rx(?:xy)+

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!

_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
Continuing with this line, it seems that a better definition is as
 follows:

  (define (string-split str [sep  ])
(remove* '() (regexp-split (regexp-quote (or sep  )) str)))

 Except that the full definition could be a bit more efficient.

 Three questions:

 1. Laurent: Does this make more sense?


Yes, this definitely makes more sense to me.
It would then treat (string-split aXXby X) just like the   case.

Although if you want to find the columns of a latex line like x  y  z
you will have the wrong result.
Maybe use an optional argument to remove the empty strings? (not sure)


 2. Matthew: Is there any reason to make the #f-as-default part of the
   interface?  (Even with the new reply I don't see a necessity for
   this -- if the target is newbies, then I think that keeping it as a
   string is simpler...)


There is probably no need for #f with the new spec.

4. Related to Q3: what does xy as that argument mean exactly?
   a. #rx[xy]
   b. #rx[xy]+
   c. #rxxy
   d. #rx(?:xy)+


Good question. d. would be the simplest case for newbies, but b. might be
more useful.
I think several other languages avoid this issue by using only one
character as the separator.

Laurent
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent

 4. Related to Q3: what does xy as that argument mean exactly?
   a. #rx[xy]
   b. #rx[xy]+
   c. #rxxy
   d. #rx(?:xy)+


 Good question. d. would be the simplest case for newbies, but b. might be
 more useful.


It would make more sense that a string really is a string, not a set of
characters.
Without going as far as srfi-14, a set could be a list of strings or
characters, but maybe this is not needed.

Laurent
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Eli Barzilay
Just now, Laurent wrote:
 1. Laurent: Does this make more sense?
 
 Yes, this definitely makes more sense to me.  It would then treat
 (string-split aXXby X) just like the   case.
 
 Although if you want to find the columns of a latex line like x 
 y  z you will have the wrong result.  Maybe use an optional
 argument to remove the empty strings? (not sure)

(This complicates things...)

First, I don't think that there's a need to make it able to do stuff
like that -- either you go with regexps, or you use combinations like

  (map string-trim (string-split x  y  z ))


 4. Related to Q3: what does xy as that argument mean exactly?
   a. #rx[xy]
   b. #rx[xy]+
   c. #rxxy
   d. #rx(?:xy)+
 
 Good question. d. would be the simplest case for newbies, but
 b. might be more useful.  I think several other languages avoid this
 issue by using only one character as the separator.

The complication is that with   or  \t it seems that you'd want b,
and with  you'd want c.  (Maybe even make  equivalent to
#rx * * -- that looks like it's too much guessing.)

And you're also making a point for:

  e. Throw an error, must be a single-character string.

BTW, this question is important because it affects other functions, so
I'd like to resolve it before doing anything.

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!

_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Eli Barzilay
An hour and a half ago, Ryan Culpepper wrote:
 Instead of trying to design a 'string-split' that is both
 miraculously intuitive and profoundly flexible, why not design it
 like a Model-T

Invalid analogy: the issue is not flexibility, it's making something
that is simple (first) and useful (second) in most cases.


An hour and a half ago, Michael W wrote:
 (TL;DR: I'd suggest two functions: one (string-words str) function
 that does Eli's way, and one (string-split str sep) that does it
 Laurent's way).

I don't think that we argued on what it should do, rather it looks
like we're both looking for whatever option looks best...


- (string-split  st  ring)
'( st  ring)
  
  which is why I think that the above is a better definition in terms of
  newbie-ness.
 
 No, every other language I've worked with does that.
 [...]

The examples you're quoting are the equivalents of our `regexp-split',
which works in a similar way and is not going to change.  We're
talking about some watered-down version that is easier to use.


Just now, Laurent wrote:
 (TL;DR: I'd suggest two functions: one (string-words str)
 function that does Eli's way, and one (string-split str sep)
 that does it Laurent's way).
 
 That would be a good option to me, considering that my way is with
 remaining s in the output list.  The question remains if a string
 can be accepted for sep, in which case the empty string must be
 considered, as pointed out in the Lua discussion. Though a single
 char should be sufficient for nearly all simple cases.

I think that I have a good conclusion here, I'll post on a new thread.

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!
_
  Racket Developers list:
  http://lists.racket-lang.org/dev