Re: [racket-dev] `string-split'

2012-04-19 Thread Eli Barzilay
An hour and a half ago, Ryan Culpepper wrote:
> Instead of trying to design a 'string-split' that is both
> miraculously intuitive and profoundly flexible, why not design it
> like a Model-T

Invalid analogy: the issue is not flexibility, it's making something
that is simple (first) and useful (second) in most cases.


An hour and a half ago, Michael W wrote:
> (TL;DR: I'd suggest two functions: one (string-words str) function
> that does Eli's way, and one (string-split str sep) that does it
> Laurent's way).

I don't think that we argued on what it should do, rather it looks
like we're both looking for whatever option looks best...


> >   -> (string-split " st  ring")
> >   '("" "st" "" "ring")
> > 
> > which is why I think that the above is a better definition in terms of
> > newbie-ness.
> 
> No, every other language I've worked with does that.
> [...]

The examples you're quoting are the equivalents of our `regexp-split',
which works in a similar way and is not going to change.  We're
talking about some watered-down version that is easier to use.


Just now, Laurent wrote:
> (TL;DR: I'd suggest two functions: one (string-words str)
> function that does Eli's way, and one (string-split str sep)
> that does it Laurent's way).
> 
> That would be a good option to me, considering that "my way" is with
> remaining ""s in the output list.  The question remains if a string
> can be accepted for sep, in which case the empty string must be
> considered, as pointed out in the Lua discussion. Though a single
> char should be sufficient for nearly all simple cases.

I think that I have a good conclusion here, I'll post on a new thread.

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
>
> (TL;DR: I'd suggest two functions: one (string-words str)
> function that does Eli's way, and one (string-split str sep) that
> does it Laurent's way).


That would be a good option to me, considering that "my way" is with
remaining ""s in the output list.
The question remains if a string can be accepted for sep, in which case the
empty string must be considered, as pointed out in the Lua discussion.
Though a single char should be sufficient for nearly all simple cases.

Laurent
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Michael W
(TL;DR: I'd suggest two functions: one (string-words str)
function that does Eli's way, and one (string-split str sep) that
does it Laurent's way).

50 minutes ago, Eli Barzilay wrote:
> That doesn't seem right -- with this you get
> 
>   -> (string-split " st  ring")
>   '("" "st" "" "ring")
> 
> which is why I think that the above is a better definition in terms of
> newbie-ness.

No, every other language I've worked with does that.

$ python
Python 3.2.2 (default, Nov 21 2011, 16:51:01) 
[GCC 4.6.2] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> " st  ring".split(" ")
['', 'st', '', 'ring']


$ node
> " st  ring".split(" ")
[ '', 'st', '', 'ring' ]


$ php -a
php > var_dump(split(" ", " str  ing"));
array(4) {
  [0]=>
  string(0) ""
  [1]=>
  string(3) "str"
  [2]=>
  string(0) ""
  [3]=>
  string(3) "ing"
}

Haskell uses two functions; one which eliminates contiguous runs
and one which doesn't (and comes from an entire external library,
sheesh! though it's easy to write your own):
$ ghci
Prelude> words " str  ing"
["str","ing"]
Prelude> Data.List.Split.splitOn " "  " str  ing"
["","str","","ing"]


Ruby has the weirdest behavior, which I consider to be a bug:

$ irb
irb(main):001:0> " st  ring".split(" ")
=> ["st", "ring"]
irb(main):002:0> " st  ring".split(/ /)
=> ["", "st", "", "ring"]

The ruby docs say:
http://www.ruby-doc.org/core-1.9.3/String.html
If pattern is a String, then its contents are used as the
delimiter when splitting str. If pattern is a single space, str
is split on whitespace, with leading whitespace and runs of
contiguous whitespace characters ignored.
If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern matches a zero-length string, str is split
into individual characters. If pattern contains groups, the
respective matches will be returned in the array as well.

In looking for Lua (which doesn't include one, by the way), I
found http://lua-users.org/wiki/SplitJoin which has a big summary
of the issues.

-- 
For the Future!
_mike
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Ryan Culpepper
Instead of trying to design a 'string-split' that is both miraculously 
intuitive and profoundly flexible, why not design it like a Model-T and 
then write a guide/cookbook for how to use regexps to do all of the 
common cases that the extremely limited 'string-split' doesn't handle?


I suspect that writing such a guide will expose a few cases where common 
patterns can be turned into functions (similar to 'regexp-replace-quote').


Ryan


On 04/19/2012 07:27 AM, Eli Barzilay wrote:

Just now, Laurent wrote:

 1. Laurent: Does this make more sense?

Yes, this definitely makes more sense to me.  It would then treat
(string-split "aXXby" "X") just like the " " case.

Although if you want to find the columns of a latex line like "x&&
y&  z" you will have the wrong result.  Maybe use an optional
argument to remove the empty strings? (not sure)


(This complicates things...)

First, I don't think that there's a need to make it able to do stuff
like that -- either you go with regexps, or you use combinations like

   (map string-trim (string-split "x&&  y&  z" "&"))



 4. Related to Q3: what does "xy" as that argument mean exactly?
   a. #rx"[xy]"
   b. #rx"[xy]+"
   c. #rx"xy"
   d. #rx"(?:xy)+"

Good question. d. would be the simplest case for newbies, but
b. might be more useful.  I think several other languages avoid this
issue by using only one character as the separator.


The complication is that with " " or " \t" it seems that you'd want b,
and with "&" you'd want c.  (Maybe even make"&" equivalent to
#rx" *&  *" -- that looks like it's too much guessing.)

And you're also making a point for:

   e. Throw an error, must be a single-character string.

BTW, this question is important because it affects other functions, so
I'd like to resolve it before doing anything.



_
 Racket Developers list:
 http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
> > 4. Related to Q3: what does "xy" as that argument mean exactly?
> >   a. #rx"[xy]"
> >   b. #rx"[xy]+"
> >   c. #rx"xy"
> >   d. #rx"(?:xy)+"
> >
> > Good question. d. would be the simplest case for newbies, but
> > b. might be more useful.  I think several other languages avoid this
> > issue by using only one character as the separator.
>
> The complication is that with " " or " \t" it seems that you'd want b,
> and with "&" you'd want c.  (Maybe even make "&" equivalent to
> #rx" *& *" -- that looks like it's too much guessing.)
>
> And you're also making a point for:
>
>  e. Throw an error, must be a single-character string.
>
> BTW, this question is important because it affects other functions, so
> I'd like to resolve it before doing anything.
>

If we make things as simple-but-useful as possible, then I'd go for a
single char separator with option b/d. (I don't think there are many cases
where one would want a string as a separator?)
Personally, I don't like much when functions ask for a character because
the #\ looks ugly to me, but it still makes more sense than asking for a
string that must have a single character.

Laurent
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Eli Barzilay
Just now, Laurent wrote:
> 1. Laurent: Does this make more sense?
> 
> Yes, this definitely makes more sense to me.  It would then treat
> (string-split "aXXby" "X") just like the " " case.
> 
> Although if you want to find the columns of a latex line like "x &&
> y & z" you will have the wrong result.  Maybe use an optional
> argument to remove the empty strings? (not sure)

(This complicates things...)

First, I don't think that there's a need to make it able to do stuff
like that -- either you go with regexps, or you use combinations like

  (map string-trim (string-split "x && y & z" "&"))


> 4. Related to Q3: what does "xy" as that argument mean exactly?
>   a. #rx"[xy]"
>   b. #rx"[xy]+"
>   c. #rx"xy"
>   d. #rx"(?:xy)+"
> 
> Good question. d. would be the simplest case for newbies, but
> b. might be more useful.  I think several other languages avoid this
> issue by using only one character as the separator.

The complication is that with " " or " \t" it seems that you'd want b,
and with "&" you'd want c.  (Maybe even make "&" equivalent to
#rx" *& *" -- that looks like it's too much guessing.)

And you're also making a point for:

  e. Throw an error, must be a single-character string.

BTW, this question is important because it affects other functions, so
I'd like to resolve it before doing anything.

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!

_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
>
> 4. Related to Q3: what does "xy" as that argument mean exactly?
>>   a. #rx"[xy]"
>>   b. #rx"[xy]+"
>>   c. #rx"xy"
>>   d. #rx"(?:xy)+"
>>
>
> Good question. d. would be the simplest case for newbies, but b. might be
> more useful.
>

It would make more sense that a string really is a string, not a set of
characters.
Without going as far as srfi-14, a set could be a list of strings or
characters, but maybe this is not needed.

Laurent
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
Continuing with this line, it seems that a better definition is as
> follows:
>
>  (define (string-split str [sep " "])
>(remove* '("") (regexp-split (regexp-quote (or sep " ")) str)))
>
> Except that the full definition could be a bit more efficient.
>
> Three questions:
>
> 1. Laurent: Does this make more sense?
>

Yes, this definitely makes more sense to me.
It would then treat (string-split "aXXby" "X") just like the " " case.

Although if you want to find the columns of a latex line like "x && y & z"
you will have the wrong result.
Maybe use an optional argument to remove the empty strings? (not sure)


> 2. Matthew: Is there any reason to make the #f-as-default part of the
>   interface?  (Even with the new reply I don't see a necessity for
>   this -- if the target is newbies, then I think that keeping it as a
>   string is simpler...)
>

There is probably no need for #f with the new spec.

4. Related to Q3: what does "xy" as that argument mean exactly?
>   a. #rx"[xy]"
>   b. #rx"[xy]+"
>   c. #rx"xy"
>   d. #rx"(?:xy)+"
>

Good question. d. would be the simplest case for newbies, but b. might be
more useful.
I think several other languages avoid this issue by using only one
character as the separator.

Laurent
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Eli Barzilay
[Meta-note: I'm not just flatly object to these, just trying to
clarify the exact behavior and the possible effects on other
functions.]

10 minutes ago, Laurent wrote:
>  
> 
>  (define (string-split str [sep #px"\\s+"])
>    (remove* '("") (regexp-split sep str)))
> 
> Nearly, I meant something more like this:
> 
> (define (string-split str [splitter " "])
>   (regexp-split (regexp-quote splitter) str))
> 
> No regexp from the user POV, and much easier to use with little
> knowledge.

That doesn't seem right -- with this you get

  -> (string-split " st  ring")
  '("" "st" "" "ring")

which is why I think that the above is a better definition in terms of
newbie-ness.


10 minutes ago, Matthew Flatt wrote:
> I agree with this: we should add `string-split', the one-argument case
> should be as Eli wrote, and the two-argument case should be as Laurent
> wrote. (Probably the optional second argument should be string-or-#f,
> where #f means to use #px"\\s+".)

Continuing with this line, it seems that a better definition is as
follows:

  (define (string-split str [sep " "])
(remove* '("") (regexp-split (regexp-quote (or sep " ")) str)))

Except that the full definition could be a bit more efficient.

Three questions:

1. Laurent: Does this make more sense?

2. Matthew: Is there any reason to make the #f-as-default part of the
   interface?  (Even with the new reply I don't see a necessity for
   this -- if the target is newbies, then I think that keeping it as a
   string is simpler...)

3. There's also the point of how this optional argument plays with
   other functions in `racket/string'.  If it works as above, then
   `string-trim' and `string-normalize-spaces' should change
   accordingly so they take the same kind of input simplified
   "regexp".

4. Related to Q3: what does "xy" as that argument mean exactly?
   a. #rx"[xy]"
   b. #rx"[xy]+"
   c. #rx"xy"
   d. #rx"(?:xy)+"

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!

_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Eli Barzilay
A few minutes ago, Laurent wrote:
> 
> Then instead of #f one idea is to go one step further and consider
> different useful cases based on input symbols like 'whitespaces,
> 'non-alpha, etc. ? Or even a list of string/symbols that can be used
> as a splitter.  That would make a more powerful function for
> sure. (It's just that I'm troubled by the uniqueness of this magical
> default argument)

(This is something that I do object to...  It leads to srfi-14 which
is one overkill way for that, and we already have regexps that do
that.  So I think that "simple" is a major point.)

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
On Thu, Apr 19, 2012 at 14:53, Matthew Flatt  wrote:

> At Thu, 19 Apr 2012 14:43:44 +0200, Laurent wrote:
> > On Thu, Apr 19, 2012 at 14:33, Matthew Flatt  wrote:
> >
> > > I agree with this: we should add `string-split', the one-argument case
> > > should be as Eli wrote,
> >
> >
> > About this I'm not sure, as one cannot reproduce this behavior by
> providing
> > an argument (or it could make the difference between
> string-as-not-regexps
> > and regexps? Wouldn't this be different from other places?).
>
> I'm suggesting that supplying `#f' as the argument would be the same as
> not supplying the argument.
>
> It is a special case, though. I don't mind the specialness here,
> because I see the job of `string-split' as making a couple of useful
> special cases easy (as opposed to the generality of `regexp-split').
>

Then instead of #f one idea is to go one step further and consider
different useful cases based on input symbols like 'whitespaces,
'non-alpha, etc. ? Or even a list of string/symbols that can be used as a
splitter.
That would make a more powerful function for sure. (It's just that I'm
troubled by the uniqueness of this magical default argument)

Laurent



>
>
> > It would then appear somewhat magical. To me the " " default splitter
> seems
> > more intuitive.
> >
> > Laurent
> >
> >
> > > and the two-argument case should be as Laurent
> > > wrote. (Probably the optional second argument should be string-or-#f,
> > > where #f means to use #px"\\s+".)
> > >
> > > At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote:
> > > >  (define (string-split str [sep #px"\\s+"])
> > > > >(remove* '("") (regexp-split sep str)))
> > > > >
> > > >
> > > > Nearly, I meant something more like this:
> > > >
> > > > (define (string-split str [splitter " "])
> > > >   (regexp-split (regexp-quote splitter) str))
> > > >
> > > > No regexp from the user POV, and much easier to use with little
> > > knowledge.
> > > > _
> > > >   Racket Developers list:
> > > >   http://lists.racket-lang.org/dev
> > >
>
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Matthew Flatt
At Thu, 19 Apr 2012 14:43:44 +0200, Laurent wrote:
> On Thu, Apr 19, 2012 at 14:33, Matthew Flatt  wrote:
> 
> > I agree with this: we should add `string-split', the one-argument case
> > should be as Eli wrote,
> 
> 
> About this I'm not sure, as one cannot reproduce this behavior by providing
> an argument (or it could make the difference between string-as-not-regexps
> and regexps? Wouldn't this be different from other places?).

I'm suggesting that supplying `#f' as the argument would be the same as
not supplying the argument.

It is a special case, though. I don't mind the specialness here,
because I see the job of `string-split' as making a couple of useful
special cases easy (as opposed to the generality of `regexp-split').


> It would then appear somewhat magical. To me the " " default splitter seems
> more intuitive.
> 
> Laurent
> 
> 
> > and the two-argument case should be as Laurent
> > wrote. (Probably the optional second argument should be string-or-#f,
> > where #f means to use #px"\\s+".)
> >
> > At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote:
> > >  (define (string-split str [sep #px"\\s+"])
> > > >(remove* '("") (regexp-split sep str)))
> > > >
> > >
> > > Nearly, I meant something more like this:
> > >
> > > (define (string-split str [splitter " "])
> > >   (regexp-split (regexp-quote splitter) str))
> > >
> > > No regexp from the user POV, and much easier to use with little
> > knowledge.
> > > _
> > >   Racket Developers list:
> > >   http://lists.racket-lang.org/dev
> >
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
On Thu, Apr 19, 2012 at 14:33, Matthew Flatt  wrote:

> I agree with this: we should add `string-split', the one-argument case
> should be as Eli wrote,


About this I'm not sure, as one cannot reproduce this behavior by providing
an argument (or it could make the difference between string-as-not-regexps
and regexps? Wouldn't this be different from other places?).
It would then appear somewhat magical. To me the " " default splitter seems
more intuitive.

Laurent


> and the two-argument case should be as Laurent
> wrote. (Probably the optional second argument should be string-or-#f,
> where #f means to use #px"\\s+".)
>
> At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote:
> >  (define (string-split str [sep #px"\\s+"])
> > >(remove* '("") (regexp-split sep str)))
> > >
> >
> > Nearly, I meant something more like this:
> >
> > (define (string-split str [splitter " "])
> >   (regexp-split (regexp-quote splitter) str))
> >
> > No regexp from the user POV, and much easier to use with little
> knowledge.
> > _
> >   Racket Developers list:
> >   http://lists.racket-lang.org/dev
>
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Matthias Felleisen

I think Laurent pointed out in his initial message that beginners may be 
intimidated by regexps. I agree. Plus someone who isn't fluent with regexp may 
be more comfortable with string-split. Last but not least, a program documents 
itself more clearly with string-split vs regexp. 



On Apr 19, 2012, at 8:21 AM, Eli Barzilay wrote:

> [Changed title to talk about each one separately.]
> 
> Two hours ago, Laurent wrote:
>> One string function that I often find useful in various scripting
>> languages is a `string-split' (explode in php).  It can be done with
>> `regexp-split', but having something more along the lines of a
>> `string-split' should belong to a racket/string lib I think.  Plus
>> it would be symmetric with `string-join', which already is in
>> racket/ string (or at least a doc line pointing to regexp-split
>> should be added there).
> 
> If you mean something like this:
> 
>  (define (string-split str) (regexp-match* #px"\\S+" str))
> 
> ?
> 
> If so, then I see a much weaker point for it -- unlike other small
> utilities, this one doesn't even compose two function calls.
> 
> The very weak point here is if you want a default argument that
> specifies the gaps to split on rather than the words:
> 
>  (define (string-split str [sep #px"\\s+"])
>(remove* '("") (regexp-split sep str)))
> 
> but that *does* use regexps, so I don't see the point, still...
> 
> -- 
>  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
>http://barzilay.org/   Maze is Life!
> 
> _
>  Racket Developers list:
>  http://lists.racket-lang.org/dev


_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Matthew Flatt
I agree with this: we should add `string-split', the one-argument case
should be as Eli wrote, and the two-argument case should be as Laurent
wrote. (Probably the optional second argument should be string-or-#f,
where #f means to use #px"\\s+".)

At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote:
>  (define (string-split str [sep #px"\\s+"])
> >(remove* '("") (regexp-split sep str)))
> >
> 
> Nearly, I meant something more like this:
> 
> (define (string-split str [splitter " "])
>   (regexp-split (regexp-quote splitter) str))
> 
> No regexp from the user POV, and much easier to use with little knowledge.
> _
>   Racket Developers list:
>   http://lists.racket-lang.org/dev
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Laurent
 (define (string-split str [sep #px"\\s+"])
>(remove* '("") (regexp-split sep str)))
>

Nearly, I meant something more like this:

(define (string-split str [splitter " "])
  (regexp-split (regexp-quote splitter) str))

No regexp from the user POV, and much easier to use with little knowledge.
_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Sam Tobin-Hochstadt
On Thu, Apr 19, 2012 at 8:21 AM, Eli Barzilay  wrote:
>
> Two hours ago, Laurent wrote:
>> One string function that I often find useful in various scripting
>> languages is a `string-split' (explode in php).  It can be done with
>> `regexp-split', but having something more along the lines of a
>> `string-split' should belong to a racket/string lib I think.  Plus
>> it would be symmetric with `string-join', which already is in
>> racket/ string (or at least a doc line pointing to regexp-split
>> should be added there).
>
> If you mean something like this:
>
>  (define (string-split str) (regexp-match* #px"\\S+" str))
>
> ?
>
> If so, then I see a much weaker point for it -- unlike other small
> utilities, this one doesn't even compose two function calls.

It composes one function call (with an extremely complex API) with one
domain-specific language (that lots of people don't
know/understand/use) into one extremely simple but useful function.

> The very weak point here is if you want a default argument that
> specifies the gaps to split on rather than the words:
>
>  (define (string-split str [sep #px"\\s+"])
>    (remove* '("") (regexp-split sep str)))
>
> but that *does* use regexps, so I don't see the point, still...

Note that (string-split str ";") works given that implementation,
which I think makes it both easy-to-understand and useful.
-- 
sam th
sa...@ccs.neu.edu

_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] `string-split'

2012-04-19 Thread Eli Barzilay
[Changed title to talk about each one separately.]

Two hours ago, Laurent wrote:
> One string function that I often find useful in various scripting
> languages is a `string-split' (explode in php).  It can be done with
> `regexp-split', but having something more along the lines of a
> `string-split' should belong to a racket/string lib I think.  Plus
> it would be symmetric with `string-join', which already is in
> racket/ string (or at least a doc line pointing to regexp-split
> should be added there).

If you mean something like this:

  (define (string-split str) (regexp-match* #px"\\S+" str))

?

If so, then I see a much weaker point for it -- unlike other small
utilities, this one doesn't even compose two function calls.

The very weak point here is if you want a default argument that
specifies the gaps to split on rather than the words:

  (define (string-split str [sep #px"\\s+"])
(remove* '("") (regexp-split sep str)))

but that *does* use regexps, so I don't see the point, still...

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!

_
  Racket Developers list:
  http://lists.racket-lang.org/dev