Re: [julia-users] Re: How to un-substring a string?!

2015-07-22 Thread andrew cooke

fwiw, https://github.com/JuliaLang/julia/issues/12262 now contains a 
summary of this thread.


On Tuesday, 21 July 2015 22:03:39 UTC-3, andrew cooke wrote:


 unfortunately, the semantics for the match don't seem to be the same.

 if you use a substring then ^ binds to the start of the substring.

 if you use match(...) with an offset then ^ binds to the start of the 
 underlying string.

 at least, that's how i understand the following:

 julia match(r^\s, a b)

 julia match(r^\s, a b[2:end])
 RegexMatch( )

 julia match(r^\s, a b, 2)

 julia

 which ruins everything for me again.  :o(

 well, i guess not totally - i can still use the copy and copy again 
 approach, but am making loads of huge copies.

 andrew


 On Tuesday, 21 July 2015 21:25:09 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 8:19 PM, andrew cooke and...@acooke.org wrote: 
  
  maybe it should check to see if the end matches? 

 My guess, is that this is an optimization that won't have too much 
 benefit because in most cases you can just call `match` with a start 
 idx. 

  
  @glen - i think that will construct an ascii string, which isn't what 
 you 
  want if the underlying data are unicode.  i've always assumed string() 
 does 
  something smart and returns the right thing, but haven't checked... 

 And from the code I pasted, `utf` should probably do the copy you want. 

  
  andrew 
  
  
  On Tuesday, 21 July 2015 20:49:12 UTC-3, Yichao Yu wrote: 
  
  On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke and...@acooke.org 
 wrote: 
   
   ah.  for some reason i was thinking they were invisible (somewhere 
 below 
   julia). 
   
   ok, thanks.  so that explains things more clearly 
   
   ...except that(!) using SubString(s, i, endof(s)) and passing *that* 
 to 
   match still gives the memory issue. 
  
  Hmmm, 
  
  ``` 
  match(re::Regex, str::Union{ByteString,SubString}, idx::Integer, 
  add_opts::UInt32=UInt32(0)) = 
  match(re, utf8(str), idx, add_opts) 
  ``` 
  
  So match on a substring does a copy. I'm guessing this is because pcre 
  expect a c string (i.e. NULL terminated?) 
  
   
   so there's still something odd that i don't understand.  maybe it's 
 just 
   that the regexp lib doesn't know about SubString. 
   
   andrew 
   
   
   
   On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote: 
   
   On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org 
   wrote: 

ok, so match(regex, string, index) solves the problem. 
  presumably it 
exists 
exactly for this reason? 
   
   At least I think this is a valid usecase. 
   

andrew 


On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: 


hmm.  ignore that last statement (same problem).  still checking 
 / 
confused.  sorry. 

On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: 


i think that returns a substring (ir a view onto the backing 
string). 
   
   ``` 
   julia typeof(aaa[2:end]) 
   ASCIIString 
   
   julia SubString(aaa, 2, 3) 
   aa 
   
   julia typeof(SubString(aaa, 2, 3)) 
   SubString{ASCIIString} 
   ``` 
   
but i am not sure.  i did read a discussion somewhere saying 
 that 
because of 
this you should use bytestring(...) before passing a string to 
 c. 
which is 
all the evidence i have for my guess. 

incidentally, match(...) has a method that takes the offset to 
start 
at 
as an argument.  so i can avoid s[i:end] and just pass i into 
 match 
(i 
just 
found this). 

however, somewhat surprisingly, it also has the same problem. 

andrew 


On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: 

On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash 
 vtj...@gmail.com 
wrote: 
 does `copy` work? although `bytestring` also seems like a 
 good 
 method 
 for 
 this also. it seems wrong to me also that `match` is making 
 a 
 copy 
 of 
 the 
 original string (if that is indeed what it is doing) 

Isn't it `s[i:end]` that is doing the copy? 

 
 On Tue, Jul 21, 2015 at 6:57 PM andrew cooke 
 and...@acooke.org 
 wrote: 
 
 
 string(bytestring(...)) seems to do it.  would appreciate 
 any 
 more 
 efficient solutions (and confirmation the analysis is 
 correct - 
 is 
 this 
 worth filing as an issue?) 
 
 
 On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke 
 wrote: 
 
 
 well, this was fun...  the following code rapidly triggers 
 the 
 OOM 
 killer 
 on my machine (julia 0.4 trunk): 
 
 s = repeat(a, 100) 
 l = Any[] 
 r = r^\w 
 
 for i in 1:length(s) 
 m = match(r, s[i:end]) 
 push!(l, m.match) 
 end 
 
 note that: (1) the regexp is only matching one character, 
 so 
 the 
 array l 
 is at most a million characters long. 
 
 what i think is 

Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread andrew cooke

maybe it should check to see if the end matches?

@glen - i think that will construct an ascii string, which isn't what you 
want if the underlying data are unicode.  i've always assumed string() does 
something smart and returns the right thing, but haven't checked...

andrew


On Tuesday, 21 July 2015 20:49:12 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke and...@acooke.org 
 javascript: wrote: 
  
  ah.  for some reason i was thinking they were invisible (somewhere below 
  julia). 
  
  ok, thanks.  so that explains things more clearly 
  
  ...except that(!) using SubString(s, i, endof(s)) and passing *that* to 
  match still gives the memory issue. 

 Hmmm, 

 ``` 
 match(re::Regex, str::Union{ByteString,SubString}, idx::Integer, 
 add_opts::UInt32=UInt32(0)) = 
 match(re, utf8(str), idx, add_opts) 
 ``` 

 So match on a substring does a copy. I'm guessing this is because pcre 
 expect a c string (i.e. NULL terminated?) 

  
  so there's still something odd that i don't understand.  maybe it's just 
  that the regexp lib doesn't know about SubString. 
  
  andrew 
  
  
  
  On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote: 
  
  On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org 
 wrote: 
   
   ok, so match(regex, string, index) solves the problem.  presumably it 
   exists 
   exactly for this reason? 
  
  At least I think this is a valid usecase. 
  
   
   andrew 
   
   
   On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: 
   
   
   hmm.  ignore that last statement (same problem).  still checking / 
   confused.  sorry. 
   
   On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: 
   
   
   i think that returns a substring (ir a view onto the backing 
 string). 
  
  ``` 
  julia typeof(aaa[2:end]) 
  ASCIIString 
  
  julia SubString(aaa, 2, 3) 
  aa 
  
  julia typeof(SubString(aaa, 2, 3)) 
  SubString{ASCIIString} 
  ``` 
  
   but i am not sure.  i did read a discussion somewhere saying that 
   because of 
   this you should use bytestring(...) before passing a string to c. 
   which is 
   all the evidence i have for my guess. 
   
   incidentally, match(...) has a method that takes the offset to 
 start 
   at 
   as an argument.  so i can avoid s[i:end] and just pass i into match 
 (i 
   just 
   found this). 
   
   however, somewhat surprisingly, it also has the same problem. 
   
   andrew 
   
   
   On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: 
   
   On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com 
   wrote: 
does `copy` work? although `bytestring` also seems like a good 
method 
for 
this also. it seems wrong to me also that `match` is making a 
 copy 
of 
the 
original string (if that is indeed what it is doing) 
   
   Isn't it `s[i:end]` that is doing the copy? 
   

On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org 

wrote: 


string(bytestring(...)) seems to do it.  would appreciate any 
 more 
efficient solutions (and confirmation the analysis is correct - 
 is 
this 
worth filing as an issue?) 


On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: 


well, this was fun...  the following code rapidly triggers the 
OOM 
killer 
on my machine (julia 0.4 trunk): 

s = repeat(a, 100) 
l = Any[] 
r = r^\w 

for i in 1:length(s) 
m = match(r, s[i:end]) 
push!(l, m.match) 
end 

note that: (1) the regexp is only matching one character, so 
 the 
array l 
is at most a million characters long. 

what i think is happening (but this is only a guess) is that 
s[i:end] is 
being passed though to the c level regexp library as a new 
string. 
the 
result (m.match) is then a substring into that.  because the 
substring is 
kept around, the backing string cannot be collected.  and so 
there's 
an n^2 
memory use. 

ideally, i don't think a new copy of the string should be 
 passed 
to 
the 
regexp engine.  maybe i am wrong? 

anyway, for now, if the above is right, i need some way to 
 copy 
m.match. 
as far as i can tell string() doesn't help.  so what works? 
  or 
am i 
wrong? 

thanks, 
andrew 



Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread andrew cooke

unfortunately, the semantics for the match don't seem to be the same.

if you use a substring then ^ binds to the start of the substring.

if you use match(...) with an offset then ^ binds to the start of the 
underlying string.

at least, that's how i understand the following:

julia match(r^\s, a b)

julia match(r^\s, a b[2:end])
RegexMatch( )

julia match(r^\s, a b, 2)

julia

which ruins everything for me again.  :o(

well, i guess not totally - i can still use the copy and copy again 
approach, but am making loads of huge copies.

andrew


On Tuesday, 21 July 2015 21:25:09 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 8:19 PM, andrew cooke and...@acooke.org 
 javascript: wrote: 
  
  maybe it should check to see if the end matches? 

 My guess, is that this is an optimization that won't have too much 
 benefit because in most cases you can just call `match` with a start 
 idx. 

  
  @glen - i think that will construct an ascii string, which isn't what 
 you 
  want if the underlying data are unicode.  i've always assumed string() 
 does 
  something smart and returns the right thing, but haven't checked... 

 And from the code I pasted, `utf` should probably do the copy you want. 

  
  andrew 
  
  
  On Tuesday, 21 July 2015 20:49:12 UTC-3, Yichao Yu wrote: 
  
  On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke and...@acooke.org 
 wrote: 
   
   ah.  for some reason i was thinking they were invisible (somewhere 
 below 
   julia). 
   
   ok, thanks.  so that explains things more clearly 
   
   ...except that(!) using SubString(s, i, endof(s)) and passing *that* 
 to 
   match still gives the memory issue. 
  
  Hmmm, 
  
  ``` 
  match(re::Regex, str::Union{ByteString,SubString}, idx::Integer, 
  add_opts::UInt32=UInt32(0)) = 
  match(re, utf8(str), idx, add_opts) 
  ``` 
  
  So match on a substring does a copy. I'm guessing this is because pcre 
  expect a c string (i.e. NULL terminated?) 
  
   
   so there's still something odd that i don't understand.  maybe it's 
 just 
   that the regexp lib doesn't know about SubString. 
   
   andrew 
   
   
   
   On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote: 
   
   On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org 
   wrote: 

ok, so match(regex, string, index) solves the problem.  presumably 
 it 
exists 
exactly for this reason? 
   
   At least I think this is a valid usecase. 
   

andrew 


On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: 


hmm.  ignore that last statement (same problem).  still checking 
 / 
confused.  sorry. 

On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: 


i think that returns a substring (ir a view onto the backing 
string). 
   
   ``` 
   julia typeof(aaa[2:end]) 
   ASCIIString 
   
   julia SubString(aaa, 2, 3) 
   aa 
   
   julia typeof(SubString(aaa, 2, 3)) 
   SubString{ASCIIString} 
   ``` 
   
but i am not sure.  i did read a discussion somewhere saying 
 that 
because of 
this you should use bytestring(...) before passing a string to 
 c. 
which is 
all the evidence i have for my guess. 

incidentally, match(...) has a method that takes the offset to 
start 
at 
as an argument.  so i can avoid s[i:end] and just pass i into 
 match 
(i 
just 
found this). 

however, somewhat surprisingly, it also has the same problem. 

andrew 


On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: 

On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com 

wrote: 
 does `copy` work? although `bytestring` also seems like a 
 good 
 method 
 for 
 this also. it seems wrong to me also that `match` is making a 
 copy 
 of 
 the 
 original string (if that is indeed what it is doing) 

Isn't it `s[i:end]` that is doing the copy? 

 
 On Tue, Jul 21, 2015 at 6:57 PM andrew cooke 
 and...@acooke.org 
 wrote: 
 
 
 string(bytestring(...)) seems to do it.  would appreciate 
 any 
 more 
 efficient solutions (and confirmation the analysis is 
 correct - 
 is 
 this 
 worth filing as an issue?) 
 
 
 On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: 
 
 
 well, this was fun...  the following code rapidly triggers 
 the 
 OOM 
 killer 
 on my machine (julia 0.4 trunk): 
 
 s = repeat(a, 100) 
 l = Any[] 
 r = r^\w 
 
 for i in 1:length(s) 
 m = match(r, s[i:end]) 
 push!(l, m.match) 
 end 
 
 note that: (1) the regexp is only matching one character, 
 so 
 the 
 array l 
 is at most a million characters long. 
 
 what i think is happening (but this is only a guess) is 
 that 
 s[i:end] is 
 being passed though to the c level regexp library as a new 
 string. 
 the 
 

Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread andrew cooke

yeah, i guess in this case that's what the lib wants, so you need to force 
conversion if you don't have utf8.

On Tuesday, 21 July 2015 21:25:09 UTC-3, Yichao Yu wrote:

 And from the code I pasted, `utf` should probably do the copy you want. 



Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread Yichao Yu
On Tue, Jul 21, 2015 at 8:19 PM, andrew cooke and...@acooke.org wrote:

 maybe it should check to see if the end matches?

My guess, is that this is an optimization that won't have too much
benefit because in most cases you can just call `match` with a start
idx.


 @glen - i think that will construct an ascii string, which isn't what you
 want if the underlying data are unicode.  i've always assumed string() does
 something smart and returns the right thing, but haven't checked...

And from the code I pasted, `utf` should probably do the copy you want.


 andrew


 On Tuesday, 21 July 2015 20:49:12 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke and...@acooke.org wrote:
 
  ah.  for some reason i was thinking they were invisible (somewhere below
  julia).
 
  ok, thanks.  so that explains things more clearly
 
  ...except that(!) using SubString(s, i, endof(s)) and passing *that* to
  match still gives the memory issue.

 Hmmm,

 ```
 match(re::Regex, str::Union{ByteString,SubString}, idx::Integer,
 add_opts::UInt32=UInt32(0)) =
 match(re, utf8(str), idx, add_opts)
 ```

 So match on a substring does a copy. I'm guessing this is because pcre
 expect a c string (i.e. NULL terminated?)

 
  so there's still something odd that i don't understand.  maybe it's just
  that the regexp lib doesn't know about SubString.
 
  andrew
 
 
 
  On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote:
 
  On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org
  wrote:
  
   ok, so match(regex, string, index) solves the problem.  presumably it
   exists
   exactly for this reason?
 
  At least I think this is a valid usecase.
 
  
   andrew
  
  
   On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote:
  
  
   hmm.  ignore that last statement (same problem).  still checking /
   confused.  sorry.
  
   On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote:
  
  
   i think that returns a substring (ir a view onto the backing
   string).
 
  ```
  julia typeof(aaa[2:end])
  ASCIIString
 
  julia SubString(aaa, 2, 3)
  aa
 
  julia typeof(SubString(aaa, 2, 3))
  SubString{ASCIIString}
  ```
 
   but i am not sure.  i did read a discussion somewhere saying that
   because of
   this you should use bytestring(...) before passing a string to c.
   which is
   all the evidence i have for my guess.
  
   incidentally, match(...) has a method that takes the offset to
   start
   at
   as an argument.  so i can avoid s[i:end] and just pass i into match
   (i
   just
   found this).
  
   however, somewhat surprisingly, it also has the same problem.
  
   andrew
  
  
   On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote:
  
   On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com
   wrote:
does `copy` work? although `bytestring` also seems like a good
method
for
this also. it seems wrong to me also that `match` is making a
copy
of
the
original string (if that is indeed what it is doing)
  
   Isn't it `s[i:end]` that is doing the copy?
  
   
On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org
wrote:
   
   
string(bytestring(...)) seems to do it.  would appreciate any
more
efficient solutions (and confirmation the analysis is correct -
is
this
worth filing as an issue?)
   
   
On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote:
   
   
well, this was fun...  the following code rapidly triggers the
OOM
killer
on my machine (julia 0.4 trunk):
   
s = repeat(a, 100)
l = Any[]
r = r^\w
   
for i in 1:length(s)
m = match(r, s[i:end])
push!(l, m.match)
end
   
note that: (1) the regexp is only matching one character, so
the
array l
is at most a million characters long.
   
what i think is happening (but this is only a guess) is that
s[i:end] is
being passed though to the c level regexp library as a new
string.
the
result (m.match) is then a substring into that.  because the
substring is
kept around, the backing string cannot be collected.  and so
there's
an n^2
memory use.
   
ideally, i don't think a new copy of the string should be
passed
to
the
regexp engine.  maybe i am wrong?
   
anyway, for now, if the above is right, i need some way to
copy
m.match.
as far as i can tell string() doesn't help.  so what works?
or
am i
wrong?
   
thanks,
andrew


[julia-users] Re: How to un-substring a string?!

2015-07-21 Thread andrew cooke

string(bytestring(...)) seems to do it.  would appreciate any more 
efficient solutions (and confirmation the analysis is correct - is this 
worth filing as an issue?)

On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote:


 well, this was fun...  the following code rapidly triggers the OOM killer 
 on my machine (julia 0.4 trunk):

 s = repeat(a, 100)
 l = Any[]
 r = r^\w

 for i in 1:length(s)
 m = match(r, s[i:end])
 push!(l, m.match)
 end

 note that: (1) the regexp is only matching one character, so the array l 
 is at most a million characters long.

 what i think is happening (but this is only a guess) is that s[i:end] is 
 being passed though to the c level regexp library as a new string.  the 
 result (m.match) is then a substring into that.  because the substring is 
 kept around, the backing string cannot be collected.  and so there's an n^2 
 memory use.

 ideally, i don't think a new copy of the string should be passed to the 
 regexp engine.  maybe i am wrong?

 anyway, for now, if the above is right, i need some way to copy m.match.  
 as far as i can tell string() doesn't help.  so what works?  or am i wrong?

 thanks,
 andrew



Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread Jameson Nash
does `copy` work? although `bytestring` also seems like a good method for
this also. it seems wrong to me also that `match` is making a copy of the
original string (if that is indeed what it is doing)

On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote:


 string(bytestring(...)) seems to do it.  would appreciate any more
 efficient solutions (and confirmation the analysis is correct - is this
 worth filing as an issue?)


 On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote:


 well, this was fun...  the following code rapidly triggers the OOM killer
 on my machine (julia 0.4 trunk):

 s = repeat(a, 100)
 l = Any[]
 r = r^\w

 for i in 1:length(s)
 m = match(r, s[i:end])
 push!(l, m.match)
 end

 note that: (1) the regexp is only matching one character, so the array l
 is at most a million characters long.

 what i think is happening (but this is only a guess) is that s[i:end] is
 being passed though to the c level regexp library as a new string.  the
 result (m.match) is then a substring into that.  because the substring is
 kept around, the backing string cannot be collected.  and so there's an n^2
 memory use.

 ideally, i don't think a new copy of the string should be passed to the
 regexp engine.  maybe i am wrong?

 anyway, for now, if the above is right, i need some way to copy m.match.
 as far as i can tell string() doesn't help.  so what works?  or am i wrong?

 thanks,
 andrew




Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread andrew cooke

(i was quite impressed that reverse(reverse(...)) didn't help either).

On Tuesday, 21 July 2015 20:11:35 UTC-3, andrew cooke wrote:


 deepcopy didn't.  i haven't actually tried copy.  hang on...  [computer 
 hangs; oom killer steps in].  nope!

 On Tuesday, 21 July 2015 20:08:33 UTC-3, Jameson wrote:

 does `copy` work? although `bytestring` also seems like a good method for 
 this also. it seems wrong to me also that `match` is making a copy of the 
 original string (if that is indeed what it is doing)

 On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote:


 string(bytestring(...)) seems to do it.  would appreciate any more 
 efficient solutions (and confirmation the analysis is correct - is this 
 worth filing as an issue?)


 On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote:


 well, this was fun...  the following code rapidly triggers the OOM 
 killer on my machine (julia 0.4 trunk):

 s = repeat(a, 100)
 l = Any[]
 r = r^\w

 for i in 1:length(s)
 m = match(r, s[i:end])
 push!(l, m.match)
 end

 note that: (1) the regexp is only matching one character, so the array 
 l is at most a million characters long.

 what i think is happening (but this is only a guess) is that s[i:end] 
 is being passed though to the c level regexp library as a new string.  the 
 result (m.match) is then a substring into that.  because the substring is 
 kept around, the backing string cannot be collected.  and so there's an 
 n^2 
 memory use.

 ideally, i don't think a new copy of the string should be passed to the 
 regexp engine.  maybe i am wrong?

 anyway, for now, if the above is right, i need some way to copy 
 m.match.  as far as i can tell string() doesn't help.  so what works?  or 
 am i wrong?

 thanks,
 andrew



Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread andrew cooke

deepcopy didn't.  i haven't actually tried copy.  hang on...  [computer 
hangs; oom killer steps in].  nope!

On Tuesday, 21 July 2015 20:08:33 UTC-3, Jameson wrote:

 does `copy` work? although `bytestring` also seems like a good method for 
 this also. it seems wrong to me also that `match` is making a copy of the 
 original string (if that is indeed what it is doing)

 On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org 
 javascript: wrote:


 string(bytestring(...)) seems to do it.  would appreciate any more 
 efficient solutions (and confirmation the analysis is correct - is this 
 worth filing as an issue?)


 On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote:


 well, this was fun...  the following code rapidly triggers the OOM 
 killer on my machine (julia 0.4 trunk):

 s = repeat(a, 100)
 l = Any[]
 r = r^\w

 for i in 1:length(s)
 m = match(r, s[i:end])
 push!(l, m.match)
 end

 note that: (1) the regexp is only matching one character, so the array l 
 is at most a million characters long.

 what i think is happening (but this is only a guess) is that s[i:end] is 
 being passed though to the c level regexp library as a new string.  the 
 result (m.match) is then a substring into that.  because the substring is 
 kept around, the backing string cannot be collected.  and so there's an n^2 
 memory use.

 ideally, i don't think a new copy of the string should be passed to the 
 regexp engine.  maybe i am wrong?

 anyway, for now, if the above is right, i need some way to copy 
 m.match.  as far as i can tell string() doesn't help.  so what works?  or 
 am i wrong?

 thanks,
 andrew



Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread Yichao Yu
On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtjn...@gmail.com wrote:
 does `copy` work? although `bytestring` also seems like a good method for
 this also. it seems wrong to me also that `match` is making a copy of the
 original string (if that is indeed what it is doing)

Isn't it `s[i:end]` that is doing the copy?


 On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote:


 string(bytestring(...)) seems to do it.  would appreciate any more
 efficient solutions (and confirmation the analysis is correct - is this
 worth filing as an issue?)


 On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote:


 well, this was fun...  the following code rapidly triggers the OOM killer
 on my machine (julia 0.4 trunk):

 s = repeat(a, 100)
 l = Any[]
 r = r^\w

 for i in 1:length(s)
 m = match(r, s[i:end])
 push!(l, m.match)
 end

 note that: (1) the regexp is only matching one character, so the array l
 is at most a million characters long.

 what i think is happening (but this is only a guess) is that s[i:end] is
 being passed though to the c level regexp library as a new string.  the
 result (m.match) is then a substring into that.  because the substring is
 kept around, the backing string cannot be collected.  and so there's an n^2
 memory use.

 ideally, i don't think a new copy of the string should be passed to the
 regexp engine.  maybe i am wrong?

 anyway, for now, if the above is right, i need some way to copy m.match.
 as far as i can tell string() doesn't help.  so what works?  or am i wrong?

 thanks,
 andrew


Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread andrew cooke

i think that returns a substring (ir a view onto the backing string).  but 
i am not sure.  i did read a discussion somewhere saying that because of 
this you should use bytestring(...) before passing a string to c. which is 
all the evidence i have for my guess.

incidentally, match(...) has a method that takes the offset to start at as 
an argument.  so i can avoid s[i:end] and just pass i into match (i just 
found this).

however, somewhat surprisingly, it also has the same problem.

andrew


On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com 
 javascript: wrote: 
  does `copy` work? although `bytestring` also seems like a good method 
 for 
  this also. it seems wrong to me also that `match` is making a copy of 
 the 
  original string (if that is indeed what it is doing) 

 Isn't it `s[i:end]` that is doing the copy? 

  
  On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org 
 javascript: wrote: 
  
  
  string(bytestring(...)) seems to do it.  would appreciate any more 
  efficient solutions (and confirmation the analysis is correct - is this 
  worth filing as an issue?) 
  
  
  On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: 
  
  
  well, this was fun...  the following code rapidly triggers the OOM 
 killer 
  on my machine (julia 0.4 trunk): 
  
  s = repeat(a, 100) 
  l = Any[] 
  r = r^\w 
  
  for i in 1:length(s) 
  m = match(r, s[i:end]) 
  push!(l, m.match) 
  end 
  
  note that: (1) the regexp is only matching one character, so the array 
 l 
  is at most a million characters long. 
  
  what i think is happening (but this is only a guess) is that s[i:end] 
 is 
  being passed though to the c level regexp library as a new string. 
  the 
  result (m.match) is then a substring into that.  because the substring 
 is 
  kept around, the backing string cannot be collected.  and so there's 
 an n^2 
  memory use. 
  
  ideally, i don't think a new copy of the string should be passed to 
 the 
  regexp engine.  maybe i am wrong? 
  
  anyway, for now, if the above is right, i need some way to copy 
 m.match. 
  as far as i can tell string() doesn't help.  so what works?  or am i 
 wrong? 
  
  thanks, 
  andrew 



Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread andrew cooke

hmm.  ignore that last statement (same problem).  still checking / 
confused.  sorry.

On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote:


 i think that returns a substring (ir a view onto the backing string).  but 
 i am not sure.  i did read a discussion somewhere saying that because of 
 this you should use bytestring(...) before passing a string to c. which is 
 all the evidence i have for my guess.

 incidentally, match(...) has a method that takes the offset to start at as 
 an argument.  so i can avoid s[i:end] and just pass i into match (i just 
 found this).

 however, somewhat surprisingly, it also has the same problem.

 andrew


 On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: 
  does `copy` work? although `bytestring` also seems like a good method 
 for 
  this also. it seems wrong to me also that `match` is making a copy of 
 the 
  original string (if that is indeed what it is doing) 

 Isn't it `s[i:end]` that is doing the copy? 

  
  On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org 
 wrote: 
  
  
  string(bytestring(...)) seems to do it.  would appreciate any more 
  efficient solutions (and confirmation the analysis is correct - is 
 this 
  worth filing as an issue?) 
  
  
  On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: 
  
  
  well, this was fun...  the following code rapidly triggers the OOM 
 killer 
  on my machine (julia 0.4 trunk): 
  
  s = repeat(a, 100) 
  l = Any[] 
  r = r^\w 
  
  for i in 1:length(s) 
  m = match(r, s[i:end]) 
  push!(l, m.match) 
  end 
  
  note that: (1) the regexp is only matching one character, so the 
 array l 
  is at most a million characters long. 
  
  what i think is happening (but this is only a guess) is that s[i:end] 
 is 
  being passed though to the c level regexp library as a new string. 
  the 
  result (m.match) is then a substring into that.  because the 
 substring is 
  kept around, the backing string cannot be collected.  and so there's 
 an n^2 
  memory use. 
  
  ideally, i don't think a new copy of the string should be passed to 
 the 
  regexp engine.  maybe i am wrong? 
  
  anyway, for now, if the above is right, i need some way to copy 
 m.match. 
  as far as i can tell string() doesn't help.  so what works?  or am i 
 wrong? 
  
  thanks, 
  andrew 



Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread andrew cooke

ok, so match(regex, string, index) solves the problem.  presumably it 
exists exactly for this reason?

andrew

On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote:


 hmm.  ignore that last statement (same problem).  still checking / 
 confused.  sorry.

 On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote:


 i think that returns a substring (ir a view onto the backing string).  
 but i am not sure.  i did read a discussion somewhere saying that because 
 of this you should use bytestring(...) before passing a string to c. which 
 is all the evidence i have for my guess.

 incidentally, match(...) has a method that takes the offset to start at 
 as an argument.  so i can avoid s[i:end] and just pass i into match (i just 
 found this).

 however, somewhat surprisingly, it also has the same problem.

 andrew


 On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: 
  does `copy` work? although `bytestring` also seems like a good method 
 for 
  this also. it seems wrong to me also that `match` is making a copy of 
 the 
  original string (if that is indeed what it is doing) 

 Isn't it `s[i:end]` that is doing the copy? 

  
  On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org 
 wrote: 
  
  
  string(bytestring(...)) seems to do it.  would appreciate any more 
  efficient solutions (and confirmation the analysis is correct - is 
 this 
  worth filing as an issue?) 
  
  
  On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: 
  
  
  well, this was fun...  the following code rapidly triggers the OOM 
 killer 
  on my machine (julia 0.4 trunk): 
  
  s = repeat(a, 100) 
  l = Any[] 
  r = r^\w 
  
  for i in 1:length(s) 
  m = match(r, s[i:end]) 
  push!(l, m.match) 
  end 
  
  note that: (1) the regexp is only matching one character, so the 
 array l 
  is at most a million characters long. 
  
  what i think is happening (but this is only a guess) is that 
 s[i:end] is 
  being passed though to the c level regexp library as a new string. 
  the 
  result (m.match) is then a substring into that.  because the 
 substring is 
  kept around, the backing string cannot be collected.  and so there's 
 an n^2 
  memory use. 
  
  ideally, i don't think a new copy of the string should be passed to 
 the 
  regexp engine.  maybe i am wrong? 
  
  anyway, for now, if the above is right, i need some way to copy 
 m.match. 
  as far as i can tell string() doesn't help.  so what works?  or am i 
 wrong? 
  
  thanks, 
  andrew 



Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread Yichao Yu
On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org wrote:

 ok, so match(regex, string, index) solves the problem.  presumably it exists
 exactly for this reason?

At least I think this is a valid usecase.


 andrew


 On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote:


 hmm.  ignore that last statement (same problem).  still checking /
 confused.  sorry.

 On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote:


 i think that returns a substring (ir a view onto the backing string).

```
julia typeof(aaa[2:end])
ASCIIString

julia SubString(aaa, 2, 3)
aa

julia typeof(SubString(aaa, 2, 3))
SubString{ASCIIString}
```

 but i am not sure.  i did read a discussion somewhere saying that because of
 this you should use bytestring(...) before passing a string to c. which is
 all the evidence i have for my guess.

 incidentally, match(...) has a method that takes the offset to start at
 as an argument.  so i can avoid s[i:end] and just pass i into match (i just
 found this).

 however, somewhat surprisingly, it also has the same problem.

 andrew


 On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote:
  does `copy` work? although `bytestring` also seems like a good method
  for
  this also. it seems wrong to me also that `match` is making a copy of
  the
  original string (if that is indeed what it is doing)

 Isn't it `s[i:end]` that is doing the copy?

 
  On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org
  wrote:
 
 
  string(bytestring(...)) seems to do it.  would appreciate any more
  efficient solutions (and confirmation the analysis is correct - is
  this
  worth filing as an issue?)
 
 
  On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote:
 
 
  well, this was fun...  the following code rapidly triggers the OOM
  killer
  on my machine (julia 0.4 trunk):
 
  s = repeat(a, 100)
  l = Any[]
  r = r^\w
 
  for i in 1:length(s)
  m = match(r, s[i:end])
  push!(l, m.match)
  end
 
  note that: (1) the regexp is only matching one character, so the
  array l
  is at most a million characters long.
 
  what i think is happening (but this is only a guess) is that
  s[i:end] is
  being passed though to the c level regexp library as a new string.
  the
  result (m.match) is then a substring into that.  because the
  substring is
  kept around, the backing string cannot be collected.  and so there's
  an n^2
  memory use.
 
  ideally, i don't think a new copy of the string should be passed to
  the
  regexp engine.  maybe i am wrong?
 
  anyway, for now, if the above is right, i need some way to copy
  m.match.
  as far as i can tell string() doesn't help.  so what works?  or am i
  wrong?
 
  thanks,
  andrew


Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread andrew cooke

ah.  for some reason i was thinking they were invisible (somewhere below 
julia).

ok, thanks.  so that explains things more clearly

...except that(!) using SubString(s, i, endof(s)) and passing *that* to 
match still gives the memory issue.

so there's still something odd that i don't understand.  maybe it's just 
that the regexp lib doesn't know about SubString.

andrew



On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org 
 javascript: wrote: 
  
  ok, so match(regex, string, index) solves the problem.  presumably it 
 exists 
  exactly for this reason? 

 At least I think this is a valid usecase. 

  
  andrew 
  
  
  On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: 
  
  
  hmm.  ignore that last statement (same problem).  still checking / 
  confused.  sorry. 
  
  On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: 
  
  
  i think that returns a substring (ir a view onto the backing string). 

 ``` 
 julia typeof(aaa[2:end]) 
 ASCIIString 

 julia SubString(aaa, 2, 3) 
 aa 

 julia typeof(SubString(aaa, 2, 3)) 
 SubString{ASCIIString} 
 ``` 

  but i am not sure.  i did read a discussion somewhere saying that 
 because of 
  this you should use bytestring(...) before passing a string to c. 
 which is 
  all the evidence i have for my guess. 
  
  incidentally, match(...) has a method that takes the offset to start 
 at 
  as an argument.  so i can avoid s[i:end] and just pass i into match (i 
 just 
  found this). 
  
  however, somewhat surprisingly, it also has the same problem. 
  
  andrew 
  
  
  On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: 
  
  On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com 
 wrote: 
   does `copy` work? although `bytestring` also seems like a good 
 method 
   for 
   this also. it seems wrong to me also that `match` is making a copy 
 of 
   the 
   original string (if that is indeed what it is doing) 
  
  Isn't it `s[i:end]` that is doing the copy? 
  
   
   On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org 
   wrote: 
   
   
   string(bytestring(...)) seems to do it.  would appreciate any more 
   efficient solutions (and confirmation the analysis is correct - is 
   this 
   worth filing as an issue?) 
   
   
   On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: 
   
   
   well, this was fun...  the following code rapidly triggers the 
 OOM 
   killer 
   on my machine (julia 0.4 trunk): 
   
   s = repeat(a, 100) 
   l = Any[] 
   r = r^\w 
   
   for i in 1:length(s) 
   m = match(r, s[i:end]) 
   push!(l, m.match) 
   end 
   
   note that: (1) the regexp is only matching one character, so the 
   array l 
   is at most a million characters long. 
   
   what i think is happening (but this is only a guess) is that 
   s[i:end] is 
   being passed though to the c level regexp library as a new 
 string. 
   the 
   result (m.match) is then a substring into that.  because the 
   substring is 
   kept around, the backing string cannot be collected.  and so 
 there's 
   an n^2 
   memory use. 
   
   ideally, i don't think a new copy of the string should be passed 
 to 
   the 
   regexp engine.  maybe i am wrong? 
   
   anyway, for now, if the above is right, i need some way to copy 
   m.match. 
   as far as i can tell string() doesn't help.  so what works?  or 
 am i 
   wrong? 
   
   thanks, 
   andrew 



Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread Glen H
I've been using ascii().

On Tuesday, July 21, 2015 at 7:38:28 PM UTC-4, andrew cooke wrote:


 ah.  for some reason i was thinking they were invisible (somewhere below 
 julia).

 ok, thanks.  so that explains things more clearly

 ...except that(!) using SubString(s, i, endof(s)) and passing *that* to 
 match still gives the memory issue.

 so there's still something odd that i don't understand.  maybe it's just 
 that the regexp lib doesn't know about SubString.

 andrew



 On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org wrote: 
  
  ok, so match(regex, string, index) solves the problem.  presumably it 
 exists 
  exactly for this reason? 

 At least I think this is a valid usecase. 

  
  andrew 
  
  
  On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: 
  
  
  hmm.  ignore that last statement (same problem).  still checking / 
  confused.  sorry. 
  
  On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: 
  
  
  i think that returns a substring (ir a view onto the backing string). 

 ``` 
 julia typeof(aaa[2:end]) 
 ASCIIString 

 julia SubString(aaa, 2, 3) 
 aa 

 julia typeof(SubString(aaa, 2, 3)) 
 SubString{ASCIIString} 
 ``` 

  but i am not sure.  i did read a discussion somewhere saying that 
 because of 
  this you should use bytestring(...) before passing a string to c. 
 which is 
  all the evidence i have for my guess. 
  
  incidentally, match(...) has a method that takes the offset to start 
 at 
  as an argument.  so i can avoid s[i:end] and just pass i into match 
 (i just 
  found this). 
  
  however, somewhat surprisingly, it also has the same problem. 
  
  andrew 
  
  
  On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: 
  
  On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com 
 wrote: 
   does `copy` work? although `bytestring` also seems like a good 
 method 
   for 
   this also. it seems wrong to me also that `match` is making a copy 
 of 
   the 
   original string (if that is indeed what it is doing) 
  
  Isn't it `s[i:end]` that is doing the copy? 
  
   
   On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org 
   wrote: 
   
   
   string(bytestring(...)) seems to do it.  would appreciate any 
 more 
   efficient solutions (and confirmation the analysis is correct - 
 is 
   this 
   worth filing as an issue?) 
   
   
   On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: 
   
   
   well, this was fun...  the following code rapidly triggers the 
 OOM 
   killer 
   on my machine (julia 0.4 trunk): 
   
   s = repeat(a, 100) 
   l = Any[] 
   r = r^\w 
   
   for i in 1:length(s) 
   m = match(r, s[i:end]) 
   push!(l, m.match) 
   end 
   
   note that: (1) the regexp is only matching one character, so the 
   array l 
   is at most a million characters long. 
   
   what i think is happening (but this is only a guess) is that 
   s[i:end] is 
   being passed though to the c level regexp library as a new 
 string. 
   the 
   result (m.match) is then a substring into that.  because the 
   substring is 
   kept around, the backing string cannot be collected.  and so 
 there's 
   an n^2 
   memory use. 
   
   ideally, i don't think a new copy of the string should be passed 
 to 
   the 
   regexp engine.  maybe i am wrong? 
   
   anyway, for now, if the above is right, i need some way to copy 
   m.match. 
   as far as i can tell string() doesn't help.  so what works?  or 
 am i 
   wrong? 
   
   thanks, 
   andrew 



Re: [julia-users] Re: How to un-substring a string?!

2015-07-21 Thread Yichao Yu
On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke and...@acooke.org wrote:

 ah.  for some reason i was thinking they were invisible (somewhere below
 julia).

 ok, thanks.  so that explains things more clearly

 ...except that(!) using SubString(s, i, endof(s)) and passing *that* to
 match still gives the memory issue.

Hmmm,

```
match(re::Regex, str::Union{ByteString,SubString}, idx::Integer,
add_opts::UInt32=UInt32(0)) =
match(re, utf8(str), idx, add_opts)
```

So match on a substring does a copy. I'm guessing this is because pcre
expect a c string (i.e. NULL terminated?)


 so there's still something odd that i don't understand.  maybe it's just
 that the regexp lib doesn't know about SubString.

 andrew



 On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote:

 On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org wrote:
 
  ok, so match(regex, string, index) solves the problem.  presumably it
  exists
  exactly for this reason?

 At least I think this is a valid usecase.

 
  andrew
 
 
  On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote:
 
 
  hmm.  ignore that last statement (same problem).  still checking /
  confused.  sorry.
 
  On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote:
 
 
  i think that returns a substring (ir a view onto the backing string).

 ```
 julia typeof(aaa[2:end])
 ASCIIString

 julia SubString(aaa, 2, 3)
 aa

 julia typeof(SubString(aaa, 2, 3))
 SubString{ASCIIString}
 ```

  but i am not sure.  i did read a discussion somewhere saying that
  because of
  this you should use bytestring(...) before passing a string to c.
  which is
  all the evidence i have for my guess.
 
  incidentally, match(...) has a method that takes the offset to start
  at
  as an argument.  so i can avoid s[i:end] and just pass i into match (i
  just
  found this).
 
  however, somewhat surprisingly, it also has the same problem.
 
  andrew
 
 
  On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote:
 
  On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com
  wrote:
   does `copy` work? although `bytestring` also seems like a good
   method
   for
   this also. it seems wrong to me also that `match` is making a copy
   of
   the
   original string (if that is indeed what it is doing)
 
  Isn't it `s[i:end]` that is doing the copy?
 
  
   On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org
   wrote:
  
  
   string(bytestring(...)) seems to do it.  would appreciate any more
   efficient solutions (and confirmation the analysis is correct - is
   this
   worth filing as an issue?)
  
  
   On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote:
  
  
   well, this was fun...  the following code rapidly triggers the
   OOM
   killer
   on my machine (julia 0.4 trunk):
  
   s = repeat(a, 100)
   l = Any[]
   r = r^\w
  
   for i in 1:length(s)
   m = match(r, s[i:end])
   push!(l, m.match)
   end
  
   note that: (1) the regexp is only matching one character, so the
   array l
   is at most a million characters long.
  
   what i think is happening (but this is only a guess) is that
   s[i:end] is
   being passed though to the c level regexp library as a new
   string.
   the
   result (m.match) is then a substring into that.  because the
   substring is
   kept around, the backing string cannot be collected.  and so
   there's
   an n^2
   memory use.
  
   ideally, i don't think a new copy of the string should be passed
   to
   the
   regexp engine.  maybe i am wrong?
  
   anyway, for now, if the above is right, i need some way to copy
   m.match.
   as far as i can tell string() doesn't help.  so what works?  or
   am i
   wrong?
  
   thanks,
   andrew