Re: [julia-users] Re: How to un-substring a string?!
fwiw, https://github.com/JuliaLang/julia/issues/12262 now contains a summary of this thread. On Tuesday, 21 July 2015 22:03:39 UTC-3, andrew cooke wrote: unfortunately, the semantics for the match don't seem to be the same. if you use a substring then ^ binds to the start of the substring. if you use match(...) with an offset then ^ binds to the start of the underlying string. at least, that's how i understand the following: julia match(r^\s, a b) julia match(r^\s, a b[2:end]) RegexMatch( ) julia match(r^\s, a b, 2) julia which ruins everything for me again. :o( well, i guess not totally - i can still use the copy and copy again approach, but am making loads of huge copies. andrew On Tuesday, 21 July 2015 21:25:09 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 8:19 PM, andrew cooke and...@acooke.org wrote: maybe it should check to see if the end matches? My guess, is that this is an optimization that won't have too much benefit because in most cases you can just call `match` with a start idx. @glen - i think that will construct an ascii string, which isn't what you want if the underlying data are unicode. i've always assumed string() does something smart and returns the right thing, but haven't checked... And from the code I pasted, `utf` should probably do the copy you want. andrew On Tuesday, 21 July 2015 20:49:12 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke and...@acooke.org wrote: ah. for some reason i was thinking they were invisible (somewhere below julia). ok, thanks. so that explains things more clearly ...except that(!) using SubString(s, i, endof(s)) and passing *that* to match still gives the memory issue. Hmmm, ``` match(re::Regex, str::Union{ByteString,SubString}, idx::Integer, add_opts::UInt32=UInt32(0)) = match(re, utf8(str), idx, add_opts) ``` So match on a substring does a copy. I'm guessing this is because pcre expect a c string (i.e. NULL terminated?) so there's still something odd that i don't understand. maybe it's just that the regexp lib doesn't know about SubString. andrew On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org wrote: ok, so match(regex, string, index) solves the problem. presumably it exists exactly for this reason? At least I think this is a valid usecase. andrew On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: hmm. ignore that last statement (same problem). still checking / confused. sorry. On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: i think that returns a substring (ir a view onto the backing string). ``` julia typeof(aaa[2:end]) ASCIIString julia SubString(aaa, 2, 3) aa julia typeof(SubString(aaa, 2, 3)) SubString{ASCIIString} ``` but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is
Re: [julia-users] Re: How to un-substring a string?!
maybe it should check to see if the end matches? @glen - i think that will construct an ascii string, which isn't what you want if the underlying data are unicode. i've always assumed string() does something smart and returns the right thing, but haven't checked... andrew On Tuesday, 21 July 2015 20:49:12 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke and...@acooke.org javascript: wrote: ah. for some reason i was thinking they were invisible (somewhere below julia). ok, thanks. so that explains things more clearly ...except that(!) using SubString(s, i, endof(s)) and passing *that* to match still gives the memory issue. Hmmm, ``` match(re::Regex, str::Union{ByteString,SubString}, idx::Integer, add_opts::UInt32=UInt32(0)) = match(re, utf8(str), idx, add_opts) ``` So match on a substring does a copy. I'm guessing this is because pcre expect a c string (i.e. NULL terminated?) so there's still something odd that i don't understand. maybe it's just that the regexp lib doesn't know about SubString. andrew On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org wrote: ok, so match(regex, string, index) solves the problem. presumably it exists exactly for this reason? At least I think this is a valid usecase. andrew On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: hmm. ignore that last statement (same problem). still checking / confused. sorry. On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: i think that returns a substring (ir a view onto the backing string). ``` julia typeof(aaa[2:end]) ASCIIString julia SubString(aaa, 2, 3) aa julia typeof(SubString(aaa, 2, 3)) SubString{ASCIIString} ``` but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
unfortunately, the semantics for the match don't seem to be the same. if you use a substring then ^ binds to the start of the substring. if you use match(...) with an offset then ^ binds to the start of the underlying string. at least, that's how i understand the following: julia match(r^\s, a b) julia match(r^\s, a b[2:end]) RegexMatch( ) julia match(r^\s, a b, 2) julia which ruins everything for me again. :o( well, i guess not totally - i can still use the copy and copy again approach, but am making loads of huge copies. andrew On Tuesday, 21 July 2015 21:25:09 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 8:19 PM, andrew cooke and...@acooke.org javascript: wrote: maybe it should check to see if the end matches? My guess, is that this is an optimization that won't have too much benefit because in most cases you can just call `match` with a start idx. @glen - i think that will construct an ascii string, which isn't what you want if the underlying data are unicode. i've always assumed string() does something smart and returns the right thing, but haven't checked... And from the code I pasted, `utf` should probably do the copy you want. andrew On Tuesday, 21 July 2015 20:49:12 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke and...@acooke.org wrote: ah. for some reason i was thinking they were invisible (somewhere below julia). ok, thanks. so that explains things more clearly ...except that(!) using SubString(s, i, endof(s)) and passing *that* to match still gives the memory issue. Hmmm, ``` match(re::Regex, str::Union{ByteString,SubString}, idx::Integer, add_opts::UInt32=UInt32(0)) = match(re, utf8(str), idx, add_opts) ``` So match on a substring does a copy. I'm guessing this is because pcre expect a c string (i.e. NULL terminated?) so there's still something odd that i don't understand. maybe it's just that the regexp lib doesn't know about SubString. andrew On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org wrote: ok, so match(regex, string, index) solves the problem. presumably it exists exactly for this reason? At least I think this is a valid usecase. andrew On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: hmm. ignore that last statement (same problem). still checking / confused. sorry. On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: i think that returns a substring (ir a view onto the backing string). ``` julia typeof(aaa[2:end]) ASCIIString julia SubString(aaa, 2, 3) aa julia typeof(SubString(aaa, 2, 3)) SubString{ASCIIString} ``` but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the
Re: [julia-users] Re: How to un-substring a string?!
yeah, i guess in this case that's what the lib wants, so you need to force conversion if you don't have utf8. On Tuesday, 21 July 2015 21:25:09 UTC-3, Yichao Yu wrote: And from the code I pasted, `utf` should probably do the copy you want.
Re: [julia-users] Re: How to un-substring a string?!
On Tue, Jul 21, 2015 at 8:19 PM, andrew cooke and...@acooke.org wrote: maybe it should check to see if the end matches? My guess, is that this is an optimization that won't have too much benefit because in most cases you can just call `match` with a start idx. @glen - i think that will construct an ascii string, which isn't what you want if the underlying data are unicode. i've always assumed string() does something smart and returns the right thing, but haven't checked... And from the code I pasted, `utf` should probably do the copy you want. andrew On Tuesday, 21 July 2015 20:49:12 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke and...@acooke.org wrote: ah. for some reason i was thinking they were invisible (somewhere below julia). ok, thanks. so that explains things more clearly ...except that(!) using SubString(s, i, endof(s)) and passing *that* to match still gives the memory issue. Hmmm, ``` match(re::Regex, str::Union{ByteString,SubString}, idx::Integer, add_opts::UInt32=UInt32(0)) = match(re, utf8(str), idx, add_opts) ``` So match on a substring does a copy. I'm guessing this is because pcre expect a c string (i.e. NULL terminated?) so there's still something odd that i don't understand. maybe it's just that the regexp lib doesn't know about SubString. andrew On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org wrote: ok, so match(regex, string, index) solves the problem. presumably it exists exactly for this reason? At least I think this is a valid usecase. andrew On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: hmm. ignore that last statement (same problem). still checking / confused. sorry. On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: i think that returns a substring (ir a view onto the backing string). ``` julia typeof(aaa[2:end]) ASCIIString julia SubString(aaa, 2, 3) aa julia typeof(SubString(aaa, 2, 3)) SubString{ASCIIString} ``` but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
[julia-users] Re: How to un-substring a string?!
string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
(i was quite impressed that reverse(reverse(...)) didn't help either). On Tuesday, 21 July 2015 20:11:35 UTC-3, andrew cooke wrote: deepcopy didn't. i haven't actually tried copy. hang on... [computer hangs; oom killer steps in]. nope! On Tuesday, 21 July 2015 20:08:33 UTC-3, Jameson wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
deepcopy didn't. i haven't actually tried copy. hang on... [computer hangs; oom killer steps in]. nope! On Tuesday, 21 July 2015 20:08:33 UTC-3, Jameson wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org javascript: wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtjn...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
i think that returns a substring (ir a view onto the backing string). but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com javascript: wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org javascript: wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
hmm. ignore that last statement (same problem). still checking / confused. sorry. On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: i think that returns a substring (ir a view onto the backing string). but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
ok, so match(regex, string, index) solves the problem. presumably it exists exactly for this reason? andrew On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: hmm. ignore that last statement (same problem). still checking / confused. sorry. On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: i think that returns a substring (ir a view onto the backing string). but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org wrote: ok, so match(regex, string, index) solves the problem. presumably it exists exactly for this reason? At least I think this is a valid usecase. andrew On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: hmm. ignore that last statement (same problem). still checking / confused. sorry. On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: i think that returns a substring (ir a view onto the backing string). ``` julia typeof(aaa[2:end]) ASCIIString julia SubString(aaa, 2, 3) aa julia typeof(SubString(aaa, 2, 3)) SubString{ASCIIString} ``` but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
ah. for some reason i was thinking they were invisible (somewhere below julia). ok, thanks. so that explains things more clearly ...except that(!) using SubString(s, i, endof(s)) and passing *that* to match still gives the memory issue. so there's still something odd that i don't understand. maybe it's just that the regexp lib doesn't know about SubString. andrew On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org javascript: wrote: ok, so match(regex, string, index) solves the problem. presumably it exists exactly for this reason? At least I think this is a valid usecase. andrew On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: hmm. ignore that last statement (same problem). still checking / confused. sorry. On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: i think that returns a substring (ir a view onto the backing string). ``` julia typeof(aaa[2:end]) ASCIIString julia SubString(aaa, 2, 3) aa julia typeof(SubString(aaa, 2, 3)) SubString{ASCIIString} ``` but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
I've been using ascii(). On Tuesday, July 21, 2015 at 7:38:28 PM UTC-4, andrew cooke wrote: ah. for some reason i was thinking they were invisible (somewhere below julia). ok, thanks. so that explains things more clearly ...except that(!) using SubString(s, i, endof(s)) and passing *that* to match still gives the memory issue. so there's still something odd that i don't understand. maybe it's just that the regexp lib doesn't know about SubString. andrew On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org wrote: ok, so match(regex, string, index) solves the problem. presumably it exists exactly for this reason? At least I think this is a valid usecase. andrew On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: hmm. ignore that last statement (same problem). still checking / confused. sorry. On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: i think that returns a substring (ir a view onto the backing string). ``` julia typeof(aaa[2:end]) ASCIIString julia SubString(aaa, 2, 3) aa julia typeof(SubString(aaa, 2, 3)) SubString{ASCIIString} ``` but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew
Re: [julia-users] Re: How to un-substring a string?!
On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke and...@acooke.org wrote: ah. for some reason i was thinking they were invisible (somewhere below julia). ok, thanks. so that explains things more clearly ...except that(!) using SubString(s, i, endof(s)) and passing *that* to match still gives the memory issue. Hmmm, ``` match(re::Regex, str::Union{ByteString,SubString}, idx::Integer, add_opts::UInt32=UInt32(0)) = match(re, utf8(str), idx, add_opts) ``` So match on a substring does a copy. I'm guessing this is because pcre expect a c string (i.e. NULL terminated?) so there's still something odd that i don't understand. maybe it's just that the regexp lib doesn't know about SubString. andrew On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke and...@acooke.org wrote: ok, so match(regex, string, index) solves the problem. presumably it exists exactly for this reason? At least I think this is a valid usecase. andrew On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: hmm. ignore that last statement (same problem). still checking / confused. sorry. On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: i think that returns a substring (ir a view onto the backing string). ``` julia typeof(aaa[2:end]) ASCIIString julia SubString(aaa, 2, 3) aa julia typeof(SubString(aaa, 2, 3)) SubString{ASCIIString} ``` but i am not sure. i did read a discussion somewhere saying that because of this you should use bytestring(...) before passing a string to c. which is all the evidence i have for my guess. incidentally, match(...) has a method that takes the offset to start at as an argument. so i can avoid s[i:end] and just pass i into match (i just found this). however, somewhat surprisingly, it also has the same problem. andrew On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash vtj...@gmail.com wrote: does `copy` work? although `bytestring` also seems like a good method for this also. it seems wrong to me also that `match` is making a copy of the original string (if that is indeed what it is doing) Isn't it `s[i:end]` that is doing the copy? On Tue, Jul 21, 2015 at 6:57 PM andrew cooke and...@acooke.org wrote: string(bytestring(...)) seems to do it. would appreciate any more efficient solutions (and confirmation the analysis is correct - is this worth filing as an issue?) On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: well, this was fun... the following code rapidly triggers the OOM killer on my machine (julia 0.4 trunk): s = repeat(a, 100) l = Any[] r = r^\w for i in 1:length(s) m = match(r, s[i:end]) push!(l, m.match) end note that: (1) the regexp is only matching one character, so the array l is at most a million characters long. what i think is happening (but this is only a guess) is that s[i:end] is being passed though to the c level regexp library as a new string. the result (m.match) is then a substring into that. because the substring is kept around, the backing string cannot be collected. and so there's an n^2 memory use. ideally, i don't think a new copy of the string should be passed to the regexp engine. maybe i am wrong? anyway, for now, if the above is right, i need some way to copy m.match. as far as i can tell string() doesn't help. so what works? or am i wrong? thanks, andrew