[Rd] splitting strings efficiently
I have a very long list of strings. Each string actually contains multiple values separated by a semi-colon. I need to turn each string into a vector of the values delimited by the semi-colons. I know I can do this very laboriously by using loops, nchar, and substr, but it is terribly slow. Is there a basic R function that handles this situation? If not, is there perhaps a faster way to do it than I currently am, which is to lapply the following function? Thanks, Mark ### string.tokenizer.func-function(string, separator){ new.vec- NULL newString- if(is.null(string)) {new.vec-} else { for(i in 1:(nchar(string) + 1)){ if(substr(string, i, i) == separator){ new.vec-c(new.vec,newString) newString - } else { newString-paste(newString, substr(string, i, i), sep=) } } new.vec-c(new.vec,newString) } new.vec } Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, Mobile VoiceMail (317) 399-1219 Home Skype: mkimpel ** [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] splitting strings efficiently
I knew there HAD to be a basic function, but 'help.search(split string)' and 'help(string) did not find it. Thanks for the help on this elementary question. Mark Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, Mobile VoiceMail (317) 399-1219 Home Skype: mkimpel ** On Wed, Sep 24, 2008 at 12:17 PM, Erik Iverson [EMAIL PROTECTED]wrote: ?strsplit Mark Kimpel wrote: I have a very long list of strings. Each string actually contains multiple values separated by a semi-colon. I need to turn each string into a vector of the values delimited by the semi-colons. I know I can do this very laboriously by using loops, nchar, and substr, but it is terribly slow. Is there a basic R function that handles this situation? If not, is there perhaps a faster way to do it than I currently am, which is to lapply the following function? Thanks, Mark ### string.tokenizer.func-function(string, separator){ new.vec- NULL newString- if(is.null(string)) {new.vec-} else { for(i in 1:(nchar(string) + 1)){ if(substr(string, i, i) == separator){ new.vec-c(new.vec,newString) newString - } else { newString-paste(newString, substr(string, i, i), sep=) } } new.vec-c(new.vec,newString) } new.vec } Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, Mobile VoiceMail (317) 399-1219 Home Skype: mkimpel ** [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] splitting strings efficiently
Also one can create a text connection and read it using read.table, scan, etc. s - c(12;13;14, 15;16;17) read.table(textConnection(s), sep = ;) # or scan(textConnection(s), sep = ;) On Wed, Sep 24, 2008 at 12:20 PM, Mark Kimpel [EMAIL PROTECTED] wrote: I knew there HAD to be a basic function, but 'help.search(split string)' and 'help(string) did not find it. Thanks for the help on this elementary question. Mark Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, Mobile VoiceMail (317) 399-1219 Home Skype: mkimpel ** On Wed, Sep 24, 2008 at 12:17 PM, Erik Iverson [EMAIL PROTECTED]wrote: ?strsplit Mark Kimpel wrote: I have a very long list of strings. Each string actually contains multiple values separated by a semi-colon. I need to turn each string into a vector of the values delimited by the semi-colons. I know I can do this very laboriously by using loops, nchar, and substr, but it is terribly slow. Is there a basic R function that handles this situation? If not, is there perhaps a faster way to do it than I currently am, which is to lapply the following function? Thanks, Mark ### string.tokenizer.func-function(string, separator){ new.vec- NULL newString- if(is.null(string)) {new.vec-} else { for(i in 1:(nchar(string) + 1)){ if(substr(string, i, i) == separator){ new.vec-c(new.vec,newString) newString - } else { newString-paste(newString, substr(string, i, i), sep=) } } new.vec-c(new.vec,newString) } new.vec } Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, Mobile VoiceMail (317) 399-1219 Home Skype: mkimpel ** [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] function can permanently modify calling function via substitute?
On Wed, 24 Sep 2008, Peter Dalgaard wrote: Perry de Valpine wrote: Dear R-devel: The following code seems to allow one function to permanently modify a calling function. I did not expect this would be allowed (short of more creative gymnastics) and wonder if it is really intended. (I can see other ways to accomplish the intended task of this code [e.g. via match.call instead of substitute below] that do not trigger the problem, but I don't think that is the point.) do.nothing - function(blah) {force(blah)} do.stuff.with.call - function(mycall) { raw.mycall - substitute(mycall); # expected raw.mycall would be local print( sys.call() ) # do.nothing( raw.mycall ); # See below re: commented lines. # .Call( showNAMED, raw.mycall[[2]] ) force( mycall ); # not relevant where (or whether) this is done raw.mycall[[2]] - runif(1); # permanently modifies try.me on the first time only # .Call( showNAMED, raw.mycall[[2]] ) raw.mycall } gumbo - function(x) { writeLines( paste( gumbo : x = , x ) ) return(x); } try.me - function() { one.val - 111; one.ans - do.stuff.with.call( mycall = gumbo( x = one.val ) ); one.ans } # after source()ing the above: deparse(try.me) [1] function () [2] { [3] one.val - 111 [4] one.ans - do.stuff.with.call(mycall = gumbo(x = one.val)) [5] one.ans [6] } try.me() do.stuff.with.call(mycall = gumbo(x = one.val)) gumbo : x = 0.396524668671191 gumbo(x = 0.396524668671191) deparse(try.me) [1] function () [2] { [3] one.val - 111 [4] one.ans - do.stuff.with.call(mycall = gumbo(x = 0.396524668671191)) [5] one.ans [6] } try.me() do.stuff.with.call(mycall = gumbo(x = 0.396524668671191)) gumbo : x = 0.396524668671191 gumbo(x = 0.0078618151601404) deparse(try.me) [1] function () [2] { [3] one.val - 111 [4] one.ans - do.stuff.with.call(mycall = gumbo(x = 0.396524668671191)) [5] one.ans [6] } So, after the first call of try.me(), do.stuff.with.call has permanently replaced the name one.val in line 2 of try.me with a numeric (0.396...). Subsequent calls from try.me to do.stuff.with.call now reflect that change, but do.stuff.with.call does not modify the try.me object again. (Note this means one needs to keep reloading try.me to investigate). If this is a problem worth investigating, here are a couple of other observations that may be relevant but are obviously speculative. 1. If the third line of do.stuff.with.call is uncommented (and try.me also reloaded), the unexpected behavior does not occur. Since do.nothing is eponymous, I was surprised because I believed it should not impact any other behavior. Speculating with limited knowledge, I thought this might implicate something that is supposed to stay under-the-hood, such as the `call by value' illusion described in the R internals documentation. 2. Poking slightly further, I looked at the NAMED values using this C code via R CMD SHLIB and dyn.load: #include R.h #include Rdefines.h SEXP showNAMED(SEXP obj) { Rprintf(%i\n, NAMED(obj)); return(R_NilValue); } Uncommenting the .Call lines in do.stuff.with.call (with the do.nothing line re-commented) reveals that on the first time do.stuff.with.call is called from try.me, raw.mycall[[2]] has NAMED == 1 both before and after the `[[-` line. On subsequent calls it has NAMED == 2 before and NAMED == 1 after. If I follow how NAMED is used, this seems relevant. Yes and no. This does sound like a bug and NAMED is likely involved, but I don't think raw.mycall[[2]] is the thing to look at. More likely, the issue is that raw.mycall[ itself has NAMED == 1 because otherwise [[- assignment would duplicate it first. This suggests that substitute has the bug. Our extraction functions, like [[, bump up the NAMED value for components to the value for the container (or to 2 -- doesn't look like we are consistent here). substitute() doesn't do that, and perhaps could. But arguably it is the point where the promise (from which substitute gets the expression) is created that is the extraction point. We could have mkPromise test for NAMED == 2 and bump up if it isn't. We could also have parse create all LANGSXPs with NAMED == 2 but that leaves out programmatically created functions. Either change fixes this bug; not sure which is the best one (or whether we should do both). Changing mkPromise is more conservative and potentially a little more costly but probably not enough to notice. luke -- Luke Tierney Chair, Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: [EMAIL PROTECTED] Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list
Re: [Rd] splitting strings efficiently
For strsplit(), note that fixed=TRUE is much faster. /HB On Wed, Sep 24, 2008 at 9:20 AM, Mark Kimpel [EMAIL PROTECTED] wrote: I knew there HAD to be a basic function, but 'help.search(split string)' and 'help(string) did not find it. Thanks for the help on this elementary question. Mark Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, Mobile VoiceMail (317) 399-1219 Home Skype: mkimpel ** On Wed, Sep 24, 2008 at 12:17 PM, Erik Iverson [EMAIL PROTECTED]wrote: ?strsplit Mark Kimpel wrote: I have a very long list of strings. Each string actually contains multiple values separated by a semi-colon. I need to turn each string into a vector of the values delimited by the semi-colons. I know I can do this very laboriously by using loops, nchar, and substr, but it is terribly slow. Is there a basic R function that handles this situation? If not, is there perhaps a faster way to do it than I currently am, which is to lapply the following function? Thanks, Mark ### string.tokenizer.func-function(string, separator){ new.vec- NULL newString- if(is.null(string)) {new.vec-} else { for(i in 1:(nchar(string) + 1)){ if(substr(string, i, i) == separator){ new.vec-c(new.vec,newString) newString - } else { newString-paste(newString, substr(string, i, i), sep=) } } new.vec-c(new.vec,newString) } new.vec } Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, Mobile VoiceMail (317) 399-1219 Home Skype: mkimpel ** [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel