Re: [R] vectorization of loops in R
Hi above tapply and aggregate, split *apply could be used) sapply(with(df, split(z, y)), mean) Cheers Petr > -Original Message- > From: R-help On Behalf Of Luigi Marongiu > Sent: Wednesday, November 17, 2021 2:21 PM > To: r-help > Subject: [R] vectorization of loops in R > > Hello, > I have a dataframe with 3 variables. I want to loop through it to get > the mean value of the variable `z`, as follows: > ``` > df = data.frame(x = c(rep(1,5), rep(2,5), rep(3,5)), > y = rep(letters[1:5],3), > z = rnorm(15), > stringsAsFactors = FALSE) > m = vector() > for (i in unique(df$y)) { > s = df[df$y == i,] > m = append(m, mean(s$z)) > } > names(m) = unique(df$y) > > (m) > a b c d e > -0.6355382 -0.4218053 -0.7256680 -0.8320783 -0.2587004 > ``` > The problem is that I have one million `y` values, so the work takes > almost a day. I understand that vectorization will speed up the > procedure. But how shall I write the procedure in vectorial terms? > Thank you > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization of loops in R
Have a look at the base functions tapply and aggregate. For example see: - https://cran.r-project.org/doc/manuals/r-release/R-intro.html#The-function-tapply_0028_0029-and-ragged-arrays , - https://online.stat.psu.edu/stat484/lesson/9/9.2, - or ?tapply and ?aggregate. Also your current code seems to contain an error: `s = df[df$y == i,]` should be `s = df$z[df$y == i]` I think. HTH, Jan On 17-11-2021 14:20, Luigi Marongiu wrote: Hello, I have a dataframe with 3 variables. I want to loop through it to get the mean value of the variable `z`, as follows: ``` df = data.frame(x = c(rep(1,5), rep(2,5), rep(3,5)), y = rep(letters[1:5],3), z = rnorm(15), stringsAsFactors = FALSE) m = vector() for (i in unique(df$y)) { s = df[df$y == i,] m = append(m, mean(s$z)) } names(m) = unique(df$y) (m) a b c d e -0.6355382 -0.4218053 -0.7256680 -0.8320783 -0.2587004 ``` The problem is that I have one million `y` values, so the work takes almost a day. I understand that vectorization will speed up the procedure. But how shall I write the procedure in vectorial terms? Thank you __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization of loops in R
If I follow what you are trying to do, you want the mean of z for each value of y. tapply(df$z, df$y, mean) > On Nov 17, 2021, at 8:20 AM, Luigi Marongiu wrote: > > Hello, > I have a dataframe with 3 variables. I want to loop through it to get > the mean value of the variable `z`, as follows: > ``` > df = data.frame(x = c(rep(1,5), rep(2,5), rep(3,5)), > y = rep(letters[1:5],3), > z = rnorm(15), > stringsAsFactors = FALSE) > m = vector() > for (i in unique(df$y)) { > s = df[df$y == i,] > m = append(m, mean(s$z)) > } > names(m) = unique(df$y) >> (m) > a b c d e > -0.6355382 -0.4218053 -0.7256680 -0.8320783 -0.2587004 > ``` > The problem is that I have one million `y` values, so the work takes > almost a day. I understand that vectorization will speed up the > procedure. But how shall I write the procedure in vectorial terms? > Thank you > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Kevin E. Thorpe Head of Biostatistics, Applied Health Research Centre (AHRC) Li Ka Shing Knowledge Institute of St. Michael’s Hospital Assistant Professor, Dalla Lana School of Public Health University of Toronto email: kevin.tho...@utoronto.ca Tel: 416.864.5776 Fax: 416.864.3016 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] vectorization of loops in R
Hello, I have a dataframe with 3 variables. I want to loop through it to get the mean value of the variable `z`, as follows: ``` df = data.frame(x = c(rep(1,5), rep(2,5), rep(3,5)), y = rep(letters[1:5],3), z = rnorm(15), stringsAsFactors = FALSE) m = vector() for (i in unique(df$y)) { s = df[df$y == i,] m = append(m, mean(s$z)) } names(m) = unique(df$y) > (m) a b c d e -0.6355382 -0.4218053 -0.7256680 -0.8320783 -0.2587004 ``` The problem is that I have one million `y` values, so the work takes almost a day. I understand that vectorization will speed up the procedure. But how shall I write the procedure in vectorial terms? Thank you __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization in a random order
I think you answered your own question. For loops are not a boogeyman... poor memory management is. Algorithms that are sensitive to evaluation sequence are often not very re-usable, and certainly not parallelizable. If you have a specific algorithm in mind, there may be some advice we can give you about optimization, but as it stands think you know how to get a working implementation. -- Sent from my phone. Please excuse my brevity. On November 10, 2016 5:06:07 AM PST, Thomas Chesneywrote: >Is there a way to use vectorization where the elements are evaluated in >a random order? > >For instance, if the code is to be run on each row in a matrix of >length nBuy the following will do the job > >for (b in sample(1:nBuy,nBuy, replace=FALSE)){ > >} > >but > >apply(nBuyMat, 1, function(x)) > >will be run I believe, in the same order each time (Row1, then Row2, >then Row3 etc.) > >This is important for building agent based models (the classic >explanation of this is probably Huberman & Glance's response to Nowak & >May's 1992 Nature article - Evolutionary games and computer >simulations, http://www.pnas.org/content/90/16/7716.abstract) > >Thank you, > >Thomas >http://www.nottingham.ac.uk/~liztc/Personal/index.html > > > >This message and any attachment are intended solely for the addressee >and may contain confidential information. If you have received this >message in error, please send it back to me, and immediately delete it. > > >Please do not use, copy or disclose the information contained in this >message or in any attachment. Any views or opinions expressed by the >author of this email do not necessarily reflect the views of the >University of Nottingham. > >This message has been checked for viruses but the contents of an >attachment may still contain software viruses which could damage your >computer system, you are advised to perform your own checks. Email >communications with the University of Nottingham may be monitored as >permitted by UK legislation. > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization in a random order
You are mistaken. apply() is *not* vectorized. It is a disguised loop. For true vectorization at the C level, the answer must be no, as the whole point is to treat the argument as a whole object and hide the iterative details. However, as you indicated, you can always manually randomize the indexing that is being iterated over and even write a function to do it if you like; e.g. (warning: esentially untested and probably clumsy as well as buggy) randapply <- function(X, MARGIN, FUN,...) { d <- dim(X) ix <- as.list(rep(TRUE,length(d))) for(i in MARGIN) ix[[i]] <- sample(seq_len(d[i]),d[i]) X <- do.call("[", c(list(X), ix)) apply(X,MARGIN,FUN,...) } > a <- array(1:24,dim = 2:4) > randapply(a, 3,mean) [1] 9.5 21.5 15.5 3.5 > randapply(a,3,mean) [1] 21.5 3.5 15.5 9.5 Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Thu, Nov 10, 2016 at 5:06 AM, Thomas Chesneywrote: > Is there a way to use vectorization where the elements are evaluated in a > random order? > > For instance, if the code is to be run on each row in a matrix of length nBuy > the following will do the job > > for (b in sample(1:nBuy,nBuy, replace=FALSE)){ > > } > > but > > apply(nBuyMat, 1, function(x)) > > will be run I believe, in the same order each time (Row1, then Row2, then > Row3 etc.) > > This is important for building agent based models (the classic explanation of > this is probably Huberman & Glance's response to Nowak & May's 1992 Nature > article - Evolutionary games and computer simulations, > http://www.pnas.org/content/90/16/7716.abstract) > > Thank you, > > Thomas > http://www.nottingham.ac.uk/~liztc/Personal/index.html > > > > This message and any attachment are intended solely for the addressee > and may contain confidential information. If you have received this > message in error, please send it back to me, and immediately delete it. > > Please do not use, copy or disclose the information contained in this > message or in any attachment. Any views or opinions expressed by the > author of this email do not necessarily reflect the views of the > University of Nottingham. > > This message has been checked for viruses but the contents of an > attachment may still contain software viruses which could damage your > computer system, you are advised to perform your own checks. Email > communications with the University of Nottingham may be monitored as > permitted by UK legislation. > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization in a random order
nBuyMat <- data.frame(matrix(rnorm(28), 7, 4)) nBuyMat nBuy <- nrow(nBuyMat) sample(1:nBuy, nBuy, replace=FALSE) sample(1:nBuy) sample(nBuy) ?sample apply(nBuyMat[sample(1:nBuy,nBuy, replace=FALSE),], 1, function(x) sum(x)) apply(nBuyMat[sample(nBuy),], 1, function(x) sum(x)) The defaults for sample do what you have requested. If the original row identification matters, then be sure the to use either a matrix with rownames or a data.frame. Rich Sent from my iPhone > On Nov 10, 2016, at 08:06, Thomas Chesney> wrote: > > for (b in sample(1:nBuy,nBuy, replace=FALSE)){ > > } > > but > > apply(nBuyMat, 1, function(x)) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Vectorization in a random order
Is there a way to use vectorization where the elements are evaluated in a random order? For instance, if the code is to be run on each row in a matrix of length nBuy the following will do the job for (b in sample(1:nBuy,nBuy, replace=FALSE)){ } but apply(nBuyMat, 1, function(x)) will be run I believe, in the same order each time (Row1, then Row2, then Row3 etc.) This is important for building agent based models (the classic explanation of this is probably Huberman & Glance's response to Nowak & May's 1992 Nature article - Evolutionary games and computer simulations, http://www.pnas.org/content/90/16/7716.abstract) Thank you, Thomas http://www.nottingham.ac.uk/~liztc/Personal/index.html This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it. Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system, you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization of rolling function
Great, many thanks for your help Jeff. Apologies for the HTML format, I'll be more careful next time. Arnaud On 08/12/2014 08:25, Jeff Newmiller wrote: Please don't post in HTML... you may not recognize it, but the receiving end does not necessarily (and in this case did not) look like the sending end, and the cleanup can impede answers you are hoping to get. In many cases, loops can be vectorized. However, near as I can tell this is an example of an algorithm that simply needs a loop [1]. One bit of advice: the coredata function is horribly slow. Just converting your time series objects to numeric vectors for the purpose of this computation sped up the algorithm by 500x on 1 point series. Converting it to inline C++ as below sped it up by yet another factor of 40x. 2x is nothing to sneeze at. ### ## optional temporary setup for windows ## assumes you have installed Rtools gcc - C:\\Rtools\\bin rtools - C:\\Rtools\\gcc-4.6.3\\bin path - strsplit(Sys.getenv(PATH), ;)[[1]] new_path - c(rtools, gcc, path) new_path - new_path[!duplicated(tolower(new_path))] Sys.setenv(PATH = paste(new_path, collapse = ;)) ## end of optional library(Rcpp) cppFunction( DataFrame EvapSimRcpp( NumericVector RR , NumericVector ETmax , const double Smax , const double initialStorage ) { int n = RR.size(); // create empty time-series to fill // effective rainfall (i.e. rainfall minus intercepted rainfall) NumericVector RReff( n ); // intercepted rainfall( n ); NumericVector Rint( n ); // residual potential evapotranspiration (ie ETmax minus // evaporation from interception) NumericVector ETres( n, NA_REAL ); double evap; // volume of water in interception storage at start of // computation double storage = initialStorage; for ( int i=0; in; i++ ) { // compute interception capacity for time step i (maximum // interception capacity minus any water intercepted but not // evaporated during previous time-step). Rint[ i ] = Smax - storage; // compute intercepted rainfall: equal to rainfall if smaller // than interception capacity, and to interception capacity if // larger. if ( RR[ i ] Rint[ i ] ) Rint[ i ] = RR[ i ]; // compute effective rainfall (rainfall minus intercepted // rainfall). RReff[ i ] = RR[ i ] - Rint[ i ]; // update interception storage: initial interception storage + // intercepted // rainfall. storage = storage + Rint[ i ]; // compute evaporation from interception storage: equal to // potential evapotranspiration if the latter is smaller than // interception storage, and to interception storage if larger. if ( storage ETmax[ i ] ) evap = ETmax[ i ]; else evap = storage; // compute residual potentiel evapotranspiration: potential // evapotranspiration minus evaporation from interception // storage. ETres[ i ] = ETmax[ i ] - evap; // update interception storage, to be carried over to next // time-step: interception storage minus evaporation from // interception storage. storage = storage - evap; } DataFrame DF = DataFrame::create( Named( \int\ ) = Rint , Named( \RReff\ ) = RReff , Named( \ETres\ ) = ETres ); return DF; } ) # Assumes your initial variables are already defined EvapSimRcpp( RR, ETmax, Smax, 0 ) ### [1] http://stackoverflow.com/questions/7153586/can-i-vectorize-a-calculation-which-depends-on-previous-elements On Sat, 6 Dec 2014, A Duranel wrote: Hello I use R to run a simple model of rainfall interception by vegetation: rainfall falls on vegetation, some is retained by the vegetation (part of which can evaporate), the rest falls on the ground (quite crude but very similar to those used in SWAT or MikeSHE, for the hydrologists among you). It uses a loop on zoo time-series of rainfall and potential evapotranspiration. Unfortunately I did not find a way to vectorize it and it takes ages to run on long datasets. Could anybody help me to make it run faster? library(zoo) set.seed(1) # artificial potential evapotranspiration time-series ETmax-zoo(runif(10, min=1, max=6), c(1:10)) # artificial rainfall time-series RR-zoo(runif(10, min=0, max=6), c(1:10)) ## create empty time-series to fill # effective rainfall (i.e. rainfall minus intercepted rainfall) RReff-zoo(NA, c(1:10)) # intercepted rainfall int-zoo(NA, c(1:10)) # residual potential evapotranspiration (ie ETmax minus evaporation from interception) ETres-zoo(NA, c(1:10)) # define maximum interception storage capacity (maximum volume of rainfall that can be intercepted per time step, provided the interception store is empty at start of time-step) Smax-3 # volume of water in interception storage at start of computation storage-0 for (i in 1:length(ETmax)) { # compute interception
Re: [R] vectorization of rolling function
Please don't post in HTML... you may not recognize it, but the receiving end does not necessarily (and in this case did not) look like the sending end, and the cleanup can impede answers you are hoping to get. In many cases, loops can be vectorized. However, near as I can tell this is an example of an algorithm that simply needs a loop [1]. One bit of advice: the coredata function is horribly slow. Just converting your time series objects to numeric vectors for the purpose of this computation sped up the algorithm by 500x on 1 point series. Converting it to inline C++ as below sped it up by yet another factor of 40x. 2x is nothing to sneeze at. ### ## optional temporary setup for windows ## assumes you have installed Rtools gcc - C:\\Rtools\\bin rtools - C:\\Rtools\\gcc-4.6.3\\bin path - strsplit(Sys.getenv(PATH), ;)[[1]] new_path - c(rtools, gcc, path) new_path - new_path[!duplicated(tolower(new_path))] Sys.setenv(PATH = paste(new_path, collapse = ;)) ## end of optional library(Rcpp) cppFunction( DataFrame EvapSimRcpp( NumericVector RR , NumericVector ETmax , const double Smax , const double initialStorage ) { int n = RR.size(); // create empty time-series to fill // effective rainfall (i.e. rainfall minus intercepted rainfall) NumericVector RReff( n ); // intercepted rainfall( n ); NumericVector Rint( n ); // residual potential evapotranspiration (ie ETmax minus // evaporation from interception) NumericVector ETres( n, NA_REAL ); double evap; // volume of water in interception storage at start of // computation double storage = initialStorage; for ( int i=0; in; i++ ) { // compute interception capacity for time step i (maximum // interception capacity minus any water intercepted but not // evaporated during previous time-step). Rint[ i ] = Smax - storage; // compute intercepted rainfall: equal to rainfall if smaller // than interception capacity, and to interception capacity if // larger. if ( RR[ i ] Rint[ i ] ) Rint[ i ] = RR[ i ]; // compute effective rainfall (rainfall minus intercepted // rainfall). RReff[ i ] = RR[ i ] - Rint[ i ]; // update interception storage: initial interception storage + // intercepted // rainfall. storage = storage + Rint[ i ]; // compute evaporation from interception storage: equal to // potential evapotranspiration if the latter is smaller than // interception storage, and to interception storage if larger. if ( storage ETmax[ i ] ) evap = ETmax[ i ]; else evap = storage; // compute residual potentiel evapotranspiration: potential // evapotranspiration minus evaporation from interception // storage. ETres[ i ] = ETmax[ i ] - evap; // update interception storage, to be carried over to next // time-step: interception storage minus evaporation from // interception storage. storage = storage - evap; } DataFrame DF = DataFrame::create( Named( \int\ ) = Rint , Named( \RReff\ ) = RReff , Named( \ETres\ ) = ETres ); return DF; } ) # Assumes your initial variables are already defined EvapSimRcpp( RR, ETmax, Smax, 0 ) ### [1] http://stackoverflow.com/questions/7153586/can-i-vectorize-a-calculation-which-depends-on-previous-elements On Sat, 6 Dec 2014, A Duranel wrote: Hello I use R to run a simple model of rainfall interception by vegetation: rainfall falls on vegetation, some is retained by the vegetation (part of which can evaporate), the rest falls on the ground (quite crude but very similar to those used in SWAT or MikeSHE, for the hydrologists among you). It uses a loop on zoo time-series of rainfall and potential evapotranspiration. Unfortunately I did not find a way to vectorize it and it takes ages to run on long datasets. Could anybody help me to make it run faster? library(zoo) set.seed(1) # artificial potential evapotranspiration time-series ETmax-zoo(runif(10, min=1, max=6), c(1:10)) # artificial rainfall time-series RR-zoo(runif(10, min=0, max=6), c(1:10)) ## create empty time-series to fill # effective rainfall (i.e. rainfall minus intercepted rainfall) RReff-zoo(NA, c(1:10)) # intercepted rainfall int-zoo(NA, c(1:10)) # residual potential evapotranspiration (ie ETmax minus evaporation from interception) ETres-zoo(NA, c(1:10)) # define maximum interception storage capacity (maximum volume of rainfall that can be intercepted per time step, provided the interception store is empty at start of time-step) Smax-3 # volume of water in interception storage at start of computation storage-0 for (i in 1:length(ETmax)) { # compute interception capacity for time step i (maximum interception capacity minus any water intercepted but not evaporated during previous time-step). int[i]-Smax-storage # compute
[R] vectorization of rolling function
Hello I use R to run a simple model of rainfall interception by vegetation: rainfall falls on vegetation, some is retained by the vegetation (part of which can evaporate), the rest falls on the ground (quite crude but very similar to those used in SWAT or MikeSHE, for the hydrologists among you). It uses a loop on zoo time-series of rainfall and potential evapotranspiration. Unfortunately I did not find a way to vectorize it and it takes ages to run on long datasets. Could anybody help me to make it run faster? library(zoo) set.seed(1) # artificial potential evapotranspiration time-series ETmax-zoo(runif(10, min=1, max=6), c(1:10)) # artificial rainfall time-series RR-zoo(runif(10, min=0, max=6), c(1:10)) ## create empty time-series to fill # effective rainfall (i.e. rainfall minus intercepted rainfall) RReff-zoo(NA, c(1:10)) # intercepted rainfall int-zoo(NA, c(1:10)) # residual potential evapotranspiration (ie ETmax minus evaporation from interception) ETres-zoo(NA, c(1:10)) # define maximum interception storage capacity (maximum volume of rainfall that can be intercepted per time step, provided the interception store is empty at start of time-step) Smax-3 # volume of water in interception storage at start of computation storage-0 for (i in 1:length(ETmax)) { # compute interception capacity for time step i (maximum interception capacity minus any water intercepted but not evaporated during previous time-step). int[i]-Smax-storage # compute intercepted rainfall: equal to rainfall if smaller than interception capacity, and to interception capacity if larger. if(RR[i]int[i]) int[i]lt;-RR[i] # compute effective rainfall (rainfall minus intercepted rainfall). RReff[i]lt;-RR[i]-int[i] # update interception storage: initial interception storage + intercepted rainfall. storagelt;-storage+coredata(int[i]) # compute evaporation from interception storage: equal to potential evapotranspiration if the latter is smaller than interception storage, and to interception storage if larger. if(storagecoredata(ETmax[i])) evap-coredata(ETmax[i]) else evap-storage # compute residual potentiel evapotranspiration: potential evapotranspiration minus evaporation from interception storage. ETres[i]-ETmax[i]-evap # update interception storage, to be carried over to next time-step: interception storage minus evaporation from interception storage. storage-storage-evap } Many thanks for your help! Arnaud UCL Department of Geography, UK -- View this message in context: http://r.789695.n4.nabble.com/vectorization-of-rolling-function-tp4700487.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] vectorization
Hi. I saw this example and I cannot begin to figure out how it works. Can anyone give me an idea on this? n = 9e6 df = data.frame(values = rnorm(n), ID = rep(LETTERS[1:3], each = n/3), stringsAsFactors = FALSE) head(df) values ID 1 -0.7355823 A 2 -0.4729925 A 3 -0.7417259 A 4 1.7633367 A 5 -0.3006790 A 6 0.6785947 A The idea is to replace all occurrences of A by'Text for A'. He does this: translator_vector = c(A = 'Text for A', B = 'Text for B', C = 'Text for C') and subset this vector using df$ID: dum_vectorized = translator_vector[df$ID] It works but I have no idea why. Thank you. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization
Hi everything is written in docs. However this example is a little tricky. Each df$ID matches name of item in translator_vector and [] selects this matched item. It is similar like x-sample(1:3, 10,replace=T) translator_vector[x] Regards Petr -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- project.org] On Behalf Of Bill Sent: Wednesday, January 29, 2014 12:41 PM To: r-help@r-project.org Subject: [R] vectorization Hi. I saw this example and I cannot begin to figure out how it works. Can anyone give me an idea on this? n = 9e6 df = data.frame(values = rnorm(n), ID = rep(LETTERS[1:3], each = n/3), stringsAsFactors = FALSE) head(df) values ID 1 -0.7355823 A 2 -0.4729925 A 3 -0.7417259 A 4 1.7633367 A 5 -0.3006790 A 6 0.6785947 A The idea is to replace all occurrences of A by'Text for A'. He does this: translator_vector = c(A = 'Text for A', B = 'Text for B', C = 'Text for C') and subset this vector using df$ID: dum_vectorized = translator_vector[df$ID] It works but I have no idea why. Thank you. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a jsou určeny pouze jeho adresátům. Jestliže jste obdržel(a) tento e-mail omylem, informujte laskavě neprodleně jeho odesílatele. Obsah tohoto emailu i s přílohami a jeho kopie vymažte ze svého systému. Nejste-li zamýšleným adresátem tohoto emailu, nejste oprávněni tento email jakkoliv užívat, rozšiřovat, kopírovat či zveřejňovat. Odesílatel e-mailu neodpovídá za eventuální škodu způsobenou modifikacemi či zpožděním přenosu e-mailu. V případě, že je tento e-mail součástí obchodního jednání: - vyhrazuje si odesílatel právo ukončit kdykoliv jednání o uzavření smlouvy, a to z jakéhokoliv důvodu i bez uvedení důvodu. - a obsahuje-li nabídku, je adresát oprávněn nabídku bezodkladně přijmout; Odesílatel tohoto e-mailu (nabídky) vylučuje přijetí nabídky ze strany příjemce s dodatkem či odchylkou. - trvá odesílatel na tom, že příslušná smlouva je uzavřena teprve výslovným dosažením shody na všech jejích náležitostech. - odesílatel tohoto emailu informuje, že není oprávněn uzavírat za společnost žádné smlouvy s výjimkou případů, kdy k tomu byl písemně zmocněn nebo písemně pověřen a takové pověření nebo plná moc byly adresátovi tohoto emailu případně osobě, kterou adresát zastupuje, předloženy nebo jejich existence je adresátovi či osobě jím zastoupené známá. This e-mail and any documents attached to it may be confidential and are intended only for its intended recipients. If you received this e-mail by mistake, please immediately inform its sender. Delete the contents of this e-mail with all attachments and its copies from your system. If you are not the intended recipient of this e-mail, you are not authorized to use, disseminate, copy or disclose this e-mail in any manner. The sender of this e-mail shall not be liable for any possible damage caused by modifications of the e-mail or by delay with transfer of the email. In case that this e-mail forms part of business dealings: - the sender reserves the right to end negotiations about entering into a contract in any time, for any reason, and without stating any reasoning. - if the e-mail contains an offer, the recipient is entitled to immediately accept such offer; The sender of this e-mail (offer) excludes any acceptance of the offer on the part of the recipient containing any amendment or variation. - the sender insists on that the respective contract is concluded only upon an express mutual agreement on all its aspects. - the sender of this e-mail informs that he/she is not authorized to enter into any contracts on behalf of the company except for cases in which he/she is expressly authorized to do so in writing, and such authorization or power of attorney is submitted to the recipient or the person represented by the recipient, or the existence of such authorization is known to the recipient of the person represented by the recipient. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization
On 14-01-29 6:41 AM, Bill wrote: Hi. I saw this example and I cannot begin to figure out how it works. Can anyone give me an idea on this? n = 9e6 df = data.frame(values = rnorm(n), ID = rep(LETTERS[1:3], each = n/3), stringsAsFactors = FALSE) head(df) values ID 1 -0.7355823 A 2 -0.4729925 A 3 -0.7417259 A 4 1.7633367 A 5 -0.3006790 A 6 0.6785947 A The idea is to replace all occurrences of A by'Text for A'. He does this: translator_vector = c(A = 'Text for A', B = 'Text for B', C = 'Text for C') and subset this vector using df$ID: dum_vectorized = translator_vector[df$ID] It works but I have no idea why. He is indexing by name. The translator_vector looks like this: ABC Text for A Text for B Text for C The first element is named A, the second B, the third C. So translator_vector[A] is the same as translator_vector[1]. The ID column in your dataframe is a vector of strings to be used as names, so each one pulls out one element from the translator_vector. Duncan Murdoch Thank you. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization
Oh wow, I guess I get it! Thank you. It is pretty tricky but I saw that it works very fast. On Wed, Jan 29, 2014 at 9:31 PM, Duncan Murdoch murdoch.dun...@gmail.comwrote: On 14-01-29 6:41 AM, Bill wrote: Hi. I saw this example and I cannot begin to figure out how it works. Can anyone give me an idea on this? n = 9e6 df = data.frame(values = rnorm(n), ID = rep(LETTERS[1:3], each = n/3), stringsAsFactors = FALSE) head(df) values ID 1 -0.7355823 A 2 -0.4729925 A 3 -0.7417259 A 4 1.7633367 A 5 -0.3006790 A 6 0.6785947 A The idea is to replace all occurrences of A by'Text for A'. He does this: translator_vector = c(A = 'Text for A', B = 'Text for B', C = 'Text for C') and subset this vector using df$ID: dum_vectorized = translator_vector[df$ID] It works but I have no idea why. He is indexing by name. The translator_vector looks like this: ABC Text for A Text for B Text for C The first element is named A, the second B, the third C. So translator_vector[A] is the same as translator_vector[1]. The ID column in your dataframe is a vector of strings to be used as names, so each one pulls out one element from the translator_vector. Duncan Murdoch Thank you. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/ posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization modifying globals in functions
In current versions of R the apply functions do not gain much (if any) in speed over a well written for loop (the for loops are much more efficient than they used to be). Using global variables could actually slow things down a little for what you are doing, if you use `-` then it has to search through multiple environments to find which to replace. In general you should avoid using global variables. It is best to pass all needed variables into a function as arguments, do any modifications internally inside the function on local copies, then return the modified local copy from the function (you can use a list if you want to return multiple variables). Since each iteration of your code depends on the previous iteration, vectorizing is not going to help (or even be reasonable). If you want to speed up the code then you might consider a compiled option, see the inline or rcpp packages (or others). On Thu, Dec 27, 2012 at 1:38 PM, Sam Steingold s...@gnu.org wrote: I have the following code: --8---cut here---start-8--- d - rep(10,10) for (i in 1:100) { a - sample.int(length(d), size = 2) if (d[a[1]] = 1) { d[a[1]] - d[a[1]] - 1 d[a[2]] - d[a[2]] + 1 } } --8---cut here---end---8--- it does what I want, i.e., modified vector d 100 times. Now, if I want to repeat this 1e6 times instead of 1e2 times, I want to vectorize it for speed, so I do this: --8---cut here---start-8--- update - function (i) { a - sample.int(n.agents, size = 2) if (d[a[1]] = delta) { d[a[1]] - d[a[1]] - 1 d[a[2]] - d[a[2]] + 1 } entropy(d, unit=log2) } system.time(entropy.history - sapply(1:1e6,update)) --8---cut here---end---8--- however, the global d is not modified, apparently update modifies the local copy. so, 1. is there a way for a function to modify a global variable? 2. how would you vectorize this loop? thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://honestreporting.com http://pmw.org.il http://www.PetitionOnline.com/tap12009/ A number problem solved with floats turns into 1.9998 problems. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Gregory (Greg) L. Snow Ph.D. 538...@gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization modifying globals in functions
On Dec 27, 2012, at 12:38 PM, Sam Steingold wrote: I have the following code: --8---cut here---start-8--- d - rep(10,10) for (i in 1:100) { a - sample.int(length(d), size = 2) if (d[a[1]] = 1) { d[a[1]] - d[a[1]] - 1 d[a[2]] - d[a[2]] + 1 } } --8---cut here---end---8--- it does what I want, i.e., modified vector d 100 times. Now, if I want to repeat this 1e6 times instead of 1e2 times, I want to vectorize it for speed, so I do this: You could get some modest improvement by vectorizing the two lookups, additions, and assignments into one: d[a] - d[a]-c(1,-1) In a test with 10 iterations, it yields about a 1.693/1.394 -1 = 21 percent improvement. --8---cut here---start-8--- update - function (i) { a - sample.int(n.agents, size = 2) if (d[a[1]] = delta) { d[a[1]] - d[a[1]] - 1 d[a[2]] - d[a[2]] + 1 } entropy(d, unit=log2) The `unit` seems likely to throw an error since there is no argument for it to match. } system.time(entropy.history - sapply(1:1e6,update)) --8---cut here---end---8--- however, the global d is not modified, apparently update modifies the local copy. You could have returned 'd' and the entropy result as a list. But what would be the point of saving 1e6 copies so, 1. is there a way for a function to modify a global variable? So if you replaced it in the global environment, you would only be seeing the result of the last iteration of the loop. What's the use of that 2. how would you vectorize this loop? thanks! -- David Winsemius, MD Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] vectorization modifying globals in functions
I have the following code: --8---cut here---start-8--- d - rep(10,10) for (i in 1:100) { a - sample.int(length(d), size = 2) if (d[a[1]] = 1) { d[a[1]] - d[a[1]] - 1 d[a[2]] - d[a[2]] + 1 } } --8---cut here---end---8--- it does what I want, i.e., modified vector d 100 times. Now, if I want to repeat this 1e6 times instead of 1e2 times, I want to vectorize it for speed, so I do this: --8---cut here---start-8--- update - function (i) { a - sample.int(n.agents, size = 2) if (d[a[1]] = delta) { d[a[1]] - d[a[1]] - 1 d[a[2]] - d[a[2]] + 1 } entropy(d, unit=log2) } system.time(entropy.history - sapply(1:1e6,update)) --8---cut here---end---8--- however, the global d is not modified, apparently update modifies the local copy. so, 1. is there a way for a function to modify a global variable? 2. how would you vectorize this loop? thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://honestreporting.com http://pmw.org.il http://www.PetitionOnline.com/tap12009/ A number problem solved with floats turns into 1.9998 problems. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization modifying globals in functions
At Thu, 27 Dec 2012 15:38:08 -0500, Sam Steingold wrote: so, 1. is there a way for a function to modify a global variable? Use - instead of -. 2. how would you vectorize this loop? This is hard. Your function has a feedback loop: an iteration depends on the previous iteration's result. A for loop is about as good as you can do in this case. sapply might help a bit, but it is really just a for loop in disguise. Since sample.int is used to generate indexes, you might try to generate a bunch of indexes, take as many as don't overlap (i.e., collect all orthogonal updates) and do all of those updates at once. If you really need the entropy after every iteration, however, then this won't work for you either. Neal __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization modifying globals in functions
You can use environments. Have a look at this this discussion. http://stackoverflow.com/questions/7439110/what-is-the-difference-between-parent-frame-and-parent-env-in-r-how-do-they On 27 December 2012 21:38, Sam Steingold s...@gnu.org wrote: I have the following code: --8---cut here---start-8--- d - rep(10,10) for (i in 1:100) { a - sample.int(length(d), size = 2) if (d[a[1]] = 1) { d[a[1]] - d[a[1]] - 1 d[a[2]] - d[a[2]] + 1 } } --8---cut here---end---8--- it does what I want, i.e., modified vector d 100 times. Now, if I want to repeat this 1e6 times instead of 1e2 times, I want to vectorize it for speed, so I do this: --8---cut here---start-8--- update - function (i) { a - sample.int(n.agents, size = 2) if (d[a[1]] = delta) { d[a[1]] - d[a[1]] - 1 d[a[2]] - d[a[2]] + 1 } entropy(d, unit=log2) } system.time(entropy.history - sapply(1:1e6,update)) --8---cut here---end---8--- however, the global d is not modified, apparently update modifies the local copy. so, 1. is there a way for a function to modify a global variable? 2. how would you vectorize this loop? thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://honestreporting.com http://pmw.org.il http://www.PetitionOnline.com/tap12009/ A number problem solved with floats turns into 1.9998 problems. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] vectorization condition counting
Hi all, I am working on a really big dataset and I would like to vectorize a condition in a if loop to improve speed. the original loop with the condition is currently writen as follow: if(sum(as.integer(tags$tag_id==tags$tag_id[i]))==1tags$lgth[i]300){ tags$stage[i]-J } Do you have some ideas ? I was unable to do it correctly Thanking you in advance for your help Guillaume -- View this message in context: http://r.789695.n4.nabble.com/vectorization-condition-counting-tp4639992.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization condition counting
Your sum(tag_id==tag_id[i])==1, meaning tag_id[i] is the only entry with its value, may be vectorized by the sneaky idiom !(duplicated(tag_id,fromLast=FALSE) | duplicated(tag_id,fromLast=TRUE) Hence f0() (with your code in a loop) and f1() are equivalent: f0 - function (tags) { for (i in seq_len(nrow(tags))) { if (sum(tags$tag_id == tags$tag_id[i]) == 1 tags$lgth[i] 300) { tags$stage[i] - J } } tags } f1 -function (tags) { needsChanging - with(tags, !(duplicated(tag_id, fromLast = FALSE) | duplicated(tag_id, fromLast = TRUE)) lgth 300) tags$stage[needsChanging] - J tags } E.g., someTags - data.frame(tag_id = c(1, 2, 2, 3, 4, 5, 6, 6), lgth = 50*(1:8), stage=factor(rep(.,8), levels=c(.,J))) all.equal(f0(someTags), f1(someTags)) [1] TRUE f1(someTags) tag_id lgth stage 1 1 50 J 2 2 100 . 3 2 150 . 4 3 200 J 5 4 250 J 6 5 300 . 7 6 350 . 8 6 400 . Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Guillaume2883 Sent: Friday, August 10, 2012 3:47 PM To: r-help@r-project.org Subject: [R] vectorization condition counting Hi all, I am working on a really big dataset and I would like to vectorize a condition in a if loop to improve speed. the original loop with the condition is currently writen as follow: if(sum(as.integer(tags$tag_id==tags$tag_id[i]))==1tags$lgth[i]300){ tags$stage[i]-J } Do you have some ideas ? I was unable to do it correctly Thanking you in advance for your help Guillaume -- View this message in context: http://r.789695.n4.nabble.com/vectorization-condition- counting-tp4639992.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization condition counting
HI, This may also help: someTags - data.frame(tag_id = c(1, 2, 2, 3, 4, 5, 6, 6), lgth = 50*(1:8), stage=factor(rep(.,8), levels=c(.,J))) f2-function(x){ needsChanging-with(someTags,is.na(match(tag_id,tag_id[duplicated(tag_id)]))lgth300) x$stage[needsChanging]-J x } f2(someTags) # tag_id lgth stage #1 1 50 J #2 2 100 . #3 2 150 . #4 3 200 J #5 4 250 J #6 5 300 . #7 6 350 . #8 6 400 . A.K. - Original Message - From: William Dunlap wdun...@tibco.com To: Guillaume2883 guillaume.bal@gmail.com; r-help@r-project.org r-help@r-project.org Cc: Sent: Friday, August 10, 2012 8:02 PM Subject: Re: [R] vectorization condition counting Your sum(tag_id==tag_id[i])==1, meaning tag_id[i] is the only entry with its value, may be vectorized by the sneaky idiom !(duplicated(tag_id,fromLast=FALSE) | duplicated(tag_id,fromLast=TRUE) Hence f0() (with your code in a loop) and f1() are equivalent: f0 - function (tags) { for (i in seq_len(nrow(tags))) { if (sum(tags$tag_id == tags$tag_id[i]) == 1 tags$lgth[i] 300) { tags$stage[i] - J } } tags } f1 -function (tags) { needsChanging - with(tags, !(duplicated(tag_id, fromLast = FALSE) | duplicated(tag_id, fromLast = TRUE)) lgth 300) tags$stage[needsChanging] - J tags } E.g., someTags - data.frame(tag_id = c(1, 2, 2, 3, 4, 5, 6, 6), lgth = 50*(1:8), stage=factor(rep(.,8), levels=c(.,J))) all.equal(f0(someTags), f1(someTags)) [1] TRUE f1(someTags) tag_id lgth stage 1 1 50 J 2 2 100 . 3 2 150 . 4 3 200 J 5 4 250 J 6 5 300 . 7 6 350 . 8 6 400 . Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Guillaume2883 Sent: Friday, August 10, 2012 3:47 PM To: r-help@r-project.org Subject: [R] vectorization condition counting Hi all, I am working on a really big dataset and I would like to vectorize a condition in a if loop to improve speed. the original loop with the condition is currently writen as follow: if(sum(as.integer(tags$tag_id==tags$tag_id[i]))==1tags$lgth[i]300){ tags$stage[i]-J } Do you have some ideas ? I was unable to do it correctly Thanking you in advance for your help Guillaume -- View this message in context: http://r.789695.n4.nabble.com/vectorization-condition- counting-tp4639992.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] vectorization with subset?
Hello, I have a data frame (68,000 rows) of scores (V4) for a series of [genomic] coordinates ranges (V2 to V3). I also have a data frame (1.2 million rows) of single [genomic] coordinates. For each genomic coordinate (in coord), I would like to determine the average of all scores whose genomic ranges (in scores) encompass the coordinate (in coord). To accomplish this, I tried: The function works, but is extremely slow. It would take about 4 days for this to finish for a single data set, and I have 64 data sets. Why does the rate at which coordinate averages are calculated increase when coord is smaller, but not when scores is smaller? How can I accomplish the same thing more efficiently? Thanks, Dan -- View this message in context: http://r.789695.n4.nabble.com/vectorization-with-subset-tp4635156.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization with subset?
On Jul 2, 2012, at 12:15 PM, dlv04c wrote: Hello, I have a data frame (68,000 rows) of scores (V4) for a series of [genomic] coordinates ranges (V2 to V3). I also have a data frame (1.2 million rows) of single [genomic] coordinates. For each genomic coordinate (in coord), I would like to determine the average of all scores whose genomic ranges (in scores) encompass the coordinate (in coord). To accomplish this, I tried: The function works, but is extremely slow. It would take about 4 days for this to finish for a single data set, and I have 64 data sets. Why does the rate at which coordinate averages are calculated increase when coord is smaller, but not when scores is smaller? How can I accomplish the same thing more efficiently? You probably need to start by reading the vignettes for the IRanges package. It's difficult to be sure since you did not show the code for what you were doing currently. -- David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization with subset?
The code is in the original post, but here it is again: Thanks, Dan -- View this message in context: http://r.789695.n4.nabble.com/vectorization-with-subset-tp4635156p4635208.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization with subset?
On Jul 2, 2012, at 5:16 PM, dlv04c wrote: The code is in the original post, but here it is again: No code here or in original posting to rhelp. You are under the delusion that Nabble is R-help. It is not. -- View this message in context: http://r.789695.n4.nabble.com/vectorization-with-subset-tp4635156p4635208.html Sent from the R help mailing list archive at Nabble.com. This is the rhelp mailing list. Not a website. -- David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Vectorization instead of loops problem
Hello, I am having problems vectorizing the following (i/o using a for/next/while loop): I have 2 sequences such as: x, y 1, 30 2, -40 0, 50 0, 25 1, -5 2, -10 1, 5 0, 40 etc etc The first sequence (x) takes integer numbers only: 0, 1, 2 The sequence y can be anything... I want to be able to retrieve (in a list if possible) the 3 last values of the y sequence before a value of 1 is encountered on the x sequence, i.e: On line 5 in the above dataset, x is 1 so I need to capture values: 25, 50 and -40 of the y sequence. So the outcome (if a list) should look something like: [1],[25,50,-40] [2],[-10,-5,25] # as member #7 of x sequence is 1... etc. etc. Can I do the above avoiding for/next or while loops? I am not sure I can explain it better. Any help/pointer extremely welcome. Best regards, Costas -- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |c|o|s|t|a|s|@|v|o|r|l|o|w|.|o|r|g| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization instead of loops problem
On 04.12.2011 16:18, Costas Vorlow wrote: Hello, I am having problems vectorizing the following (i/o using a for/next/while loop): I have 2 sequences such as: x, y 1, 30 2, -40 0, 50 0, 25 1, -5 2, -10 1, 5 0, 40 etc etc The first sequence (x) takes integer numbers only: 0, 1, 2 The sequence y can be anything... I want to be able to retrieve (in a list if possible) the 3 last values of the y sequence before a value of 1 is encountered on the x sequence, i.e: On line 5 in the above dataset, x is 1 so I need to capture values: 25, 50 and -40 of the y sequence. So the outcome (if a list) should look something like: [1],[25,50,-40] [2],[-10,-5,25] # as member #7 of x sequence is 1... etc. etc. Can I do the above avoiding for/next or while loops? I am not sure I can explain it better. Any help/pointer extremely welcome. Best regards, Costas One way is (assuming your data is in a data.frame called dat): wx - which(dat$x==1) result - lapply(wx[wx 3], function(x) dat$y[x - (1:3)]) (where lapply is a loop, implicitly). Uwe Ligges __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization instead of loops problem
Thanks Uwe. What happens if these are zoo (or time series) sequences/dataframes? \ I think your solution would apply as well, no? Thanks again best wishes, Costas 2011/12/4 Uwe Ligges lig...@statistik.tu-dortmund.de On 04.12.2011 16:18, Costas Vorlow wrote: Hello, I am having problems vectorizing the following (i/o using a for/next/while loop): I have 2 sequences such as: x, y 1, 30 2, -40 0, 50 0, 25 1, -5 2, -10 1, 5 0, 40 etc etc The first sequence (x) takes integer numbers only: 0, 1, 2 The sequence y can be anything... I want to be able to retrieve (in a list if possible) the 3 last values of the y sequence before a value of 1 is encountered on the x sequence, i.e: On line 5 in the above dataset, x is 1 so I need to capture values: 25, 50 and -40 of the y sequence. So the outcome (if a list) should look something like: [1],[25,50,-40] [2],[-10,-5,25] # as member #7 of x sequence is 1... etc. etc. Can I do the above avoiding for/next or while loops? I am not sure I can explain it better. Any help/pointer extremely welcome. Best regards, Costas One way is (assuming your data is in a data.frame called dat): wx - which(dat$x==1) result - lapply(wx[wx 3], function(x) dat$y[x - (1:3)]) (where lapply is a loop, implicitly). Uwe Ligges -- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |c|o|s|t|a|s|@|v|o|r|l|o|w|.|o|r|g| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization instead of loops problem
On Sun, Dec 4, 2011 at 10:18 AM, Costas Vorlow costas.vor...@gmail.com wrote: Hello, I am having problems vectorizing the following (i/o using a for/next/while loop): I have 2 sequences such as: x, y 1, 30 2, -40 0, 50 0, 25 1, -5 2, -10 1, 5 0, 40 etc etc The first sequence (x) takes integer numbers only: 0, 1, 2 The sequence y can be anything... I want to be able to retrieve (in a list if possible) the 3 last values of the y sequence before a value of 1 is encountered on the x sequence, i.e: On line 5 in the above dataset, x is 1 so I need to capture values: 25, 50 and -40 of the y sequence. So the outcome (if a list) should look something like: [1],[25,50,-40] [2],[-10,-5,25] # as member #7 of x sequence is 1... etc. etc. Can I do the above avoiding for/next or while loops? I am not sure I can explain it better. Any help/pointer extremely welcome. Try this. embed(z, 4) places values 1,2,3,4 of vector z in the first row, values 2,3,4,5 in the second row and so on so we want the rows of embed(y, 4) for which embed(x, 4) is 1, i.e we want rows of embed(y, 4) for which embed(x, 4)[,1]==1, except the first column can be suppressed (-1). embed(y, 4)[embed(x, 4)[, 1] == 1, -1] [,1] [,2] [,3] [1,] 25 50 -40 [2,] -10 -5 25 -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization instead of loops problem
Costas: (and thanks for giving us your name) which(x == 1) gives you the indices where x is 1 (up to floating point equality -- you did not specify whether your x values are integers or calculated as floating point, and that certainly makes a difference). You can then use simple indexing to get the y values. No loops needed. However, let's explore why your question may have been too poorly formed to get the answer you seek: 1. What if the index of the first 1 is 3 or less? -- Do you want to ignore the (less than 3) preceding values or just choose as many as you can? 2. What if, as in your example, several 1's occur in x. Do you want the 3 preceding values for all of them or just the first? 3. If the answer to 2 is all of them, what if several 1's are less than 3 indices apart -- do you want to include the overlapping sets of 3 y's -- or what? My point is that etc. etc. is simply inadequate as a coherent or useful problem description in your post. You _must_ be explicit, complete, and concise. This can be hard. Indeed, it may require considerable thought and effort. I have found -- and others have often noted here -- that going through such an exercise itself often reveals a solution. But be that as it may, the Posting Guide is actually an excellent, comprehensive discussion of how to ask good questions in forums like this. Read it. Follow it. ... and to be fair, your post below is, imho, probably above average as posts go, allowing me to focus on specific points that I thought required clarification. Quite a few posts here of late have been so muddled and incoherent that I had no clue what the OP wanted. And it's not English as a second language. I am a language ignoramus and speak only English, so I am happy to tolerate poor grammar and vocabulary from someone for whom English is only one of several languages in which they can communicate. The problem is poor thinking, not poor English. Best, Bert On Sun, Dec 4, 2011 at 7:18 AM, Costas Vorlow costas.vor...@gmail.com wrote: Hello, I am having problems vectorizing the following (i/o using a for/next/while loop): I have 2 sequences such as: x, y 1, 30 2, -40 0, 50 0, 25 1, -5 2, -10 1, 5 0, 40 etc etc The first sequence (x) takes integer numbers only: 0, 1, 2 The sequence y can be anything... I want to be able to retrieve (in a list if possible) the 3 last values of the y sequence before a value of 1 is encountered on the x sequence, i.e: On line 5 in the above dataset, x is 1 so I need to capture values: 25, 50 and -40 of the y sequence. So the outcome (if a list) should look something like: [1],[25,50,-40] [2],[-10,-5,25] # as member #7 of x sequence is 1... etc. etc. Can I do the above avoiding for/next or while loops? I am not sure I can explain it better. Any help/pointer extremely welcome. Best regards, Costas -- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |c|o|s|t|a|s|@|v|o|r|l|o|w|.|o|r|g| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization instead of loops problem
Dear Bert, You are right (obviously). Apologies for any inconvenience caused. I thought my problem was simplistic with a very obvious answer which eluded me. As per your justified questions : 2: Answer is all, hence: 3. would be include overlapping set (I guess) but this does not matter for the time being. I didn't give it too much thought admittedly... If I got 1 2 right I could have modified the code for point 3 (if answer in 2 != all'), so I did not consider it when I was formulating my query. However, I can see now why this is confusing. Anyways, thanks again for the pointers. BTW, is there a good quick read/guide on vectorization in R that one could recommend? That would minimize my queries at least in the list. :-) Apologies again and best regards, Costas On 4 December 2011 17:45, Bert Gunter gunter.ber...@gene.com wrote: Costas: (and thanks for giving us your name) which(x == 1) gives you the indices where x is 1 (up to floating point equality -- you did not specify whether your x values are integers or calculated as floating point, and that certainly makes a difference). You can then use simple indexing to get the y values. No loops needed. However, let's explore why your question may have been too poorly formed to get the answer you seek: 1. What if the index of the first 1 is 3 or less? -- Do you want to ignore the (less than 3) preceding values or just choose as many as you can? 2. What if, as in your example, several 1's occur in x. Do you want the 3 preceding values for all of them or just the first? 3. If the answer to 2 is all of them, what if several 1's are less than 3 indices apart -- do you want to include the overlapping sets of 3 y's -- or what? My point is that etc. etc. is simply inadequate as a coherent or useful problem description in your post. You _must_ be explicit, complete, and concise. This can be hard. Indeed, it may require considerable thought and effort. I have found -- and others have often noted here -- that going through such an exercise itself often reveals a solution. But be that as it may, the Posting Guide is actually an excellent, comprehensive discussion of how to ask good questions in forums like this. Read it. Follow it. ... and to be fair, your post below is, imho, probably above average as posts go, allowing me to focus on specific points that I thought required clarification. Quite a few posts here of late have been so muddled and incoherent that I had no clue what the OP wanted. And it's not English as a second language. I am a language ignoramus and speak only English, so I am happy to tolerate poor grammar and vocabulary from someone for whom English is only one of several languages in which they can communicate. The problem is poor thinking, not poor English. Best, Bert On Sun, Dec 4, 2011 at 7:18 AM, Costas Vorlow costas.vor...@gmail.com wrote: Hello, I am having problems vectorizing the following (i/o using a for/next/while loop): I have 2 sequences such as: x, y 1, 30 2, -40 0, 50 0, 25 1, -5 2, -10 1, 5 0, 40 etc etc The first sequence (x) takes integer numbers only: 0, 1, 2 The sequence y can be anything... I want to be able to retrieve (in a list if possible) the 3 last values of the y sequence before a value of 1 is encountered on the x sequence, i.e: On line 5 in the above dataset, x is 1 so I need to capture values: 25, 50 and -40 of the y sequence. So the outcome (if a list) should look something like: [1],[25,50,-40] [2],[-10,-5,25] # as member #7 of x sequence is 1... etc. etc. Can I do the above avoiding for/next or while loops? I am not sure I can explain it better. Any help/pointer extremely welcome. Best regards, Costas -- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |c|o|s|t|a|s|@|v|o|r|l|o|w|.|o|r|g| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm -- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |c|o|s|t|a|s|@|v|o|r|l|o|w|.|o|r|g| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization instead of loops problem
Inline below On Sun, Dec 4, 2011 at 10:29 AM, Costas Vorlow costas.vor...@gmail.com wrote: Dear Bert, You are right (obviously). Apologies for any inconvenience caused. I thought my problem was simplistic with a very obvious answer which eluded me. As per your justified questions : 2: Answer is all, hence: 3. would be include overlapping set (I guess) but this does not matter for the time being. I didn't give it too much thought admittedly... If I got 1 2 right I could have modified the code for point 3 (if answer in 2 != all'), so I did not consider it when I was formulating my query. However, I can see now why this is confusing. Anyways, thanks again for the pointers. BTW, is there a good quick read/guide on vectorization in R that one could recommend? That would minimize my queries at least in the list. :-) Vectorization is a central paradigm in R, so practically all books on the S language discuss this. The R language definition manual that ships with R is pretty comprehensive, but VR's MASS or S Programming Books, Patrick Burns's website tutorials (he has several well suited for beginners), John Chambers's Programming with R , etc. are just a few among many. It is impossible for me to be more specific than that. -- Bert Apologies again and best regards, Costas On 4 December 2011 17:45, Bert Gunter gunter.ber...@gene.com wrote: Costas: (and thanks for giving us your name) which(x == 1) gives you the indices where x is 1 (up to floating point equality -- you did not specify whether your x values are integers or calculated as floating point, and that certainly makes a difference). You can then use simple indexing to get the y values. No loops needed. However, let's explore why your question may have been too poorly formed to get the answer you seek: 1. What if the index of the first 1 is 3 or less? -- Do you want to ignore the (less than 3) preceding values or just choose as many as you can? 2. What if, as in your example, several 1's occur in x. Do you want the 3 preceding values for all of them or just the first? 3. If the answer to 2 is all of them, what if several 1's are less than 3 indices apart -- do you want to include the overlapping sets of 3 y's -- or what? My point is that etc. etc. is simply inadequate as a coherent or useful problem description in your post. You _must_ be explicit, complete, and concise. This can be hard. Indeed, it may require considerable thought and effort. I have found -- and others have often noted here -- that going through such an exercise itself often reveals a solution. But be that as it may, the Posting Guide is actually an excellent, comprehensive discussion of how to ask good questions in forums like this. Read it. Follow it. ... and to be fair, your post below is, imho, probably above average as posts go, allowing me to focus on specific points that I thought required clarification. Quite a few posts here of late have been so muddled and incoherent that I had no clue what the OP wanted. And it's not English as a second language. I am a language ignoramus and speak only English, so I am happy to tolerate poor grammar and vocabulary from someone for whom English is only one of several languages in which they can communicate. The problem is poor thinking, not poor English. Best, Bert On Sun, Dec 4, 2011 at 7:18 AM, Costas Vorlow costas.vor...@gmail.com wrote: Hello, I am having problems vectorizing the following (i/o using a for/next/while loop): I have 2 sequences such as: x, y 1, 30 2, -40 0, 50 0, 25 1, -5 2, -10 1, 5 0, 40 etc etc The first sequence (x) takes integer numbers only: 0, 1, 2 The sequence y can be anything... I want to be able to retrieve (in a list if possible) the 3 last values of the y sequence before a value of 1 is encountered on the x sequence, i.e: On line 5 in the above dataset, x is 1 so I need to capture values: 25, 50 and -40 of the y sequence. So the outcome (if a list) should look something like: [1],[25,50,-40] [2],[-10,-5,25] # as member #7 of x sequence is 1... etc. etc. Can I do the above avoiding for/next or while loops? I am not sure I can explain it better. Any help/pointer extremely welcome. Best regards, Costas -- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |c|o|s|t|a|s|@|v|o|r|l|o|w|.|o|r|g| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website:
Re: [R] Vectorization
On Sun, Jan 23, 2011 at 07:29:16PM -0800, eric wrote: Is there a way to vectorize this loop or a smarter way to do it ? y [1] 0.003990746 -0.037664639 0.005397999 0.010415496 0.003500676 [6] 0.001691775 0.008170774 0.011961998 -0.016879531 0.007284486 [11] -0.015083581 -0.006645958 -0.013153103 0.028148639 -0.005724317 [16] -0.027408025 0.014767422 -0.001619691 0.018334730 -0.009747171 x -numeric(length(y)) for (i in 1 :length(y)) { x[i] - ifelse( i==1, 1*(1+y[i]), (1+y[i])*x[i-1]) } x [1] 10039.907 9661.758 9713.912 9815.087 9849.447 9866.110 9946.724 [8] 10065.706 9895.802 9967.888 9817.536 9752.289 9624.016 9894.919 [15] 9838.278 9568.630 9709.934 9694.207 9871.948 9775.724 Basically trying to see how the equity of an investment changes after each return period. Start with $10,000 and a series of returns over time. Figure out the equity after each time period (return). Hello. The cycle computes a cumulative product. The initialization may be add as a common multiplier. So, z in the following should be equal to x up to the machine rounding error. y - c( 0.003990746, -0.037664639, 0.005397999, 0.010415496, 0.003500676, 0.001691775, 0.008170774, 0.011961998, -0.016879531, 0.007284486, -0.015083581, -0.006645958, -0.013153103, 0.028148639, -0.005724317, -0.027408025, 0.014767422, -0.001619691, 0.018334730, -0.009747171) x - numeric(length(y)) for (i in 1:length(y)) { x[i] - ifelse(i==1, 1*(1+y[i]), (1+y[i])*x[i-1]) } z - 1*cumprod(1 + y) max(abs(x - z)) # [1] 1.818989e-12 Petr Savicky. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Vectorization
Is there a way to vectorize this loop or a smarter way to do it ? y [1] 0.003990746 -0.037664639 0.005397999 0.010415496 0.003500676 [6] 0.001691775 0.008170774 0.011961998 -0.016879531 0.007284486 [11] -0.015083581 -0.006645958 -0.013153103 0.028148639 -0.005724317 [16] -0.027408025 0.014767422 -0.001619691 0.018334730 -0.009747171 x -numeric(length(y)) for (i in 1 :length(y)) { x[i] - ifelse( i==1, 1*(1+y[i]), (1+y[i])*x[i-1]) } x [1] 10039.907 9661.758 9713.912 9815.087 9849.447 9866.110 9946.724 [8] 10065.706 9895.802 9967.888 9817.536 9752.289 9624.016 9894.919 [15] 9838.278 9568.630 9709.934 9694.207 9871.948 9775.724 Basically trying to see how the equity of an investment changes after each return period. Start with $10,000 and a series of returns over time. Figure out the equity after each time period (return). -- View this message in context: http://r.789695.n4.nabble.com/Vectorization-tp3233340p3233340.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R - Vectorization and Functional Programming Constructs
On Fri, Jan 21, 2011 at 10:10 PM, Mingo catojo...@gmail.com wrote: Hello, I am new to R (coming from Perl) and have what is, at least at this point, a philosophical question and a request for comment on some basic code. As I understand it - R emphasizes ,or at least supports, the functional programming model. I've come across some code that was markedly absent in for loops - and have been seeing some constructs that relate to functional programming and vectorized code (not that is at all unique to R of course). But I'm also new to the concept of vectorizing code. However, since I anticipate dealing with vectors of large sizes I think that this approach is probably going to serve well in terms of performance. As an example I anticipate having vector operations calling for shifting. I'll be shifting vectors to the right (or left) like below while maintaining the length and filling with zeros. Keep in mind I'll ultimately be dealing with vectors with very large length. x - c(0,3,2,1,0,0,0) vlen - length(x) [1] 7 One solution to accomplish the right shift is to do something like: x=c(0,x[1:vlen-1]) x 1] 0 0 3 2 1 0 0 this does the trick though I'm wondering if this is in the spirit of Vectorization. I could make recursive function that would cycle through the whole vector eventually leaving it full of 0s thus ending the recursion. Though does this capture the spirit of R programming and vectorizing ? Are there more primitive operators closer to the underlying C code that would serve performance interests better ? If x is supposed to represent a time series that you are trying to align you would likely be better off to represent it as an object of one of the time series classes (ts, zoo, xts, timeSeries) and then use lag. That way you will not only have a convenient lag function but all the other functionality that you might need to conveniently handle such objects. lag is written in C in both zoo and xts and might be in timeSeries as well. If your series is regularly spaced and so applicable to ts then, internally, lagging only involves manipulating its tsp attribute so it would be extremely fast. -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R - Vectorization and Functional Programming Constructs
Hello, I am new to R (coming from Perl) and have what is, at least at this point, a philosophical question and a request for comment on some basic code. As I understand it - R emphasizes ,or at least supports, the functional programming model. I've come across some code that was markedly absent in for loops - and have been seeing some constructs that relate to functional programming and vectorized code (not that is at all unique to R of course). But I'm also new to the concept of vectorizing code. However, since I anticipate dealing with vectors of large sizes I think that this approach is probably going to serve well in terms of performance. As an example I anticipate having vector operations calling for shifting. I'll be shifting vectors to the right (or left) like below while maintaining the length and filling with zeros. Keep in mind I'll ultimately be dealing with vectors with very large length. x - c(0,3,2,1,0,0,0) vlen - length(x) [1] 7 One solution to accomplish the right shift is to do something like: x=c(0,x[1:vlen-1]) x 1] 0 0 3 2 1 0 0 this does the trick though I'm wondering if this is in the spirit of Vectorization. I could make recursive function that would cycle through the whole vector eventually leaving it full of 0s thus ending the recursion. Though does this capture the spirit of R programming and vectorizing ? Are there more primitive operators closer to the underlying C code that would serve performance interests better ? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Vectorization of three embedded loops
Dear R-programmer, I wrote an adapted implementation of the Kennard-Stone algorithm for sample selection of multivariate data (R 2.7.1 under MacBook Pro, Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM). I used for the heart of the script three embedded loops. This makes it especially for huge datasets very slow. For a datamatrix of 1853*1853 and the selection of 556 samples needed computation time of more than 24 hours. I did some research on vecotrization, but I could not figure out how to do it better/faster. Which ways are there to replace the time consuming loops? Here are some information: # val.n-24; # start.b-matrix(nrow=1812, ncol=20); # val is a vector of the rownames of 22 in an earlier step chosen extrem samples; # euc--matrix(nrow=1853, ncol=1853); [contains the Euclidean distance calculations] The following calculation of the system.time was for the selection of two samples: system.time(KEN.STO(val.n,start.b,val.start,euc)) user system elapsed 25.294 13.262 38.927 The function: KEN.STO-function(val.n,start.b,val,euc){ for(k in 1:val.n){ sum.dist-c(); for(i in 1:length(start.b[,1])){ sum-c(); for(j in 1:length(val)){ sum[j]-euc[rownames(start.b)[i],val[j]] } sum.dist[i]-min(sum); } bla-rownames(start.b)[which(sum.dist==max(sum.dist))] val-c(val,bla[1]); start.b-start.b[-(which(match(rownames(start.b),val[length(val)])! =NA)),]; if(length(val)=val.n)break; } return(val); } Regards, Thomas Dr. Thomas Terhoeven-Urselmans Post-Doc Fellow Soil infrared spectroscopy World Agroforestry Center (ICRAF) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization of three embedded loops
Hello, I believe that your bottleneck lies at this piece of code: sum-c(); for(j in 1:length(val)){ sum[j]-euc[rownames(start.b)[i],val[j]] } In order to speed up your code, there are two alternatives: 1) Try to reorder the euc matrix so that the sum vector corresponds to (part of) a row or column of euc. 2) For each i value, create a matrix with the coordinates corresponding to ( rownames(start.b)[i], val[j] ) and index the matrix by this matrix in order to create sum. This will be easiest if you can reorder euc in a way that accessing its elements will be easy (and then you would be back into (1)). Creating a variable sum as c() and increasing its size in a loop is one of the easiest ways to uselessly burn your CPU. Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com On Wed, 2009-01-14 at 10:32 +0300, Thomas Terhoeven-Urselmans wrote: Dear R-programmer, I wrote an adapted implementation of the Kennard-Stone algorithm for sample selection of multivariate data (R 2.7.1 under MacBook Pro, Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM). I used for the heart of the script three embedded loops. This makes it especially for huge datasets very slow. For a datamatrix of 1853*1853 and the selection of 556 samples needed computation time of more than 24 hours. I did some research on vecotrization, but I could not figure out how to do it better/faster. Which ways are there to replace the time consuming loops? Here are some information: # val.n-24; # start.b-matrix(nrow=1812, ncol=20); # val is a vector of the rownames of 22 in an earlier step chosen extrem samples; # euc--matrix(nrow=1853, ncol=1853); [contains the Euclidean distance calculations] The following calculation of the system.time was for the selection of two samples: system.time(KEN.STO(val.n,start.b,val.start,euc)) user system elapsed 25.294 13.262 38.927 The function: KEN.STO-function(val.n,start.b,val,euc){ for(k in 1:val.n){ sum.dist-c(); for(i in 1:length(start.b[,1])){ sum-c(); for(j in 1:length(val)){ sum[j]-euc[rownames(start.b)[i],val[j]] } sum.dist[i]-min(sum); } bla-rownames(start.b)[which(sum.dist==max(sum.dist))] val-c(val,bla[1]); start.b-start.b[-(which(match(rownames(start.b),val[length(val)])! =NA)),]; if(length(val)=val.n)break; } return(val); } Regards, Thomas Dr. Thomas Terhoeven-Urselmans Post-Doc Fellow Soil infrared spectroscopy World Agroforestry Center (ICRAF) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization of three embedded loops
You are definitely in Circle 2 of the R Inferno. Growing objects is suboptimal, although your objects are small so this probably isn't taking too much time. There is no need for the inner-most loop: sum.dist[i] - min(euc[rownames(start.b)[i],val] ) Maybe I'm blind, but I don't see where 'k' comes in from the outer-most loop. Patrick Burns patr...@burns-stat.com +44 (0)20 8525 0696 http://www.burns-stat.com (home of The R Inferno and A Guide for the Unwilling S User) Thomas Terhoeven-Urselmans wrote: Dear R-programmer, I wrote an adapted implementation of the Kennard-Stone algorithm for sample selection of multivariate data (R 2.7.1 under MacBook Pro, Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM). I used for the heart of the script three embedded loops. This makes it especially for huge datasets very slow. For a datamatrix of 1853*1853 and the selection of 556 samples needed computation time of more than 24 hours. I did some research on vecotrization, but I could not figure out how to do it better/faster. Which ways are there to replace the time consuming loops? Here are some information: # val.n-24; # start.b-matrix(nrow=1812, ncol=20); # val is a vector of the rownames of 22 in an earlier step chosen extrem samples; # euc--matrix(nrow=1853, ncol=1853); [contains the Euclidean distance calculations] The following calculation of the system.time was for the selection of two samples: system.time(KEN.STO(val.n,start.b,val.start,euc)) user system elapsed 25.294 13.262 38.927 The function: KEN.STO-function(val.n,start.b,val,euc){ for(k in 1:val.n){ sum.dist-c(); for(i in 1:length(start.b[,1])){ sum-c(); for(j in 1:length(val)){ sum[j]-euc[rownames(start.b)[i],val[j]] } sum.dist[i]-min(sum); } bla-rownames(start.b)[which(sum.dist==max(sum.dist))] val-c(val,bla[1]); start.b-start.b[-(which(match(rownames(start.b),val[length(val)])! =NA)),]; if(length(val)=val.n)break; } return(val); } Regards, Thomas Dr. Thomas Terhoeven-Urselmans Post-Doc Fellow Soil infrared spectroscopy World Agroforestry Center (ICRAF) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization of three embedded loops
Dear Patrick, thanks for the very helpful response. I can calculate now 25 times faster. I use the 'k' from the outer-most loop only indirectly. It gives a maximal number of repetitions of the whole script until following command applies 'if(length(val.x.c)=val.x.c.n)break'. The reason why I use this 'break' instead of 'for(k in 1:val.x.c.n){' command is that in some other application of this algorithm more than one sample can be chosen in one round. Is there another/faster way to avoid this usage of 'k'? Regards, Thomas On 14 Jan 2009, at 12:52, Patrick Burns wrote: You are definitely in Circle 2 of the R Inferno. Growing objects is suboptimal, although your objects are small so this probably isn't taking too much time. There is no need for the inner-most loop: sum.dist[i] - min(euc[rownames(start.b)[i],val] ) Maybe I'm blind, but I don't see where 'k' comes in from the outer-most loop. Patrick Burns patr...@burns-stat.com +44 (0)20 8525 0696 http://www.burns-stat.com (home of The R Inferno and A Guide for the Unwilling S User) Thomas Terhoeven-Urselmans wrote: Dear R-programmer, I wrote an adapted implementation of the Kennard-Stone algorithm for sample selection of multivariate data (R 2.7.1 under MacBook Pro, Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM). I used for the heart of the script three embedded loops. This makes it especially for huge datasets very slow. For a datamatrix of 1853*1853 and the selection of 556 samples needed computation time of more than 24 hours. I did some research on vecotrization, but I could not figure out how to do it better/faster. Which ways are there to replace the time consuming loops? Here are some information: # val.n-24; # start.b-matrix(nrow=1812, ncol=20); # val is a vector of the rownames of 22 in an earlier step chosen extrem samples; # euc--matrix(nrow=1853, ncol=1853); [contains the Euclidean distance calculations] The following calculation of the system.time was for the selection of two samples: system.time(KEN.STO(val.n,start.b,val.start,euc)) user system elapsed 25.294 13.262 38.927 The function: KEN.STO-function(val.n,start.b,val,euc){ for(k in 1:val.n){ sum.dist-c(); for(i in 1:length(start.b[,1])){ sum-c(); for(j in 1:length(val)){ sum[j]-euc[rownames(start.b)[i],val[j]] } sum.dist[i]-min(sum); } bla-rownames(start.b)[which(sum.dist==max(sum.dist))] val-c(val,bla[1]); start.b-start.b[-(which(match(rownames(start.b),val[length(val)])! =NA)),]; if(length(val)=val.n)break; } return(val); } Regards, Thomas Dr. Thomas Terhoeven-Urselmans Post-Doc Fellow Soil infrared spectroscopy World Agroforestry Center (ICRAF)[[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Regards, Thomas Dr. Thomas Terhoeven-Urselmans Post-Doc Fellow Soil infrared spectroscopy World Agroforestry Center (ICRAF) United Nations Avenue, Gigiri PO Box 30677-00100 Nairobi, Kenya Ph: 254 20 722 4113 or via USA 1 650 833 6654 ext. 4113 Fax 254 20 722 4001 or via USA 1 650 833 6646 Email: t.urselm...@cgiar.org Internet: http://worldagroforestrycentre.org [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization of three embedded loops
Dear Carlos, thanks for your support. Patrick Burns gave me a hint, which is in the end very similar to your proposal. Now the script is roughly 25 times faster. Here is the code (I implemented as well an in size not increasing vector 'summ.dist-rep(0,val.x.c.n)'): KEN.STO-function(val.n,start.b,val,euc){ for(k in 1:val.n){ summ.dist-rep(0,val.n); for(i in 1:length(start.b[,1])){ summ.dist[i]-min(euc[rownames(start.b)[i],val]); } bla-rownames(start.b)[which(summ.dist==max(summ.dist))] val-c(val,bla[1]); start.b-start.b[-(which(match(rownames(start.b),val[length(val)])! =NA)),]; if(length(val)=val.n)break; } return(val.x.c); } Regards, Thomas On 14 Jan 2009, at 12:58, Carlos J. Gil Bellosta wrote: Hello, I believe that your bottleneck lies at this piece of code: sum-c(); for(j in 1:length(val)){ sum[j]-euc[rownames(start.b)[i],val[j]] } In order to speed up your code, there are two alternatives: 1) Try to reorder the euc matrix so that the sum vector corresponds to (part of) a row or column of euc. 2) For each i value, create a matrix with the coordinates corresponding to ( rownames(start.b)[i], val[j] ) and index the matrix by this matrix in order to create sum. This will be easiest if you can reorder euc in a way that accessing its elements will be easy (and then you would be back into (1)). Creating a variable sum as c() and increasing its size in a loop is one of the easiest ways to uselessly burn your CPU. Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com On Wed, 2009-01-14 at 10:32 +0300, Thomas Terhoeven-Urselmans wrote: Dear R-programmer, I wrote an adapted implementation of the Kennard-Stone algorithm for sample selection of multivariate data (R 2.7.1 under MacBook Pro, Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM). I used for the heart of the script three embedded loops. This makes it especially for huge datasets very slow. For a datamatrix of 1853*1853 and the selection of 556 samples needed computation time of more than 24 hours. I did some research on vecotrization, but I could not figure out how to do it better/faster. Which ways are there to replace the time consuming loops? Here are some information: # val.n-24; # start.b-matrix(nrow=1812, ncol=20); # val is a vector of the rownames of 22 in an earlier step chosen extrem samples; # euc--matrix(nrow=1853, ncol=1853); [contains the Euclidean distance calculations] The following calculation of the system.time was for the selection of two samples: system.time(KEN.STO(val.n,start.b,val.start,euc)) user system elapsed 25.294 13.262 38.927 The function: KEN.STO-function(val.n,start.b,val,euc){ for(k in 1:val.n){ sum.dist-c(); for(i in 1:length(start.b[,1])){ sum-c(); for(j in 1:length(val)){ sum[j]-euc[rownames(start.b)[i],val[j]] } sum.dist[i]-min(sum); } bla-rownames(start.b)[which(sum.dist==max(sum.dist))] val-c(val,bla[1]); start.b-start.b[-(which(match(rownames(start.b),val[length(val)])! =NA)),]; if(length(val)=val.n)break; } return(val); } Regards, Thomas Dr. Thomas Terhoeven-Urselmans Post-Doc Fellow Soil infrared spectroscopy World Agroforestry Center (ICRAF) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Regards, Thomas Dr. Thomas Terhoeven-Urselmans Post-Doc Fellow Soil infrared spectroscopy World Agroforestry Center (ICRAF) United Nations Avenue, Gigiri PO Box 30677-00100 Nairobi, Kenya Ph: 254 20 722 4113 or via USA 1 650 833 6654 ext. 4113 Fax 254 20 722 4001 or via USA 1 650 833 6646 Email: t.urselm...@cgiar.org Internet: http://worldagroforestrycentre.org [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] vectorization instead of using loop
Dear all, I've sent this question 2 days ago and got response from Sarah. Thanks for that. But unfortunately, it did not really solve our problem. The main issue is that we want to use our own (manipulated) covariance matrix in the calculation of the mahalanobis distance. Does anyone know how to vectorize the below code instead of using a loop (which slows it down)? I'd really appreciate any help on this, thank you all in advance! Cheers, Frank This is what I posted 2 days ago: We have a data frame x with n people as rows and k variables as columns. Now, for each person (i.e., each row) we want to calculate a distance between him/her and EACH other person in x. In other words, we want to create a n x n matrix with distances (with zeros in the diagonal). However, we do not want to calculate Euclidian distances. We want to calculate Mahalanobis distances, which take into account the covariance among variables. Below is the piece of code we wrote (covmat in the function below is the variance-covariance matrix among variables in Data that has to be fed into mahalonobis function we are using). mahadist = function(x, covmat) { dismat = matrix(0,ncol=nrow(x),nrow=nrow(x)) for (i in 1:nrow(x)) { dismat[i,] = mahalanobis(as.matrix(x), as.matrix(x[i,]), covmat)^.5 } return(dismat) } This piece of code works, but it is very slow. We were wondering if it's at all possible to somehow vectorize this function. Any help would be greatly appreciated. Thanks, Frank [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization instead of using loop
One thing that would speed it up is if you inverted 'covmat' once and then used 'inverted=TRUE' in the call to 'mahalanobis'. Patrick Burns [EMAIL PROTECTED] +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and A Guide for the Unwilling S User) Frank Hedler wrote: Dear all, I've sent this question 2 days ago and got response from Sarah. Thanks for that. But unfortunately, it did not really solve our problem. The main issue is that we want to use our own (manipulated) covariance matrix in the calculation of the mahalanobis distance. Does anyone know how to vectorize the below code instead of using a loop (which slows it down)? I'd really appreciate any help on this, thank you all in advance! Cheers, Frank This is what I posted 2 days ago: We have a data frame x with n people as rows and k variables as columns. Now, for each person (i.e., each row) we want to calculate a distance between him/her and EACH other person in x. In other words, we want to create a n x n matrix with distances (with zeros in the diagonal). However, we do not want to calculate Euclidian distances. We want to calculate Mahalanobis distances, which take into account the covariance among variables. Below is the piece of code we wrote (covmat in the function below is the variance-covariance matrix among variables in Data that has to be fed into mahalonobis function we are using). mahadist = function(x, covmat) { dismat = matrix(0,ncol=nrow(x),nrow=nrow(x)) for (i in 1:nrow(x)) { dismat[i,] = mahalanobis(as.matrix(x), as.matrix(x[i,]), covmat)^.5 } return(dismat) } This piece of code works, but it is very slow. We were wondering if it's at all possible to somehow vectorize this function. Any help would be greatly appreciated. Thanks, Frank [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization instead of using loop
I've sent this question 2 days ago and got response from Sarah. Thanks for that. But unfortunately, it did not really solve our problem. The main issue is that we want to use our own (manipulated) covariance matrix in the calculation of the mahalanobis distance. Does anyone know how to vectorize the below code instead of using a loop (which slows it down)? I'd really appreciate any help on this, thank you all in advance! Cheers, Frank This is what I posted 2 days ago: We have a data frame x with n people as rows and k variables as columns. Now, for each person (i.e., each row) we want to calculate a distance between him/her and EACH other person in x. In other words, we want to create a n x n matrix with distances (with zeros in the diagonal). However, we do not want to calculate Euclidian distances. We want to calculate Mahalanobis distances, which take into account the covariance among variables. Below is the piece of code we wrote (covmat in the function below is the variance-covariance matrix among variables in Data that has to be fed into mahalonobis function we are using). mahadist = function(x, covmat) { dismat = matrix(0,ncol=nrow(x),nrow=nrow(x)) for (i in 1:nrow(x)) { dismat[i,] = mahalanobis(as.matrix(x), as.matrix(x[i,]), covmat)^.5 } return(dismat) } This piece of code works, but it is very slow. We were wondering if it's at all possible to somehow vectorize this function. Any help would be greatly appreciated. You can save a substantial time by calling as.matrix before the loop, e.g. x - data.frame(runif(1000), runif(1000), runif(1000)) covmat - cov(x) mahadist = function(x, covmat) #yours { dismat = matrix(0,ncol=nrow(x),nrow=nrow(x)) for (i in 1:nrow(x)) { dismat[i,] = mahalanobis(as.matrix(x), as.matrix(x[i,]), covmat)^.5 } return(dismat) } mahadist2 - function(x, covmat) #my modification { n - nrow(x) dismat - matrix(0,ncol=n,nrow=n) matx - as.matrix(x) for (i in 1:n) { dismat[i,] - mahalanobis(matx, matx[i,], covmat)^.5 } dismat } system.time(mahadist(x, covmat)) # user system elapsed # 2.820.062.95 system.time(mahadist2(x, covmat)) # user system elapsed # 1.390.041.45 Regards, Richie. Mathematical Sciences Unit HSL ATTENTION: This message contains privileged and confidential inform...{{dropped:20}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization instead of using loop
Frank said: This piece of code works, but it is very slow. We were wondering if it's at all possible to somehow vectorize this function. Any help would be greatly appreciated. Richie said: You can save a substantial time by calling as.matrix before the loop Patrick said: One thing that would speed it up is if you inverted 'covmat' once and then used 'inverted=TRUE' in the call to 'mahalanobis'. The timings before: system.time(mahadist(x, covmat)) # user system elapsed # 2.820.062.95 system.time(mahadist2(x, covmat)) # user system elapsed # 1.390.041.45 With Patrick's modification, and moving the square root out of the loop: mahadist3 - function(x, covmat) #patrick's modification { n - nrow(x) dismat - matrix(0,ncol=n,nrow=n) matx - as.matrix(x) icovmat - chol2inv(chol(covmat)) for (i in 1:n) { dismat[i,] - mahalanobis(matx, matx[i,], icovmat, inverted=TRUE) } dismat^.5 } system.time(mahadist3(x, covmat)) # user system elapsed # 0.800.000.85 Not bad - a better than threefold speed up, without worrying about vectorization. Regards, Richie. Mathematical Sciences Unit HSL ATTENTION: This message contains privileged and confidential inform...{{dropped:20}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] vectorization of a loop for mahalanobis distance calculation
Dear all, We have a data frame x with n people as rows and k variables as columns. Now, for each person (i.e., each row) we want to calculate a distance between him/her and EACH other person in x. In other words, we want to create a n x n matrix with distances (with zeros in the diagonal). However, we do not want to calculate Euclidian distances. We want to calculate Mahalanobis distances, which take into account the covariance among variables. Below is the piece of code we wrote (covmat in the function below is the variance-covariance matrix among variables in Data that has to be fed into mahalonobis function we are using). mahadist = function(x, covmat) { dismat = matrix(0,ncol=nrow(x),nrow=nrow(x)) for (i in 1:nrow(x)) { dismat[i,] = mahalanobis(as.matrix(x), as.matrix(x[i,]), covmat)^.5 } return(dismat) } This piece of code works, but it is very slow. We were wondering if it's at all possible to somehow vectorize this function. Any help would be greatly appreciated. Thanks, Frank [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization of a loop for mahalanobis distance calculation
distance() from the ecodist package will calculate Mahalanobis distances. Sarah -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization of a loop for mahalanobis distance calculation
Hi Frank, If the way distance() calculates the Mahalanobis distance meets your needs other than the covariance specification, you can tweak that _very_ easily. If you use fix(distance) at the command line, you can edit the source. change the first line to: function (x, method = euclidean, icov) and under method 4, change the icov calculation to: if(missing(icov)) { icov - solve(cov(x)) } Alternatively, here's a simplified distanceM function with everything but the relevant bits deleted. You'll still need to have ecodist loaded. distanceM - function (x, method = mahalanobis, icov) { paireddiff - function(x) { N - nrow(x) P - ncol(x) A - numeric(N * N * P) A - .C(pdiff, as.double(as.vector(t(x))), as.integer(N), as.integer(P), A = as.double(A), PACKAGE = ecodist)$A A - array(A, dim = c(N, N, P)) A } x - as.matrix(x) N - nrow(x) P - ncol(x) if(missing(icov)) { icov - solve(cov(x)) } A - paireddiff(x) A1 - apply(A, 1, function(z) (z %*% icov %*% t(z))) D - A1[seq(1, N * N, by = (N + 1)), ] D - D[col(D) row(D)] attr(D, Size) - N attr(D, Labels) - rownames(x) attr(D, Diag) - FALSE attr(D, Upper) - FALSE attr(D, method) - METHODS[method] class(D) - dist D } Sarah On Tue, Oct 7, 2008 at 1:05 PM, Frank Hedler [EMAIL PROTECTED] wrote: Dear all, we just realized something. Sarah's distance function - indeed - calculates mahalanobis distance very well. However, it uses the observed variance-covariance matrix by default. What we actually need (sorry for not stating it clearly in to be able to specify which variance-covariance matrix goes into that calculation. On Tue, Oct 7, 2008 at 12:44 PM, Sarah Goslee [EMAIL PROTECTED] wrote: distance() from the ecodist package will calculate Mahalanobis distances. Sarah -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] vectorization of a loop for mahalanobis distance calculation
Dear all,we just realized something. Sarah's distance function - indeed - calculates mahalanobis distance very well. However, it uses the observed variance-covariance matrix by default. What we actually need (sorry for not stating it clearly in to be able to specify which variance-covariance matrix goes into that calculation. On Tue, Oct 7, 2008 at 12:44 PM, Sarah Goslee [EMAIL PROTECTED]wrote: distance() from the ecodist package will calculate Mahalanobis distances. Sarah -- Sarah Goslee http://www.functionaldiversity.org ORIGINAL request: Dear all, We have a data frame x with n people as rows and k variables as columns. Now, for each person (i.e., each row) we want to calculate a distance between him/her and EACH other person in x. In other words, we want to create a n x n matrix with distances (with zeros in the diagonal). However, we do not want to calculate Euclidian distances. We want to calculate Mahalanobis distances, which take into account the covariance among variables. Below is the piece of code we wrote (covmat in the function below is the variance-covariance matrix among variables in Data that has to be fed into mahalonobis function we are using). mahadist = function(x, covmat) { dismat = matrix(0,ncol=nrow(x),nrow=nrow(x)) for (i in 1:nrow(x)) { dismat[i,] = mahalanobis(as.matrix(x), as.matrix(x[i,]), covmat)^.5 } return(dismat) } This piece of code works, but it is very slow. We were wondering if it's at all possible to somehow vectorize this function. Any help would be greatly appreciated. Thanks, Frank [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization of duration of the game in the gambler ruin's problem
Hi Jose, If you are only interested in the expected duration, the problem can be solved analytically - no simulation is needed. Let P be the probability to get total.capital (and then 1-P is the probability to loose all the money) when starting with initial.capital. This probability P is well known (I do not remember it now but I can derive the formula if you need - let me know). Let X(i) be the gain at game i and let D be the duration. Let S(n) = X(1)+...+X(n). Since EX(i) = p - (1-p) = 2p-1, S(n) - n*(2p-1) is a martingale, and since D is a stopping time we get that E(S(D) - (2p-1)*D) = 0, so that (2p-1)*E(D) = E(S(D)) = P*(total.capital-initial.capital) + (1-P)*(-initial.capital), and so E(D) can be computed provided that p != 1/2. If p = 1/2 then S(n) is a martingale and then by Wald's Lemma, E(S(D)^2) = E(D)*E(X^2) = E(D). Since E(S(D)^2) = P*(total.capital-initial.capital)^2 + (1-P)*(-initial.capital)^2, we can compute E(D). Regards, Moshe. --- On Fri, 15/8/08, jose romero [EMAIL PROTECTED] wrote: From: jose romero [EMAIL PROTECTED] Subject: [R] Vectorization of duration of the game in the gambler ruin's problem To: r-help@r-project.org Received: Friday, 15 August, 2008, 2:26 PM Hey fellas: In the context of the gambler's ruin problem, the following R code obtains the mean duration of the game, in turns: # total.capital is a constant, an arbitrary positive integer # initial.capital is a constant, an arbitrary positive integer between, and not including # 0 and total.capital # p is the probability of winning 1$ on each turn # 1-p is the probability of loosing 1$ # N is a large integer representing the number of times to simulate # dur is a vector containing the simulated game durations T - total.capital dur - NULL for (n in 1:N) { x - initial.capital d - 0 while ((x!=0)(x!=T)) { x - x+sample(c(-1,1),1,replace=TRUE,c(1-p,p)) d - d+1 } dur - c(dur,d) } mean(dur) #returns the mean duration of the game The problem with this code is that, using the traditional control structures (while, for, etc.) it is rather slow. Does anyone know of a way i could vectorize the while and the for to produce a faster code? And while I'm at it, does anyone know of a discrete-event simulation package in R such as the SimPy for Python? Thanks in advance [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Vectorization of duration of the game in the gambler ruin's problem
Hey fellas: In the context of the gambler's ruin problem, the following R code obtains the mean duration of the game, in turns: # total.capital is a constant, an arbitrary positive integer # initial.capital is a constant, an arbitrary positive integer between, and not including # 0 and total.capital # p is the probability of winning 1$ on each turn # 1-p is the probability of loosing 1$ # N is a large integer representing the number of times to simulate # dur is a vector containing the simulated game durations T - total.capital dur - NULL for (n in 1:N) { x - initial.capital d - 0 while ((x!=0)(x!=T)) { x - x+sample(c(-1,1),1,replace=TRUE,c(1-p,p)) d - d+1 } dur - c(dur,d) } mean(dur) #returns the mean duration of the game The problem with this code is that, using the traditional control structures (while, for, etc.) it is rather slow. Does anyone know of a way i could vectorize the while and the for to produce a faster code? And while I'm at it, does anyone know of a discrete-event simulation package in R such as the SimPy for Python? Thanks in advance [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization Problem
Sergey Goriatchev [EMAIL PROTECTED] wrote in news:[EMAIL PROTECTED]: I have the code for the bivariate Gaussian copula. It is written with for-loops, it works, but I wonder if there is a way to vectorize the function. I don't see how outer() can be used in this case, but maybe one can use mapply() or Vectorize() in some way? Could anyone help me, please? ## Density of Gauss Copula snipped your code that you didn't like When Yan built his copula package, he called the dmvnorm function from Leisch's mvtnorm package: dnormalCopula - function(copula, u) { dim - [EMAIL PROTECTED] sigma - getSigma(copula) if (is.vector(u)) u - matrix(u, ncol = dim) x - qnorm(u) val - dmvnorm(x, sigma = sigma) / apply(x, 1, function(v) prod(dnorm (v))) val[apply(u, 1, function(v) any(v = 0))] - 0 val[apply(u, 1, function(v) any(v = 1))] - 0 val } If the mvtnorm package is installed, one looks at the dmvnorm function simply by typing: dmvnorm I did not see any for-loops. After error checking, Leisch's code is: distval - mahalanobis(x, center = mean, cov = sigma) logdet - sum(log(eigen(sigma, symmetric = TRUE, only.values = TRUE)$values)) logretval - -(ncol(x) * log(2 * pi) + logdet + distval)/2 if (log) return(logretval) exp(logretval) - -- David Winsemius __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Vectorization/Speed Problem
Hi, I cannot find a 'vectorized' solution to this 'for loop' kind of problem. Do you see a vectorized, fast-running solution? Objective: Take the value of X at each timepoint and calculate the corresponding value of Y. Leading 0's and all 1's for X are assigned to Y; otherwise Y is incremented by the number of 0's adjacent to the last 1. The frequency and distribution of X vary widely and may have ~100 repeated 0's or 1's in a vector of 10k timepoints. Example: time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X0 1 0 1 0 1 0 0 1 1 1 0 0 0 . . Y0 1 2 1 2 1 2 3 1 1 1 2 3 4 . . What I have done: My for() and apply()-related standard solutions are too slow. They are 6 times slower than my prototype, vectorized code which uses cumsum(). However(!)... my results are inaccurate and I can't correct them without introducing a for()! Here is my shot at a vectorized solution, as far as I can take it. Preliminary Vectorized Code: X - matrix(sample(c(1,0,0,0,0), 500, replace = TRUE), 25, 20, byrow=TRUE) colnames(X) - c(paste(a, 1:20, sep=)) noX - X; noX[X!=0] - 0; cumX - noX; cumNoX - noX; Y1 - noX; Y2 - X; Y3 - X for (e in 1:ncol(X)) { cumX[,e] - cumsum(X[,e]) noX[X[,e] 1 cumsum(X[,e]) 0 ,e] - 1 cumNoX[,e] - cumsum(noX[,e]) } Y1[cumNoX 0] - cumNoX[cumNoX 0] + 1 Y2[X == 0 noX 0] - Y1[X == 0 noX 0] Y3 - Y2 Y3[cumX 1 noX 0] - Y2[cumX 1 noX 0] - cumX[cumX 1 noX 0] X; Y3 Your help would be greatly appreciated! I'm stuck. Thank you, Tom Johnson __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Vectorization/Speed Problem
Let x be the input vector and cx be the cumulative running sum of it. Then seq_along(cx) - match(cx, cx) gives increasing sequences starting at 0 and for those after the leading zeros we start them at 1 by adding cummax(x). x - c(0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0) # input cx - cumsum(x) seq_along(cx) - match(cx, cx) + cummax(x) On Nov 20, 2007 6:42 PM, Tom Johnson [EMAIL PROTECTED] wrote: Hi, I cannot find a 'vectorized' solution to this 'for loop' kind of problem. Do you see a vectorized, fast-running solution? Objective: Take the value of X at each timepoint and calculate the corresponding value of Y. Leading 0's and all 1's for X are assigned to Y; otherwise Y is incremented by the number of 0's adjacent to the last 1. The frequency and distribution of X vary widely and may have ~100 repeated 0's or 1's in a vector of 10k timepoints. Example: time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X0 1 0 1 0 1 0 0 1 1 1 0 0 0 . . Y0 1 2 1 2 1 2 3 1 1 1 2 3 4 . . What I have done: My for() and apply()-related standard solutions are too slow. They are 6 times slower than my prototype, vectorized code which uses cumsum(). However(!)... my results are inaccurate and I can't correct them without introducing a for()! Here is my shot at a vectorized solution, as far as I can take it. Preliminary Vectorized Code: X - matrix(sample(c(1,0,0,0,0), 500, replace = TRUE), 25, 20, byrow=TRUE) colnames(X) - c(paste(a, 1:20, sep=)) noX - X; noX[X!=0] - 0; cumX - noX; cumNoX - noX; Y1 - noX; Y2 - X; Y3 - X for (e in 1:ncol(X)) { cumX[,e] - cumsum(X[,e]) noX[X[,e] 1 cumsum(X[,e]) 0 ,e] - 1 cumNoX[,e] - cumsum(noX[,e]) } Y1[cumNoX 0] - cumNoX[cumNoX 0] + 1 Y2[X == 0 noX 0] - Y1[X == 0 noX 0] Y3 - Y2 Y3[cumX 1 noX 0] - Y2[cumX 1 noX 0] - cumX[cumX 1 noX 0] X; Y3 Your help would be greatly appreciated! I'm stuck. Thank you, Tom Johnson __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.