Re: [R] How to make this for() loop memory efficient?
Ray, your solution works and is indeed faster than mine! It looks like it's going to take a few days to to 400,000 rows, still, which is unfortunate. Steve, thanks for your help, I'll definitely self-teach plyr and data.table. - Isaac Research Assistant Quantitative Finance Faculty, UTS -- View this message in context: http://r.789695.n4.nabble.com/How-to-make-this-for-loop-memory-efficient-tp4283594p4284716.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to make this for() loop memory efficient?
On 01/11/2012 12:09 AM, iliketurtles wrote: Ray, your solution works and is indeed faster than mine! It looks like it's going to take a few days to to 400,000 rows, still, which is unfortunate. Steve, thanks for your help, I'll definitely self-teach plyr and data.table. I added a column with the first two digits of the module data$XX - substr(L[,2], 1, 2) then created a data frame that summarized the first module of each call and the length of the phone call df - with(data, data.frame(FirstModule=tapply(XX, `phone calls`, `[[`, 1), Length=tapply(XX, `phone calls`, length))) then summarized the length of the phone calls associated with each module with(df, tapply(Length, FirstModule, mean)) resulting in with(df, tapply(Length, FirstModule, mean)) 82 84 92 93 94 96 97 1.00 2.00 1.75 1.67 1.00 1.22 1.67 Martin - Isaac Research Assistant Quantitative Finance Faculty, UTS -- View this message in context: http://r.789695.n4.nabble.com/How-to-make-this-for-loop-memory-efficient-tp4283594p4284716.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to make this for() loop memory efficient?
##I have 2 columns of data. The first column is unique event IDs that represent a phone call made to a customer. ###So, if you see 3 entries together in the first column like follows: matrix(c(call1a,call1a,call1a) ) ##then this means that this particular phone call (the first call that's logged in the data set) was transferred ##between 3 different modules before the call was terminated. ##The second column is a numerical description of the module the call started with and then got transferred to prior to ##call termination. Now, I'll construct a ##representative array of the type of data I'm dealing with (the real data set goes ##on for X00,000s of rows): ##(Ignore how I construct the following array, it’s completely unrelated to how the actual data set was constructed). a-sapply(1:50,function(i){paste(call,i,sep=,collapse=)}) development.a-seq(1,40,3) development.a2-seq(1,40,5) a[development.a]-a[development.a+1] a[development.a2]-a[development.a2+1] a[1:2]-call2a;a[3]-call3a;a[4:5]-call5a;a[6:8]-call8a;a[9]-call9a b-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,970050,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,930010,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500,920010,970050,930010,960500,920010,970050,930010,920010,960500,970050,930010,920009,960500,970050,930009,940010,960500,960500,960500) data-as.data.frame(cbind(a,b)) colnames(data)-c(phone calls,modules) dim(data) print(data[1:10,]) #sample of 10 rows # Note that in the real data set, data[,2] ranges from 810,000 to 999,999. I've been tasked with the following: # For each phone call that BEGINS with the module which is denoted by 81 (i.e. of the form 81X,XXX), what is the expected number of modules in these calls? #Then it's the same question for each module beginning with 82, 83, 84. all the way until 99. #I've created code that I think works for this, but I can't actually run it on the whole data set. I left it for 30 minutes and it only had about #5% of the task completed (I clicked STOP then checked my output to see if I did it properly, and it seems correct). #I know the apply() family specializes in vector operations, but I can't figure out how to complete the above question in any way other than #loops. L-data A-array(0,dim=c(19,2));rownames(A)-seq(81,99,1) A-data.frame(A) for(i in 1:(nrow(L)-1)) { if(L[(i+1),1]!=L[i,1]) { A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),1]- { A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),1]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate number of modules in the calls that begin with XX (not yet averaged). } A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),2]- { A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),2]+1 } } } #If I can get this code to be more memory efficient such that I can do it on a 400,000 row data set, I can do, for example, A[17,1]/A[17,2] #and I'll arrive at the mean number of modules per call where the call starts with a module that starts with 97. A[17,1] #is 10, which means that, out of every single call that started with a module of 97X,XXX, #they went through 10 modules in total. A[17,2] #is 6, which means that there was 6 calls in total that began with a 97X,XXX module. #Hence, A[17,1]/A[17,2] #is the average number of modules that were executed in all the calls that began with a 97X,XXX module. - Isaac Research Assistant Quantitative Finance Faculty, UTS -- View this message in context: http://r.789695.n4.nabble.com/How-to-make-this-for-loop-memory-efficient-tp4283594p4283594.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to make this for() loop memory efficient?
On Wed, 11 Jan 2012, iliketurtles wrote: ##I have 2 columns of data. The first column is unique event IDs that represent a phone call made to a customer. ###So, if you see 3 entries together in the first column like follows: matrix(c(call1a,call1a,call1a) ) ##then this means that this particular phone call (the first call that's logged in the data set) was transferred ##between 3 different modules before the call was terminated. ##The second column is a numerical description of the module the call started with and then got transferred to prior to ##call termination. Now, I'll construct a ##representative array of the type of data I'm dealing with (the real data set goes ##on for X00,000s of rows): ##(Ignore how I construct the following array, it’s completely unrelated to how the actual data set was constructed). a-sapply(1:50,function(i){paste(call,i,sep=,collapse=)}) development.a-seq(1,40,3) development.a2-seq(1,40,5) a[development.a]-a[development.a+1] a[development.a2]-a[development.a2+1] a[1:2]-call2a;a[3]-call3a;a[4:5]-call5a;a[6:8]-call8a;a[9]-ca ll9a b-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,970050 ,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,9300 10,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500,92 0010,970050,930010,960500,920010,970050,930010,920010,960500,970050,930010, 920009,960500,970050,930009,940010,960500,960500,960500) data-as.data.frame(cbind(a,b)) colnames(data)-c(phone calls,modules) dim(data) print(data[1:10,]) #sample of 10 rows # Note that in the real data set, data[,2] ranges from 810,000 to 999,999. I've been tasked with the following: # For each phone call that BEGINS with the module which is denoted by 81 (i.e. of the form 81X,XXX), what is the expected number of modules in these calls? #Then it's the same question for each module beginning with 82, 83, 84. all the way until 99. #I've created code that I think works for this, but I can't actually run it on the whole data set. I left it for 30 minutes and it only had about #5% of the task completed (I clicked STOP then checked my output to see if I did it properly, and it seems correct). #I know the apply() family specializes in vector operations, but I can't figure out how to complete the above question in any way other than #loops. L-data A-array(0,dim=c(19,2));rownames(A)-seq(81,99,1) A-data.frame(A) for(i in 1:(nrow(L)-1)) { if(L[(i+1),1]!=L[i,1]) { A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),1 ]- { A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),1 ]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate number of modules in the calls that begin with XX (not yet averaged). } A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),2 ]- { A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=),2 ]+1 } } } #If I can get this code to be more memory efficient such that I can do it on a 400,000 row data set, I can do, for example, A[17,1]/A[17,2] #and I'll arrive at the mean number of modules per call where the call starts with a module that starts with 97. A[17,1] #is 10, which means that, out of every single call that started with a module of 97X,XXX, #they went through 10 modules in total. A[17,2] #is 6, which means that there was 6 calls in total that began with a 97X,XXX module. #Hence, A[17,1]/A[17,2] #is the average number of modules that were executed in all the calls that began with a 97X,XXX module. - Isaac Research Assistant Quantitative Finance Faculty, UTS I don't see any need for you to use data frames. If you make A and data (not a good use of a variable name) just matrices, you get the same answers at about 10 times the speed (using your example). Hope this helps, Ray Brownrigg __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to make this for() loop memory efficient?
I'm having a really difficult time understanding what you're trying to get -- copy and pasting your code is failing to run, and your question isn't clear, ie: For each phone call that BEGINS with the module which is denoted by 81 (i.e. of the form 81X,XXX), what is the expected number of modules in these calls? How does one calculate the expected number of modules in this module? What does that even mean? Anyway, here's some using your `data` data.frame that calculates the number of unique calls and other statistics on the call id within each module prefix. I'm using both data.table and plyr ... there are no for loops. You will want to do `whatever it is you really want to do` inside the blocks below. ## R code data - transform(data, module.prefix=substring(modules, 1, 2)) ## take a look at `data` now ## calulate stuff inside each module.prefix using data.table xx - data.table(data, key=module.prefix) ans - xx[, { ## the columns of the particular subset of your data.table ## are injected into the scope for this expression block ## which is where the `calls` variable below comes from tabled - table(as.character(calls)) list(unique.calls=length(tabled), min=min(tabled), median=as.numeric(median(tabled)), max=max(tabled)) ## you will want to return your own list of stuff }, by='module.prefix'] ## with plyr library(plyr) ans - ddply(data, module.prefix, function(x) { ## `x` is a data.frame that all share the same module.prefix ## do whatever you want with it here tabled - table(as.character(x$calls)) c(unique.calls=length(tabled), min=min(tabled), median=median(tabled), max=max(tabled)) }) You'll have to read up on the particulars of data.table and plyr. Both are really powerful packages ... you should get familiar with at least one. plyr is a bit more flexible in some ways. data.table is a bit more strict (cf. the need for `as.numeric(median(tabled))`), but also tends to be (much) faster when working over large datasets HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to make this for() loop memory efficient?
On Wed, 11 Jan 2012, Ray Brownrigg wrote: On Wed, 11 Jan 2012, iliketurtles wrote: ##I have 2 columns of data. The first column is unique event IDs that represent a phone call made to a customer. ###So, if you see 3 entries together in the first column like follows: matrix(c(call1a,call1a,call1a) ) ##then this means that this particular phone call (the first call that's logged in the data set) was transferred ##between 3 different modules before the call was terminated. ##The second column is a numerical description of the module the call started with and then got transferred to prior to ##call termination. Now, I'll construct a ##representative array of the type of data I'm dealing with (the real data set goes ##on for X00,000s of rows): ##(Ignore how I construct the following array, it’s completely unrelated to how the actual data set was constructed). a-sapply(1:50,function(i){paste(call,i,sep=,collapse=)}) development.a-seq(1,40,3) development.a2-seq(1,40,5) a[development.a]-a[development.a+1] a[development.a2]-a[development.a2+1] a[1:2]-call2a;a[3]-call3a;a[4:5]-call5a;a[6:8]-call8a;a[9]- ca ll9a b-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,97005 0 ,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,9 300 10,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500 ,92 0010,970050,930010,960500,920010,970050,930010,920010,960500,970050,9300 10, 920009,960500,970050,930009,940010,960500,960500,960500) data-as.data.frame(cbind(a,b)) colnames(data)-c(phone calls,modules) dim(data) print(data[1:10,]) #sample of 10 rows # Note that in the real data set, data[,2] ranges from 810,000 to 999,999. I've been tasked with the following: # For each phone call that BEGINS with the module which is denoted by 81 (i.e. of the form 81X,XXX), what is the expected number of modules in these calls? #Then it's the same question for each module beginning with 82, 83, 84. all the way until 99. #I've created code that I think works for this, but I can't actually run it on the whole data set. I left it for 30 minutes and it only had about #5% of the task completed (I clicked STOP then checked my output to see if I did it properly, and it seems correct). #I know the apply() family specializes in vector operations, but I can't figure out how to complete the above question in any way other than #loops. L-data A-array(0,dim=c(19,2));rownames(A)-seq(81,99,1) A-data.frame(A) for(i in 1:(nrow(L)-1)) { if(L[(i+1),1]!=L[i,1]) { A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=) ,1 ]- { A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=) ,1 ]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate number of modules in the calls that begin with XX (not yet averaged). } A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=) ,2 ]- { A[paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=) ,2 ]+1 } } } #If I can get this code to be more memory efficient such that I can do it on a 400,000 row data set, I can do, for example, A[17,1]/A[17,2] #and I'll arrive at the mean number of modules per call where the call starts with a module that starts with 97. A[17,1] #is 10, which means that, out of every single call that started with a module of 97X,XXX, #they went through 10 modules in total. A[17,2] #is 6, which means that there was 6 calls in total that began with a 97X,XXX module. #Hence, A[17,1]/A[17,2] #is the average number of modules that were executed in all the calls that began with a 97X,XXX module. - Isaac Research Assistant Quantitative Finance Faculty, UTS I don't see any need for you to use data frames. If you make A and data (not a good use of a variable name) just matrices, you get the same answers at about 10 times the speed (using your example). Further, you should calculate your rowname, namely: paste(strsplit(as.character(L[i+1,2]),)[[1]][1:2],sep=,collapse=) only once each loop, instead of 4 times. this saves another 25-30% cputime. And you can combine the two updates into a single assignment. So using the code: L - as.matrix(data) A - array(0, dim=c(19, 2)); rownames(A) - seq(81, 99, 1) # A - data.frame(A) for(i in 1:(nrow(L)-1)) { if(L[(i+1),1]!=L[i,1]) { myrow - paste(strsplit(as.character(L[i+1, 2]), )[[1]][1:2], sep=, collapse=) A[myrow, ] - A[myrow, ] + c(length(grep(as.character(L[i+1, 1]), L[, 1], value=FALSE)), 1) } } is 15 times as fast as your original code. Hope this helps, Ray Brownrigg __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
Re: [R] How to make this for() loop memory efficient?
Steve: I don't understand why you couldn't get the original code working. You just have to notice that one comment overflows its line. However I couldn't get your code to match the output of the original - almost, but not quite! Ray On Wed, 11 Jan 2012, Steve Lianoglou wrote: I'm having a really difficult time understanding what you're trying to get -- copy and pasting your code is failing to run, and your question isn't clear, ie: For each phone call that BEGINS with the module which is denoted by 81 (i.e. of the form 81X,XXX), what is the expected number of modules in these calls? How does one calculate the expected number of modules in this module? What does that even mean? Anyway, here's some using your `data` data.frame that calculates the number of unique calls and other statistics on the call id within each module prefix. I'm using both data.table and plyr ... there are no for loops. You will want to do `whatever it is you really want to do` inside the blocks below. ## R code data - transform(data, module.prefix=substring(modules, 1, 2)) ## take a look at `data` now ## calulate stuff inside each module.prefix using data.table xx - data.table(data, key=module.prefix) ans - xx[, { ## the columns of the particular subset of your data.table ## are injected into the scope for this expression block ## which is where the `calls` variable below comes from tabled - table(as.character(calls)) list(unique.calls=length(tabled), min=min(tabled), median=as.numeric(median(tabled)), max=max(tabled)) ## you will want to return your own list of stuff }, by='module.prefix'] ## with plyr library(plyr) ans - ddply(data, module.prefix, function(x) { ## `x` is a data.frame that all share the same module.prefix ## do whatever you want with it here tabled - table(as.character(x$calls)) c(unique.calls=length(tabled), min=min(tabled), median=median(tabled), max=max(tabled)) }) You'll have to read up on the particulars of data.table and plyr. Both are really powerful packages ... you should get familiar with at least one. plyr is a bit more flexible in some ways. data.table is a bit more strict (cf. the need for `as.numeric(median(tabled))`), but also tends to be (much) faster when working over large datasets HTH, -steve __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to make this for() loop memory efficient?
Let me just reply to myself. Sorry, it's funny how much I don't get this, but it appears Ray is following you and provides an answer -- scratch my email, it seems to be way off (you should still learn plyr and/or data.table if you haven't yet, tho ;-) Apologies, -steve On Tue, Jan 10, 2012 at 7:18 PM, Steve Lianoglou mailinglist.honey...@gmail.com wrote: I'm having a really difficult time understanding what you're trying to get -- copy and pasting your code is failing to run, and your question isn't clear, ie: For each phone call that BEGINS with the module which is denoted by 81 (i.e. of the form 81X,XXX), what is the expected number of modules in these calls? How does one calculate the expected number of modules in this module? What does that even mean? Anyway, here's some using your `data` data.frame that calculates the number of unique calls and other statistics on the call id within each module prefix. I'm using both data.table and plyr ... there are no for loops. You will want to do `whatever it is you really want to do` inside the blocks below. ## R code data - transform(data, module.prefix=substring(modules, 1, 2)) ## take a look at `data` now ## calulate stuff inside each module.prefix using data.table xx - data.table(data, key=module.prefix) ans - xx[, { ## the columns of the particular subset of your data.table ## are injected into the scope for this expression block ## which is where the `calls` variable below comes from tabled - table(as.character(calls)) list(unique.calls=length(tabled), min=min(tabled), median=as.numeric(median(tabled)), max=max(tabled)) ## you will want to return your own list of stuff }, by='module.prefix'] ## with plyr library(plyr) ans - ddply(data, module.prefix, function(x) { ## `x` is a data.frame that all share the same module.prefix ## do whatever you want with it here tabled - table(as.character(x$calls)) c(unique.calls=length(tabled), min=min(tabled), median=median(tabled), max=max(tabled)) }) You'll have to read up on the particulars of data.table and plyr. Both are really powerful packages ... you should get familiar with at least one. plyr is a bit more flexible in some ways. data.table is a bit more strict (cf. the need for `as.numeric(median(tabled))`), but also tends to be (much) faster when working over large datasets HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to make this for() loop memory efficient?
Yeah -- just fired off an apology email before this landed in my inbox. Sometimes I'm better off not trying to help at all -- this was one of those cases ;-) Whatever I was trying to do clearly was going down the wrong trail Thankfully, you're on top of it though. Sorry for the spam, -steve On Tue, Jan 10, 2012 at 7:33 PM, Ray Brownrigg ray.brownr...@ecs.vuw.ac.nz wrote: Steve: I don't understand why you couldn't get the original code working. You just have to notice that one comment overflows its line. However I couldn't get your code to match the output of the original - almost, but not quite! Ray On Wed, 11 Jan 2012, Steve Lianoglou wrote: I'm having a really difficult time understanding what you're trying to get -- copy and pasting your code is failing to run, and your question isn't clear, ie: For each phone call that BEGINS with the module which is denoted by 81 (i.e. of the form 81X,XXX), what is the expected number of modules in these calls? How does one calculate the expected number of modules in this module? What does that even mean? Anyway, here's some using your `data` data.frame that calculates the number of unique calls and other statistics on the call id within each module prefix. I'm using both data.table and plyr ... there are no for loops. You will want to do `whatever it is you really want to do` inside the blocks below. ## R code data - transform(data, module.prefix=substring(modules, 1, 2)) ## take a look at `data` now ## calulate stuff inside each module.prefix using data.table xx - data.table(data, key=module.prefix) ans - xx[, { ## the columns of the particular subset of your data.table ## are injected into the scope for this expression block ## which is where the `calls` variable below comes from tabled - table(as.character(calls)) list(unique.calls=length(tabled), min=min(tabled), median=as.numeric(median(tabled)), max=max(tabled)) ## you will want to return your own list of stuff }, by='module.prefix'] ## with plyr library(plyr) ans - ddply(data, module.prefix, function(x) { ## `x` is a data.frame that all share the same module.prefix ## do whatever you want with it here tabled - table(as.character(x$calls)) c(unique.calls=length(tabled), min=min(tabled), median=median(tabled), max=max(tabled)) }) You'll have to read up on the particulars of data.table and plyr. Both are really powerful packages ... you should get familiar with at least one. plyr is a bit more flexible in some ways. data.table is a bit more strict (cf. the need for `as.numeric(median(tabled))`), but also tends to be (much) faster when working over large datasets HTH, -steve __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.