Re: [R] Memory management
Okay thanks, I'm going through the docs now.. and I came through this.. The named field is set and accessed by the SET_NAMED and NAMED macros, and take values 0, 1 and 2. R has a `call by value' illusion, so an assignment like b - a appears to make a copy of a and refer to it as b. However, if neither a nor b are subsequently altered there is no need to copy. What really happens is that a new symbol b is bound to the same value as a and the named field on the value object is set (in this case to 2). When an object is about to be altered, the named field is consulted. A value of 2 means that the object must be duplicated before being changed. What does it mean the new symbol b is bound to the same value as a. Does it mean b has a pointer pointing to a? Thanks!! - yooo yoo wrote: I guess I have more reading to do Are there any website that I can read up on memory management, or specifically what happen when we 'pass in' variables, which strategy is better at which situation? Thanks~ - y Prof Brian Ripley wrote: On Tue, 10 Apr 2007, yoo wrote: Hi all, I'm just curious how memory management works in R... I need to run an optimization that keeps calling the same function with a large set of parameters... so then I start to wonder if it's better if I attach the variables first vs passing them in (coz that involves a lot of copying.. ) Your paranethetical comment is wrong: no copying is needed to 'pass in' a variable. Thus, I do this fn3 - function(x, y, z, a, b, c){ sum(x, y, z, a, b, c) } fn4 - function(){ sum(x, y, z, a, b, c) } rdn - rep(1.1, times=1e8) r - proc.time() for (i in 1:5) fn3(rdn, rdn, rdn, rdn, rdn, rdn) time1 - proc.time() - r print(time1) lt - list(x = rdn, y = rdn, z = rdn, a = rdn, b = rdn, c = rdn) attach(lt) r - proc.time() for (i in 1:5) fn4() time2 - proc.time() - r print(time2) detach(lt) The output is [1] 25.691 0.003 25.735 0.000 0.000 [1] 25.822 0.005 25.860 0.000 0.000 Turns out attaching takes longer to run.. which is counter intuitive (unless the search to the pos=2 envir takes long time as well) Do you guys know why this is the case? I would not trust timing differences of that nature: they often depend on the state of the system, and in particular of the garbage collector. You should be using system.time() for that reason: it calls the garbage collector immediately before timing. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Memory-management-tf3556238.html#a9961010 Sent from the R help mailing list archive at Nabble.com. __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management
I guess I have more reading to do Are there any website that I can read up on memory management, or specifically what happen when we 'pass in' variables, which strategy is better at which situation? Thanks~ - y Prof Brian Ripley wrote: On Tue, 10 Apr 2007, yoo wrote: Hi all, I'm just curious how memory management works in R... I need to run an optimization that keeps calling the same function with a large set of parameters... so then I start to wonder if it's better if I attach the variables first vs passing them in (coz that involves a lot of copying.. ) Your paranethetical comment is wrong: no copying is needed to 'pass in' a variable. Thus, I do this fn3 - function(x, y, z, a, b, c){ sum(x, y, z, a, b, c) } fn4 - function(){ sum(x, y, z, a, b, c) } rdn - rep(1.1, times=1e8) r - proc.time() for (i in 1:5) fn3(rdn, rdn, rdn, rdn, rdn, rdn) time1 - proc.time() - r print(time1) lt - list(x = rdn, y = rdn, z = rdn, a = rdn, b = rdn, c = rdn) attach(lt) r - proc.time() for (i in 1:5) fn4() time2 - proc.time() - r print(time2) detach(lt) The output is [1] 25.691 0.003 25.735 0.000 0.000 [1] 25.822 0.005 25.860 0.000 0.000 Turns out attaching takes longer to run.. which is counter intuitive (unless the search to the pos=2 envir takes long time as well) Do you guys know why this is the case? I would not trust timing differences of that nature: they often depend on the state of the system, and in particular of the garbage collector. You should be using system.time() for that reason: it calls the garbage collector immediately before timing. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Memory-management-tf3556238.html#a9937981 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management
Start with the 'R Internals' manual. R has 'call by value' semantics, but lazy copying (the idea is to make a copy only when an object is changed and there are still references to the original version, but that idea is partially implemented). 'which strategy is better at which situation' is difficult. 'S Programming' (see the FAQ) has a lot of accumulated wisdom that has largely been superseded by changes to S and R. We keep making changes to reduce copying (another slew of changes is planned for 2.6.0), so this is something that is very hard to keep up with. We can tell you that some things are likely to be bad, and 'S Programming' is a good place to find out about most of those. On Wed, 11 Apr 2007, yoo wrote: I guess I have more reading to do Are there any website that I can read up on memory management, or specifically what happen when we 'pass in' variables, which strategy is better at which situation? Thanks~ - y Prof Brian Ripley wrote: On Tue, 10 Apr 2007, yoo wrote: Hi all, I'm just curious how memory management works in R... I need to run an optimization that keeps calling the same function with a large set of parameters... so then I start to wonder if it's better if I attach the variables first vs passing them in (coz that involves a lot of copying.. ) Your paranethetical comment is wrong: no copying is needed to 'pass in' a variable. Thus, I do this fn3 - function(x, y, z, a, b, c){ sum(x, y, z, a, b, c) } fn4 - function(){ sum(x, y, z, a, b, c) } rdn - rep(1.1, times=1e8) r - proc.time() for (i in 1:5) fn3(rdn, rdn, rdn, rdn, rdn, rdn) time1 - proc.time() - r print(time1) lt - list(x = rdn, y = rdn, z = rdn, a = rdn, b = rdn, c = rdn) attach(lt) r - proc.time() for (i in 1:5) fn4() time2 - proc.time() - r print(time2) detach(lt) The output is [1] 25.691 0.003 25.735 0.000 0.000 [1] 25.822 0.005 25.860 0.000 0.000 Turns out attaching takes longer to run.. which is counter intuitive (unless the search to the pos=2 envir takes long time as well) Do you guys know why this is the case? I would not trust timing differences of that nature: they often depend on the state of the system, and in particular of the garbage collector. You should be using system.time() for that reason: it calls the garbage collector immediately before timing. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management
Before you go down that road, I would recommend first seeing if it is really a problem. Premature code optimization is in my opinion never a good idea. Also, reading the Details on ?attach you will find this: The database is not actually attached. Rather, a new environment is created on the search path and the elements of a list (including columns of a data frame) or objects in a save file or an environment are copied into the new environment. If you use - or assign to assign to an attached database, you only alter the attached copy, not the original object. (Normal assignment will place a modified version in the user's workspace: see the examples.) For this reason attach can lead to confusion. So in fact it is the attaching that has to do copying, not the other way around. As for references, perhaps there is a better one, but searching for pass in Writing R Extensions I found the following on page 41: Some memory allocation is obvious in interpreted code, for example, y - x + 1 allocates memory for a new vector y. Other memory allocation is less obvious and occurs because R is forced to make good on its promise of ‘call-by-value’ argument passing. When an argument is passed to a function it is not immediately copied. Copying occurs (if necessary) only when the argument is modified. This can lead to surprising memory use. Perhaps a better source, section 4.3.3 of The R language definition, on Argument Evaluation. On Apr 11, 2007, at 8:25 AM, yoo wrote: I guess I have more reading to do Are there any website that I can read up on memory management, or specifically what happen when we 'pass in' variables, which strategy is better at which situation? Thanks~ - y On Tue, 10 Apr 2007, yoo wrote: Hi all, I'm just curious how memory management works in R... I need to run an optimization that keeps calling the same function with a large set of parameters... so then I start to wonder if it's better if I attach the variables first vs passing them in (coz that involves a lot of copying.. ) Haris Skiadas Department of Mathematics and Computer Science Hanover College __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Memory management
Hi all, I'm just curious how memory management works in R... I need to run an optimization that keeps calling the same function with a large set of parameters... so then I start to wonder if it's better if I attach the variables first vs passing them in (coz that involves a lot of copying.. ) Thus, I do this fn3 - function(x, y, z, a, b, c){ sum(x, y, z, a, b, c) } fn4 - function(){ sum(x, y, z, a, b, c) } rdn - rep(1.1, times=1e8) r - proc.time() for (i in 1:5) fn3(rdn, rdn, rdn, rdn, rdn, rdn) time1 - proc.time() - r print(time1) lt - list(x = rdn, y = rdn, z = rdn, a = rdn, b = rdn, c = rdn) attach(lt) r - proc.time() for (i in 1:5) fn4() time2 - proc.time() - r print(time2) detach(lt) The output is [1] 25.691 0.003 25.735 0.000 0.000 [1] 25.822 0.005 25.860 0.000 0.000 Turns out attaching takes longer to run.. which is counter intuitive (unless the search to the pos=2 envir takes long time as well) Do you guys know why this is the case? -- View this message in context: http://www.nabble.com/Memory-management-tf3556238.html#a9929835 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory management
On Tue, 10 Apr 2007, yoo wrote: Hi all, I'm just curious how memory management works in R... I need to run an optimization that keeps calling the same function with a large set of parameters... so then I start to wonder if it's better if I attach the variables first vs passing them in (coz that involves a lot of copying.. ) Your paranethetical comment is wrong: no copying is needed to 'pass in' a variable. Thus, I do this fn3 - function(x, y, z, a, b, c){ sum(x, y, z, a, b, c) } fn4 - function(){ sum(x, y, z, a, b, c) } rdn - rep(1.1, times=1e8) r - proc.time() for (i in 1:5) fn3(rdn, rdn, rdn, rdn, rdn, rdn) time1 - proc.time() - r print(time1) lt - list(x = rdn, y = rdn, z = rdn, a = rdn, b = rdn, c = rdn) attach(lt) r - proc.time() for (i in 1:5) fn4() time2 - proc.time() - r print(time2) detach(lt) The output is [1] 25.691 0.003 25.735 0.000 0.000 [1] 25.822 0.005 25.860 0.000 0.000 Turns out attaching takes longer to run.. which is counter intuitive (unless the search to the pos=2 envir takes long time as well) Do you guys know why this is the case? I would not trust timing differences of that nature: they often depend on the state of the system, and in particular of the garbage collector. You should be using system.time() for that reason: it calls the garbage collector immediately before timing. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management uestion [Broadcast]
I don't see why making copies of the columns you need inside the loop is better memory management. If the data are in a matrix, accessing elements is quite fast. If you're worrying about speed of that, do what Charles suggest: work with the transpose so that you are accessing elements in the same column in each iteration of the loop. Andy From: Federico Calboli Charles C. Berry wrote: Whoa! You are accessing one ROW at a time. Either way this will tangle up your cache if you have many rows and columns in your orignal data. You might do better to do Y - t( X ) ### use '-' ! for (i in whatever ){ do something using Y[ , i ] } My question is NOT how to write the fastest code, it is whether dummy variables (for lack of better words) make the memory management better, i.e. faster, or not. Best, Fede -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Notice: This e-mail message, together with any attachments,...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management uestion [Broadcast]
Liaw, Andy wrote: I don't see why making copies of the columns you need inside the loop is better memory management. If the data are in a matrix, accessing elements is quite fast. If you're worrying about speed of that, do what Charles suggest: work with the transpose so that you are accessing elements in the same column in each iteration of the loop. As I said, this is pretty academic, I am not looking for how to do something differetly. Having said that, let me present this code: for(i in gp){ new[i,1] = ifelse(srow[i]0, new[srow[i],zippo[i]], sav[i]) new[i,2] = ifelse(drow[i]0, new[drow[i],zappo[i]], sav[i]) } where gp is large vector and srow and drow are the dummy variables for: srow = data[,2] drow = data[,4] If instead of the dummy variable I access the array directly (and its' a 60 x 6 array) the loop takes 2/3 days --not sure here, I killed it after 48 hours. If I use dummy variables the code runs in 10 minutes-ish. Comments? Best, Fede -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management uestion [Broadcast]
On Tue, 20 Feb 2007, Federico Calboli wrote: Liaw, Andy wrote: I don't see why making copies of the columns you need inside the loop is better memory management. If the data are in a matrix, accessing elements is quite fast. If you're worrying about speed of that, do what Charles suggest: work with the transpose so that you are accessing elements in the same column in each iteration of the loop. As I said, this is pretty academic, I am not looking for how to do something differetly. Having said that, let me present this code: for(i in gp){ new[i,1] = ifelse(srow[i]0, new[srow[i],zippo[i]], sav[i]) new[i,2] = ifelse(drow[i]0, new[drow[i],zappo[i]], sav[i]) } where gp is large vector and srow and drow are the dummy variables for: srow = data[,2] drow = data[,4] If instead of the dummy variable I access the array directly (and its' a 60 x 6 array) the loop takes 2/3 days --not sure here, I killed it after 48 hours. If I use dummy variables the code runs in 10 minutes-ish. Comments? This is a bit different than your original post (where it appeared that you were manipulating one row of a matrix at a time), but the issue is the same. As suggested in my earlier email this looks like a caching issue, and this is not peculiar to R. Viz. Most modern CPUs are so fast that for most program workloads the locality of reference of memory accesses, and the efficiency of the caching and memory transfer between different levels of the hierarchy, is the practical limitation on processing speed. As a result, the CPU spends much of its time idling, waiting for memory I/O to complete. (from http://en.wikipedia.org/wiki/Memory_hierarchy) The computation you have is challenging to your cache, and the effect of dropping unused columns of your 'data' object by assiging the columns used to 'srow' and 'drow' has lightened the load. If you do not know why SAXPY and friends are written as they are, a little bit of study will be rewarded by a much better understanding of these issues. I think Golub and Van Loan's 'Matrix Computations' touches on this (but I do not have my copy close to hand to check). Best, Fede -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:[EMAIL PROTECTED] UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0901 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management uestion [Broadcast]
Charles C. Berry wrote: This is a bit different than your original post (where it appeared that you were manipulating one row of a matrix at a time), but the issue is the same. As suggested in my earlier email this looks like a caching issue, and this is not peculiar to R. Viz. Most modern CPUs are so fast that for most program workloads the locality of reference of memory accesses, and the efficiency of the caching and memory transfer between different levels of the hierarchy, is the practical limitation on processing speed. As a result, the CPU spends much of its time idling, waiting for memory I/O to complete. (from http://en.wikipedia.org/wiki/Memory_hierarchy) The computation you have is challenging to your cache, and the effect of dropping unused columns of your 'data' object by assiging the columns used to 'srow' and 'drow' has lightened the load. If you do not know why SAXPY and friends are written as they are, a little bit of study will be rewarded by a much better understanding of these issues. I think Golub and Van Loan's 'Matrix Computations' touches on this (but I do not have my copy close to hand to check). Thanks for the clarifications. My bottom line is, I prefer dummy variables because they allow me to write cleaner code, with a shorter line for the same instruction i.e. less chances of creeping errors (+ turning into -, etc). I've been challenged that that's memory inefficent, and I wanted to have the opinion of people with more experience than mine on the matter. Best, Fede -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] memory management uestion
Hi All, I would like to ask the following. I have an array of data in an objetct, let's say X. I need to use a for loop on the elements of one or more columns of X and I am having a debate with a colleague about the best memory management. I believe that if I do: col1 = X[,1] col2 = X[,2] ... colx = X[,x] and then for(i in whatever){ do something using col1[i], col2[i] ... colx[i] } my memory management is better that doing: for(i in whatever){ do something using X[i,1], X[i,2] ... X[,x] } BTW, here I *have to* use a for() loop an no nifty tapply, lapply and family. Any comment is welcome. Best, Fede -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management uestion
On Mon, 19 Feb 2007, Federico Calboli wrote: Hi All, I would like to ask the following. I have an array of data in an objetct, let's say X. I need to use a for loop on the elements of one or more columns of X and I am having a debate with a colleague about the best memory management. Yez guys should take this fight out into the parking lot. ;-) Armed with gc(), system.time(), and whatever memory monitoring tools your OS'es provide you can pound each other with memory usage and timing stats till one of you screams 'uncle' or you both have had enough and decide to shake hands and come back inside. I believe that if I do: col1 = X[,1] col2 = X[,2] ... colx = X[,x] and then for(i in whatever){ do something using col1[i], col2[i] ... colx[i] } my memory management is better that doing: for(i in whatever){ do something using X[i,1], X[i,2] ... X[,x] } Whoa! You are accessing one ROW at a time. Either way this will tangle up your cache if you have many rows and columns in your orignal data. You might do better to do Y - t( X ) ### use '-' ! for (i in whatever ){ do something using Y[ , i ] } BTW, here I *have to* use a for() loop an no nifty tapply, lapply and family. Any comment is welcome. Best, Fede -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:[EMAIL PROTECTED] UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0901 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management uestion
Charles C. Berry wrote: Whoa! You are accessing one ROW at a time. Either way this will tangle up your cache if you have many rows and columns in your orignal data. You might do better to do Y - t( X ) ### use '-' ! for (i in whatever ){ do something using Y[ , i ] } My question is NOT how to write the fastest code, it is whether dummy variables (for lack of better words) make the memory management better, i.e. faster, or not. Best, Fede -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] memory management
Hi All, just a quick (?) question while I wait my code runs... I'm comparing the identity of the lines of a dataframe, doing all possible pairwise comparisons. In doing so I use identical(), but that's by the way. I'm doing a (not so) quick and dirty check, and subsetting the data as data[row.numb,] and data[a different row,] I suspect the problem there is that I load into memory the whole frame data[,] every time, making the biz quite slow and wasteful. As I'm idly waiting, I though, had I put every line of data[,] as the item of a list, then done my pairwise comparisons using the list, would I have had a better performance? (do I win the prize for the most convoluted sentence sent to the R-help?) For the pedants, yes, I know I could kill the process and try myself, but the spirit of the question is, is there a way of dealing with big data *efficiently*? Best, Fede -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory management
This was asked before. Collapse the data frame into a vector, e.g. v - apply(DF,1,function(x) {paste(x,collapse=_)}) then work with the values of that vector (table, unique etc). If your data frame is really large run this in a DBMS. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Federico Calboli Sent: Monday, October 30, 2006 11:35 AM To: r-help Subject: [R] memory management Hi All, just a quick (?) question while I wait my code runs... I'm comparing the identity of the lines of a dataframe, doing all possible pairwise comparisons. In doing so I use identical(), but that's by the way. I'm doing a (not so) quick and dirty check, and subsetting the data as data[row.numb,] and data[a different row,] I suspect the problem there is that I load into memory the whole frame data[,] every time, making the biz quite slow and wasteful. As I'm idly waiting, I though, had I put every line of data[,] as the item of a list, then done my pairwise comparisons using the list, would I have had a better performance? (do I win the prize for the most convoluted sentence sent to the R-help?) For the pedants, yes, I know I could kill the process and try myself, but the spirit of the question is, is there a way of dealing with big data *efficiently*? Best, Fede -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] memory management
All, I've written some functions that use a list and a list of sub-lists and I'm running into memory problems, even after changing memory.limit. Does it make any difference to the handling of memory if I use simple vectors and matrices instead of the list and list of sub-lists? I suspect no, but just want to check. thanks! Dave ps - please reply to [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Memory management on Windows (was Size of jpegs/pngs)
I think this an issue about the amount of graphics memory. You are asking for an image of about 17*2*3 = 102Mb, and you need more than that. From the help page: Windows imposes limits on the size of bitmaps: these are not documented in the SDK and may depend on the version of Windows. It seems that 'width' and 'height' are each limited to 2^15-1 and there is a 16Mb limit on the total amount of memory in Windows 95/98/ME. so I do wonder why you are surprised. My laptop appears to be limited to about half your example with a 128Mb graphics card (and lots of other things going on). On Sun, 2 Oct 2005 [EMAIL PROTECTED] wrote: Dear all I have trouble with setting the size for jpegs and pngs. I need to save a dendrogram of 1000 words into a jpeg or png file. On one of my computers, the following works just fine: bb-agnes(aa, method=ward) jpeg(C:/Temp/test.txt, width=17000, height=2000) plot(bb) dev.off() On my main computer, however, this doesn't work: jpeg(C:/Temp/test.txt, width=17000, height=2000) Error in jpeg(C:/Temp/test.txt, width = 17000, height = 2000) : unable to start device devWindows In addition: Warning message: Unable to allocate bitmap This is a Windows XP Pro SP2 system, which is started with this chsort R.version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major2 minor1.1 year 2005 month06 day 20 language R which is started with a shortcut. C:\rw2011\bin\Rgui.exe --max-mem-size=1500M I checked the web and the R-help pages, tried out the ppsize option, and compared the options settings with those of the machine that works (which actually runs R 2.0.1 of 15 Nov 2004), but couldn't come up with an explanation. Any idea what I do wrong? Did you read the help page? -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Memory Management under Linux: Problems to allocate large amounts of data
Dear Prof. Ripley. Thank You for Your quick answer. Your right by assuming that we run R on a 32bit System. My technician tried to install R on a emulated 64bit Opteron machine which led into some trouble. Maybe because the Opteron includes a 32bit Processor which emulates 64bit (AMD64 x86_64). As You seem to have good experience with running R on a 64bit OS I feel encouraged to have another try for this. -Ursprüngliche Nachricht- Von: Prof Brian Ripley [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 29. Juni 2005 15:18 An: Dubravko Dolic Cc: r-help@stat.math.ethz.ch Betreff: Re: [R] Memory Management under Linux: Problems to allocate large amounts of data Let's assume this is a 32-bit Xeon and a 32-bit OS (there are 64-bit-capable Xeons). Then a user process like R gets a 4GB address space, 1GB of which is reserved for the kernel. So R has a 3GB address space, and it is trying to allocate a 2GB contigous chunk. Because of memory fragmentation that is quite unlikely to succeed. We run 64-bit OSes on all our machines with 2GB or more RAM, for this reason. On Wed, 29 Jun 2005, Dubravko Dolic wrote: Dear Group I'm still trying to bring many data into R (see older postings). After solving some troubles with the database I do most of the work in MySQL. But still I could be nice to work on some data using R. Therefore I can use a dedicated Server with Gentoo Linux as OS hosting only R. This Server is a nice machine with two CPU and 4GB RAM which should do the job: Dual Intel XEON 3.06 GHz 4 x 1 GB RAM PC2100 CL2 HP Proliant DL380-G3 I read the R-Online help on memory issues and the article on garbage collection from the R-News 01-2001 (Luke Tierney). Also the FAQ and some newsgroup postings were very helpful on understanding memory issues using R. Now I try to read data from a database. The data I wanted to read consists of 158902553 rows and one field (column) and is of type bigint(20) in the database. I received the message that R could not allocate the 2048000 Kb (almost 2GB) sized vector. As I have 4BG of RAM I could not imagine why this happened. In my understanding R under Linux (32bit) should be able to use the full RAM. As there is not much space used by OS and R as such (free shows the use of app. 670 MB after dbSendQuery and fetch) there are 3GB to be occupied by R. Is that correct? Not really. The R executable code and the Ncells are already in the address space, and this is a virtual memory OS, so the amount of RAM is not relevant (it would still be a 3GB limit with 12GB of RAM). After that I started R by setting n/vsize explicitly R --min-vsize=10M --max-vsize=3G --min-nsize=500k --max-nsize=100M mem.limits() nsize vsize 104857600NA and received the same message. A garbage collection delivered the following information: gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 217234 5.9 50 13.4 280050 13.4 Vcells 87472 0.7 157650064 1202.8 3072 196695437 1500.7 Now I'm at a loss. Maybe anyone could give me a hint where I should read further or which Information can take me any further -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Memory Management under Linux: Problems to allocate large amounts of data
On Thu, 30 Jun 2005, Dubravko Dolic wrote: Dear Prof. Ripley. Thank You for Your quick answer. Your right by assuming that we run R on a 32bit System. My technician tried to install R on a emulated 64bit Opteron machine which led into some trouble. Maybe because the Opteron includes a 32bit Processor which emulates 64bit (AMD64 x86_64). As You seem to have good experience with running R on a 64bit OS I feel encouraged to have another try for this. It should work out of the box on an Opteron Linux systen: it does for example on FC3 and SuSE 9.x. Some earlier Linux distros for x86_64 are not fully 64-bit, but we ran R on FC2 (although some packages could not be installed). Trying to build a 32-bit version of R on FC3 does not work for me: the wrong libgcc_s is found. (One might want a 32-bit version for speed on small tasks.) -Ursprüngliche Nachricht- Von: Prof Brian Ripley [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 29. Juni 2005 15:18 An: Dubravko Dolic Cc: r-help@stat.math.ethz.ch Betreff: Re: [R] Memory Management under Linux: Problems to allocate large amounts of data Let's assume this is a 32-bit Xeon and a 32-bit OS (there are 64-bit-capable Xeons). Then a user process like R gets a 4GB address space, 1GB of which is reserved for the kernel. So R has a 3GB address space, and it is trying to allocate a 2GB contigous chunk. Because of memory fragmentation that is quite unlikely to succeed. We run 64-bit OSes on all our machines with 2GB or more RAM, for this reason. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Memory Management under Linux: Problems to allocate large amounts of data
Prof Brian Ripley [EMAIL PROTECTED] writes: On Thu, 30 Jun 2005, Dubravko Dolic wrote: Dear Prof. Ripley. Thank You for Your quick answer. Your right by assuming that we run R on a 32bit System. My technician tried to install R on a emulated 64bit Opteron machine which led into some trouble. Maybe because the Opteron includes a 32bit Processor which emulates 64bit (AMD64 x86_64). As You seem to have good experience with running R on a 64bit OS I feel encouraged to have another try for this. Er? What is an emulated Opteron machine? Opterons are 64 bit. It should work out of the box on an Opteron Linux systen: it does for example on FC3 and SuSE 9.x. Some earlier Linux distros for x86_64 are not fully 64-bit, but we ran R on FC2 (although some packages could not be installed). Trying to build a 32-bit version of R on FC3 does not work for me: the wrong libgcc_s is found. (One might want a 32-bit version for speed on small tasks.) On FC4 it is even easier: yum install R R-devel gets you a working R 2.1.1 straight away (from Fedora Extras). Only if you want to include hardcore optimized BLAS or do not like the performance hit of having R as a shared library do you need to compile at all. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Memory Management under Linux: Problems to allocate large amounts of data
Dear Peter, AMD64 and EM64T (Intel) were designed as 32bit CPUs which are able to address 64bit registers. So they are nut pure 64bit Systems. This is why they are much cheaper than a real 64bit machine. -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Peter Dalgaard Gesendet: Donnerstag, 30. Juni 2005 11:48 An: Prof Brian Ripley Cc: Dubravko Dolic; r-help@stat.math.ethz.ch Betreff: Re: [R] Memory Management under Linux: Problems to allocate large amounts of data Prof Brian Ripley [EMAIL PROTECTED] writes: On Thu, 30 Jun 2005, Dubravko Dolic wrote: Dear Prof. Ripley. Thank You for Your quick answer. Your right by assuming that we run R on a 32bit System. My technician tried to install R on a emulated 64bit Opteron machine which led into some trouble. Maybe because the Opteron includes a 32bit Processor which emulates 64bit (AMD64 x86_64). As You seem to have good experience with running R on a 64bit OS I feel encouraged to have another try for this. Er? What is an emulated Opteron machine? Opterons are 64 bit. It should work out of the box on an Opteron Linux systen: it does for example on FC3 and SuSE 9.x. Some earlier Linux distros for x86_64 are not fully 64-bit, but we ran R on FC2 (although some packages could not be installed). Trying to build a 32-bit version of R on FC3 does not work for me: the wrong libgcc_s is found. (One might want a 32-bit version for speed on small tasks.) On FC4 it is even easier: yum install R R-devel gets you a working R 2.1.1 straight away (from Fedora Extras). Only if you want to include hardcore optimized BLAS or do not like the performance hit of having R as a shared library do you need to compile at all. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Memory Management under Linux: Problems to allocate large amounts of data
Dear Group I'm still trying to bring many data into R (see older postings). After solving some troubles with the database I do most of the work in MySQL. But still I could be nice to work on some data using R. Therefore I can use a dedicated Server with Gentoo Linux as OS hosting only R. This Server is a nice machine with two CPU and 4GB RAM which should do the job: Dual Intel XEON 3.06 GHz 4 x 1 GB RAM PC2100 CL2 HP Proliant DL380-G3 I read the R-Online help on memory issues and the article on garbage collection from the R-News 01-2001 (Luke Tierney). Also the FAQ and some newsgroup postings were very helpful on understanding memory issues using R. Now I try to read data from a database. The data I wanted to read consists of 158902553 rows and one field (column) and is of type bigint(20) in the database. I received the message that R could not allocate the 2048000 Kb (almost 2GB) sized vector. As I have 4BG of RAM I could not imagine why this happened. In my understanding R under Linux (32bit) should be able to use the full RAM. As there is not much space used by OS and R as such (free shows the use of app. 670 MB after dbSendQuery and fetch) there are 3GB to be occupied by R. Is that correct? After that I started R by setting n/vsize explicitly R --min-vsize=10M --max-vsize=3G --min-nsize=500k --max-nsize=100M mem.limits() nsize vsize 104857600NA and received the same message. A garbage collection delivered the following information: gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 217234 5.9 50 13.4 280050 13.4 Vcells 87472 0.7 157650064 1202.8 3072 196695437 1500.7 Now I'm at a loss. Maybe anyone could give me a hint where I should read further or which Information can take me any further Dubravko Dolic Statistical Analyst Tel: +49 (0)89-55 27 44 - 4630 Fax: +49 (0)89-55 27 44 - 2463 Email: [EMAIL PROTECTED] Komdat GmbH Nymphenburger Straße 86 80636 München - ONLINE MARKETING THAT WORKS - This electronic message contains information from Komdat Gmb...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Memory Management under Linux: Problems to allocate large amounts of data
Let's assume this is a 32-bit Xeon and a 32-bit OS (there are 64-bit-capable Xeons). Then a user process like R gets a 4GB address space, 1GB of which is reserved for the kernel. So R has a 3GB address space, and it is trying to allocate a 2GB contigous chunk. Because of memory fragmentation that is quite unlikely to succeed. We run 64-bit OSes on all our machines with 2GB or more RAM, for this reason. On Wed, 29 Jun 2005, Dubravko Dolic wrote: Dear Group I'm still trying to bring many data into R (see older postings). After solving some troubles with the database I do most of the work in MySQL. But still I could be nice to work on some data using R. Therefore I can use a dedicated Server with Gentoo Linux as OS hosting only R. This Server is a nice machine with two CPU and 4GB RAM which should do the job: Dual Intel XEON 3.06 GHz 4 x 1 GB RAM PC2100 CL2 HP Proliant DL380-G3 I read the R-Online help on memory issues and the article on garbage collection from the R-News 01-2001 (Luke Tierney). Also the FAQ and some newsgroup postings were very helpful on understanding memory issues using R. Now I try to read data from a database. The data I wanted to read consists of 158902553 rows and one field (column) and is of type bigint(20) in the database. I received the message that R could not allocate the 2048000 Kb (almost 2GB) sized vector. As I have 4BG of RAM I could not imagine why this happened. In my understanding R under Linux (32bit) should be able to use the full RAM. As there is not much space used by OS and R as such (free shows the use of app. 670 MB after dbSendQuery and fetch) there are 3GB to be occupied by R. Is that correct? Not really. The R executable code and the Ncells are already in the address space, and this is a virtual memory OS, so the amount of RAM is not relevant (it would still be a 3GB limit with 12GB of RAM). After that I started R by setting n/vsize explicitly R --min-vsize=10M --max-vsize=3G --min-nsize=500k --max-nsize=100M mem.limits() nsize vsize 104857600NA and received the same message. A garbage collection delivered the following information: gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 217234 5.9 50 13.4 280050 13.4 Vcells 87472 0.7 157650064 1202.8 3072 196695437 1500.7 Now I'm at a loss. Maybe anyone could give me a hint where I should read further or which Information can take me any further -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html