[R] Memory usage
Hi, I have below lines of code to understand how R manages memory. > library(pryr) *Warning message:* *package ‘pryr’ was built under R version 3.4.3 * > mem_change(x <- 1:1e6) 4.01 MB > mem_change(y <- x) 976 B > mem_change(x[100] < NA) 976 B > mem_change(rm(x)) 864 B > mem_change(rm(y)) -4 MB > I do understand why there is only 976 B positive change in the 3rd line. This is because now y and x both points to the same block of memory that holds 1:1e6. But I dont understand below > mem_change(rm(x)) 864 B Why memory consumption increased here while deleting an object, although by a small amount? Any detailed explanation will be appreciated. Thanks, [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage in prcomp
> On Mar 22, 2016, at 10:00 AM, Martin Maechler> wrote: > >> Roy Mendelssohn <- NOAA Federal > >>on Tue, 22 Mar 2016 07:42:10 -0700 writes: > >> Hi All: >> I am running prcomp on a very large array, roughly [50, 3650]. The >> array itself is 16GB. I am running on a Unix machine and am running “top” >> at the same time and am quite surprised to see that the application memory >> usage is 76GB. I have the “tol” set very high (.8) so that it should only >> pull out a few components. I am surprised at this memory usage because >> prcomp uses the SVD if I am not mistaken, and when I take guesses at the >> size of the SVD matrices they shouldn’t be that large. While I can fit >> this in, for a variety of reasons I would like to reduce the memory >> footprint. She questions: > >> 1. I am running with “center=FALSE” and “scale=TRUE”. Would I save memory >> if I scaled the data first myself, saved the result, cleared out the >> workspace, read the scaled data back in and did the prcomp call? Basically >> are the intermediate calculations for scaling kept in memory after use. > >> 2. I don’t know how prcomp memory usage compares to a direct call to “svn” >> which allows me to explicitly set how many singular vectors to compute (I >> only need like the first five at most). prcomp is convenient because it >> does a lot of the other work for me > > For your example, where p := ncol(x) is 3650 but you only want > the first 5 PCs, it would be *considerably* more efficient to > use svd(..., nv = 5) directly. > > So I would take stats:::prcomp.default and modify it > correspondingly. > > This seems such a useful idea in general that I consider > updating the function in R with a new optional 'rank.' argument which > you'd set to 5 in your case. > > Scrutinizing R's underlying svd() code however, I know see that > there are typicall still two other [n x p] matrices created (on > in R's La.svd(), one in C code) ... which I think should be > unnecessary in this case... but that would really be another > topic (for R-devel , not R-help). > > Martin > Thanks. It is easy enough to recode using SVN, and I think I will.It gives me a ;title more control on what the algorithm does. -Roy ** "The contents of this message do not reflect any position of the U.S. Government or NOAA." ** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center ***Note new address and phone*** 110 Shaffer Road Santa Cruz, CA 95060 Phone: (831)-420-3666 Fax: (831) 420-3980 e-mail: roy.mendelss...@noaa.gov www: http://www.pfeg.noaa.gov/ "Old age and treachery will overcome youth and skill." "From those who have been given much, much will be expected" "the arc of the moral universe is long, but it bends toward justice" -MLK Jr. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage in prcomp
> Roy Mendelssohn <- NOAA Federal> > on Tue, 22 Mar 2016 07:42:10 -0700 writes: > Hi All: > I am running prcomp on a very large array, roughly [50, 3650]. The array itself is 16GB. I am running on a Unix machine and am running “top” at the same time and am quite surprised to see that the application memory usage is 76GB. I have the “tol” set very high (.8) so that it should only pull out a few components. I am surprised at this memory usage because prcomp uses the SVD if I am not mistaken, and when I take guesses at the size of the SVD matrices they shouldn’t be that large. While I can fit this in, for a variety of reasons I would like to reduce the memory footprint. She questions: > 1. I am running with “center=FALSE” and “scale=TRUE”. Would I save memory if I scaled the data first myself, saved the result, cleared out the workspace, read the scaled data back in and did the prcomp call? Basically are the intermediate calculations for scaling kept in memory after use. > 2. I don’t know how prcomp memory usage compares to a direct call to “svn” which allows me to explicitly set how many singular vectors to compute (I only need like the first five at most). prcomp is convenient because it does a lot of the other work for me For your example, where p := ncol(x) is 3650 but you only want the first 5 PCs, it would be *considerably* more efficient to use svd(..., nv = 5) directly. So I would take stats:::prcomp.default and modify it correspondingly. This seems such a useful idea in general that I consider updating the function in R with a new optional 'rank.' argument which you'd set to 5 in your case. Scrutinizing R's underlying svd() code however, I know see that there are typicall still two other [n x p] matrices created (on in R's La.svd(), one in C code) ... which I think should be unnecessary in this case... but that would really be another topic (for R-devel , not R-help). Martin __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Memory usage in prcomp
Hi All: I am running prcomp on a very large array, roughly [50, 3650]. The array itself is 16GB. I am running on a Unix machine and am running “top” at the same time and am quite surprised to see that the application memory usage is 76GB. I have the “tol” set very high (.8) so that it should only pull out a few components. I am surprised at this memory usage because prcomp uses the SVD if I am not mistaken, and when I take guesses at the size of the SVD matrices they shouldn’t be that large. While I can fit this in, for a variety of reasons I would like to reduce the memory footprint. She questions: 1. I am running with “center=FALSE” and “scale=TRUE”. Would I save memory if I scaled the data first myself, saved the result, cleared out the workspace, read the scaled data back in and did the prcomp call? Basically are the intermediate calculations for scaling kept in memory after use. 2. I don’t know how prcomp memory usage compares to a direct call to “svn” which allows me to explicitly set how many singular vectors to compute (I only need like the first five at most). prcomp is convenient because it does a lot of the other work for me ** "The contents of this message do not reflect any position of the U.S. Government or NOAA." ** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center ***Note new address and phone*** 110 Shaffer Road Santa Cruz, CA 95060 Phone: (831)-420-3666 Fax: (831) 420-3980 e-mail: roy.mendelss...@noaa.gov www: http://www.pfeg.noaa.gov/ "Old age and treachery will overcome youth and skill." "From those who have been given much, much will be expected" "the arc of the moral universe is long, but it bends toward justice" -MLK Jr. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Memory usage problem while using nlm function
Hi, I am trying to do nonlinear minimization using nlm() function, but for large amount of data it is going out of memory. Code which i am using: f-function(p,n11,E){ sum(-log((p[5] * dnbinom(n11, size=p[1], prob=p[2]/(p[2]+E)) + (1-p[5]) * dnbinom(n11, size=p[3], prob=p[4]/(p[4]+E) } p_out -nlm(f, p=c(alpha1= 0.2, beta1= 0.06, alpha2=1.4, beta2=1.8, w=0.1), n11=n11_c, E=E_c) When the size of n11_c or E_c vector is to large, it is going out of memory. please give me some solution for this. -- View this message in context: http://r.789695.n4.nabble.com/Memory-usage-problem-while-using-nlm-function-tp4701241.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] memory usage with party::cforest
Is there a way to shrink the size of RandomForest-class (an S4 object), so that it requires less memory during run-time and less disk space for serialization? On my system the data slot is about 2GB, which is causing problems, and I'd like to see whether predict() works without it. # example with a much smaller data set (i.e., less than 2GB) require(party) data(iris) cf - cforest(Species ~ ., data=iris) str(cf, max.level=2) cf@data - NULL # this fails Andrew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage bar plot
init 148.0 KiB + 26.0 KiB = 174.0 KiB mapping-daemon 152.0 KiB + 25.5 KiB = 177.5 KiB gnome-keyring-daemon 152.0 KiB + 27.5 KiB = 179.5 KiB portmap 164.0 KiB + 18.0 KiB = 182.0 KiB syslogd 168.0 KiB + 24.5 KiB = 192.5 KiB atd 180.0 KiB + 18.5 KiB = 198.5 KiB brcm_iscsiuio 188.0 KiB + 37.0 KiB = 225.0 KiB rpc.statd 208.0 KiB + 24.0 KiB = 232.0 KiB audispd 208.0 KiB + 40.5 KiB = 248.5 KiB hald-runner 244.0 KiB + 23.5 KiB = 267.5 KiB smartd 240.0 KiB + 35.5 KiB = 275.5 KiB hpiod 244.0 KiB + 35.0 KiB = 279.0 KiB hcid 228.0 KiB + 75.0 KiB = 303.0 KiB hald-addon-keyboard (2) 196.0 KiB + 144.0 KiB = 340.0 KiB sh 328.0 KiB + 32.5 KiB = 360.5 KiB gam_server 336.0 KiB + 32.5 KiB = 368.5 KiB xinetd 364.0 KiB + 28.5 KiB = 392.5 KiB auditd 420.0 KiB + 84.0 KiB = 504.0 KiB mingetty (6) 552.0 KiB + 19.5 KiB = 571.5 KiB udevd 532.0 KiB + 56.0 KiB = 588.0 KiB rpc.idmapd 544.0 KiB + 50.5 KiB = 594.5 KiB ssh-agent 612.0 KiB + 29.0 KiB = 641.0 KiB crond 484.0 KiB + 176.0 KiB = 660.0 KiB avahi-daemon (2) 576.0 KiB + 164.0 KiB = 740.0 KiB sftp-server 744.0 KiB + 74.5 KiB = 818.5 KiB automount 756.0 KiB + 186.5 KiB = 942.5 KiB gnome-vfs-daemon 736.0 KiB + 296.0 KiB = 1.0 MiB dbus-daemon (2) 988.0 KiB + 61.5 KiB = 1.0 MiB pcscd 824.0 KiB + 231.5 KiB = 1.0 MiB pam-panel-icon 1.0 MiB + 26.0 KiB = 1.1 MiB nmon 864.0 KiB + 229.5 KiB = 1.1 MiB bt-applet 712.0 KiB + 398.0 KiB = 1.1 MiB nm-system-settings 1.0 MiB + 63.0 KiB = 1.1 MiB nmbd 996.0 KiB + 131.0 KiB = 1.1 MiB bonobo-activation-server 740.0 KiB + 395.5 KiB = 1.1 MiB escd 880.0 KiB + 432.0 KiB = 1.3 MiB bash (2) 1.1 MiB + 212.5 KiB = 1.3 MiB gnome-screensaver 796.0 KiB + 617.5 KiB = 1.4 MiB gdm-rh-security-token-helper 916.0 KiB + 739.5 KiB = 1.6 MiB gdm-binary (2) 1.2 MiB + 387.5 KiB = 1.6 MiB gnome-session 1.4 MiB + 221.0 KiB = 1.6 MiB cupsd 1.3 MiB + 443.5 KiB = 1.8 MiB notification-area-applet 2.1 MiB + 69.0 KiB = 2.2 MiB xfs 1.8 MiB + 545.5 KiB = 2.3 MiB eggcups 2.2 MiB + 86.5 KiB = 2.3 MiB gconfd-2 1.9 MiB + 492.5 KiB = 2.4 MiB gnome-settings-daemon 2.0 MiB + 421.5 KiB = 2.4 MiB gnome-power-manager 1.9 MiB + 569.0 KiB = 2.5 MiB trashapplet 1.7 MiB + 1.0 MiB = 2.7 MiB smbd (2) 2.6 MiB + 365.0 KiB = 2.9 MiB iscsid (2) 2.7 MiB + 349.0 KiB = 3.0 MiB sendmail.sendmail (2) 3.2 MiB + 73.0 KiB = 3.2 MiB hald 2.7 MiB + 649.0 KiB = 3.4 MiB clock-applet 2.5 MiB + 1.4 MiB = 3.9 MiB nm-applet 3.4 MiB + 729.5 KiB = 4.1 MiB metacity 2.8 MiB + 1.4 MiB = 4.2 MiB sshd (4) 3.4 MiB + 853.0 KiB = 4.3 MiB wnck-applet 4.4 MiB + 377.5 KiB = 4.8 MiB Xorg 4.3 MiB + 717.5 KiB = 5.0 MiB mixer_applet2 4.5 MiB + 809.5 KiB = 5.3 MiB gnome-panel 5.3 MiB + 251.5 KiB = 5.6 MiB hpssd.py 4.0 MiB + 3.3 MiB = 7.2 MiB httpd (11) 10.5 MiB + 870.0 KiB = 11.3 MiB gdmgreeter 12.8 MiB + 1.1 MiB = 13.8 MiB Xvnc 13.7 MiB + 515.5 KiB = 14.2 MiB yum-updatesd 16.3 MiB + 1.6 MiB = 17.9 MiB nautilus 20.8 MiB + 1.4 MiB = 22.2 MiB puplet 1.5 GiB + 438.0 KiB = 1.5 GiB java - 1.7 GiB = Thanks, Mohan From: jim holtman jholt...@gmail.com To: mohan.radhakrish...@polarisft.com Cc: R mailing list r-help@r-project.org Date: 08/30/2013 07:14 PM Subject:Re: [R] Memory usage bar plot Here is how to parse the data and put it into groups. Not sure what the 'timing' of each group is since not time information was given. Also not sure is there is an 'MiB' qualifier on the data, but you have the matrix of data which is easy to do with as you want. input - readLines(textConnection( + Private + Shared = RAM used Program + + 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd + 108.0 KiB + 12.5 KiB = 120.5 KiB klogd + 124.0 KiB + 17.0 KiB = 141.0 KiB hidd + 116.0 KiB + 30.0 KiB = 146.0 KiB acpid + 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage + 144.0 KiB + 15.0 KiB = 159.0 KiB gpm + 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check + - + 453.9 MiB + + = + Private + Shared = RAM used Program + + 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd + 108.0 KiB + 12.5 KiB = 120.5 KiB klogd + 124.0 KiB + 17.0 KiB = 141.0 KiB hidd + 116.0 KiB + 30.0 KiB = 146.0 KiB acpid + 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage + 144.0
Re: [R] Memory usage bar plot
brcm_iscsiuio 188.0 KiB + 37.0 KiB = 225.0 KiB rpc.statd 208.0 KiB + 24.0 KiB = 232.0 KiB audispd 208.0 KiB + 40.5 KiB = 248.5 KiB hald-runner 244.0 KiB + 23.5 KiB = 267.5 KiB smartd 240.0 KiB + 35.5 KiB = 275.5 KiB hpiod 244.0 KiB + 35.0 KiB = 279.0 KiB hcid 228.0 KiB + 75.0 KiB = 303.0 KiB hald-addon-keyboard (2) 196.0 KiB + 144.0 KiB = 340.0 KiB sh 328.0 KiB + 32.5 KiB = 360.5 KiB gam_server 336.0 KiB + 32.5 KiB = 368.5 KiB xinetd 364.0 KiB + 28.5 KiB = 392.5 KiB auditd 420.0 KiB + 84.0 KiB = 504.0 KiB mingetty (6) 552.0 KiB + 19.5 KiB = 571.5 KiB udevd 532.0 KiB + 56.0 KiB = 588.0 KiB rpc.idmapd 544.0 KiB + 50.5 KiB = 594.5 KiB ssh-agent 612.0 KiB + 29.0 KiB = 641.0 KiB crond 484.0 KiB + 176.0 KiB = 660.0 KiB avahi-daemon (2) 576.0 KiB + 164.0 KiB = 740.0 KiB sftp-server 744.0 KiB + 74.5 KiB = 818.5 KiB automount 756.0 KiB + 186.5 KiB = 942.5 KiB gnome-vfs-daemon 736.0 KiB + 296.0 KiB = 1.0 MiB dbus-daemon (2) 988.0 KiB + 61.5 KiB = 1.0 MiB pcscd 824.0 KiB + 231.5 KiB = 1.0 MiB pam-panel-icon 1.0 MiB + 26.0 KiB = 1.1 MiB nmon 864.0 KiB + 229.5 KiB = 1.1 MiB bt-applet 712.0 KiB + 398.0 KiB = 1.1 MiB nm-system-settings 1.0 MiB + 63.0 KiB = 1.1 MiB nmbd 996.0 KiB + 131.0 KiB = 1.1 MiB bonobo-activation-server 740.0 KiB + 395.5 KiB = 1.1 MiB escd 880.0 KiB + 432.0 KiB = 1.3 MiB bash (2) 1.1 MiB + 212.5 KiB = 1.3 MiB gnome-screensaver 796.0 KiB + 617.5 KiB = 1.4 MiB gdm-rh-security-token-helper 916.0 KiB + 739.5 KiB = 1.6 MiB gdm-binary (2) 1.2 MiB + 387.5 KiB = 1.6 MiB gnome-session 1.4 MiB + 221.0 KiB = 1.6 MiB cupsd 1.3 MiB + 443.5 KiB = 1.8 MiB notification-area-applet 2.1 MiB + 69.0 KiB = 2.2 MiB xfs 1.8 MiB + 545.5 KiB = 2.3 MiB eggcups 2.2 MiB + 86.5 KiB = 2.3 MiB gconfd-2 1.9 MiB + 492.5 KiB = 2.4 MiB gnome-settings-daemon 2.0 MiB + 421.5 KiB = 2.4 MiB gnome-power-manager 1.9 MiB + 569.0 KiB = 2.5 MiB trashapplet 1.7 MiB + 1.0 MiB = 2.7 MiB smbd (2) 2.6 MiB + 365.0 KiB = 2.9 MiB iscsid (2) 2.7 MiB + 349.0 KiB = 3.0 MiB sendmail.sendmail (2) 3.2 MiB + 73.0 KiB = 3.2 MiB hald 2.7 MiB + 649.0 KiB = 3.4 MiB clock-applet 2.5 MiB + 1.4 MiB = 3.9 MiB nm-applet 3.4 MiB + 729.5 KiB = 4.1 MiB metacity 2.8 MiB + 1.4 MiB = 4.2 MiB sshd (4) 3.4 MiB + 853.0 KiB = 4.3 MiB wnck-applet 4.4 MiB + 377.5 KiB = 4.8 MiB Xorg 4.3 MiB + 717.5 KiB = 5.0 MiB mixer_applet2 4.5 MiB + 809.5 KiB = 5.3 MiB gnome-panel 5.3 MiB + 251.5 KiB = 5.6 MiB hpssd.py 4.0 MiB + 3.3 MiB = 7.2 MiB httpd (11) 10.5 MiB + 870.0 KiB = 11.3 MiB gdmgreeter 12.8 MiB + 1.1 MiB = 13.8 MiB Xvnc 13.7 MiB + 515.5 KiB = 14.2 MiB yum-updatesd 16.3 MiB + 1.6 MiB = 17.9 MiB nautilus 20.8 MiB + 1.4 MiB = 22.2 MiB puplet 1.5 GiB + 438.0 KiB = 1.5 GiB java - 1.7 GiB =)) input1- input input2- str_trim(gsub([=+],,input1)) input3- input2[input2!=] dat1-read.table(text=gsub(\\,+,,,gsub(\\s{2},,,input3)),sep=,,header=FALSE,stringsAsFactors=FALSE,fill=TRUE) dat2- dat1[,3:4] dat3- dat2[dat2[,1]!=,][-1,] lst1-lapply(split(dat3,cumsum(1*grepl(RAM,dat3[,1]))),function(x) {x1-if(length(grep(RAM,x[,1]))0) x[-grep(RAM,x[,1]),] else x; x2- data.frame(read.table(text=x1[,1],sep=,header=FALSE,stringsAsFactors=FALSE),x1[,2],stringsAsFactors=FALSE); colnames(x2)- c(RAM, used, Program);x2}) str(lst1) #List of 2 # $ 0:'data.frame': 79 obs. of 3 variables: # ..$ RAM : num [1:79] 98.5 119.5 139 140.5 144.5 ... # ..$ used : chr [1:79] KiB KiB KiB KiB ... # ..$ Program: chr [1:79] sleep klogd hidd gpm ... # $ 1:'data.frame': 79 obs. of 3 variables: # ..$ RAM : num [1:79] 120 139 140 146 148 ... # ..$ used : chr [1:79] KiB KiB KiB KiB ... # ..$ Program: chr [1:79] klogd hidd gpm hald-addon-storage ... lapply(lst1,head) #$`0` # RAM used Program #1 98.5 KiB sleep #2 119.5 KiB klogd #3 139.0 KiB hidd #4 140.5 KiB gpm #5 144.5 KiB hald-addon-storage #6 148.0 KiB acpid # #$`1` # RAM used Program #1 119.5 KiB klogd #2 139.0 KiB hidd #3 140.5 KiB gpm #4 145.5 KiB hald-addon-storage #5 148.0 KiB acpid #6 153.0 KiB dbus-launch A.K. - Original Message - From: mohan.radhakrish...@polarisft.com mohan.radhakrish...@polarisft.com To: jim holtman jholt...@gmail.com Cc: R mailing list r-help@r-project.org Sent: Wednesday, September 4, 2013 6:43 AM Subject: Re: [R] Memory usage bar plot Hi, I have tried the ideas with an actual data set but couldn't pass the parsing
[R] Memory usage bar plot
Hi, I haven't tried the code yet. Is there a way to parse this data using R and create bar plots so that each program's 'RAM used' figures are grouped together. So 'uuidd' bars will be together. The data will have about 50 sets. So if there are 100 processes each will have about 50 bars. What is the recommended way to graph these big barplots ? I am looking for only 'RAM used' figures. Thanks, Mohan Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check - 453.9 MiB = Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check -- 453.9 MiB = This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e-mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. Visit us at http://www.polarisFT.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage bar plot
Hello, This memory usage should be graphed with time. Are there examples of scatterplots that can clearly show usage vs time ? This is for memory leak detection. Thanks, Mohan From: PIKAL Petr petr.pi...@precheza.cz To: mohan.radhakrish...@polarisft.com mohan.radhakrish...@polarisft.com, r-help@r-project.org r-help@r-project.org Date: 08/30/2013 05:33 PM Subject:RE: [R] Memory usage bar plot Hi For reading data into R you shall look to read.table and similar. For plotting ggplot could handle it. However I wonder if 100 times 50 bars is the way how to present your data. You shall think over what do you want to show to yourself or your audience. Maybe boxplots or scatterplots could be better. Petr -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- project.org] On Behalf Of mohan.radhakrish...@polarisft.com Sent: Friday, August 30, 2013 1:25 PM To: r-help@r-project.org Subject: [R] Memory usage bar plot Hi, I haven't tried the code yet. Is there a way to parse this data using R and create bar plots so that each program's 'RAM used' figures are grouped together. So 'uuidd' bars will be together. The data will have about 50 sets. So if there are 100 processes each will have about 50 bars. What is the recommended way to graph these big barplots ? I am looking for only 'RAM used' figures. Thanks, Mohan Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check - 453.9 MiB = Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check -- 453.9 MiB = This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e- mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. Visit us at http://www.polarisFT.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e-mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. Visit us at http://www.polarisFT.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage bar plot
Here is how to parse the data and put it into groups. Not sure what the 'timing' of each group is since not time information was given. Also not sure is there is an 'MiB' qualifier on the data, but you have the matrix of data which is easy to do with as you want. input - readLines(textConnection( + Private + Shared = RAM used Program + + 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd + 108.0 KiB + 12.5 KiB = 120.5 KiB klogd + 124.0 KiB + 17.0 KiB = 141.0 KiB hidd + 116.0 KiB + 30.0 KiB = 146.0 KiB acpid + 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage + 144.0 KiB + 15.0 KiB = 159.0 KiB gpm + 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check + - + 453.9 MiB + + = + Private + Shared = RAM used Program + + 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd + 108.0 KiB + 12.5 KiB = 120.5 KiB klogd + 124.0 KiB + 17.0 KiB = 141.0 KiB hidd + 116.0 KiB + 30.0 KiB = 146.0 KiB acpid + 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage + 144.0 KiB + 15.0 KiB = 159.0 KiB gpm + 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check + -- + 453.9 MiB + =)) # keep only the data input - input[grepl('=', input)] # separate into groups grps - split(input, cumsum(grepl(= RAM, input))) # parse the data (not sure if there is also 'MiB') parsed - lapply(grps, function(.grp){ + # parse ignoring first and last lines + .data - sub(.*= ([^ ]+) ([^ ]+)\\s+(.*), \\1 \\2 \\3 + , .grp[2:(length(.grp) - 1L)] + ) + # return matrix + do.call(rbind, strsplit(.data, ' ')) + }) parsed $`1` [,1][,2] [,3] [1,] 107.5 KiB uuidd [2,] 120.5 KiB klogd [3,] 141.0 KiB hidd [4,] 146.0 KiB acpid [5,] 153.5 KiB hald-addon-storage [6,] 159.0 KiB gpm [7,] 162.5 KiB pam_timestamp_check $`2` [,1][,2] [,3] [1,] 107.5 KiB uuidd [2,] 120.5 KiB klogd [3,] 141.0 KiB hidd [4,] 146.0 KiB acpid [5,] 153.5 KiB hald-addon-storage [6,] 159.0 KiB gpm [7,] 162.5 KiB pam_timestamp_check Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Fri, Aug 30, 2013 at 7:24 AM, mohan.radhakrish...@polarisft.com wrote: Hi, I haven't tried the code yet. Is there a way to parse this data using R and create bar plots so that each program's 'RAM used' figures are grouped together. So 'uuidd' bars will be together. The data will have about 50 sets. So if there are 100 processes each will have about 50 bars. What is the recommended way to graph these big barplots ? I am looking for only 'RAM used' figures. Thanks, Mohan Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check - 453.9 MiB = Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check -- 453.9 MiB = This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e-mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. Visit us at http://www.polarisFT.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __
Re: [R] Memory usage bar plot
HI, You could also parse the data by: input1- input library(stringr) input2-str_trim(gsub([=+],,input1)) dat1-read.table(text=word(input2[!grepl(---,input2) input2!= !grepl(RAM|MiB,input2)],8,15),sep=,header=FALSE,stringsAsFactors=FALSE) lst1-split(dat1,cumsum(dat1$V3==uuidd)) lst1 #$`1` # V1 V2 V3 #1 107.5 KiB uuidd #2 120.5 KiB klogd #3 141.0 KiB hidd #4 146.0 KiB acpid #5 153.5 KiB hald-addon-storage #6 159.0 KiB gpm #7 162.5 KiB pam_timestamp_check # #$`2` # V1 V2 V3 #8 107.5 KiB uuidd #9 120.5 KiB klogd #10 141.0 KiB hidd #11 146.0 KiB acpid #12 153.5 KiB hald-addon-storage #13 159.0 KiB gpm #14 162.5 KiB pam_timestamp_check A.K. - Original Message - From: jim holtman jholt...@gmail.com To: mohan.radhakrish...@polarisft.com Cc: R mailing list r-help@r-project.org Sent: Friday, August 30, 2013 9:44 AM Subject: Re: [R] Memory usage bar plot Here is how to parse the data and put it into groups. Not sure what the 'timing' of each group is since not time information was given. Also not sure is there is an 'MiB' qualifier on the data, but you have the matrix of data which is easy to do with as you want. input - readLines(textConnection( + Private + Shared = RAM used Program + + 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd + 108.0 KiB + 12.5 KiB = 120.5 KiB klogd + 124.0 KiB + 17.0 KiB = 141.0 KiB hidd + 116.0 KiB + 30.0 KiB = 146.0 KiB acpid + 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage + 144.0 KiB + 15.0 KiB = 159.0 KiB gpm + 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check + - + 453.9 MiB + + = + Private + Shared = RAM used Program + + 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd + 108.0 KiB + 12.5 KiB = 120.5 KiB klogd + 124.0 KiB + 17.0 KiB = 141.0 KiB hidd + 116.0 KiB + 30.0 KiB = 146.0 KiB acpid + 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage + 144.0 KiB + 15.0 KiB = 159.0 KiB gpm + 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check + -- + 453.9 MiB + =)) # keep only the data input - input[grepl('=', input)] # separate into groups grps - split(input, cumsum(grepl(= RAM, input))) # parse the data (not sure if there is also 'MiB') parsed - lapply(grps, function(.grp){ + # parse ignoring first and last lines + .data - sub(.*= ([^ ]+) ([^ ]+)\\s+(.*), \\1 \\2 \\3 + , .grp[2:(length(.grp) - 1L)] + ) + # return matrix + do.call(rbind, strsplit(.data, ' ')) + }) parsed $`1` [,1] [,2] [,3] [1,] 107.5 KiB uuidd [2,] 120.5 KiB klogd [3,] 141.0 KiB hidd [4,] 146.0 KiB acpid [5,] 153.5 KiB hald-addon-storage [6,] 159.0 KiB gpm [7,] 162.5 KiB pam_timestamp_check $`2` [,1] [,2] [,3] [1,] 107.5 KiB uuidd [2,] 120.5 KiB klogd [3,] 141.0 KiB hidd [4,] 146.0 KiB acpid [5,] 153.5 KiB hald-addon-storage [6,] 159.0 KiB gpm [7,] 162.5 KiB pam_timestamp_check Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Fri, Aug 30, 2013 at 7:24 AM, mohan.radhakrish...@polarisft.com wrote: Hi, I haven't tried the code yet. Is there a way to parse this data using R and create bar plots so that each program's 'RAM used' figures are grouped together. So 'uuidd' bars will be together. The data will have about 50 sets. So if there are 100 processes each will have about 50 bars. What is the recommended way to graph these big barplots ? I am looking for only 'RAM used' figures. Thanks, Mohan Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check - 453.9 MiB = Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check
Re: [R] Memory usage bar plot
Hi From: mohan.radhakrish...@polarisft.com [mailto:mohan.radhakrish...@polarisft.com] Sent: Friday, August 30, 2013 3:16 PM To: PIKAL Petr Cc: r-help@r-project.org Subject: RE: [R] Memory usage bar plot Hello, This memory usage should be graphed with time. Are there examples of scatterplots that can clearly show usage vs time ? This is for memory leak detection. Hm, Actually I do not understand what do you want. No data, no code just some vague description. If you have data frame with variables usage and time you can plot plot(time, usage) Regards Petr Thanks, Mohan From:PIKAL Petr petr.pi...@precheza.czmailto:petr.pi...@precheza.cz To: mohan.radhakrish...@polarisft.commailto:mohan.radhakrish...@polarisft.com mohan.radhakrish...@polarisft.commailto:mohan.radhakrish...@polarisft.com, r-help@r-project.orgmailto:r-help@r-project.org r-help@r-project.orgmailto:r-help@r-project.org Date:08/30/2013 05:33 PM Subject:RE: [R] Memory usage bar plot Hi For reading data into R you shall look to read.table and similar. For plotting ggplot could handle it. However I wonder if 100 times 50 bars is the way how to present your data. You shall think over what do you want to show to yourself or your audience. Maybe boxplots or scatterplots could be better. Petr -Original Message- From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org [mailto:r-help-bounces@r- project.org] On Behalf Of mohan.radhakrish...@polarisft.commailto:mohan.radhakrish...@polarisft.com Sent: Friday, August 30, 2013 1:25 PM To: r-help@r-project.orgmailto:r-help@r-project.org Subject: [R] Memory usage bar plot Hi, I haven't tried the code yet. Is there a way to parse this data using R and create bar plots so that each program's 'RAM used' figures are grouped together. So 'uuidd' bars will be together. The data will have about 50 sets. So if there are 100 processes each will have about 50 bars. What is the recommended way to graph these big barplots ? I am looking for only 'RAM used' figures. Thanks, Mohan Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check - 453.9 MiB = Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check -- 453.9 MiB = This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e- mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. Visit us at http://www.polarisFT.comhttp://www.polarisft.com/ [[alternative HTML version deleted]] __ R-help@r-project.orgmailto:R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-http://www.r-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e-mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. Visit us at http://www.polarisFT.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list
Re: [R] Memory usage bar plot
Hi For reading data into R you shall look to read.table and similar. For plotting ggplot could handle it. However I wonder if 100 times 50 bars is the way how to present your data. You shall think over what do you want to show to yourself or your audience. Maybe boxplots or scatterplots could be better. Petr -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- project.org] On Behalf Of mohan.radhakrish...@polarisft.com Sent: Friday, August 30, 2013 1:25 PM To: r-help@r-project.org Subject: [R] Memory usage bar plot Hi, I haven't tried the code yet. Is there a way to parse this data using R and create bar plots so that each program's 'RAM used' figures are grouped together. So 'uuidd' bars will be together. The data will have about 50 sets. So if there are 100 processes each will have about 50 bars. What is the recommended way to graph these big barplots ? I am looking for only 'RAM used' figures. Thanks, Mohan Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check - 453.9 MiB = Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check -- 453.9 MiB = This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e- mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. Visit us at http://www.polarisFT.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage bar plot
## Here is a plot. The input was parsed with Jim Holtman's code. ## The panel.dumbell is something I devised to show differences. ## Rich input - readLines(textConnection( Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check - 453.9 MiB = Private + Shared = RAM used Program 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd 108.0 KiB + 12.5 KiB = 120.5 KiB klogd 124.0 KiB + 17.0 KiB = 141.0 KiB hidd 116.0 KiB + 30.0 KiB = 146.0 KiB acpid 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage 144.0 KiB + 15.0 KiB = 159.0 KiB gpm 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check -- 453.9 MiB =)) # keep only the data input - input[grepl('=', input)] # separate into groups grps - split(input, cumsum(grepl(= RAM, input))) # parse the data (not sure if there is also 'MiB') parsed - lapply(grps, function(.grp){ # parse ignoring first and last lines .data - sub(.*= ([^ ]+) ([^ ]+)\\s+(.*), \\1 \\2 \\3 , .grp[2:(length(.grp) - 1L)] ) # return matrix do.call(rbind, strsplit(.data, ' ')) }) parsed tmp1 - do.call(rbind, lapply(parsed, function(x) data.frame(x))) names(tmp1) - c(RamUsed, units, Program) tmp1$Time - factor(rep(1:2, each=7)) tmp1$RamUsed - as.numeric(tmp1$RamUsed) library(lattice) dotplot(Program ~ RamUsed, groups=Time, data=tmp1) ## this is silly. Let me construct a more interesting example with different values at each time. tmp1$RamUsed[8:14] - tmp1$RamUsed[1:7] + 10*(sample(1:7)) tmp1 dotplot(Program ~ RamUsed, groups=Time, data=tmp1, auto.key=list(title=Time, border=TRUE, columns=2)) panel.dumbell - function(x, y, ..., lwd=1) { n - length(x)/2 panel.segments(x[1:n], as.numeric(y)[n+(1:n)], x[n+(1:n)], as.numeric(y)[n+(1:n)], lwd=lwd) panel.dotplot(x, y, ...) } dotplot(Program ~ RamUsed, groups=Time, data=tmp1, auto.key=list(title=Time, border=TRUE, columns=2), panel=panel.dumbell, par.settings=list(superpose.symbol=list(pch=19)), ) On Fri, Aug 30, 2013 at 9:44 AM, jim holtman jholt...@gmail.com wrote: Here is how to parse the data and put it into groups. Not sure what the 'timing' of each group is since not time information was given. Also not sure is there is an 'MiB' qualifier on the data, but you have the matrix of data which is easy to do with as you want. input - readLines(textConnection( + Private + Shared = RAM used Program + + 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd + 108.0 KiB + 12.5 KiB = 120.5 KiB klogd + 124.0 KiB + 17.0 KiB = 141.0 KiB hidd + 116.0 KiB + 30.0 KiB = 146.0 KiB acpid + 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage + 144.0 KiB + 15.0 KiB = 159.0 KiB gpm + 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check + - + 453.9 MiB + + = + Private + Shared = RAM used Program + + 96.0 KiB + 11.5 KiB = 107.5 KiB uuidd + 108.0 KiB + 12.5 KiB = 120.5 KiB klogd + 124.0 KiB + 17.0 KiB = 141.0 KiB hidd + 116.0 KiB + 30.0 KiB = 146.0 KiB acpid + 124.0 KiB + 29.5 KiB = 153.5 KiB hald-addon-storage + 144.0 KiB + 15.0 KiB = 159.0 KiB gpm + 136.0 KiB + 26.5 KiB = 162.5 KiB pam_timestamp_check + -- + 453.9 MiB + =)) # keep only the data input - input[grepl('=', input)] # separate into groups grps - split(input, cumsum(grepl(= RAM, input))) # parse the data (not sure if there is also 'MiB') parsed - lapply(grps, function(.grp){ + # parse ignoring first and last lines + .data - sub(.*= ([^ ]+) ([^ ]+)\\s+(.*), \\1 \\2 \\3 + , .grp[2:(length(.grp) - 1L)] + ) + # return matrix + do.call(rbind, strsplit(.data, ' ')) + }) parsed $`1` [,1][,2] [,3] [1,] 107.5 KiB uuidd [2,] 120.5 KiB klogd [3,] 141.0 KiB hidd [4,] 146.0 KiB acpid [5,] 153.5 KiB hald-addon-storage [6,] 159.0 KiB gpm [7,] 162.5 KiB pam_timestamp_check $`2` [,1][,2] [,3] [1,] 107.5 KiB uuidd [2,] 120.5 KiB klogd [3,] 141.0 KiB hidd [4,] 146.0 KiB acpid [5,] 153.5 KiB
Re: [R] Memory usage reported by gc() differs from 'top'
Merci beaucoup Milan, thank you very much Martin and Kjetil for your responses. I appreciate the caveat about virtual memory. I gather that besides resident memory and swap space, it may also include memory mapped files, which don't cost anything. Maybe by pure chance, in my case virtual memory still seems mildly relevant. While there is RAM available, res tracks virt closely, about till this point: top - 13:32:28 up 208 days, 20:46, 3 users, load average: 1.68, 1.41, 1.17 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 46.5%us, 6.5%sy, 0.0%ni, 6.0%id, 41.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8063744k total, 8012976k used,50768k free, 464k buffers Swap: 19543064k total, 3445236k used, 16097828k free,35096k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 6210 brech 20 0 7486m 7.2g 7424 R 98 93.2 6:15.74 R (That's 7.3g virtual memory.) After that, res stays the same, while virt keeps growing. That's an issue because if it uses up all the swap space (a bit beyond the state I showed in my original post), R starts reporting problems, e.g.: Error in system(command = command, intern = output) : cannot popen 'whoami', probable reason 'Cannot allocate memory' It sounds like a major reason for the discrepancy could be fragmentation, possibly caused by repeated copying. It will take some work to profile memory usage (thanks for you pointers to the tools) get a better picture, and to create a minimal reproducible example. I'm glad you pointed me to /proc/[pid]/maps and smaps; they have a wealth of information. The most interesting entry is [heap]; it's growing rapidly during the run of my code, and accounts for all of res and 98% of virt. The others are less exciting, mostly memory mapped files (e.g., lib/R/library/MASS/libs/x86_64/MASS.so), and change at most by a few k Referenced, or moved from Rss to Swap. So clearly my interest is in R's heap. Again many thanks for all your help! /Christian On Thu, Apr 18, 2013 at 5:17 PM, Kjetil Kjernsmo kje...@ifi.uio.no wrote: On Thursday 18. April 2013 12.18.03 Milan Bouchet-Valat wrote: First, completely stop looking at virtual memory: it does not mean much, if anything. What you care about is resident memory. See e.g.: http://serverfault.com/questions/138427/top-what-does-virtual-memory-size-m ean-linux-ubuntu I concur. I have lost track of R's internals long ago, but in a previous life analyzing the Apache HTTP server's actual memory use (something that focused on shared RAM, quite different from what you'd probably like to do), I found that if you really need to understand what's going on, you would need to look elsewhere. On Linux, you'll find the details in the /proc/[pid]/maps and /proc/[pid]/smaps pseudo-filesystem files, where [pid] is the process ID, im your example 6210. That's where you really see what's eating your RAM. :-) Cheers, KJetil [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage reported by gc() differs from 'top'
Le mercredi 17 avril 2013 à 23:17 -0400, Christian Brechbühler a écrit : In help(gc) I read, ...the primary purpose of calling 'gc' is for the report on memory usage. What memory usage does gc() report? And more importantly, which memory uses does it NOT report? Because I see one answer from gc(): used (Mb) gc trigger (Mb) max used (Mb) Ncells 14875922 794.5 21754962 1161.9 17854776 953.6 Vcells 59905567 457.1 84428913 644.2 72715009 554.8 (That's about 1.5g max used, 1.8g trigger.) And a different answer from an OS utility, 'top': PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6210 brech 20 0 18.2g 7.2g 2612 S 1 93.4 16:26.73 R So the R process is holding on to 18.2g memory, but it only seems to have accout of 1.5g or so. Where is the rest? I tried searching the archives, and found answers like just buy more RAM. Which doesn't exactly answer my question. And come on, 18g is pretty big; sure it doesn't fit in my RAM (only 7.2g are in), but that's beside the point. The huge memory demand is specific to R version 2.15.3 Patched (2013-03-13 r62500) -- Security Blanket. The same test runs without issues under R version 2.15.1 beta (2012-06-11 r59557) -- Roasted Marshmallows. I appreciate any insights you can share into R's memory management, and gc() in particular. /Christian First, completely stop looking at virtual memory: it does not mean much, if anything. What you care about is resident memory. See e.g.: http://serverfault.com/questions/138427/top-what-does-virtual-memory-size-mean-linux-ubuntu Then, there is a limitation with R/Linux: gc() does not reorder objects in memory so that they are all on the same area. This means that while the total size of R objects in memory is 457MB, they are spread all over the RAM, and a single object in a memory page forces the Linux kernel to keep it in RAM. I do not know the exact details, as it seems that Windows does a better job than Linux in that regard. One workaround is to save the session and restart R: objects will be loaded in a more compact fashion. As for the differences between R 2.15.1 and R 2.15.3, maybe there is some more copying that increases memory fragmentation, but the fundamental problem has not changed AFAIK. You can call tracemem() on large objects to see how many times they are being copied. See http://developer.r-project.org/memory-profiling.html My two cents [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage reported by gc() differs from 'top'
On 04/18/2013 03:18 AM, Milan Bouchet-Valat wrote: Le mercredi 17 avril 2013 à 23:17 -0400, Christian Brechbühler a écrit : In help(gc) I read, ...the primary purpose of calling 'gc' is for the report on memory usage. What memory usage does gc() report? And more importantly, which memory uses does it NOT report? Because I see one answer from gc(): used (Mb) gc trigger (Mb) max used (Mb) Ncells 14875922 794.5 21754962 1161.9 17854776 953.6 Vcells 59905567 457.1 84428913 644.2 72715009 554.8 From the R side of things, this is an (approximate) accounting of memory actually reached by objects in the current session. One possible reason for discrepancy with the OS is that you are using a package that references memory R does not know about (e.g., 'external pointers'), or there is a memory leak in R or a third party package where memory is not returned to the OS. Even if the reason is 'memory fragmentation' as suggested by Milan, it is interesting to understand how that fragmentation arises, either to identify a work-around or more productively to understand and address the underlying problem. So a reasonable avenue is to develop a minimal, reproducible example of how one could arrive at the situation you report. Martin (That's about 1.5g max used, 1.8g trigger.) And a different answer from an OS utility, 'top': PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 6210 brech 20 0 18.2g 7.2g 2612 S1 93.4 16:26.73 R So the R process is holding on to 18.2g memory, but it only seems to have accout of 1.5g or so. Where is the rest? I tried searching the archives, and found answers like just buy more RAM. Which doesn't exactly answer my question. And come on, 18g is pretty big; sure it doesn't fit in my RAM (only 7.2g are in), but that's beside the point. The huge memory demand is specific to R version 2.15.3 Patched (2013-03-13 r62500) -- Security Blanket. The same test runs without issues under R version 2.15.1 beta (2012-06-11 r59557) -- Roasted Marshmallows. I appreciate any insights you can share into R's memory management, and gc() in particular. /Christian First, completely stop looking at virtual memory: it does not mean much, if anything. What you care about is resident memory. See e.g.: http://serverfault.com/questions/138427/top-what-does-virtual-memory-size-mean-linux-ubuntu Then, there is a limitation with R/Linux: gc() does not reorder objects in memory so that they are all on the same area. This means that while the total size of R objects in memory is 457MB, they are spread all over the RAM, and a single object in a memory page forces the Linux kernel to keep it in RAM. I do not know the exact details, as it seems that Windows does a better job than Linux in that regard. One workaround is to save the session and restart R: objects will be loaded in a more compact fashion. As for the differences between R 2.15.1 and R 2.15.3, maybe there is some more copying that increases memory fragmentation, but the fundamental problem has not changed AFAIK. You can call tracemem() on large objects to see how many times they are being copied. See http://developer.r-project.org/memory-profiling.html My two cents [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage reported by gc() differs from 'top'
On Thursday 18. April 2013 12.18.03 Milan Bouchet-Valat wrote: First, completely stop looking at virtual memory: it does not mean much, if anything. What you care about is resident memory. See e.g.: http://serverfault.com/questions/138427/top-what-does-virtual-memory-size-m ean-linux-ubuntu I concur. I have lost track of R's internals long ago, but in a previous life analyzing the Apache HTTP server's actual memory use (something that focused on shared RAM, quite different from what you'd probably like to do), I found that if you really need to understand what's going on, you would need to look elsewhere. On Linux, you'll find the details in the /proc/[pid]/maps and /proc/[pid]/smaps pseudo-filesystem files, where [pid] is the process ID, im your example 6210. That's where you really see what's eating your RAM. :-) Cheers, KJetil __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Memory usage reported by gc() differs from 'top'
In help(gc) I read, ...the primary purpose of calling 'gc' is for the report on memory usage. What memory usage does gc() report? And more importantly, which memory uses does it NOT report? Because I see one answer from gc(): used (Mb) gc trigger (Mb) max used (Mb) Ncells 14875922 794.5 21754962 1161.9 17854776 953.6 Vcells 59905567 457.1 84428913 644.2 72715009 554.8 (That's about 1.5g max used, 1.8g trigger.) And a different answer from an OS utility, 'top': PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 6210 brech 20 0 18.2g 7.2g 2612 S1 93.4 16:26.73 R So the R process is holding on to 18.2g memory, but it only seems to have accout of 1.5g or so. Where is the rest? I tried searching the archives, and found answers like just buy more RAM. Which doesn't exactly answer my question. And come on, 18g is pretty big; sure it doesn't fit in my RAM (only 7.2g are in), but that's beside the point. The huge memory demand is specific to R version 2.15.3 Patched (2013-03-13 r62500) -- Security Blanket. The same test runs without issues under R version 2.15.1 beta (2012-06-11 r59557) -- Roasted Marshmallows. I appreciate any insights you can share into R's memory management, and gc() in particular. /Christian [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage in R grows considerably while calculating word frequencies
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 25/09/12 01:29, mcelis wrote: I am working with some large text files (up to 16 GBytes). I am interested in extracting the words and counting each time each word appears in the text. I have written a very simple R program by following some suggestions and examples I found online. Just an idea (I have no experience with what you want to do, so it might not work): What about putting the text in a database (sqlite comes to mind) where each word is one entry. Then you could use sql to query the database, which should need much less memory. In addition, it should make further processing much easier. Cheers, Rainer If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a better way to do this that will minimize memory usage. I am very new to R, so I would appreciate some tips on how to improve my program or a better way to do it. R program: # Read in the entire file and convert all words in text to lower case words.txt-tolower(scan(text_file,character,sep=\n)) # Extract words pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,words.txt) words.txt - regmatches(words.txt,match) # Create a vector from the list of words words.txt-unlist(words.txt) # Calculate word frequencies words.txt-table(words.txt,dnn=words) # Sort by frequency, not alphabetically words.txt-sort(words.txt,decreasing=TRUE) # Put into some readable form, Name of word and Number of times it occurs words.txt-paste(names(words.txt),words.txt,sep=\t) # Results to a file cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) -- View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html Sent from the R help mailing list archive at Nabble.com. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlBitboACgkQoYgNqgF2egr1pgCgjHxE/E1qIwUbrYzB30qIk9cK z/oAoILCYn66+c9CF5tzkWeQH3E2utwi =ahI5 -END PGP SIGNATURE- __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage in R grows considerably while calculating word frequencies
HI, In a text file of 6834 words, I compared your program with a modified program. sapply(strsplit(txt1, ),length) #[1] 6834 #your program system.time({ txt1-tolower(scan(text_file,character,sep=\n)) pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,txt1) words.txt - regmatches(txt1,match) words.txt-unlist(words.txt) words.txt-table(words.txt,dnn=words) words.txt-sort(words.txt,decreasing=TRUE) words.txt-paste(names(words.txt),words.txt,sep=\t) cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) }) # user system elapsed # 0.208 0.000 0.206 #Modified code system.time({ txt1-tolower(scan(text_file,character,sep=\n)) words.txt-sort(table(strsplit(tolower(txt1),\\s)),decreasing=TRUE) words.txt-paste(names(words.txt),words.txt,sep=\t) cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) }) # user system elapsed # 0.016 0.000 0.014 A.K. - Original Message - From: mcelis mce...@lightminersystems.com To: r-help@r-project.org Cc: Sent: Monday, September 24, 2012 7:29 PM Subject: [R] Memory usage in R grows considerably while calculating word frequencies I am working with some large text files (up to 16 GBytes). I am interested in extracting the words and counting each time each word appears in the text. I have written a very simple R program by following some suggestions and examples I found online. If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a better way to do this that will minimize memory usage. I am very new to R, so I would appreciate some tips on how to improve my program or a better way to do it. R program: # Read in the entire file and convert all words in text to lower case words.txt-tolower(scan(text_file,character,sep=\n)) # Extract words pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,words.txt) words.txt - regmatches(words.txt,match) # Create a vector from the list of words words.txt-unlist(words.txt) # Calculate word frequencies words.txt-table(words.txt,dnn=words) # Sort by frequency, not alphabetically words.txt-sort(words.txt,decreasing=TRUE) # Put into some readable form, Name of word and Number of times it occurs words.txt-paste(names(words.txt),words.txt,sep=\t) # Results to a file cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) -- View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage in R grows considerably while calculating word frequencies
HI, In the previous email, I forgot to add unlist(). With four paragraphs, sapply(strsplit(txt1, ),length) #[1] 4850 9072 6400 2071 #Your code: system.time({ txt1-tolower(scan(text_file,character,sep=\n)) pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,txt1) words.txt - regmatches(txt1,match) words.txt-unlist(words.txt) words.txt-table(words.txt,dnn=words) words.txt-sort(words.txt,decreasing=TRUE) words.txt-paste(names(words.txt),words.txt,sep=\t) cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) }) #Read 4 items # user system elapsed # 11.781 0.004 11.799 #Modified code: system.time({ txt1-tolower(scan(text_file,character,sep=\n)) words.txt-sort(table(unlist(strsplit(tolower(txt1),\\s))),decreasing=TRUE) words.txt-paste(names(words.txt),words.txt,sep=\t) cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) }) #Read 4 items #user system elapsed # 0.036 0.008 0.043 A.K. - Original Message - From: mcelis mce...@lightminersystems.com To: r-help@r-project.org Cc: Sent: Monday, September 24, 2012 7:29 PM Subject: [R] Memory usage in R grows considerably while calculating word frequencies I am working with some large text files (up to 16 GBytes). I am interested in extracting the words and counting each time each word appears in the text. I have written a very simple R program by following some suggestions and examples I found online. If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a better way to do this that will minimize memory usage. I am very new to R, so I would appreciate some tips on how to improve my program or a better way to do it. R program: # Read in the entire file and convert all words in text to lower case words.txt-tolower(scan(text_file,character,sep=\n)) # Extract words pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,words.txt) words.txt - regmatches(words.txt,match) # Create a vector from the list of words words.txt-unlist(words.txt) # Calculate word frequencies words.txt-table(words.txt,dnn=words) # Sort by frequency, not alphabetically words.txt-sort(words.txt,decreasing=TRUE) # Put into some readable form, Name of word and Number of times it occurs words.txt-paste(names(words.txt),words.txt,sep=\t) # Results to a file cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) -- View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage in R grows considerably while calculating word frequencies
arun smartpink...@yahoo.com on Mon, 24 Sep 2012 19:59:35 -0700 writes: HI, In the previous email, I forgot to add unlist(). With four paragraphs, sapply(strsplit(txt1, ),length) #[1] 4850 9072 6400 2071 #Your code: system.time({ txt1-tolower(scan(text_file,character,sep=\n)) pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,txt1) words.txt - regmatches(txt1,match) words.txt-unlist(words.txt) words.txt-table(words.txt,dnn=words) words.txt-sort(words.txt,decreasing=TRUE) words.txt-paste(names(words.txt),words.txt,sep=\t) cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) }) #Read 4 items # user system elapsed # 11.781 0.004 11.799 #Modified code: system.time({ txt1-tolower(scan(text_file,character,sep=\n)) words.txt-sort(table(unlist(strsplit(tolower(txt1),\\s))),decreasing=TRUE) words.txt-paste(names(words.txt),words.txt,sep=\t) cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) }) #Read 4 items #user system elapsed # 0.036 0.008 0.043 A.K. Well, dear A.K., your definition of word is really different, and in my view clearly much too simplistic, compared to what the OP (= original-poster) asked from. E.g., from the above paragraph, your method will get words such as A.K., different, or (= clearly wrongly. Martin Maechler, ETH Zurich - Original Message - From: mcelis mce...@lightminersystems.com To: r-help@r-project.org Cc: Sent: Monday, September 24, 2012 7:29 PM Subject: [R] Memory usage in R grows considerably while calculating word frequencies I am working with some large text files (up to 16 GBytes). I am interested in extracting the words and counting each time each word appears in the text. I have written a very simple R program by following some suggestions and examples I found online. If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a better way to do this that will minimize memory usage. I am very new to R, so I would appreciate some tips on how to improve my program or a better way to do it. R program: # Read in the entire file and convert all words in text to lower case words.txt-tolower(scan(text_file,character,sep=\n)) # Extract words pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,words.txt) words.txt - regmatches(words.txt,match) # Create a vector from the list of words words.txt-unlist(words.txt) # Calculate word frequencies words.txt-table(words.txt,dnn=words) # Sort by frequency, not alphabetically words.txt-sort(words.txt,decreasing=TRUE) # Put into some readable form, Name of word and Number of times it occurs words.txt-paste(names(words.txt),words.txt,sep=\t) # Results to a file cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) -- View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage in R grows considerably while calculating word frequencies
Le lundi 24 septembre 2012 à 16:29 -0700, mcelis a écrit : I am working with some large text files (up to 16 GBytes). I am interested in extracting the words and counting each time each word appears in the text. I have written a very simple R program by following some suggestions and examples I found online. If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a better way to do this that will minimize memory usage. I am very new to R, so I would appreciate some tips on how to improve my program or a better way to do it. First, I think you should have a look at the tm package by Ingo Feinerer. It will help you to import the texts, optionally run processing steps on it, and then extract the words and create a document-term matrix counting their frequencies. No need to reinvent the wheel. Second, there's nothing wrong with using RAM as long as it's available. If other programs need it, the Linux will reclaim it. There's a problem only if R's memory use does not reduce at that point. Use gc() to check whether the RAM allocated to R is really in use. But tm should improve the efficiency of the computations. My two cents R program: # Read in the entire file and convert all words in text to lower case words.txt-tolower(scan(text_file,character,sep=\n)) # Extract words pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,words.txt) words.txt - regmatches(words.txt,match) # Create a vector from the list of words words.txt-unlist(words.txt) # Calculate word frequencies words.txt-table(words.txt,dnn=words) # Sort by frequency, not alphabetically words.txt-sort(words.txt,decreasing=TRUE) # Put into some readable form, Name of word and Number of times it occurs words.txt-paste(names(words.txt),words.txt,sep=\t) # Results to a file cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) -- View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage in R grows considerably while calculating word frequencies
Dear Martin, Thanks for testing the code. You are right. I modified the code: If I test it for a sample text, txt1-Romney A.K. different, (= than other people. Is it? OP's code: pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,txt1) words.txt - regmatches(txt1,match) words.txt-unlist(words.txt) words.txt-table(words.txt,dnn=words) words.txt-sort(words.txt,decreasing=TRUE) words.txt #words # A different Is it K other people Romney # 1 1 1 1 1 1 1 1 # than # 1 #My code: words.txt1-sort(table(gsub(\\W,,unlist(strsplit(tolower(txt1),\\s)))[grepl(\\b\\w+\\b,gsub(\\W,,unlist(strsplit(tolower(txt1),\\s])) # ak different is it other people romney than # 1 1 1 1 1 1 1 1 Here, as you can see, OP's code split A.K. to two words, but my code joins it. I didn't fix it because the concern is to minimize memory usage. I again, tested the new code with text of : sapply(strsplit(txt1, ),length) #[1] 4850 9072 6400 2071 sum(sapply(strsplit(txt1, ),length)) #[1] 22393 : words. #OP's code: system.time({ txt1-tolower(scan(text_file,character,sep=\n)) pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,txt1) words.txt - regmatches(txt1,match) words.txt-unlist(words.txt) words.txt-table(words.txt,dnn=words) words.txt-sort(words.txt,decreasing=TRUE) words.txt-paste(names(words.txt),words.txt,sep=\t) cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) }) #Read 4 items # user system elapsed # 12.056 0.000 12.066 #My code: system.time({ txt1-tolower(scan(text_file,character,sep=\n)) words.txt-sort(table(gsub(\\W,,unlist(strsplit(tolower(txt1),\\s)))[grepl(\\b\\w+\\b,gsub(\\W,,unlist(strsplit(tolower(txt1),\\s]),decreasing=TRUE) words.txt-paste(names(words.txt),words.txt,sep=\t) cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) }) #Read 4 items # user system elapsed # 0.148 0.000 0.150 There is improvement in the speed. Output also looked similar. This code may be still improved. A.K. - Original Message - From: Martin Maechler maech...@stat.math.ethz.ch To: arun smartpink...@yahoo.com Cc: mcelis mce...@lightminersystems.com; R help r-help@r-project.org Sent: Tuesday, September 25, 2012 9:07 AM Subject: Re: [R] Memory usage in R grows considerably while calculating word frequencies arun smartpink...@yahoo.com on Mon, 24 Sep 2012 19:59:35 -0700 writes: HI, In the previous email, I forgot to add unlist(). With four paragraphs, sapply(strsplit(txt1, ),length) #[1] 4850 9072 6400 2071 #Your code: system.time({ txt1-tolower(scan(text_file,character,sep=\n)) pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,txt1) words.txt - regmatches(txt1,match) words.txt-unlist(words.txt) words.txt-table(words.txt,dnn=words) words.txt-sort(words.txt,decreasing=TRUE) words.txt-paste(names(words.txt),words.txt,sep=\t) cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) }) #Read 4 items # user system elapsed # 11.781 0.004 11.799 #Modified code: system.time({ txt1-tolower(scan(text_file,character,sep=\n)) words.txt-sort(table(unlist(strsplit(tolower(txt1),\\s))),decreasing=TRUE) words.txt-paste(names(words.txt),words.txt,sep=\t) cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) }) #Read 4 items #user system elapsed # 0.036 0.008 0.043 A.K. Well, dear A.K., your definition of word is really different, and in my view clearly much too simplistic, compared to what the OP (= original-poster) asked from. E.g., from the above paragraph, your method will get words such as A.K., different, or (= clearly wrongly. Martin Maechler, ETH Zurich - Original Message - From: mcelis mce...@lightminersystems.com To: r-help@r-project.org Cc: Sent: Monday, September 24, 2012 7:29 PM Subject: [R] Memory usage in R grows considerably while calculating word frequencies I am working with some large text files (up to 16 GBytes). I am interested in extracting the words and counting each time each word appears in the text. I have written a very simple R program by following some suggestions and examples I found online. If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a better way to do this that will minimize memory usage. I am very new to R, so I would appreciate some tips on how to improve my program or a better way to do it. R program: # Read in the entire file and convert all words in text to lower case words.txt-tolower(scan(text_file,character
[R] Memory usage in R grows considerably while calculating word frequencies
I am working with some large text files (up to 16 GBytes). I am interested in extracting the words and counting each time each word appears in the text. I have written a very simple R program by following some suggestions and examples I found online. If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a better way to do this that will minimize memory usage. I am very new to R, so I would appreciate some tips on how to improve my program or a better way to do it. R program: # Read in the entire file and convert all words in text to lower case words.txt-tolower(scan(text_file,character,sep=\n)) # Extract words pattern - (\\b[A-Za-z]+\\b) match - gregexpr(pattern,words.txt) words.txt - regmatches(words.txt,match) # Create a vector from the list of words words.txt-unlist(words.txt) # Calculate word frequencies words.txt-table(words.txt,dnn=words) # Sort by frequency, not alphabetically words.txt-sort(words.txt,decreasing=TRUE) # Put into some readable form, Name of word and Number of times it occurs words.txt-paste(names(words.txt),words.txt,sep=\t) # Results to a file cat(Word\tFREQ,words.txt,file=frequencies,sep=\n) -- View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] memory usage benefit from anonymous variable constructions.
This is an I was just wondering question. When the package dataframe was announced, the author claimed to reduce the number of times a data frame was copied, I started to wonder if I should care about this in my projects. Has anybody written a general guide for how to write R code that doesn't needlessly exhaust RAM? In Objective-C, we used to gain some considerable advantages by avoiding declaring objects separately, using anonymous variable instead. The storage was allocated on the stack, I think, and I think there was talk that the numbers might stay 'closer' to the CPU (register?) for immediate usage. Does this benefit in R as well? For example, instead of the way I would usually do this: mf - model.frame(model) y - model.response(mf) Here is the anonymous alternative, mf is never declared y - model.response(model.frame(model)) On the face of it, I can imagine this might be better because no permanent thing mf is created, the garbage collector wouldn't be called into play if all the data is local and disappears immediately. But, then again, R is doing lots of stuff under the hood that I've never bothered to learn about. pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] memory usage upon web-query using try function
Dear Community, my program below runs quite slow and I'm not sure whether the http-requests are to blame for this. Also, when running it gradually increases the memory usage enormously. After the program finishes, the memory is not freed. Can someone point out a problem in the code? Sorry my basic question, but I am totally new to R programming... Many thans for your time, Cyrus require(XML) row=0 URL=http://de.finance.yahoo.com/lookup?s=; df - matrix(ncol=6,nrow=10) for (Ticker in 10:20) { URLTicker=paste(URL,Ticker,sep=) query=try(readHTMLTable( URLTicker, which=2, header=T, colClasses = c(character,character,character, character,character,character), stringsAsFactors=F,)[1,],silent=T) if (class(query)==data.frame) { row=row+1 df[row,]=as.character(query) }} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage in read.csv()
Hi Jim Gabor - Apparently, it was most likely a hardware issue (shortly after sending my last e-mail, the computer promptly died). After buying a new system and restoring, the script runs fine. Thanks for your help! On Tue, Jan 19, 2010 at 2:02 PM, jim holtman - jholt...@gmail.com +nabble+miller_2555+9dc9649aca.jholtman#gmail@spamgourmet.com wrote: I read vmstat data in just fine without any problems. Here is an example of how I do it: VMstat - read.table('vmstat.txt', header=TRUE, as.is=TRUE) vmstat.txt looks like this: date time r b w swap free re mf pi po fr de sr intr syscalls cs user sys id 07/27/05 00:13:06 0 0 0 27755440 13051648 20 86 0 0 0 0 0 456 2918 1323 0 1 99 07/27/05 00:13:36 0 0 0 27755280 13051480 11 53 0 0 0 0 0 399 1722 1411 0 1 99 07/27/05 00:14:06 0 0 0 27753952 13051248 18 88 0 0 0 0 0 424 1259 1254 0 1 99 07/27/05 00:14:36 0 0 0 27755304 13051496 17 85 0 0 0 0 0 430 1029 1246 0 1 99 07/27/05 00:15:06 0 0 0 27755064 13051232 41 278 0 1 1 0 0 452 2047 1386 0 1 99 07/27/05 00:15:36 0 0 0 27753824 13040720 125 1039 0 0 0 0 0 664 4097 1901 3 2 95 07/27/05 00:16:06 0 0 0 27754472 13027000 15 91 0 0 0 0 0 432 1160 1273 0 1 99 07/27/05 00:16:36 0 0 0 27754568 13027104 17 85 0 0 0 0 0 416 1058 1271 0 1 99 Have you tried a smaller portion of data? Here is what it took to read in a file with 85K lines: system.time(vmstat - read.table('c:/vmstat.txt', header=TRUE)) user system elapsed 2.01 0.01 2.03 str(vmstat) 'data.frame': 85680 obs. of 20 variables: $ date : Factor w/ 2 levels 07/27/05,07/28/05: 1 1 1 1 1 1 1 1 1 1 ... $ time : Factor w/ 2856 levels 00:00:26,00:00:56,..: 27 29 31 33 35 37 39 41 43 45 ... $ r : int 0 0 0 0 0 0 0 0 0 0 ... $ b : int 0 0 0 0 0 0 0 0 0 0 ... $ w : int 0 0 0 0 0 0 0 0 0 0 ... $ swap : int 27755440 27755280 27753952 27755304 27755064 27753824 27754472 27754568 27754560 27754704 ... $ free : int 13051648 13051480 13051248 13051496 13051232 13040720 13027000 13027104 13027096 13027240 ... $ re : int 20 11 18 17 41 125 15 17 13 12 ... $ mf : int 86 53 88 85 278 1039 91 85 69 51 ... $ pi : int 0 0 0 0 0 0 0 0 0 0 ... $ po : int 0 0 0 0 1 0 0 0 0 1 ... $ fr : int 0 0 0 0 1 0 0 0 0 1 ... $ de : int 0 0 0 0 0 0 0 0 0 0 ... $ sr : int 0 0 0 0 0 0 0 0 0 0 ... $ intr : int 456 399 424 430 452 664 432 416 425 432 ... $ syscalls: int 2918 1722 1259 1029 2047 4097 1160 1058 1198 1727 ... $ cs : int 1323 1411 1254 1246 1386 1901 1273 1271 1268 1477 ... $ user : int 0 0 0 0 0 3 0 0 0 0 ... $ sys : int 1 1 1 1 1 2 1 1 1 1 ... $ id : int 99 99 99 99 99 95 99 99 99 99 ... On Tue, Jan 19, 2010 at 9:25 AM, nabble.30.miller_2...@spamgourmet.com wrote: I'm sure this has gotten some attention before, but I have two CSV files generated from vmstat and free that are roughly 6-8 Mb (about 80,000 lines) each. When I try to use read.csv(), R allocates all available memory (about 4.9 Gb) when loading the files, which is over 300 times the size of the raw data. Here are the scripts used to generate the CSV files as well as the R code: Scripts (run for roughly a 24-hour period): vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS= ; OFS=,; print strftime(%F %T %Z),$6,$7,$12,$13,$14,$15,$16,$17;}' ~/vmstat_20100118_133845.o; free -ms 1 | awk '$0 ~ /Mem\:/ {FS= ; OFS=,; print strftime(%F %T %Z),$2,$3,$4,$5,$6,$7}' ~/memfree_20100118_140845.o; R code: infile.vms - ~/vmstat_20100118_133845.o; infile.mem - ~/memfree_20100118_140845.o; vms.colnames - c(time,r,b,swpd,free,inact,active,si,so,bi,bo,in,cs,us,sy,id,wa,st); vms.colclass - c(character,rep(integer,length(vms.colnames)-1)); mem.colnames - c(time,total,used,free,shared,buffers,cached); mem.colclass - c(character,rep(integer,length(mem.colnames)-1)); vmsdf - (read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames)); memdf - (read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames)); I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There are no other significant programs running and `rm()` followed by ` gc()` successfully frees the memory (followed by swapins after other programs seek to used previously cached information swapped to disk). I've incorporated the memory-saving suggestions in the `read.csv()` manual page, excluding the limit on the lines read (which shouldn't really be necessary here since we're only talking about 20 Mb of raw data. Any suggestions, or is the read.csv() code known to have memory leak/ overcommit issues? Thanks __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
[R] Memory usage in read.csv()
I'm sure this has gotten some attention before, but I have two CSV files generated from vmstat and free that are roughly 6-8 Mb (about 80,000 lines) each. When I try to use read.csv(), R allocates all available memory (about 4.9 Gb) when loading the files, which is over 300 times the size of the raw data. Here are the scripts used to generate the CSV files as well as the R code: Scripts (run for roughly a 24-hour period): vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS= ; OFS=,; print strftime(%F %T %Z),$6,$7,$12,$13,$14,$15,$16,$17;}' ~/vmstat_20100118_133845.o; free -ms 1 | awk '$0 ~ /Mem\:/ {FS= ; OFS=,; print strftime(%F %T %Z),$2,$3,$4,$5,$6,$7}' ~/memfree_20100118_140845.o; R code: infile.vms - ~/vmstat_20100118_133845.o; infile.mem - ~/memfree_20100118_140845.o; vms.colnames - c(time,r,b,swpd,free,inact,active,si,so,bi,bo,in,cs,us,sy,id,wa,st); vms.colclass - c(character,rep(integer,length(vms.colnames)-1)); mem.colnames - c(time,total,used,free,shared,buffers,cached); mem.colclass - c(character,rep(integer,length(mem.colnames)-1)); vmsdf - (read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames)); memdf - (read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames)); I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There are no other significant programs running and `rm()` followed by ` gc()` successfully frees the memory (followed by swapins after other programs seek to used previously cached information swapped to disk). I've incorporated the memory-saving suggestions in the `read.csv()` manual page, excluding the limit on the lines read (which shouldn't really be necessary here since we're only talking about 20 Mb of raw data. Any suggestions, or is the read.csv() code known to have memory leak/ overcommit issues? Thanks __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory usage in read.csv()
I read vmstat data in just fine without any problems. Here is an example of how I do it: VMstat - read.table('vmstat.txt', header=TRUE, as.is=TRUE) vmstat.txt looks like this: date time r b w swap free re mf pi po fr de sr intr syscalls cs user sys id 07/27/05 00:13:06 0 0 0 27755440 13051648 20 86 0 0 0 0 0 456 2918 1323 0 1 99 07/27/05 00:13:36 0 0 0 27755280 13051480 11 53 0 0 0 0 0 399 1722 1411 0 1 99 07/27/05 00:14:06 0 0 0 27753952 13051248 18 88 0 0 0 0 0 424 1259 1254 0 1 99 07/27/05 00:14:36 0 0 0 27755304 13051496 17 85 0 0 0 0 0 430 1029 1246 0 1 99 07/27/05 00:15:06 0 0 0 27755064 13051232 41 278 0 1 1 0 0 452 2047 1386 0 1 99 07/27/05 00:15:36 0 0 0 27753824 13040720 125 1039 0 0 0 0 0 664 4097 1901 3 2 95 07/27/05 00:16:06 0 0 0 27754472 13027000 15 91 0 0 0 0 0 432 1160 1273 0 1 99 07/27/05 00:16:36 0 0 0 27754568 13027104 17 85 0 0 0 0 0 416 1058 1271 0 1 99 Have you tried a smaller portion of data? Here is what it took to read in a file with 85K lines: system.time(vmstat - read.table('c:/vmstat.txt', header=TRUE)) user system elapsed 2.010.012.03 str(vmstat) 'data.frame': 85680 obs. of 20 variables: $ date: Factor w/ 2 levels 07/27/05,07/28/05: 1 1 1 1 1 1 1 1 1 1 ... $ time: Factor w/ 2856 levels 00:00:26,00:00:56,..: 27 29 31 33 35 37 39 41 43 45 ... $ r : int 0 0 0 0 0 0 0 0 0 0 ... $ b : int 0 0 0 0 0 0 0 0 0 0 ... $ w : int 0 0 0 0 0 0 0 0 0 0 ... $ swap: int 27755440 27755280 27753952 27755304 27755064 27753824 27754472 27754568 27754560 27754704 ... $ free: int 13051648 13051480 13051248 13051496 13051232 13040720 13027000 13027104 13027096 13027240 ... $ re : int 20 11 18 17 41 125 15 17 13 12 ... $ mf : int 86 53 88 85 278 1039 91 85 69 51 ... $ pi : int 0 0 0 0 0 0 0 0 0 0 ... $ po : int 0 0 0 0 1 0 0 0 0 1 ... $ fr : int 0 0 0 0 1 0 0 0 0 1 ... $ de : int 0 0 0 0 0 0 0 0 0 0 ... $ sr : int 0 0 0 0 0 0 0 0 0 0 ... $ intr: int 456 399 424 430 452 664 432 416 425 432 ... $ syscalls: int 2918 1722 1259 1029 2047 4097 1160 1058 1198 1727 ... $ cs : int 1323 1411 1254 1246 1386 1901 1273 1271 1268 1477 ... $ user: int 0 0 0 0 0 3 0 0 0 0 ... $ sys : int 1 1 1 1 1 2 1 1 1 1 ... $ id : int 99 99 99 99 99 95 99 99 99 99 ... On Tue, Jan 19, 2010 at 9:25 AM, nabble.30.miller_2...@spamgourmet.com wrote: I'm sure this has gotten some attention before, but I have two CSV files generated from vmstat and free that are roughly 6-8 Mb (about 80,000 lines) each. When I try to use read.csv(), R allocates all available memory (about 4.9 Gb) when loading the files, which is over 300 times the size of the raw data. Here are the scripts used to generate the CSV files as well as the R code: Scripts (run for roughly a 24-hour period): vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS= ; OFS=,; print strftime(%F %T %Z),$6,$7,$12,$13,$14,$15,$16,$17;}' ~/vmstat_20100118_133845.o; free -ms 1 | awk '$0 ~ /Mem\:/ {FS= ; OFS=,; print strftime(%F %T %Z),$2,$3,$4,$5,$6,$7}' ~/memfree_20100118_140845.o; R code: infile.vms - ~/vmstat_20100118_133845.o; infile.mem - ~/memfree_20100118_140845.o; vms.colnames - c(time,r,b,swpd,free,inact,active,si,so,bi,bo,in,cs,us,sy,id,wa,st); vms.colclass - c(character,rep(integer,length(vms.colnames)-1)); mem.colnames - c(time,total,used,free,shared,buffers,cached); mem.colclass - c(character,rep(integer,length(mem.colnames)-1)); vmsdf - (read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames)); memdf - (read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames)); I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There are no other significant programs running and `rm()` followed by ` gc()` successfully frees the memory (followed by swapins after other programs seek to used previously cached information swapped to disk). I've incorporated the memory-saving suggestions in the `read.csv()` manual page, excluding the limit on the lines read (which shouldn't really be necessary here since we're only talking about 20 Mb of raw data. Any suggestions, or is the read.csv() code known to have memory leak/ overcommit issues? Thanks __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal,
Re: [R] Memory usage in read.csv()
You could also try read.csv.sql in sqldf. See examples on sqldf home page: http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql On Tue, Jan 19, 2010 at 9:25 AM, nabble.30.miller_2...@spamgourmet.com wrote: I'm sure this has gotten some attention before, but I have two CSV files generated from vmstat and free that are roughly 6-8 Mb (about 80,000 lines) each. When I try to use read.csv(), R allocates all available memory (about 4.9 Gb) when loading the files, which is over 300 times the size of the raw data. Here are the scripts used to generate the CSV files as well as the R code: Scripts (run for roughly a 24-hour period): vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS= ; OFS=,; print strftime(%F %T %Z),$6,$7,$12,$13,$14,$15,$16,$17;}' ~/vmstat_20100118_133845.o; free -ms 1 | awk '$0 ~ /Mem\:/ {FS= ; OFS=,; print strftime(%F %T %Z),$2,$3,$4,$5,$6,$7}' ~/memfree_20100118_140845.o; R code: infile.vms - ~/vmstat_20100118_133845.o; infile.mem - ~/memfree_20100118_140845.o; vms.colnames - c(time,r,b,swpd,free,inact,active,si,so,bi,bo,in,cs,us,sy,id,wa,st); vms.colclass - c(character,rep(integer,length(vms.colnames)-1)); mem.colnames - c(time,total,used,free,shared,buffers,cached); mem.colclass - c(character,rep(integer,length(mem.colnames)-1)); vmsdf - (read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames)); memdf - (read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames)); I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There are no other significant programs running and `rm()` followed by ` gc()` successfully frees the memory (followed by swapins after other programs seek to used previously cached information swapped to disk). I've incorporated the memory-saving suggestions in the `read.csv()` manual page, excluding the limit on the lines read (which shouldn't really be necessary here since we're only talking about 20 Mb of raw data. Any suggestions, or is the read.csv() code known to have memory leak/ overcommit issues? Thanks __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Passing lists and R memory usage growth
Hello, I can't think of an explanation for this memory allocation behaviour and was hoping someone on the list could help out. Setup: -- R version 2.8.1, 32-bit Ubuntu 9.04 Linux, Core 2 Duo with 3GB ram Description: Inside a for loop, I am passing a list to a function. The function accesses various members of the list. I understand that in this situation, the entire list may be duplicated in each function call. That's ok. But the memory given to these duplicates doesn't seem to be recovered by the garbage collector after the function call has ended and more memory is allocated in each iteration. (See output below.) I also tried summing up object.size() for all objects in all environments, and the total is constant about 15 Mbytes at each iteration. But overall memory consumption as reported by gc() (and my operating system) keeps going up to 2 Gbytes and more. Pseudocode: --- # This function and its callees need a 'results' list some.function.1 - function(iter, res, par) { # access res$gamma[[iter-1]], res$beta[[iter-1]] ... } # This function and its callees need a 'results' list some.function.2 - function(iter, res, par) { # access res$gamma[[iter-1]], res$beta[[iter-1]] ... } # Some parameters par - list( ... ) # List storing results. # Only results$gamma[1:3], results$beta[1:3] are used results - list(gamma = list(), beta = list()) for (iter in 1:100) { print(paste(Iteration , iter)) # min(iter, 3) is the most recent slot of results$gamma etc. results$gamma[[min(iter, 3)]] - some.function.1(min(iter, 3), results, par) results$beta[[min(iter, 3)]] - some.function.2(min(iter, 3), results, par) # Delete earlier results if (iter 2) { results$gamma[[1]] - NULL results$beta[[1]] - NULL } # Report on memory usage gc(verbose=TRUE) } Output from an actual run of my program: [1] Iteration 1 Garbage collection 255 = 122+60+73 (level 2) ... 6.1 Mbytes of cons cells used (48%) 232.3 Mbytes of vectors used (69%) [1] Iteration 2 Garbage collection 257 = 123+60+74 (level 2) ... 6.1 Mbytes of cons cells used (48%) 238.3 Mbytes of vectors used (67%) [1] Iteration 3 Garbage collection 258 = 123+60+75 (level 2) ... 6.1 Mbytes of cons cells used (49%) 242.8 Mbytes of vectors used (69%) [1] Iteration 4 Garbage collection 259 = 123+60+76 (level 2) ... 6.2 Mbytes of cons cells used (49%) 247.3 Mbytes of vectors used (66%) [1] Iteration 5 Garbage collection 260 = 123+60+77 (level 2) ... 6.2 Mbytes of cons cells used (50%) 251.8 Mbytes of vectors used (68%) ... Thanks, Rajeev. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Passing lists and R memory usage growth
You need to give reproducible code for a question like this, not pseudocode. And you should consider using a recent version of R, not the relatively ancient 2.8.1 (which was released in late 2008. Duncan Murdoch On 03/10/2009 1:30 PM, Rajeev Ayyagari wrote: Hello, I can't think of an explanation for this memory allocation behaviour and was hoping someone on the list could help out. Setup: -- R version 2.8.1, 32-bit Ubuntu 9.04 Linux, Core 2 Duo with 3GB ram Description: Inside a for loop, I am passing a list to a function. The function accesses various members of the list. I understand that in this situation, the entire list may be duplicated in each function call. That's ok. But the memory given to these duplicates doesn't seem to be recovered by the garbage collector after the function call has ended and more memory is allocated in each iteration. (See output below.) I also tried summing up object.size() for all objects in all environments, and the total is constant about 15 Mbytes at each iteration. But overall memory consumption as reported by gc() (and my operating system) keeps going up to 2 Gbytes and more. Pseudocode: --- # This function and its callees need a 'results' list some.function.1 - function(iter, res, par) { # access res$gamma[[iter-1]], res$beta[[iter-1]] ... } # This function and its callees need a 'results' list some.function.2 - function(iter, res, par) { # access res$gamma[[iter-1]], res$beta[[iter-1]] ... } # Some parameters par - list( ... ) # List storing results. # Only results$gamma[1:3], results$beta[1:3] are used results - list(gamma = list(), beta = list()) for (iter in 1:100) { print(paste(Iteration , iter)) # min(iter, 3) is the most recent slot of results$gamma etc. results$gamma[[min(iter, 3)]] - some.function.1(min(iter, 3), results, par) results$beta[[min(iter, 3)]] - some.function.2(min(iter, 3), results, par) # Delete earlier results if (iter 2) { results$gamma[[1]] - NULL results$beta[[1]] - NULL } # Report on memory usage gc(verbose=TRUE) } Output from an actual run of my program: [1] Iteration 1 Garbage collection 255 = 122+60+73 (level 2) ... 6.1 Mbytes of cons cells used (48%) 232.3 Mbytes of vectors used (69%) [1] Iteration 2 Garbage collection 257 = 123+60+74 (level 2) ... 6.1 Mbytes of cons cells used (48%) 238.3 Mbytes of vectors used (67%) [1] Iteration 3 Garbage collection 258 = 123+60+75 (level 2) ... 6.1 Mbytes of cons cells used (49%) 242.8 Mbytes of vectors used (69%) [1] Iteration 4 Garbage collection 259 = 123+60+76 (level 2) ... 6.2 Mbytes of cons cells used (49%) 247.3 Mbytes of vectors used (66%) [1] Iteration 5 Garbage collection 260 = 123+60+77 (level 2) ... 6.2 Mbytes of cons cells used (50%) 251.8 Mbytes of vectors used (68%) ... Thanks, Rajeev. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Passing lists and R memory usage growth
Duncan: I took your suggestion and upgraded to R 2.9.2, but the problem persists. I am not able to reproduce the problem in a simple case. In my actual code the functions some.function.1() and some.function.2() are quite complicated and call various other functions which also access elements of the list. If I can find a simple way to reproduce it, I will post the code to the list. I know it must be the results list in the pseudocode which is causing the problem because: 1. I tried tracemem() on par and results; results is duplicated several times but par is not. 2. I can eliminate the memory problem completely by rewriting some.function.1() and some.function.2() to accept individual elements of the list as arguments, and passing several list elements like results$gamma[[iter-1]] etc. in the call. (Rather than passing the entire list as a single argument.) This makes the code harder to read but the memory problem is eliminated. Regards Rajeev. On Sat, Oct 3, 2009 at 1:43 PM, Duncan Murdoch murd...@stats.uwo.ca wrote: You need to give reproducible code for a question like this, not pseudocode. And you should consider using a recent version of R, not the relatively ancient 2.8.1 (which was released in late 2008. Duncan Murdoch On 03/10/2009 1:30 PM, Rajeev Ayyagari wrote: Hello, I can't think of an explanation for this memory allocation behaviour and was hoping someone on the list could help out. Setup: -- R version 2.8.1, 32-bit Ubuntu 9.04 Linux, Core 2 Duo with 3GB ram Description: Inside a for loop, I am passing a list to a function. The function accesses various members of the list. I understand that in this situation, the entire list may be duplicated in each function call. That's ok. But the memory given to these duplicates doesn't seem to be recovered by the garbage collector after the function call has ended and more memory is allocated in each iteration. (See output below.) I also tried summing up object.size() for all objects in all environments, and the total is constant about 15 Mbytes at each iteration. But overall memory consumption as reported by gc() (and my operating system) keeps going up to 2 Gbytes and more. Pseudocode: --- # This function and its callees need a 'results' list some.function.1 - function(iter, res, par) { # access res$gamma[[iter-1]], res$beta[[iter-1]] ... } # This function and its callees need a 'results' list some.function.2 - function(iter, res, par) { # access res$gamma[[iter-1]], res$beta[[iter-1]] ... } # Some parameters par - list( ... ) # List storing results. # Only results$gamma[1:3], results$beta[1:3] are used results - list(gamma = list(), beta = list()) for (iter in 1:100) { print(paste(Iteration , iter)) # min(iter, 3) is the most recent slot of results$gamma etc. results$gamma[[min(iter, 3)]] - some.function.1(min(iter, 3), results, par) results$beta[[min(iter, 3)]] - some.function.2(min(iter, 3), results, par) # Delete earlier results if (iter 2) { results$gamma[[1]] - NULL results$beta[[1]] - NULL } # Report on memory usage gc(verbose=TRUE) } Output from an actual run of my program: [1] Iteration 1 Garbage collection 255 = 122+60+73 (level 2) ... 6.1 Mbytes of cons cells used (48%) 232.3 Mbytes of vectors used (69%) [1] Iteration 2 Garbage collection 257 = 123+60+74 (level 2) ... 6.1 Mbytes of cons cells used (48%) 238.3 Mbytes of vectors used (67%) [1] Iteration 3 Garbage collection 258 = 123+60+75 (level 2) ... 6.1 Mbytes of cons cells used (49%) 242.8 Mbytes of vectors used (69%) [1] Iteration 4 Garbage collection 259 = 123+60+76 (level 2) ... 6.2 Mbytes of cons cells used (49%) 247.3 Mbytes of vectors used (66%) [1] Iteration 5 Garbage collection 260 = 123+60+77 (level 2) ... 6.2 Mbytes of cons cells used (50%) 251.8 Mbytes of vectors used (68%) ... Thanks, Rajeev. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Memory Usage Concerns
On Mon, Sep 14, 2009 at 10:01 PM, Henrik Bengtsson h...@stat.berkeley.edu wrote: As already suggested, you're (much) better off if you specify colClasses, e.g. tab - read.table(~/20090708.tab, colClasses=c(factor, double, double)); Otherwise, R has to load all the data, make a best guess of the column classes, and then coerce (which requires a copy). Thanks Henrik, I tried this as well as a variant that another user sent me privately. When I tell R the colClasses, it does a much better job of allocating memory (ending up with 96M of RSS memory, which isn't great but is definitely acceptable). A couple of notes I made from testing some variants, if anyone else is interested: * giving it an nrows argument doesn't help it allocate less memory (just a guess, but maybe because it's trying the powers-of-two allocation strategy in both cases) * there's no difference in memory usage between telling it a column is numeric vs double * when telling it the types in advance, loading the table is much, much faster Maybe if I gather some more fortitude in the future, I'll poke around at the internals and see where the extra memory is going, since I'm still curious where the extra memory is going. Is that just the overhead of allocating a full object for each value (i.e. rather than just a double[] or whatever)? -- Evan Klitzke e...@eklitzke.org :wq __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Memory Usage Concerns
On Tue, 15 Sep 2009, Evan Klitzke wrote: On Mon, Sep 14, 2009 at 10:01 PM, Henrik Bengtsson h...@stat.berkeley.edu wrote: As already suggested, you're (much) better off if you specify colClasses, e.g. tab - read.table(~/20090708.tab, colClasses=c(factor, double, double)); Otherwise, R has to load all the data, make a best guess of the column classes, and then coerce (which requires a copy). Thanks Henrik, I tried this as well as a variant that another user sent me privately. When I tell R the colClasses, it does a much better job of allocating memory (ending up with 96M of RSS memory, which isn't great but is definitely acceptable). A couple of notes I made from testing some variants, if anyone else is interested: * giving it an nrows argument doesn't help it allocate less memory (just a guess, but maybe because it's trying the powers-of-two allocation strategy in both cases) * there's no difference in memory usage between telling it a column is numeric vs double Because they are the same type * when telling it the types in advance, loading the table is much, much faster Indeed. Maybe if I gather some more fortitude in the future, I'll poke around at the internals and see where the extra memory is going, since I'm still curious where the extra memory is going. Is that just the overhead of allocating a full object for each value (i.e. rather than just a double[] or whatever)? No, because it doesn't allocate a full object for each value, it does just allocate a double[] plus a constant amount of overhead. R doesn't have scalar types so there isn't even such a thing as an object for a single value, just vectors with a single element. R will use more than the object size for the data set, because it makes temporary copies of things. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlum...@u.washington.eduUniversity of Washington, Seattle __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Memory Usage Concerns
Hello, I do not know whether my package colbycol may help you. It can help you read files that would not have fitted into memory otherwise. Internally, as the name indicates, data is read into R in a column by column fashion. IO times increase but you need just a fraction of intermediate memory to read the files. Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com On Tue, 2009-09-15 at 00:10 -0700, Evan Klitzke wrote: On Mon, Sep 14, 2009 at 10:01 PM, Henrik Bengtsson h...@stat.berkeley.edu wrote: As already suggested, you're (much) better off if you specify colClasses, e.g. tab - read.table(~/20090708.tab, colClasses=c(factor, double, double)); Otherwise, R has to load all the data, make a best guess of the column classes, and then coerce (which requires a copy). Thanks Henrik, I tried this as well as a variant that another user sent me privately. When I tell R the colClasses, it does a much better job of allocating memory (ending up with 96M of RSS memory, which isn't great but is definitely acceptable). A couple of notes I made from testing some variants, if anyone else is interested: * giving it an nrows argument doesn't help it allocate less memory (just a guess, but maybe because it's trying the powers-of-two allocation strategy in both cases) * there's no difference in memory usage between telling it a column is numeric vs double * when telling it the types in advance, loading the table is much, much faster Maybe if I gather some more fortitude in the future, I'll poke around at the internals and see where the extra memory is going, since I'm still curious where the extra memory is going. Is that just the overhead of allocating a full object for each value (i.e. rather than just a double[] or whatever)? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R Memory Usage Concerns
Hello all, To start with, these measurements are on Linux with R 2.9.2 (64-bit build) and Python 2.6 (also 64-bit). I've been investigating R for some log file analysis that I've been doing. I'm coming at this from the angle of a programmer whose primarily worked in Python. As I've been playing around with R, I've noticed that R seems to use a *lot* of memory, especially compared to Python. Here's an example of what I'm talking about. I have a sample data file whose characteristics are like this: [e...@t500 ~]$ ls -lh 20090708.tab -rw-rw-r-- 1 evan evan 63M 2009-07-08 20:56 20090708.tab [e...@t500 ~]$ head 20090708.tab spice 1247036405.04 0.0141088962555 spice 1247036405.01 0.046797990799 spice 1247036405.13 0.0137498378754 spice 1247036404.87 0.0594480037689 spice 1247036405.02 0.0170919895172 topic 1247036404.74 0.512196063995 user_details 1247036404.64 0.242133140564 spice 1247036405.23 0.0408620834351 biz_details 1247036405.04 0.40732884407 spice 1247036405.35 0.0501029491425 [e...@t500 ~]$ wc -l 20090708.tab 1797601 20090708.tab So it's basically a CSV file (actually, space delimited) where all of the lines are three columns, a low-cardinality string, a double, and a double. The file itself is 63M. Python can load all of the data from the file really compactly (source for the script at the bottom of the message): [e...@t500 ~]$ python code/scratch/pymem.py VIRT = 25230, RSS = 860 VIRT = 81142, RSS = 55825 So this shows that my Python process starts out at 860K RSS memory before doing any processing, and ends at 55M of RSS memory. This is pretty good, actually it's better than the size of the file, since a double can be stored more compactly than the textual data stored in the data file. Since I'm new to R I didn't know how to read /proc and so forth, so instead I launched an R repl and used ps to record the RSS memory usage before and after running the following statement: tab - read.table(~/20090708.tab) The numbers I measured were: VIRT = 176820, RSS = 26180 (just after starting the repl) VIRT = 414284, RSS = 263708 (after executing the command) This kind of concerns me. I can understand why R uses more memory at startup, since it's launching a full repl which my Python script wasn't doing. But I would have expected the memory usage to not have grown more like Python did after loading the data. In fact, R ought to be able to use less memory, since the first column is textual and has low cardinality (I think 7 distinct values), so storing it as a factor should be very memory efficient. For the things that I want to use R for, I know I'll be processing much larger datasets, and at the rate that R is consuming memory it may not be possible to fully load the data into memory. I'm concerned that it may not be worth pursuing learning R if it's possible to load the data into memory using something like Python but not R. I don't want to overlook the possibility that I'm overlooking something, since I'm new to the language. Can anyone answer for me: * What is R doing with all of that memory? * Is there something I did wrong? Is there a more memory-efficient way to load this data? * Are there R modules that can store large data-sets in a more memory-efficient way? Can anyone relate their experiences with them? For reference, here's the Python script I used to measure Python's memory usage: import os def show_mem(): statm = open('/proc/%d/statm' % os.getpid()).read() print 'VIRT = %s, RSS = %s' % tuple(statm.split(' ')[:2]) def read_data(fname): servlets = [] timestamps = [] elapsed = [] for line in open(fname, 'r'): s, t, e = line.strip().split(' ') servlets.append(s) timestamps.append(float(t)) elapsed.append(float(e)) show_mem() if __name__ == '__main__': show_mem() read_data('/home/evan/20090708.tab') -- Evan Klitzke e...@eklitzke.org :wq __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Memory Usage Concerns
When you read your file into R, show the structure of the object: str(tab) also the size of the object: object.size(tab) This will tell you what your data looks like and the size taken in R. Also in read.table, use colClasses to define what the format of the data is; may make it faster. You might want to force a garbage collection 'gc()' to see if that frees up any memory. If your input is about 2M lines and it looks like there are three column (alpha, numeric, numeric), I would guess that you will probably have an object.size of about 50MB. This information would help. On Mon, Sep 14, 2009 at 11:11 PM, Evan Klitzke e...@eklitzke.org wrote: Hello all, To start with, these measurements are on Linux with R 2.9.2 (64-bit build) and Python 2.6 (also 64-bit). I've been investigating R for some log file analysis that I've been doing. I'm coming at this from the angle of a programmer whose primarily worked in Python. As I've been playing around with R, I've noticed that R seems to use a *lot* of memory, especially compared to Python. Here's an example of what I'm talking about. I have a sample data file whose characteristics are like this: [e...@t500 ~]$ ls -lh 20090708.tab -rw-rw-r-- 1 evan evan 63M 2009-07-08 20:56 20090708.tab [e...@t500 ~]$ head 20090708.tab spice 1247036405.04 0.0141088962555 spice 1247036405.01 0.046797990799 spice 1247036405.13 0.0137498378754 spice 1247036404.87 0.0594480037689 spice 1247036405.02 0.0170919895172 topic 1247036404.74 0.512196063995 user_details 1247036404.64 0.242133140564 spice 1247036405.23 0.0408620834351 biz_details 1247036405.04 0.40732884407 spice 1247036405.35 0.0501029491425 [e...@t500 ~]$ wc -l 20090708.tab 1797601 20090708.tab So it's basically a CSV file (actually, space delimited) where all of the lines are three columns, a low-cardinality string, a double, and a double. The file itself is 63M. Python can load all of the data from the file really compactly (source for the script at the bottom of the message): [e...@t500 ~]$ python code/scratch/pymem.py VIRT = 25230, RSS = 860 VIRT = 81142, RSS = 55825 So this shows that my Python process starts out at 860K RSS memory before doing any processing, and ends at 55M of RSS memory. This is pretty good, actually it's better than the size of the file, since a double can be stored more compactly than the textual data stored in the data file. Since I'm new to R I didn't know how to read /proc and so forth, so instead I launched an R repl and used ps to record the RSS memory usage before and after running the following statement: tab - read.table(~/20090708.tab) The numbers I measured were: VIRT = 176820, RSS = 26180 (just after starting the repl) VIRT = 414284, RSS = 263708 (after executing the command) This kind of concerns me. I can understand why R uses more memory at startup, since it's launching a full repl which my Python script wasn't doing. But I would have expected the memory usage to not have grown more like Python did after loading the data. In fact, R ought to be able to use less memory, since the first column is textual and has low cardinality (I think 7 distinct values), so storing it as a factor should be very memory efficient. For the things that I want to use R for, I know I'll be processing much larger datasets, and at the rate that R is consuming memory it may not be possible to fully load the data into memory. I'm concerned that it may not be worth pursuing learning R if it's possible to load the data into memory using something like Python but not R. I don't want to overlook the possibility that I'm overlooking something, since I'm new to the language. Can anyone answer for me: * What is R doing with all of that memory? * Is there something I did wrong? Is there a more memory-efficient way to load this data? * Are there R modules that can store large data-sets in a more memory-efficient way? Can anyone relate their experiences with them? For reference, here's the Python script I used to measure Python's memory usage: import os def show_mem(): statm = open('/proc/%d/statm' % os.getpid()).read() print 'VIRT = %s, RSS = %s' % tuple(statm.split(' ')[:2]) def read_data(fname): servlets = [] timestamps = [] elapsed = [] for line in open(fname, 'r'): s, t, e = line.strip().split(' ') servlets.append(s) timestamps.append(float(t)) elapsed.append(float(e)) show_mem() if __name__ == '__main__': show_mem() read_data('/home/evan/20090708.tab') -- Evan Klitzke e...@eklitzke.org :wq __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman
Re: [R] R Memory Usage Concerns
And, by the way, factors take up _more_ memory than character vectors. object.size(sample(c(a,b), 1000, replace=TRUE)) 4088 bytes object.size(factor(sample(c(a,b), 1000, replace=TRUE))) 4296 bytes On Mon, Sep 14, 2009 at 11:35 PM, jim holtman jholt...@gmail.com wrote: When you read your file into R, show the structure of the object: str(tab) also the size of the object: object.size(tab) This will tell you what your data looks like and the size taken in R. Also in read.table, use colClasses to define what the format of the data is; may make it faster. You might want to force a garbage collection 'gc()' to see if that frees up any memory. If your input is about 2M lines and it looks like there are three column (alpha, numeric, numeric), I would guess that you will probably have an object.size of about 50MB. This information would help. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Memory Usage Concerns
On Mon, Sep 14, 2009 at 8:35 PM, jim holtman jholt...@gmail.com wrote: When you read your file into R, show the structure of the object: ... Here's the data I get: tab - read.table(~/20090708.tab) str(tab) 'data.frame': 1797601 obs. of 3 variables: $ V1: Factor w/ 6 levels biz_details,..: 4 4 4 4 4 5 6 4 1 4 ... $ V2: num 1.25e+09 1.25e+09 1.25e+09 1.25e+09 1.25e+09 ... $ V3: num 0.0141 0.0468 0.0137 0.0594 0.0171 ... object.size(tab) 35953640 bytes gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 119580 6.41489330 79.6 2380869 127.2 Vcells 6647905 50.8 17367032 132.5 16871956 128.8 Forcing a GC doesn't seem to free up an appreciable amount of memory (memory usage reported by ps is about the same), but it's encouraging that the output from object.size shows that the object is small. I am, however, a little bit skeptical of this: 1797601 * (4 + 8 + 8) = 35952020, which is awfully close to 35953640. My assumption is that the first column is mapped to a 32-bit integer, plus two 8-byte numbers for the doubles, plus a little bit of overhead to store whatever structs for the objects and the mapping of servlet name (i.e. to store the string - int mapping used by the factor) to its 32-bit representation. This seems like it might be too conservative for me, since it implies that R allocated exactly as much memory for the lists as there were numbers in the list (e.g. typically in an interpreter like this you'd be allocating on order-of-two boundaries, i.e. sizeof(obj) 21; this is how Python lists internally work). Is it possible that R is counting its memory usage naively, e.g. just adding up the size of all of the constituent objects, rather than the amount of space it actually allocated for those objects? -- Evan Klitzke e...@eklitzke.org :wq __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Memory Usage Concerns
On Mon, Sep 14, 2009 at 8:58 PM, Eduardo Leoni leoni...@msu.edu wrote: And, by the way, factors take up _more_ memory than character vectors. object.size(sample(c(a,b), 1000, replace=TRUE)) 4088 bytes object.size(factor(sample(c(a,b), 1000, replace=TRUE))) 4296 bytes I think this is just because you picked short strings. If the factor is mapping the string to a native integer type, the strings would have to be larger for you to notice: object.size(sample(c(a pretty long string, another pretty long string), 1000, replace=TRUE)) 8184 bytes object.size(factor(sample(c(a pretty long string, another pretty long string), 1000, replace=TRUE))) 4560 bytes -- Evan Klitzke e...@eklitzke.org :wq __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Memory Usage Concerns
I think this is just because you picked short strings. If the factor is mapping the string to a native integer type, the strings would have to be larger for you to notice: object.size(sample(c(a pretty long string, another pretty long string), 1000, replace=TRUE)) 8184 bytes object.size(factor(sample(c(a pretty long string, another pretty long string), 1000, replace=TRUE))) 4560 bytes No, it's probably because you have an older version of R, which doesn't have the global string cache. object.size(sample(c(a pretty long string, another pretty long string), 1000, replace=TRUE)) 4136 bytes object.size(factor(sample(c(a pretty long string, another pretty long string), 1000, replace=TRUE))) 4344 bytes Hadley -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Memory Usage Concerns
its 32-bit representation. This seems like it might be too conservative for me, since it implies that R allocated exactly as much memory for the lists as there were numbers in the list (e.g. typically in an interpreter like this you'd be allocating on order-of-two boundaries, i.e. sizeof(obj) 21; this is how Python lists internally work). This is not how R vectors work. R data structures tend to be immutable, and so are designed somewhat differently to their python equivalents. Hadley -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Memory Usage Concerns
As already suggested, you're (much) better off if you specify colClasses, e.g. tab - read.table(~/20090708.tab, colClasses=c(factor, double, double)); Otherwise, R has to load all the data, make a best guess of the column classes, and then coerce (which requires a copy). /Henrik On Mon, Sep 14, 2009 at 9:26 PM, Evan Klitzke e...@eklitzke.org wrote: On Mon, Sep 14, 2009 at 8:35 PM, jim holtman jholt...@gmail.com wrote: When you read your file into R, show the structure of the object: ... Here's the data I get: tab - read.table(~/20090708.tab) str(tab) 'data.frame': 1797601 obs. of 3 variables: $ V1: Factor w/ 6 levels biz_details,..: 4 4 4 4 4 5 6 4 1 4 ... $ V2: num 1.25e+09 1.25e+09 1.25e+09 1.25e+09 1.25e+09 ... $ V3: num 0.0141 0.0468 0.0137 0.0594 0.0171 ... object.size(tab) 35953640 bytes gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 119580 6.4 1489330 79.6 2380869 127.2 Vcells 6647905 50.8 17367032 132.5 16871956 128.8 Forcing a GC doesn't seem to free up an appreciable amount of memory (memory usage reported by ps is about the same), but it's encouraging that the output from object.size shows that the object is small. I am, however, a little bit skeptical of this: 1797601 * (4 + 8 + 8) = 35952020, which is awfully close to 35953640. My assumption is that the first column is mapped to a 32-bit integer, plus two 8-byte numbers for the doubles, plus a little bit of overhead to store whatever structs for the objects and the mapping of servlet name (i.e. to store the string - int mapping used by the factor) to its 32-bit representation. This seems like it might be too conservative for me, since it implies that R allocated exactly as much memory for the lists as there were numbers in the list (e.g. typically in an interpreter like this you'd be allocating on order-of-two boundaries, i.e. sizeof(obj) 21; this is how Python lists internally work). Is it possible that R is counting its memory usage naively, e.g. just adding up the size of all of the constituent objects, rather than the amount of space it actually allocated for those objects? -- Evan Klitzke e...@eklitzke.org :wq __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory usage grows too fast
Thanks for Peter, William, and Hadley's helps. Your codes are much more concise than mine. :P Both William and Hadley's comments are the same. Here are their codes. f - function(dataMatrix) rowMeans(datamatrix==02) And Peter's codes are the following. apply(yourMatrix, 1, function(x) length(x[x==yourPattern]))/ncol(yourMatrix) In terms of the running time, the first one ran faster than the later one on my dataset (2.5 mins vs. 6.4 mins) The memory consumption, however, of the first one is much higher than the later. ( 8G vs. ~3G ) Any thoughts? My guess is the rowMeans created extra copies to perform its calculation, but not so sure. And I am also interested in understanding ways to handle memory issues. Help someone could shed light on this for me. :) Best, Mike -Original Message- From: Peter Alspach [mailto:palsp...@hortresearch.co.nz] Sent: Thursday, May 14, 2009 4:47 PM To: Ping-Hsun Hsieh Subject: RE: [R] memory usage grows too fast Tena koe Mike If I understand you correctly, you should be able to use something like: apply(yourMatrix, 1, function(x) length(x[x==yourPattern]))/ncol(yourMatrix) I see you've divided by nrow(yourMatrix) so perhaps I am missing something. HTH ... Peter Alspach -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Ping-Hsun Hsieh Sent: Friday, 15 May 2009 11:22 a.m. To: r-help@r-project.org Subject: [R] memory usage grows too fast Hi All, I have a 1000x100 matrix. The calculation I would like to do is actually very simple: for each row, calculate the frequency of a given pattern. For example, a toy dataset is as follows. Col1 Col2Col3Col4 0102 02 00 = Freq of 02 is 0.5 0202 02 01 = Freq of 02 is 0.75 0002 01 01 ... My code is quite simple as the following to find the pattern 02. OccurrenceRate_Fun-function(dataMatrix) { tmp-NULL tmpMatrix-apply(dataMatrix,1,match,02) for ( i in 1: ncol(tmpMatrix)) { tmpRate-table(tmpMatrix[,i])[[1]]/ nrow(tmpMatrix) tmp-c(tmp,tmpHET) } rm(tmpMatrix) rm(tmpRate) return(tmp) gc() } The problem is the memory usage grows very fast and hard to be handled on machines with less RAM. Could anyone please give me some comments on how to reduce the space complexity in this calculation? Thanks, Mike __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. The contents of this e-mail are confidential and may be ...{{dropped:14}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory usage grows too fast
Hi William, Thanks for the comments and explanation. It is really good to know the details of rowMeans. I did modified Peter's codes from length(x[x==02]) to sum(x==02), though it improved only in few seconds. :) Best, Mike -Original Message- From: William Dunlap [mailto:wdun...@tibco.com] Sent: Friday, May 15, 2009 10:09 AM To: Ping-Hsun Hsieh Subject: RE: [R] memory usage grows too fast rowMeans(dataMatrix==02) must (a) make a logical matrix the dimensions of dataMatrix in which to put the result of dataMatrix==02 (4 bytes/logical element) (b) make a double precision matrix (8 bytes/element) the size of that logical matrix because rowMeans uses some C code that only works on doubles apply(dataMatrix,1,function(x)length(x[x==02])/ncol(dataMatrix)) never has to make any copies of the entire matrix. It extracts a row at a time and when it is done with the row, the memory used for working on the row is available for other uses. Note that it would probably be a tad faster if it were changed to apply(dataMatrix,1,function(x)sum(x==02)) / ncol(dataMatrix) as sum(logicalVector) is the same as length(x[logicalVector]) and there is no need to compute ncol(dataMatrix) more than once. Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com -Original Message- From: Ping-Hsun Hsieh [mailto:hsi...@ohsu.edu] Sent: Friday, May 15, 2009 9:58 AM To: Peter Alspach; William Dunlap; hadley wickham Cc: r-help@r-project.org Subject: RE: [R] memory usage grows too fast Thanks for Peter, William, and Hadley's helps. Your codes are much more concise than mine. :P Both William and Hadley's comments are the same. Here are their codes. f - function(dataMatrix) rowMeans(datamatrix==02) And Peter's codes are the following. apply(yourMatrix, 1, function(x) length(x[x==yourPattern]))/ncol(yourMatrix) In terms of the running time, the first one ran faster than the later one on my dataset (2.5 mins vs. 6.4 mins) The memory consumption, however, of the first one is much higher than the later. ( 8G vs. ~3G ) Any thoughts? My guess is the rowMeans created extra copies to perform its calculation, but not so sure. And I am also interested in understanding ways to handle memory issues. Help someone could shed light on this for me. :) Best, Mike -Original Message- From: Peter Alspach [mailto:palsp...@hortresearch.co.nz] Sent: Thursday, May 14, 2009 4:47 PM To: Ping-Hsun Hsieh Subject: RE: [R] memory usage grows too fast Tena koe Mike If I understand you correctly, you should be able to use something like: apply(yourMatrix, 1, function(x) length(x[x==yourPattern]))/ncol(yourMatrix) I see you've divided by nrow(yourMatrix) so perhaps I am missing something. HTH ... Peter Alspach -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Ping-Hsun Hsieh Sent: Friday, 15 May 2009 11:22 a.m. To: r-help@r-project.org Subject: [R] memory usage grows too fast Hi All, I have a 1000x100 matrix. The calculation I would like to do is actually very simple: for each row, calculate the frequency of a given pattern. For example, a toy dataset is as follows. Col1Col2Col3Col4 01 02 02 00 = Freq of 02 is 0.5 02 02 02 01 = Freq of 02 is 0.75 00 02 01 01 ... My code is quite simple as the following to find the pattern 02. OccurrenceRate_Fun-function(dataMatrix) { tmp-NULL tmpMatrix-apply(dataMatrix,1,match,02) for ( i in 1: ncol(tmpMatrix)) { tmpRate-table(tmpMatrix[,i])[[1]]/ nrow(tmpMatrix) tmp-c(tmp,tmpHET) } rm(tmpMatrix) rm(tmpRate) return(tmp) gc() } The problem is the memory usage grows very fast and hard to be handled on machines with less RAM. Could anyone please give me some comments on how to reduce the space complexity in this calculation? Thanks, Mike __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. The contents of this e-mail are confidential and may be subject to legal privilege. If you are not the intended recipient you must not use, disseminate, distribute or reproduce all or any part of this e-mail or attachments. If you have received this e-mail in error, please notify the sender and delete all material pertaining to this e-mail. Any opinion or views expressed in this e-mail are those of the individual sender and may not represent those of The New Zealand Institute for Plant and Food Research Limited
[R] memory usage grows too fast
Hi All, I have a 1000x100 matrix. The calculation I would like to do is actually very simple: for each row, calculate the frequency of a given pattern. For example, a toy dataset is as follows. Col1Col2Col3Col4 01 02 02 00 = Freq of “02” is 0.5 02 02 02 01 = Freq of “02” is 0.75 00 02 01 01 … My code is quite simple as the following to find the pattern “02”. OccurrenceRate_Fun-function(dataMatrix) { tmp-NULL tmpMatrix-apply(dataMatrix,1,match,02) for ( i in 1: ncol(tmpMatrix)) { tmpRate-table(tmpMatrix[,i])[[1]]/ nrow(tmpMatrix) tmp-c(tmp,tmpHET) } rm(tmpMatrix) rm(tmpRate) return(tmp) gc() } The problem is the memory usage grows very fast and hard to be handled on machines with less RAM. Could anyone please give me some comments on how to reduce the space complexity in this calculation? Thanks, Mike __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory usage grows too fast
On Thu, May 14, 2009 at 6:21 PM, Ping-Hsun Hsieh hsi...@ohsu.edu wrote: Hi All, I have a 1000x100 matrix. The calculation I would like to do is actually very simple: for each row, calculate the frequency of a given pattern. For example, a toy dataset is as follows. Col1 Col2 Col3 Col4 01 02 02 00 = Freq of “02” is 0.5 02 02 02 01 = Freq of “02” is 0.75 00 02 01 01 … My code is quite simple as the following to find the pattern “02”. OccurrenceRate_Fun-function(dataMatrix) { tmp-NULL tmpMatrix-apply(dataMatrix,1,match,02) for ( i in 1: ncol(tmpMatrix)) { tmpRate-table(tmpMatrix[,i])[[1]]/ nrow(tmpMatrix) tmp-c(tmp,tmpHET) } rm(tmpMatrix) rm(tmpRate) return(tmp) gc() } The problem is the memory usage grows very fast and hard to be handled on machines with less RAM. Could anyone please give me some comments on how to reduce the space complexity in this calculation? rowMeans(dataMatrix == 02) ? Hadley -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R memory usage and size limits
I have a general question about R's usage or memory and what limits exist on the size of datasets it can deal with. My understanding was that all object in a session are held in memory. This implies that you're limited in the size of datasets that you can process by the amount of memory you've got access to (be it physical or paging). Is this true? Or does R store objects on disk and page them in as parts are needed in the way that SAS does? Are there 64 bit versions of R that can therefore deal with much larger objects? Many thanks. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R memory usage and size limits
Please read ?Memory-limits and the R-admin manual for basic information. On Thu, 5 Feb 2009, Tom Quarendon wrote: I have a general question about R's usage or memory and what limits exist on the size of datasets it can deal with. My understanding was that all object in a session are held in memory. This implies that you're limited in the size of datasets that you can process by the amount of memory you've got access to (be it physical or paging). Is this true? Or does R store objects on disk and page them in as parts are needed in the way that SAS does? That's rather a false dichotomy: paging uses the disk, so the distinction is if R implemented its own virtual memory system or uses the OS's one (the latter). There are also interfaces to DBMSs for use with large datasets: see the R-data manual and also look at the package list in the FAQ. Are there 64 bit versions of R that can therefore deal with much larger objects? Yes, there have been 64-bit versions of R for many years, and they are in routine use on very large problems. Many thanks. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Problems with R memory usage on Linux
Hello all, I'm working with a large data-set, and upgraded my RAM to 4GB to help with the mem use. I've got a 32bit kernel with 64GB memory support compiled in. gnome-system-monitor and free both show the full 4GB as being available. In R I was doing some processing and I got the following message (when collecting 100 307200*8 dataframes into a single data-frame (for plotting): Error: cannot allocate vector of size 2.3 Mb So I checked the R memory usage: $ ps -C R -o size SZ 3102548 I tried removing some objects and running gc() R then shows much less memory being used: $ ps -C R -o size SZ 2732124 Which should give me an extra 300MB in R. I still get the same error about R being unable to allocate another 2.3MB. I deleted well over 2.3MB of objects... Any suggestions as to get around this? Is the only way to use all 4GB in R to use a 64bit kernel? Thanks all, B. Bogart __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problems with R memory usage on Linux
See ?Memory-size On Wed, 15 Oct 2008, B. Bogart wrote: Hello all, I'm working with a large data-set, and upgraded my RAM to 4GB to help with the mem use. I've got a 32bit kernel with 64GB memory support compiled in. gnome-system-monitor and free both show the full 4GB as being available. In R I was doing some processing and I got the following message (when collecting 100 307200*8 dataframes into a single data-frame (for plotting): Error: cannot allocate vector of size 2.3 Mb So I checked the R memory usage: $ ps -C R -o size SZ 3102548 I tried removing some objects and running gc() R then shows much less memory being used: $ ps -C R -o size SZ 2732124 Which should give me an extra 300MB in R. I still get the same error about R being unable to allocate another 2.3MB. I deleted well over 2.3MB of objects... Any suggestions as to get around this? Is the only way to use all 4GB in R to use a 64bit kernel? Thanks all, B. Bogart __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problems with R memory usage on Linux
Doen't work. \misiek Prof Brian Ripley wrote: See ?Memory-size On Wed, 15 Oct 2008, B. Bogart wrote: [...] [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problems with R memory usage on Linux
Or ?Memory-limits (and the posting guide of course). On Wed, 15 Oct 2008, Prof Brian Ripley wrote: See ?Memory-size On Wed, 15 Oct 2008, B. Bogart wrote: Hello all, I'm working with a large data-set, and upgraded my RAM to 4GB to help with the mem use. I've got a 32bit kernel with 64GB memory support compiled in. gnome-system-monitor and free both show the full 4GB as being available. In R I was doing some processing and I got the following message (when collecting 100 307200*8 dataframes into a single data-frame (for plotting): Error: cannot allocate vector of size 2.3 Mb So I checked the R memory usage: $ ps -C R -o size SZ 3102548 I tried removing some objects and running gc() R then shows much less memory being used: $ ps -C R -o size SZ 2732124 Which should give me an extra 300MB in R. I still get the same error about R being unable to allocate another 2.3MB. I deleted well over 2.3MB of objects... Any suggestions as to get around this? Is the only way to use all 4GB in R to use a 64bit kernel? Thanks all, B. Bogart __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] reducing memory usage WAS: Problems with R memory usage on Linux
Hello, I have read the R memory pages. I realized after my post that I would not have enough memory to accomplish this task. The command I'm using to convert the list into a data-frame is as such: som - do.call(rbind, somlist) Where som is the dataframe resulting from combining all the dataframes in somlist. Is there a way I can remove each item from the list and gc() once it has been collected in the som data frame? That way the memory usage should be able the same, rather than double or triple? Any other suggestions on reducing memory usage? (I'm already running blackbox and a single terminal to do the job) I do have enough memory to store the somlist twice over, but the do.call bails before its done, so I suppose it uses a workspace so that I need 2x the space of the somlist to collect it? Is there another function that does the same thing but only uses 2x the size of somlist of memory? Thanks for your help, Prof Brian Ripley wrote: Or ?Memory-limits (and the posting guide of course). On Wed, 15 Oct 2008, Prof Brian Ripley wrote: See ?Memory-size On Wed, 15 Oct 2008, B. Bogart wrote: Hello all, I'm working with a large data-set, and upgraded my RAM to 4GB to help with the mem use. I've got a 32bit kernel with 64GB memory support compiled in. gnome-system-monitor and free both show the full 4GB as being available. In R I was doing some processing and I got the following message (when collecting 100 307200*8 dataframes into a single data-frame (for plotting): Error: cannot allocate vector of size 2.3 Mb So I checked the R memory usage: $ ps -C R -o size SZ 3102548 I tried removing some objects and running gc() R then shows much less memory being used: $ ps -C R -o size SZ 2732124 Which should give me an extra 300MB in R. I still get the same error about R being unable to allocate another 2.3MB. I deleted well over 2.3MB of objects... Any suggestions as to get around this? Is the only way to use all 4GB in R to use a 64bit kernel? Thanks all, B. Bogart __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Memory usage
Hello, I have aggregate a data.frame from 16MB (Object size). After some minutes I get the error message cannot allocate vector of size 64.5MB. My computer has a physical memory of 4GB under Windows Vista. I have test the same command on another computer with the same OS and 2GB RAM. In nearly 2sec I get the result without problems. Thanks buch-read.delim(Y2006_1.csv,sep=;,as.is=TRUE,header=TRUE,dec=,) ana01-aggregate(buch[,c(VALUELW,LZLW,SZLW)],by=data.frame(buch$PRODGRP,buch$LAND1,buch$KUNDE1),sum) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.