Re: [R] Adding column to dataframe
To update on this. I ran the same command on a grid of computers with 32gb ram, and it completed in 15 seconds, compared to the ~20 minutes on my desktop. Simply a ram issue as suspected by a few on the list here. Thanks -- View this message in context: http://r.789695.n4.nabble.com/Adding-column-to-dataframe-tp2330556p2339076.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Adding column to dataframe
> > Provide and 'str' and 'object.size' of the object > so that we can see what you are working with. My rule of thumb is > that no single object should take more than 25-30% of memory since > copies may be made. So the reasons things are taking 20 minutes is > you might be paging. It is always good to break the problem into > pieces to see what is happening. Read in only 25% of the data and > time it; then 50% and so on. In any performance related problems you > need to determine where the "knee of the curve" it. Never undertake > processing the large data file at once; start with some pieces and > work up so that you know what to expect. > > On Wed, Aug 18, 2010 at 9:46 PM, Matt Cooper wrote: > > 2) My specific problem with this dataset. > > > > I am essentially trying to convert a date and add it to a data frame. I > > imagine any 'data manipulation on a column within dataframe into a new > > column' will present the same issue, be it as.Date or anything else. > > > > I have a dataset, size > > > >> dim(morbidity) > > [1] 1775683 264 > > > > This was read in from a STATA .dta file. The dates have come in as the > > number of ms from 1960 so I have the following to convert these to usable > > dates. > > > > as.Date(morbidity$adm_date / (100*10*60*60*24), origin="1960-01-01") > > > > when I store this as a vector it is near instant, <5 seconds > > test <- as.Date(etc) > > when I place it over itself it takes ~20 minutes > > morbidity$adm_date <- as.Date(etc) > > when I place the vector over it (so no computation involved), or place it > as > > a new column it still takes ~20 minutes > > morbidity$adm_date <- test > > morbidity$new_col <- test > > when I tried a cbind to add it that way it took >20 minutes > > new_morb <- cbind(morbidity,test) > > > > Has anyone done something similar or know of a different command that > should > > work faster? I can't get my head around what R is doing, if it can create > > the vector instantly then the computation is quite simple, I don't > > understand why then adding it as a column to a dataframe can take that > long. > > > > R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough > > resources. > Thanks Jim, results below. ~2.66 gig for the object so I guess there is no way to speed up working with that entire data frame. What I've done in the mean time is removed most of the columns down to what I want to do data manipulations with (approx 40 of the 264) and this is much much quicker, then I will join them back on at the end (am expecting that to take a while!). Any other feedback appreciated. > object.size(morbidity) 2865834800 bytes > str(morbidity) 'data.frame': 1775683 obs. of 264 variables: $ root: chr "2G5PQVQH5KYZY" "DDSMVGQEW9YXP" "DDSMVGQEW9YXP" "DDSMVGQEW9YXP" ... $ lpnot : chr "58GDA44MJSG3P" "4ZAM2XCK332NX" "5KX4FB6NTM831" "8CGXVV2A25C3M" ... $ hospital: int 226 616 633 633 616 631 631 631 616 629 ... $ hosp_area : int 2 1 1 1 1 1 1 1 1 1 ... $ hosp_region : int NA NA NA NA NA NA NA NA NA NA ... $ hosp_type : int 1 2 2 2 2 2 2 2 2 2 ... $ hosp_category : int 3 2 2 2 2 2 2 2 2 2 ... $ hsa : int NA NA NA NA NA NA NA NA NA NA ... $ adm_date:Class 'Date' num [1:1775683] 11079 11084 11534 11869 11051 ... $ adm_date_ddmwdob: int 0 0 450 785 0 70 122 125 0 91 ... $ sep_date:Class 'Date' num [1:1775683] 11089 11089 11534 11869 11057 ... $ sep_date_ddmwdob: int 10 5 450 785 6 70 122 136 23 98 ... $ adm_time: int 2345 750 630 630 651 930 930 930 1146 1728 ... $ sep_time: int 630 1715 1014 1013 1951 1630 1630 1000 941 1020 ... $ mf_los : int 10 5 1 1 6 1 1 8 23 7 ... $ suburb : chr "WARBURTON COMMUNITY" "WESTMINSTER" "WESTMINSTER" "WESTMINSTER" ... $ postcode: int 6431 6061 6061 6061 6150 6160 6160 6160 6016 6016 ... $ state : int 5 5 5 5 5 5 5 5 5 5 ... $ loc_code: chr "E06001" "" "" "" ... $ lga : int NA NA NA NA NA NA NA NA NA NA ... $ dob_my : num 1.27e+12 1.27e+12 1.27e+12 1.27e+12 1.27e+12 ... $ dob_ddmwdob : int 0 0 0 0 0 0 0 0 0 0 ... $ age : int 0 0 1 2 0 0 0 0 0 0 ... $ age_group : int 1 1 1 1 1 1 1 1 1 1 ... $ sex : int 1 2 2 2 2 2 2 2 2 2 ... $ aborig : int 1 4 4 4 4 4 4 4 4 4 ... $ cob : int 1105 1105 1100 1101 1105 3 3 3 1105 1105 ... $ marital : int 1 1 1 1 1 1 1 1 1 1 ... $ emp_stat: int 1 1 8 1 1 1 1 1 1 1 ... $ interp : int 2 2 2 2 2 2 2 2 2 2 ... $ occup : int 96 96 NA NA 96 96 NA NA 96 NA ... $ src_ref : int 0 0 NA NA 0 0 NA NA 0 NA ... $ pat_epi : int 2 2 1 1 2 1 1 2 2 2 ... $ adm_from: int 900 900 900 900 900 900 900 900 900 900 ... $ spl_adm : int 25 39 50 50 39 84 25 25 39 25 ... $ spl_sep : int 25 39 50 50 39 84 25 25 21 25 ... $ adm_type: int 1 1 4
Re: [R] Adding column to dataframe
I think you are probably paging on your system. Turn on your performance metrics and look at it. If the object you are processing is all numeric, it would seem to require about 3.5GB of space (50% of available memory). Provide and 'str' and 'object.size' of the object so that we can see what you are working with. My rule of thumb is that no single object should take more than 25-30% of memory since copies may be made. So the reasons things are taking 20 minutes is you might be paging. It is always good to break the problem into pieces to see what is happening. Read in only 25% of the data and time it; then 50% and so on. In any performance related problems you need to determine where the "knee of the curve" it. Never undertake processing the large data file at once; start with some pieces and work up so that you know what to expect. On Wed, Aug 18, 2010 at 9:46 PM, Matt Cooper wrote: > Two questions: > 1) Are there any good R guides/sites with information/techniques for dealing > with large datasets in R? (Large being ~2 mil rows and ~200 columns) > > 2) My specific problem with this dataset. > > I am essentially trying to convert a date and add it to a data frame. I > imagine any 'data manipulation on a column within dataframe into a new > column' will present the same issue, be it as.Date or anything else. > > I have a dataset, size > >> dim(morbidity) > [1] 1775683 264 > > This was read in from a STATA .dta file. The dates have come in as the > number of ms from 1960 so I have the following to convert these to usable > dates. > > as.Date(morbidity$adm_date / (100*10*60*60*24), origin="1960-01-01") > > when I store this as a vector it is near instant, <5 seconds > test <- as.Date(etc) > when I place it over itself it takes ~20 minutes > morbidity$adm_date <- as.Date(etc) > when I place the vector over it (so no computation involved), or place it as > a new column it still takes ~20 minutes > morbidity$adm_date <- test > morbidity$new_col <- test > when I tried a cbind to add it that way it took >20 minutes > new_morb <- cbind(morbidity,test) > > Has anyone done something similar or know of a different command that should > work faster? I can't get my head around what R is doing, if it can create > the vector instantly then the computation is quite simple, I don't > understand why then adding it as a column to a dataframe can take that long. > > R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough > resources. > > Thanks > Matt > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Adding column to dataframe
Two questions: 1) Are there any good R guides/sites with information/techniques for dealing with large datasets in R? (Large being ~2 mil rows and ~200 columns) 2) My specific problem with this dataset. I am essentially trying to convert a date and add it to a data frame. I imagine any 'data manipulation on a column within dataframe into a new column' will present the same issue, be it as.Date or anything else. I have a dataset, size > dim(morbidity) [1] 1775683 264 This was read in from a STATA .dta file. The dates have come in as the number of ms from 1960 so I have the following to convert these to usable dates. as.Date(morbidity$adm_date / (100*10*60*60*24), origin="1960-01-01") when I store this as a vector it is near instant, <5 seconds test <- as.Date(etc) when I place it over itself it takes ~20 minutes morbidity$adm_date <- as.Date(etc) when I place the vector over it (so no computation involved), or place it as a new column it still takes ~20 minutes morbidity$adm_date <- test morbidity$new_col <- test when I tried a cbind to add it that way it took >20 minutes new_morb <- cbind(morbidity,test) Has anyone done something similar or know of a different command that should work faster? I can't get my head around what R is doing, if it can create the vector instantly then the computation is quite simple, I don't understand why then adding it as a column to a dataframe can take that long. R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough resources. Thanks Matt [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.