Re: [R] Adding column to dataframe

2010-08-25 Thread nzcoops

To update on this. I ran the same command on a grid of computers with 32gb
ram, and it completed in 15 seconds, compared to the ~20 minutes on my
desktop.

Simply a ram issue as suspected by a few on the list here.

Thanks
-- 
View this message in context: 
http://r.789695.n4.nabble.com/Adding-column-to-dataframe-tp2330556p2339076.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Adding column to dataframe

2010-08-23 Thread Matt Cooper
>
> Provide and 'str' and 'object.size' of the object
> so that we can see what you are working with.  My rule of thumb is
> that no single object should take more than 25-30% of memory since
> copies may be made.  So the reasons things are taking 20 minutes is
> you might be paging.  It is always good to break the problem into
> pieces to see what is happening.  Read in only 25% of the data and
> time it; then 50% and so on.  In any performance related problems you
> need to determine where the "knee of the curve" it.  Never undertake
> processing the large data file at once; start with some pieces and
> work up so that you know what to expect.
>
> On Wed, Aug 18, 2010 at 9:46 PM, Matt Cooper  wrote:
>  > 2) My specific problem with this dataset.
> >
> > I am essentially trying to convert a date and add it to a data frame. I
> > imagine any 'data manipulation on a column within dataframe into a new
> > column' will present the same issue, be it as.Date or anything else.
> >
> > I have a dataset, size
> >
> >> dim(morbidity)
> > [1] 1775683 264
> >
> > This was read in from a STATA .dta file. The dates have come in as the
> > number of ms from 1960 so I have the following to convert these to usable
> > dates.
> >
> > as.Date(morbidity$adm_date / (100*10*60*60*24), origin="1960-01-01")
> >
> > when I store this as a vector it is near instant, <5 seconds
> > test <- as.Date(etc)
> > when I place it over itself it takes ~20 minutes
> > morbidity$adm_date <- as.Date(etc)
> > when I place the vector over it (so no computation involved), or place it
> as
> > a new column it still takes ~20 minutes
> > morbidity$adm_date <- test
> > morbidity$new_col <- test
> > when I tried a cbind to add it that way it took >20 minutes
> > new_morb <- cbind(morbidity,test)
> >
> > Has anyone done something similar or know of a different command that
> should
> > work faster? I can't get my head around what R is doing, if it can create
> > the vector instantly then the computation is quite simple, I don't
> > understand why then adding it as a column to a dataframe can take that
> long.
> >
> > R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough
> > resources.
>

 Thanks Jim, results below.

~2.66 gig for the object so I guess there is no way to speed up working with
that entire data frame. What I've done in the mean time is removed most of
the columns down to what I want to do data manipulations with (approx 40 of
the 264) and this is much much quicker, then I will join them back on at the
end (am expecting that to take a while!).

Any other feedback appreciated.

> object.size(morbidity)
2865834800 bytes
> str(morbidity)
'data.frame': 1775683 obs. of  264 variables:
 $ root: chr  "2G5PQVQH5KYZY" "DDSMVGQEW9YXP" "DDSMVGQEW9YXP"
"DDSMVGQEW9YXP" ...
 $ lpnot   : chr  "58GDA44MJSG3P" "4ZAM2XCK332NX" "5KX4FB6NTM831"
"8CGXVV2A25C3M" ...
 $ hospital: int  226 616 633 633 616 631 631 631 616 629 ...
 $ hosp_area   : int  2 1 1 1 1 1 1 1 1 1 ...
 $ hosp_region : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hosp_type   : int  1 2 2 2 2 2 2 2 2 2 ...
 $ hosp_category   : int  3 2 2 2 2 2 2 2 2 2 ...
 $ hsa : int  NA NA NA NA NA NA NA NA NA NA ...
 $ adm_date:Class 'Date'  num [1:1775683] 11079 11084 11534 11869
11051 ...
 $ adm_date_ddmwdob: int  0 0 450 785 0 70 122 125 0 91 ...
 $ sep_date:Class 'Date'  num [1:1775683] 11089 11089 11534 11869
11057 ...
 $ sep_date_ddmwdob: int  10 5 450 785 6 70 122 136 23 98 ...
 $ adm_time: int  2345 750 630 630 651 930 930 930 1146 1728 ...
 $ sep_time: int  630 1715 1014 1013 1951 1630 1630 1000 941 1020
...
 $ mf_los  : int  10 5 1 1 6 1 1 8 23 7 ...
 $ suburb  : chr  "WARBURTON COMMUNITY" "WESTMINSTER" "WESTMINSTER"
"WESTMINSTER" ...
 $ postcode: int  6431 6061 6061 6061 6150 6160 6160 6160 6016 6016
...
 $ state   : int  5 5 5 5 5 5 5 5 5 5 ...
 $ loc_code: chr  "E06001" "" "" "" ...
 $ lga : int  NA NA NA NA NA NA NA NA NA NA ...
 $ dob_my  : num  1.27e+12 1.27e+12 1.27e+12 1.27e+12 1.27e+12 ...
 $ dob_ddmwdob : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age : int  0 0 1 2 0 0 0 0 0 0 ...
 $ age_group   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sex : int  1 2 2 2 2 2 2 2 2 2 ...
 $ aborig  : int  1 4 4 4 4 4 4 4 4 4 ...
 $ cob : int  1105 1105 1100 1101 1105 3 3 3 1105 1105 ...
 $ marital : int  1 1 1 1 1 1 1 1 1 1 ...
 $ emp_stat: int  1 1 8 1 1 1 1 1 1 1 ...
 $ interp  : int  2 2 2 2 2 2 2 2 2 2 ...
 $ occup   : int  96 96 NA NA 96 96 NA NA 96 NA ...
 $ src_ref : int  0 0 NA NA 0 0 NA NA 0 NA ...
 $ pat_epi : int  2 2 1 1 2 1 1 2 2 2 ...
 $ adm_from: int  900 900 900 900 900 900 900 900 900 900 ...
 $ spl_adm : int  25 39 50 50 39 84 25 25 39 25 ...
 $ spl_sep : int  25 39 50 50 39 84 25 25 21 25 ...
 $ adm_type: int  1 1 4

Re: [R] Adding column to dataframe

2010-08-19 Thread jim holtman
I think you are probably paging on your system.  Turn on your
performance metrics and look at it.  If the object you are processing
is all numeric, it would seem to require about 3.5GB of space (50% of
available memory).  Provide and 'str' and 'object.size' of the object
so that we can see what you are working with.  My rule of thumb is
that no single object should take more than 25-30% of memory since
copies may be made.  So the reasons things are taking 20 minutes is
you might be paging.  It is always good to break the problem into
pieces to see what is happening.  Read in only 25% of the data and
time it; then 50% and so on.  In any performance related problems you
need to determine where the "knee of the curve" it.  Never undertake
processing the large data file at once; start with some pieces and
work up so that you know what to expect.

On Wed, Aug 18, 2010 at 9:46 PM, Matt Cooper  wrote:
> Two questions:
> 1) Are there any good R guides/sites with information/techniques for dealing
> with large datasets in R? (Large being ~2 mil rows and ~200 columns)
>
> 2) My specific problem with this dataset.
>
> I am essentially trying to convert a date and add it to a data frame. I
> imagine any 'data manipulation on a column within dataframe into a new
> column' will present the same issue, be it as.Date or anything else.
>
> I have a dataset, size
>
>> dim(morbidity)
> [1] 1775683     264
>
> This was read in from a STATA .dta file. The dates have come in as the
> number of ms from 1960 so I have the following to convert these to usable
> dates.
>
> as.Date(morbidity$adm_date / (100*10*60*60*24), origin="1960-01-01")
>
> when I store this as a vector it is near instant, <5 seconds
> test <- as.Date(etc)
> when I place it over itself it takes ~20 minutes
> morbidity$adm_date <- as.Date(etc)
> when I place the vector over it (so no computation involved), or place it as
> a new column it still takes ~20 minutes
> morbidity$adm_date <- test
> morbidity$new_col <- test
> when I tried a cbind to add it that way it took >20 minutes
> new_morb <- cbind(morbidity,test)
>
> Has anyone done something similar or know of a different command that should
> work faster? I can't get my head around what R is doing, if it can create
> the vector instantly then the computation is quite simple, I don't
> understand why then adding it as a column to a dataframe can take that long.
>
> R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough
> resources.
>
> Thanks
> Matt
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Adding column to dataframe

2010-08-18 Thread Matt Cooper
Two questions:
1) Are there any good R guides/sites with information/techniques for dealing
with large datasets in R? (Large being ~2 mil rows and ~200 columns)

2) My specific problem with this dataset.

I am essentially trying to convert a date and add it to a data frame. I
imagine any 'data manipulation on a column within dataframe into a new
column' will present the same issue, be it as.Date or anything else.

I have a dataset, size

> dim(morbidity)
[1] 1775683 264

This was read in from a STATA .dta file. The dates have come in as the
number of ms from 1960 so I have the following to convert these to usable
dates.

as.Date(morbidity$adm_date / (100*10*60*60*24), origin="1960-01-01")

when I store this as a vector it is near instant, <5 seconds
test <- as.Date(etc)
when I place it over itself it takes ~20 minutes
morbidity$adm_date <- as.Date(etc)
when I place the vector over it (so no computation involved), or place it as
a new column it still takes ~20 minutes
morbidity$adm_date <- test
morbidity$new_col <- test
when I tried a cbind to add it that way it took >20 minutes
new_morb <- cbind(morbidity,test)

Has anyone done something similar or know of a different command that should
work faster? I can't get my head around what R is doing, if it can create
the vector instantly then the computation is quite simple, I don't
understand why then adding it as a column to a dataframe can take that long.

R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough
resources.

Thanks
Matt

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.