Re: [R] data manipulation and summaries with few million rows
Factors are you friend here: myData mydate gender mygroup id mygrp.f 1 2012-03-25 F A 1 1 2 2005-05-23 F B 2 2 3 2005-09-08 F B 2 2 4 2005-12-07 F B 2 2 5 2006-02-26 F C 2 3 6 2006-05-13 F C 2 3 7 2006-09-01 F C 2 3 8 2006-12-12 F D 2 4 9 2006-02-19 F D 2 4 10 2006-05-03 F D 2 4 11 2006-04-23 F D 2 4 12 2007-12-08 F D 2 4 13 2011-03-19 F D 2 4 14 2007-12-20 M A 3 1 15 2008-06-15 M A 3 1 16 2008-12-16 M A 3 1 17 2009-06-07 M B 3 2 18 2009-10-09 M B 3 2 19 2010-01-28 M B 3 2 20 2007-06-05 M A 4 1 # change 'mygroup' to a factor so you can use 'diff' to count the changes myData$mygrp.f - as.integer(factor(myData$mygroup)) # count the changes for each 'id' changes - tapply(myData$mygrp.f, myData$id, function(x){ + sum(diff(x) != 0) + }) changes 1 2 3 4 0 2 1 0 On Wed, Aug 24, 2011 at 12:48 PM, Juliet Hannah juliet.han...@gmail.com wrote: I have a data set with about 6 million rows and 50 columns. It is a mixture of dates, factors, and numerics. What I am trying to accomplish can be seen with the following simplified data, which is given as dput output below. head(myData) mydate gender mygroup id 1 2012-03-25 F A 1 2 2005-05-23 F B 2 3 2005-09-08 F B 2 4 2005-12-07 F B 2 5 2006-02-26 F C 2 6 2006-05-13 F C 2 For each id, I want to count the number of changes of the variable 'mygroup' that occur. For example, id=1 has 0 changes because it is observed only once. id=2 has 2 changes (B to C, and C to D). I also need to calculate the total observation time for each id using the variable mydate. In the end, I am trying to have a new data set in which each row has an id, days observed, number of changes, and gender. I made some simple summaries using data.table and plyr, but I'm stuck on this reformatting. Thanks for your help. myData - structure(list(mydate = c(2012-03-25, 2005-05-23, 2005-09-08, 2005-12-07, 2006-02-26, 2006-05-13, 2006-09-01, 2006-12-12, 2006-02-19, 2006-05-03, 2006-04-23, 2007-12-08, 2011-03-19, 2007-12-20, 2008-06-15, 2008-12-16, 2009-06-07, 2009-10-09, 2010-01-28, 2007-06-05), gender = c(F, F, F, F, F, F, F, F, F, F, F, F, F, M, M, M, M, M, M, M), mygroup = c(A, B, B, B, C, C, C, D, D, D, D, D, D, A, A, A, B, B, B, A), id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c(mydate, gender, mygroup, id), class = data.frame, row.names = c(NA, -20L )) sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] data manipulation and summaries with few million rows
I have a data set with about 6 million rows and 50 columns. It is a mixture of dates, factors, and numerics. What I am trying to accomplish can be seen with the following simplified data, which is given as dput output below. head(myData) mydate gender mygroup id 1 2012-03-25 F A 1 2 2005-05-23 F B 2 3 2005-09-08 F B 2 4 2005-12-07 F B 2 5 2006-02-26 F C 2 6 2006-05-13 F C 2 For each id, I want to count the number of changes of the variable 'mygroup' that occur. For example, id=1 has 0 changes because it is observed only once. id=2 has 2 changes (B to C, and C to D). I also need to calculate the total observation time for each id using the variable mydate. In the end, I am trying to have a new data set in which each row has an id, days observed, number of changes, and gender. I made some simple summaries using data.table and plyr, but I'm stuck on this reformatting. Thanks for your help. myData - structure(list(mydate = c(2012-03-25, 2005-05-23, 2005-09-08, 2005-12-07, 2006-02-26, 2006-05-13, 2006-09-01, 2006-12-12, 2006-02-19, 2006-05-03, 2006-04-23, 2007-12-08, 2011-03-19, 2007-12-20, 2008-06-15, 2008-12-16, 2009-06-07, 2009-10-09, 2010-01-28, 2007-06-05), gender = c(F, F, F, F, F, F, F, F, F, F, F, F, F, M, M, M, M, M, M, M), mygroup = c(A, B, B, B, C, C, C, D, D, D, D, D, D, A, A, A, B, B, B, A), id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c(mydate, gender, mygroup, id), class = data.frame, row.names = c(NA, -20L )) sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] data manipulation and summaries with few million rows
Hi Juliet: Here's a Q D solution: # (1) plyr f - function(d) length(unique(d$mygroup)) - 1 ddply(myData, .(id), f) id V1 1 1 0 2 2 2 3 3 1 4 4 0 # (2) data.table myDT - data.table(myData, key = 'id') myDT[, list(nswitch = length(unique(mygroup)) - 1), by = 'id'] If one can switch back and forth between levels more than once, then the above is clearly not appropriate. A more robust method would be to employ rle() [run length encoding]: g - function(d) length(rle(d$mygroup)$lengths) - 1 ddply(myData, .(id), g)# gives the same answer as above myDT[, list(nswitch = length(rle(mygroup)$lengths) - 1), by = 'id'] # ditto HTH, Dennis On Wed, Aug 24, 2011 at 9:48 AM, Juliet Hannah juliet.han...@gmail.com wrote: I have a data set with about 6 million rows and 50 columns. It is a mixture of dates, factors, and numerics. What I am trying to accomplish can be seen with the following simplified data, which is given as dput output below. head(myData) mydate gender mygroup id 1 2012-03-25 F A 1 2 2005-05-23 F B 2 3 2005-09-08 F B 2 4 2005-12-07 F B 2 5 2006-02-26 F C 2 6 2006-05-13 F C 2 For each id, I want to count the number of changes of the variable 'mygroup' that occur. For example, id=1 has 0 changes because it is observed only once. id=2 has 2 changes (B to C, and C to D). I also need to calculate the total observation time for each id using the variable mydate. In the end, I am trying to have a new data set in which each row has an id, days observed, number of changes, and gender. I made some simple summaries using data.table and plyr, but I'm stuck on this reformatting. Thanks for your help. myData - structure(list(mydate = c(2012-03-25, 2005-05-23, 2005-09-08, 2005-12-07, 2006-02-26, 2006-05-13, 2006-09-01, 2006-12-12, 2006-02-19, 2006-05-03, 2006-04-23, 2007-12-08, 2011-03-19, 2007-12-20, 2008-06-15, 2008-12-16, 2009-06-07, 2009-10-09, 2010-01-28, 2007-06-05), gender = c(F, F, F, F, F, F, F, F, F, F, F, F, F, M, M, M, M, M, M, M), mygroup = c(A, B, B, B, C, C, C, D, D, D, D, D, D, A, A, A, B, B, B, A), id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c(mydate, gender, mygroup, id), class = data.frame, row.names = c(NA, -20L )) sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] data manipulation and summaries with few million rows
Thanks Dennis! I'll check this out. Just to clarify, I need the total number of switches/changes regardless of if that state had occurred in the past. So A-A-B-A, would have 2 changes: A to B and B to A. Thanks again. On Wed, Aug 24, 2011 at 1:28 PM, Dennis Murphy djmu...@gmail.com wrote: Hi Juliet: Here's a Q D solution: # (1) plyr f - function(d) length(unique(d$mygroup)) - 1 ddply(myData, .(id), f) id V1 1 1 0 2 2 2 3 3 1 4 4 0 # (2) data.table myDT - data.table(myData, key = 'id') myDT[, list(nswitch = length(unique(mygroup)) - 1), by = 'id'] If one can switch back and forth between levels more than once, then the above is clearly not appropriate. A more robust method would be to employ rle() [run length encoding]: g - function(d) length(rle(d$mygroup)$lengths) - 1 ddply(myData, .(id), g) # gives the same answer as above myDT[, list(nswitch = length(rle(mygroup)$lengths) - 1), by = 'id'] # ditto HTH, Dennis On Wed, Aug 24, 2011 at 9:48 AM, Juliet Hannah juliet.han...@gmail.com wrote: I have a data set with about 6 million rows and 50 columns. It is a mixture of dates, factors, and numerics. What I am trying to accomplish can be seen with the following simplified data, which is given as dput output below. head(myData) mydate gender mygroup id 1 2012-03-25 F A 1 2 2005-05-23 F B 2 3 2005-09-08 F B 2 4 2005-12-07 F B 2 5 2006-02-26 F C 2 6 2006-05-13 F C 2 For each id, I want to count the number of changes of the variable 'mygroup' that occur. For example, id=1 has 0 changes because it is observed only once. id=2 has 2 changes (B to C, and C to D). I also need to calculate the total observation time for each id using the variable mydate. In the end, I am trying to have a new data set in which each row has an id, days observed, number of changes, and gender. I made some simple summaries using data.table and plyr, but I'm stuck on this reformatting. Thanks for your help. myData - structure(list(mydate = c(2012-03-25, 2005-05-23, 2005-09-08, 2005-12-07, 2006-02-26, 2006-05-13, 2006-09-01, 2006-12-12, 2006-02-19, 2006-05-03, 2006-04-23, 2007-12-08, 2011-03-19, 2007-12-20, 2008-06-15, 2008-12-16, 2009-06-07, 2009-10-09, 2010-01-28, 2007-06-05), gender = c(F, F, F, F, F, F, F, F, F, F, F, F, F, M, M, M, M, M, M, M), mygroup = c(A, B, B, B, C, C, C, D, D, D, D, D, D, A, A, A, B, B, B, A), id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c(mydate, gender, mygroup, id), class = data.frame, row.names = c(NA, -20L )) sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.