Re: [R] data manipulation and summaries with few million rows

2011-08-27 Thread jim holtman
Factors are you friend here:

 myData
   mydate gender mygroup id mygrp.f
1  2012-03-25  F   A  1   1
2  2005-05-23  F   B  2   2
3  2005-09-08  F   B  2   2
4  2005-12-07  F   B  2   2
5  2006-02-26  F   C  2   3
6  2006-05-13  F   C  2   3
7  2006-09-01  F   C  2   3
8  2006-12-12  F   D  2   4
9  2006-02-19  F   D  2   4
10 2006-05-03  F   D  2   4
11 2006-04-23  F   D  2   4
12 2007-12-08  F   D  2   4
13 2011-03-19  F   D  2   4
14 2007-12-20  M   A  3   1
15 2008-06-15  M   A  3   1
16 2008-12-16  M   A  3   1
17 2009-06-07  M   B  3   2
18 2009-10-09  M   B  3   2
19 2010-01-28  M   B  3   2
20 2007-06-05  M   A  4   1
 # change 'mygroup' to a factor so you can use 'diff' to count the changes
 myData$mygrp.f - as.integer(factor(myData$mygroup))
 # count the changes for each 'id'
 changes - tapply(myData$mygrp.f, myData$id, function(x){
+ sum(diff(x) != 0)
+ })


 changes
1 2 3 4
0 2 1 0



On Wed, Aug 24, 2011 at 12:48 PM, Juliet Hannah juliet.han...@gmail.com wrote:
 I have a data set with about 6 million rows and 50 columns. It is a
 mixture of dates, factors, and numerics.

 What I am trying to accomplish can be seen with the following
 simplified data, which is given as dput output below.

 head(myData)
      mydate gender mygroup id
 1 2012-03-25      F       A  1
 2 2005-05-23      F       B  2
 3 2005-09-08      F       B  2
 4 2005-12-07      F       B  2
 5 2006-02-26      F       C  2
 6 2006-05-13      F       C  2

 For each id, I want to count the number of changes of the variable
 'mygroup' that occur. For example, id=1 has 0 changes because it is
 observed only once.  id=2 has 2 changes (B to C, and C to D).  I also
 need to calculate the total observation time for each id using the
 variable mydate.  In the end, I am trying to have a new data set in
 which each row has an id, days observed, number of changes, and
 gender.

 I made some simple summaries using data.table and plyr, but I'm stuck
 on this reformatting.

 Thanks for your help.

 myData - structure(list(mydate = c(2012-03-25, 2005-05-23, 2005-09-08,
 2005-12-07, 2006-02-26, 2006-05-13, 2006-09-01, 2006-12-12,
 2006-02-19, 2006-05-03, 2006-04-23, 2007-12-08, 2011-03-19,
 2007-12-20, 2008-06-15, 2008-12-16, 2009-06-07, 2009-10-09,
 2010-01-28, 2007-06-05), gender = c(F, F, F, F, F,
 F, F, F, F, F, F, F, F, M, M, M, M, M,
 M, M), mygroup = c(A, B, B, B, C, C, C, D,
 D, D, D, D, D, A, A, A, B, B, B, A),
    id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
    3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c(mydate, gender,
 mygroup, id), class = data.frame, row.names = c(NA, -20L
 ))

 sessionInfo()
 R version 2.13.1 (2011-07-08)
 Platform: x86_64-unknown-linux-gnu (64-bit)

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] data manipulation and summaries with few million rows

2011-08-24 Thread Juliet Hannah
I have a data set with about 6 million rows and 50 columns. It is a
mixture of dates, factors, and numerics.

What I am trying to accomplish can be seen with the following
simplified data, which is given as dput output below.

 head(myData)
  mydate gender mygroup id
1 2012-03-25  F   A  1
2 2005-05-23  F   B  2
3 2005-09-08  F   B  2
4 2005-12-07  F   B  2
5 2006-02-26  F   C  2
6 2006-05-13  F   C  2

For each id, I want to count the number of changes of the variable
'mygroup' that occur. For example, id=1 has 0 changes because it is
observed only once.  id=2 has 2 changes (B to C, and C to D).  I also
need to calculate the total observation time for each id using the
variable mydate.  In the end, I am trying to have a new data set in
which each row has an id, days observed, number of changes, and
gender.

I made some simple summaries using data.table and plyr, but I'm stuck
on this reformatting.

Thanks for your help.

myData - structure(list(mydate = c(2012-03-25, 2005-05-23, 2005-09-08,
2005-12-07, 2006-02-26, 2006-05-13, 2006-09-01, 2006-12-12,
2006-02-19, 2006-05-03, 2006-04-23, 2007-12-08, 2011-03-19,
2007-12-20, 2008-06-15, 2008-12-16, 2009-06-07, 2009-10-09,
2010-01-28, 2007-06-05), gender = c(F, F, F, F, F,
F, F, F, F, F, F, F, F, M, M, M, M, M,
M, M), mygroup = c(A, B, B, B, C, C, C, D,
D, D, D, D, D, A, A, A, B, B, B, A),
id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c(mydate, gender,
mygroup, id), class = data.frame, row.names = c(NA, -20L
))

 sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] data manipulation and summaries with few million rows

2011-08-24 Thread Dennis Murphy
Hi Juliet:

Here's a Q  D solution:

# (1) plyr
 f - function(d) length(unique(d$mygroup)) - 1
 ddply(myData, .(id), f)
  id V1
1  1  0
2  2  2
3  3  1
4  4  0

# (2) data.table

myDT - data.table(myData, key = 'id')
myDT[, list(nswitch = length(unique(mygroup)) - 1), by = 'id']

If one can switch back and forth between levels more than once, then
the above is clearly not appropriate. A more robust method would be to
employ rle() [run length encoding]:

g - function(d) length(rle(d$mygroup)$lengths) - 1
ddply(myData, .(id), g)# gives the same answer as above
myDT[, list(nswitch = length(rle(mygroup)$lengths) - 1), by = 'id']   # ditto


HTH,
Dennis

On Wed, Aug 24, 2011 at 9:48 AM, Juliet Hannah juliet.han...@gmail.com wrote:
 I have a data set with about 6 million rows and 50 columns. It is a
 mixture of dates, factors, and numerics.

 What I am trying to accomplish can be seen with the following
 simplified data, which is given as dput output below.

 head(myData)
      mydate gender mygroup id
 1 2012-03-25      F       A  1
 2 2005-05-23      F       B  2
 3 2005-09-08      F       B  2
 4 2005-12-07      F       B  2
 5 2006-02-26      F       C  2
 6 2006-05-13      F       C  2

 For each id, I want to count the number of changes of the variable
 'mygroup' that occur. For example, id=1 has 0 changes because it is
 observed only once.  id=2 has 2 changes (B to C, and C to D).  I also
 need to calculate the total observation time for each id using the
 variable mydate.  In the end, I am trying to have a new data set in
 which each row has an id, days observed, number of changes, and
 gender.

 I made some simple summaries using data.table and plyr, but I'm stuck
 on this reformatting.

 Thanks for your help.

 myData - structure(list(mydate = c(2012-03-25, 2005-05-23, 2005-09-08,
 2005-12-07, 2006-02-26, 2006-05-13, 2006-09-01, 2006-12-12,
 2006-02-19, 2006-05-03, 2006-04-23, 2007-12-08, 2011-03-19,
 2007-12-20, 2008-06-15, 2008-12-16, 2009-06-07, 2009-10-09,
 2010-01-28, 2007-06-05), gender = c(F, F, F, F, F,
 F, F, F, F, F, F, F, F, M, M, M, M, M,
 M, M), mygroup = c(A, B, B, B, C, C, C, D,
 D, D, D, D, D, A, A, A, B, B, B, A),
    id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
    3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c(mydate, gender,
 mygroup, id), class = data.frame, row.names = c(NA, -20L
 ))

 sessionInfo()
 R version 2.13.1 (2011-07-08)
 Platform: x86_64-unknown-linux-gnu (64-bit)

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] data manipulation and summaries with few million rows

2011-08-24 Thread Juliet Hannah
Thanks Dennis! I'll check this out.

Just to clarify, I need the total number of switches/changes
regardless of if that state
had occurred in the past. So A-A-B-A, would have 2 changes: A to B and B to A.

Thanks again.


On Wed, Aug 24, 2011 at 1:28 PM, Dennis Murphy djmu...@gmail.com wrote:
 Hi Juliet:

 Here's a Q  D solution:

 # (1) plyr
 f - function(d) length(unique(d$mygroup)) - 1
 ddply(myData, .(id), f)
  id V1
 1  1  0
 2  2  2
 3  3  1
 4  4  0

 # (2) data.table

 myDT - data.table(myData, key = 'id')
 myDT[, list(nswitch = length(unique(mygroup)) - 1), by = 'id']

 If one can switch back and forth between levels more than once, then
 the above is clearly not appropriate. A more robust method would be to
 employ rle() [run length encoding]:

 g - function(d) length(rle(d$mygroup)$lengths) - 1
 ddply(myData, .(id), g)    # gives the same answer as above
 myDT[, list(nswitch = length(rle(mygroup)$lengths) - 1), by = 'id']   # ditto


 HTH,
 Dennis

 On Wed, Aug 24, 2011 at 9:48 AM, Juliet Hannah juliet.han...@gmail.com 
 wrote:
 I have a data set with about 6 million rows and 50 columns. It is a
 mixture of dates, factors, and numerics.

 What I am trying to accomplish can be seen with the following
 simplified data, which is given as dput output below.

 head(myData)
      mydate gender mygroup id
 1 2012-03-25      F       A  1
 2 2005-05-23      F       B  2
 3 2005-09-08      F       B  2
 4 2005-12-07      F       B  2
 5 2006-02-26      F       C  2
 6 2006-05-13      F       C  2

 For each id, I want to count the number of changes of the variable
 'mygroup' that occur. For example, id=1 has 0 changes because it is
 observed only once.  id=2 has 2 changes (B to C, and C to D).  I also
 need to calculate the total observation time for each id using the
 variable mydate.  In the end, I am trying to have a new data set in
 which each row has an id, days observed, number of changes, and
 gender.

 I made some simple summaries using data.table and plyr, but I'm stuck
 on this reformatting.

 Thanks for your help.

 myData - structure(list(mydate = c(2012-03-25, 2005-05-23, 2005-09-08,
 2005-12-07, 2006-02-26, 2006-05-13, 2006-09-01, 2006-12-12,
 2006-02-19, 2006-05-03, 2006-04-23, 2007-12-08, 2011-03-19,
 2007-12-20, 2008-06-15, 2008-12-16, 2009-06-07, 2009-10-09,
 2010-01-28, 2007-06-05), gender = c(F, F, F, F, F,
 F, F, F, F, F, F, F, F, M, M, M, M, M,
 M, M), mygroup = c(A, B, B, B, C, C, C, D,
 D, D, D, D, D, A, A, A, B, B, B, A),
    id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
    3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c(mydate, gender,
 mygroup, id), class = data.frame, row.names = c(NA, -20L
 ))

 sessionInfo()
 R version 2.13.1 (2011-07-08)
 Platform: x86_64-unknown-linux-gnu (64-bit)

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.