[R] selecting rows with more than x occurrences in a given column (data type is names)

2007-03-13 Thread Mike Jasper
Despite a long search on the archives, I couldn't find how to do this.
Thanks in advance for what is likely a simple issue.

I have a data set where the first column is name (i.e., 'Joe Smith',
'Jane Doe', etc). The following columns are data associated with that
person. I have many people with multiple rows. What I want is to get a
new data frame out with only the people who have more than x
occurrences in the first column.

Here's what I've done, that's not working:

Let's call my old data.frame all.data

table(all.data$names)10

I get a list of names and TRUE/FALSE values. I then want to make a
list of the TRUEs and pass that to some subset type command like

dup.names=table(all.data$names)10

new.data=(all.data[all.data$names==dup.names,])

That's not working because the dimensions are wrong (I think). But
even when I tried to do part of it manually (to troubleshoot) like
this

dup.names=c('Joe Smith','Jane Doe','etc')

I got warnings and it didn't work correctly. There must be a simple
way to do this that I'm just not seeing. Thanks.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] selecting rows with more than x occurrences in a given column (data type is names)

2007-03-13 Thread Stephen Tucker
This isn't pretty, but should work:

x - 10 # number of occurrences
y - split(all.data,f=all.data$names)
z - y[unlist(lapply(y,nrow))x]
newdata - vector()
for( k in z ) {
  newdata - rbind(newdata,k)
}

Basically I split your data frame into groups by name (into a list), then
selected elements in the list for which the number of rows (number of
occurrences) was  x, then concatenated rows from the selected elements to an
initially empty vector. Probably there is a more elegant way to do this but I
can't think of it at the moment...

You are correct in that the conditional statement using '==' cannot test
vectors of mismatched dimensions.





--- Mike Jasper [EMAIL PROTECTED] wrote:

 Despite a long search on the archives, I couldn't find how to do this.
 Thanks in advance for what is likely a simple issue.
 
 I have a data set where the first column is name (i.e., 'Joe Smith',
 'Jane Doe', etc). The following columns are data associated with that
 person. I have many people with multiple rows. What I want is to get a
 new data frame out with only the people who have more than x
 occurrences in the first column.
 
 Here's what I've done, that's not working:
 
 Let's call my old data.frame all.data
 
 table(all.data$names)10
 
 I get a list of names and TRUE/FALSE values. I then want to make a
 list of the TRUEs and pass that to some subset type command like
 
 dup.names=table(all.data$names)10
 
 new.data=(all.data[all.data$names==dup.names,])
 
 That's not working because the dimensions are wrong (I think). But
 even when I tried to do part of it manually (to troubleshoot) like
 this
 
 dup.names=c('Joe Smith','Jane Doe','etc')
 
 I got warnings and it didn't work correctly. There must be a simple
 way to do this that I'm just not seeing. Thanks.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 



 

Finding fabulous fares is fun.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] selecting rows with more than x occurrences in a given column(data type is names)

2007-03-13 Thread Dimitris Rizopoulos
try this:

set.seed(123)
all.data - data.frame(name = sample(c(Joe, Elen, Jane, Mike), 
8, TRUE),
x = rnorm(8), y = runif(8))
##
tab.nams - table(all.data$name)
nams - names(tab.nams[tab.nams = 2])
all.data[all.data$name %in% nams, ]


I hope it helps.

Best,
Dimitris


Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven

Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/(0)16/336899
Fax: +32/(0)16/337015
Web: http://med.kuleuven.be/biostat/
 http://www.student.kuleuven.be/~m0390867/dimitris.htm


- Original Message - 
From: Mike Jasper [EMAIL PROTECTED]
To: r-help@stat.math.ethz.ch
Sent: Tuesday, March 13, 2007 3:38 PM
Subject: [R] selecting rows with more than x occurrences in a given 
column(data type is names)


 Despite a long search on the archives, I couldn't find how to do 
 this.
 Thanks in advance for what is likely a simple issue.

 I have a data set where the first column is name (i.e., 'Joe Smith',
 'Jane Doe', etc). The following columns are data associated with 
 that
 person. I have many people with multiple rows. What I want is to get 
 a
 new data frame out with only the people who have more than x
 occurrences in the first column.

 Here's what I've done, that's not working:

 Let's call my old data.frame all.data

 table(all.data$names)10

 I get a list of names and TRUE/FALSE values. I then want to make a
 list of the TRUEs and pass that to some subset type command like

 dup.names=table(all.data$names)10

 new.data=(all.data[all.data$names==dup.names,])

 That's not working because the dimensions are wrong (I think). But
 even when I tried to do part of it manually (to troubleshoot) like
 this

 dup.names=c('Joe Smith','Jane Doe','etc')

 I got warnings and it didn't work correctly. There must be a simple
 way to do this that I'm just not seeing. Thanks.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] selecting rows with more than x occurrences in a given column (data type is names)

2007-03-13 Thread Marc Schwartz
On Tue, 2007-03-13 at 10:38 -0400, Mike Jasper wrote:
 Despite a long search on the archives, I couldn't find how to do this.
 Thanks in advance for what is likely a simple issue.
 
 I have a data set where the first column is name (i.e., 'Joe Smith',
 'Jane Doe', etc). The following columns are data associated with that
 person. I have many people with multiple rows. What I want is to get a
 new data frame out with only the people who have more than x
 occurrences in the first column.
 
 Here's what I've done, that's not working:
 
 Let's call my old data.frame all.data
 
 table(all.data$names)10
 
 I get a list of names and TRUE/FALSE values. I then want to make a
 list of the TRUEs and pass that to some subset type command like
 
 dup.names=table(all.data$names)10
 
 new.data=(all.data[all.data$names==dup.names,])
 
 That's not working because the dimensions are wrong (I think). But
 even when I tried to do part of it manually (to troubleshoot) like
 this
 
 dup.names=c('Joe Smith','Jane Doe','etc')
 
 I got warnings and it didn't work correctly. There must be a simple
 way to do this that I'm just not seeing. Thanks.


Something like this should work:

  NewDF - subset(all.data, names %in% unique(names[duplicated(names)]))

See ?duplicated, ?unique and ?%in% for more information.

HTH,

Marc Schwartz

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] selecting rows with more than x occurrences in a given column (data type is names)

2007-03-13 Thread Chuck Cleland
Mike Jasper wrote:
 Despite a long search on the archives, I couldn't find how to do this.
 Thanks in advance for what is likely a simple issue.
 
 I have a data set where the first column is name (i.e., 'Joe Smith',
 'Jane Doe', etc). The following columns are data associated with that
 person. I have many people with multiple rows. What I want is to get a
 new data frame out with only the people who have more than x
 occurrences in the first column.
 
 Here's what I've done, that's not working:
 
 Let's call my old data.frame all.data
 
 table(all.data$names)10
 
 I get a list of names and TRUE/FALSE values. I then want to make a
 list of the TRUEs and pass that to some subset type command like
 
 dup.names=table(all.data$names)10
 
 new.data=(all.data[all.data$names==dup.names,])
 
 That's not working because the dimensions are wrong (I think). But
 even when I tried to do part of it manually (to troubleshoot) like
 this
 
 dup.names=c('Joe Smith','Jane Doe','etc')
 
 I got warnings and it didn't work correctly. There must be a simple
 way to do this that I'm just not seeing. Thanks.

  Does this help?

df - data.frame(PERSON = rep(c(John,Tom,Sara,Mary),
  c(5,4,5,4)),
 Y = runif(18))

subset(df, PERSON %in% names(which(table(PERSON) = 5)))

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

-- 
Chuck Cleland, Ph.D.
NDRI, Inc.
71 West 23rd Street, 8th floor
New York, NY 10010
tel: (212) 845-4495 (Tu, Th)
tel: (732) 512-0171 (M, W, F)
fax: (917) 438-0894

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] selecting rows with more than x occurrences in a given column (data type is names)

2007-03-13 Thread Marc Schwartz
On Tue, 2007-03-13 at 10:32 -0500, Marc Schwartz wrote:
 On Tue, 2007-03-13 at 10:38 -0400, Mike Jasper wrote:
  Despite a long search on the archives, I couldn't find how to do this.
  Thanks in advance for what is likely a simple issue.
  
  I have a data set where the first column is name (i.e., 'Joe Smith',
  'Jane Doe', etc). The following columns are data associated with that
  person. I have many people with multiple rows. What I want is to get a
  new data frame out with only the people who have more than x
  occurrences in the first column.
  
  Here's what I've done, that's not working:
  
  Let's call my old data.frame all.data
  
  table(all.data$names)10
  
  I get a list of names and TRUE/FALSE values. I then want to make a
  list of the TRUEs and pass that to some subset type command like
  
  dup.names=table(all.data$names)10
  
  new.data=(all.data[all.data$names==dup.names,])
  
  That's not working because the dimensions are wrong (I think). But
  even when I tried to do part of it manually (to troubleshoot) like
  this
  
  dup.names=c('Joe Smith','Jane Doe','etc')
  
  I got warnings and it didn't work correctly. There must be a simple
  way to do this that I'm just not seeing. Thanks.
 
 
 Something like this should work:
 
   NewDF - subset(all.data, names %in% unique(names[duplicated(names)]))
 
 See ?duplicated, ?unique and ?%in% for more information.
 
 HTH,
 
 Marc Schwartz

Ack...sorry about that.  I misread the query as for any duplicated
occurrences. The solution provided by Dimitris is correct.

Marc

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] selecting rows with more than x occurrences in a given column(data type is names)

2007-03-13 Thread Mike Jasper
Thanks to all of you who got me the answer. The key I was missing was
%in%. Had never seen it before.

best.

On 3/13/07, Dimitris Rizopoulos [EMAIL PROTECTED] wrote:
 try this:

 set.seed(123)
 all.data - data.frame(name = sample(c(Joe, Elen, Jane, Mike),
 8, TRUE),
 x = rnorm(8), y = runif(8))
 ##
 tab.nams - table(all.data$name)
 nams - names(tab.nams[tab.nams = 2])
 all.data[all.data$name %in% nams, ]


 I hope it helps.

 Best,
 Dimitris

 
 Dimitris Rizopoulos
 Ph.D. Student
 Biostatistical Centre
 School of Public Health
 Catholic University of Leuven

 Address: Kapucijnenvoer 35, Leuven, Belgium
 Tel: +32/(0)16/336899
 Fax: +32/(0)16/337015
 Web: http://med.kuleuven.be/biostat/
  http://www.student.kuleuven.be/~m0390867/dimitris.htm


 - Original Message -
 From: Mike Jasper [EMAIL PROTECTED]
 To: r-help@stat.math.ethz.ch
 Sent: Tuesday, March 13, 2007 3:38 PM
 Subject: [R] selecting rows with more than x occurrences in a given
 column(data type is names)


  Despite a long search on the archives, I couldn't find how to do
  this.
  Thanks in advance for what is likely a simple issue.
 
  I have a data set where the first column is name (i.e., 'Joe Smith',
  'Jane Doe', etc). The following columns are data associated with
  that
  person. I have many people with multiple rows. What I want is to get
  a
  new data frame out with only the people who have more than x
  occurrences in the first column.
 
  Here's what I've done, that's not working:
 
  Let's call my old data.frame all.data
 
  table(all.data$names)10
 
  I get a list of names and TRUE/FALSE values. I then want to make a
  list of the TRUEs and pass that to some subset type command like
 
  dup.names=table(all.data$names)10
 
  new.data=(all.data[all.data$names==dup.names,])
 
  That's not working because the dimensions are wrong (I think). But
  even when I tried to do part of it manually (to troubleshoot) like
  this
 
  dup.names=c('Joe Smith','Jane Doe','etc')
 
  I got warnings and it didn't work correctly. There must be a simple
  way to do this that I'm just not seeing. Thanks.
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 


 Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm



__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.