[R] selecting rows with more than x occurrences in a given column (data type is names)
Despite a long search on the archives, I couldn't find how to do this. Thanks in advance for what is likely a simple issue. I have a data set where the first column is name (i.e., 'Joe Smith', 'Jane Doe', etc). The following columns are data associated with that person. I have many people with multiple rows. What I want is to get a new data frame out with only the people who have more than x occurrences in the first column. Here's what I've done, that's not working: Let's call my old data.frame all.data table(all.data$names)10 I get a list of names and TRUE/FALSE values. I then want to make a list of the TRUEs and pass that to some subset type command like dup.names=table(all.data$names)10 new.data=(all.data[all.data$names==dup.names,]) That's not working because the dimensions are wrong (I think). But even when I tried to do part of it manually (to troubleshoot) like this dup.names=c('Joe Smith','Jane Doe','etc') I got warnings and it didn't work correctly. There must be a simple way to do this that I'm just not seeing. Thanks. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] selecting rows with more than x occurrences in a given column (data type is names)
This isn't pretty, but should work: x - 10 # number of occurrences y - split(all.data,f=all.data$names) z - y[unlist(lapply(y,nrow))x] newdata - vector() for( k in z ) { newdata - rbind(newdata,k) } Basically I split your data frame into groups by name (into a list), then selected elements in the list for which the number of rows (number of occurrences) was x, then concatenated rows from the selected elements to an initially empty vector. Probably there is a more elegant way to do this but I can't think of it at the moment... You are correct in that the conditional statement using '==' cannot test vectors of mismatched dimensions. --- Mike Jasper [EMAIL PROTECTED] wrote: Despite a long search on the archives, I couldn't find how to do this. Thanks in advance for what is likely a simple issue. I have a data set where the first column is name (i.e., 'Joe Smith', 'Jane Doe', etc). The following columns are data associated with that person. I have many people with multiple rows. What I want is to get a new data frame out with only the people who have more than x occurrences in the first column. Here's what I've done, that's not working: Let's call my old data.frame all.data table(all.data$names)10 I get a list of names and TRUE/FALSE values. I then want to make a list of the TRUEs and pass that to some subset type command like dup.names=table(all.data$names)10 new.data=(all.data[all.data$names==dup.names,]) That's not working because the dimensions are wrong (I think). But even when I tried to do part of it manually (to troubleshoot) like this dup.names=c('Joe Smith','Jane Doe','etc') I got warnings and it didn't work correctly. There must be a simple way to do this that I'm just not seeing. Thanks. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Finding fabulous fares is fun. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] selecting rows with more than x occurrences in a given column(data type is names)
try this: set.seed(123) all.data - data.frame(name = sample(c(Joe, Elen, Jane, Mike), 8, TRUE), x = rnorm(8), y = runif(8)) ## tab.nams - table(all.data$name) nams - names(tab.nams[tab.nams = 2]) all.data[all.data$name %in% nams, ] I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/(0)16/336899 Fax: +32/(0)16/337015 Web: http://med.kuleuven.be/biostat/ http://www.student.kuleuven.be/~m0390867/dimitris.htm - Original Message - From: Mike Jasper [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Tuesday, March 13, 2007 3:38 PM Subject: [R] selecting rows with more than x occurrences in a given column(data type is names) Despite a long search on the archives, I couldn't find how to do this. Thanks in advance for what is likely a simple issue. I have a data set where the first column is name (i.e., 'Joe Smith', 'Jane Doe', etc). The following columns are data associated with that person. I have many people with multiple rows. What I want is to get a new data frame out with only the people who have more than x occurrences in the first column. Here's what I've done, that's not working: Let's call my old data.frame all.data table(all.data$names)10 I get a list of names and TRUE/FALSE values. I then want to make a list of the TRUEs and pass that to some subset type command like dup.names=table(all.data$names)10 new.data=(all.data[all.data$names==dup.names,]) That's not working because the dimensions are wrong (I think). But even when I tried to do part of it manually (to troubleshoot) like this dup.names=c('Joe Smith','Jane Doe','etc') I got warnings and it didn't work correctly. There must be a simple way to do this that I'm just not seeing. Thanks. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] selecting rows with more than x occurrences in a given column (data type is names)
On Tue, 2007-03-13 at 10:38 -0400, Mike Jasper wrote: Despite a long search on the archives, I couldn't find how to do this. Thanks in advance for what is likely a simple issue. I have a data set where the first column is name (i.e., 'Joe Smith', 'Jane Doe', etc). The following columns are data associated with that person. I have many people with multiple rows. What I want is to get a new data frame out with only the people who have more than x occurrences in the first column. Here's what I've done, that's not working: Let's call my old data.frame all.data table(all.data$names)10 I get a list of names and TRUE/FALSE values. I then want to make a list of the TRUEs and pass that to some subset type command like dup.names=table(all.data$names)10 new.data=(all.data[all.data$names==dup.names,]) That's not working because the dimensions are wrong (I think). But even when I tried to do part of it manually (to troubleshoot) like this dup.names=c('Joe Smith','Jane Doe','etc') I got warnings and it didn't work correctly. There must be a simple way to do this that I'm just not seeing. Thanks. Something like this should work: NewDF - subset(all.data, names %in% unique(names[duplicated(names)])) See ?duplicated, ?unique and ?%in% for more information. HTH, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] selecting rows with more than x occurrences in a given column (data type is names)
Mike Jasper wrote: Despite a long search on the archives, I couldn't find how to do this. Thanks in advance for what is likely a simple issue. I have a data set where the first column is name (i.e., 'Joe Smith', 'Jane Doe', etc). The following columns are data associated with that person. I have many people with multiple rows. What I want is to get a new data frame out with only the people who have more than x occurrences in the first column. Here's what I've done, that's not working: Let's call my old data.frame all.data table(all.data$names)10 I get a list of names and TRUE/FALSE values. I then want to make a list of the TRUEs and pass that to some subset type command like dup.names=table(all.data$names)10 new.data=(all.data[all.data$names==dup.names,]) That's not working because the dimensions are wrong (I think). But even when I tried to do part of it manually (to troubleshoot) like this dup.names=c('Joe Smith','Jane Doe','etc') I got warnings and it didn't work correctly. There must be a simple way to do this that I'm just not seeing. Thanks. Does this help? df - data.frame(PERSON = rep(c(John,Tom,Sara,Mary), c(5,4,5,4)), Y = runif(18)) subset(df, PERSON %in% names(which(table(PERSON) = 5))) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Chuck Cleland, Ph.D. NDRI, Inc. 71 West 23rd Street, 8th floor New York, NY 10010 tel: (212) 845-4495 (Tu, Th) tel: (732) 512-0171 (M, W, F) fax: (917) 438-0894 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] selecting rows with more than x occurrences in a given column (data type is names)
On Tue, 2007-03-13 at 10:32 -0500, Marc Schwartz wrote: On Tue, 2007-03-13 at 10:38 -0400, Mike Jasper wrote: Despite a long search on the archives, I couldn't find how to do this. Thanks in advance for what is likely a simple issue. I have a data set where the first column is name (i.e., 'Joe Smith', 'Jane Doe', etc). The following columns are data associated with that person. I have many people with multiple rows. What I want is to get a new data frame out with only the people who have more than x occurrences in the first column. Here's what I've done, that's not working: Let's call my old data.frame all.data table(all.data$names)10 I get a list of names and TRUE/FALSE values. I then want to make a list of the TRUEs and pass that to some subset type command like dup.names=table(all.data$names)10 new.data=(all.data[all.data$names==dup.names,]) That's not working because the dimensions are wrong (I think). But even when I tried to do part of it manually (to troubleshoot) like this dup.names=c('Joe Smith','Jane Doe','etc') I got warnings and it didn't work correctly. There must be a simple way to do this that I'm just not seeing. Thanks. Something like this should work: NewDF - subset(all.data, names %in% unique(names[duplicated(names)])) See ?duplicated, ?unique and ?%in% for more information. HTH, Marc Schwartz Ack...sorry about that. I misread the query as for any duplicated occurrences. The solution provided by Dimitris is correct. Marc __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] selecting rows with more than x occurrences in a given column(data type is names)
Thanks to all of you who got me the answer. The key I was missing was %in%. Had never seen it before. best. On 3/13/07, Dimitris Rizopoulos [EMAIL PROTECTED] wrote: try this: set.seed(123) all.data - data.frame(name = sample(c(Joe, Elen, Jane, Mike), 8, TRUE), x = rnorm(8), y = runif(8)) ## tab.nams - table(all.data$name) nams - names(tab.nams[tab.nams = 2]) all.data[all.data$name %in% nams, ] I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/(0)16/336899 Fax: +32/(0)16/337015 Web: http://med.kuleuven.be/biostat/ http://www.student.kuleuven.be/~m0390867/dimitris.htm - Original Message - From: Mike Jasper [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Tuesday, March 13, 2007 3:38 PM Subject: [R] selecting rows with more than x occurrences in a given column(data type is names) Despite a long search on the archives, I couldn't find how to do this. Thanks in advance for what is likely a simple issue. I have a data set where the first column is name (i.e., 'Joe Smith', 'Jane Doe', etc). The following columns are data associated with that person. I have many people with multiple rows. What I want is to get a new data frame out with only the people who have more than x occurrences in the first column. Here's what I've done, that's not working: Let's call my old data.frame all.data table(all.data$names)10 I get a list of names and TRUE/FALSE values. I then want to make a list of the TRUEs and pass that to some subset type command like dup.names=table(all.data$names)10 new.data=(all.data[all.data$names==dup.names,]) That's not working because the dimensions are wrong (I think). But even when I tried to do part of it manually (to troubleshoot) like this dup.names=c('Joe Smith','Jane Doe','etc') I got warnings and it didn't work correctly. There must be a simple way to do this that I'm just not seeing. Thanks. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.