Re: [R] sampling dataframe based upon number of record occurrences

2015-03-04 Thread David L Carlson
I'm not sure I understand, but I think you have a large data frame with records 
and you want to construct a sample of that data frame that includes no more 
than 3 records for each IDbyYear combination? You say there are 5589 unique 
combinations and your code uses a data frame called fitting_set. Assuming this 
is the data frame you are describing, your code will select all of the lines 
since fitting_set$IDbyYear[i] is always a vector of length 1.

We need a reproducible example. The best way for you to give us that would be 
to copy the result of dput(head(fitting_set, 10)). It would look something like 
this plus the 6 other columns you mention except that I've added dta - in 
front of structure() to create a data frame:

dta - structure(list(IDbyYear = c(42.24, 42.24, 42.24, 42.24, 42.24, 
42.24, 45.32, 45.32, 45.36, 45.4, 45.4), SiteID = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c(A-Airport, 
A-Bark Corral East), class = factor), Year = c(2006L, 2006L, 
2006L, 2006L, 2006L, 2006L, 2008L, 2008L, 2009L, 2010L, 2010L
)), .Names = c(IDbyYear, SiteID, Year), class = data.frame, row.names = 
c(NA, 
-11L))

Now create a list of data frames, one for each IDbyYear:

dta.list - split(dta, dta$IDbyYear)

Now a function that will select 3 rows or all of them if there are fewer:

smp - function(dframe) {
ind - seq_len(nrow(dframe))
dframe[sample(ind, ifelse(length(ind)2, 3, length(ind))),]
}

Now take the samples and combine them into a single data frame:

sample - do.call(rbind, lapply(dta.list, smp))
sample

-
David L Carlson
Department of Anthropology
Texas AM University
College Station, TX 77840-4352


-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Curtis 
Burkhalter
Sent: Tuesday, March 3, 2015 3:23 PM
To: r-help@r-project.org
Subject: [R] sampling dataframe based upon number of record occurrences

Hello everyone,

I'm having trouble performing a task that is probably very simple, but
can't seem to figure out how to get my code to work. What I want to do is
use the sample function to pick records within in a dataframe, but only if
a column attribute value is repeated more than 3 times. So if you look at
the data below I have created a unique attribute value that corresponds to
every site by year combination (i.e. IDxYear). So you can see that for the
site called A-Airport it was sampled 6 times in 2006, A-Bank Corral
East was sampled twice in 2008. So what I want to do is randomly select 3
records for A-Airport in 2006 for the existing 6 records, but for A-Bark
Corral East in 2008 I just want to leave these records as they currently
are.

I've used the following code to try and  accomplish this, but like I said I
can't get it to work so I'm clearly doing something wrong. If you could
check out the code and provide any suggestions that would be great. It
should be noted that there are 5589 unique IDxYear combinations so that's
why that number is in the code. If any further clarification is needed also
let me know.

boom=data.frame()
for (i in 1:5589){

boom[i,]=ifelse(length(fitting_set$IDbyYear[i]3),fitting_set[sample(nrow(fitting_set),3),],fitting_set)

}
boom


  *IDbyYear*   *SiteID *  *Year*
 *6 other column attributes*
  42.24   A-Airport 2006
 42.24   A-Airport 2006
  42.24   A-Airport 2006
 42.24   A-Airport 2006
  42.24   A-Airport 2006
 42.24   A-Airport 2006
 45.32  A-Bark Corral East2008
 45.32  A-Bark Corral East2008
 45.36  A-Bark Corral East2009
 45.40  A-Bark Corral East2010
 45.40   A-Bark Corral East   2010

 Thanks


-- 
Curtis Burkhalter

https://sites.google.com/site/curtisburkhalter/

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sampling dataframe based upon number of record occurrences

2015-03-04 Thread JS Huang
Here is an implementation with function named getSample. Some modification to
the data was made so that it can be read as a table.

 fitting.set
   IDbyYear SiteID Year
1 42.24  A-Airport 2006
2 42.24  A-Airport 2006
3 42.24  A-Airport 2006
4 42.24  A-Airport 2006
5 42.24  A-Airport 2006
6 42.24  A-Airport 2006
7 45.32 A-Bark.Corral.East 2008
8 45.32 A-Bark.Corral.East 2008
9 45.36 A-Bark.Corral.East 2009
1045.40 A-Bark.Corral.East 2010
1145.40 A-Bark.Corral.East 2010
 getSample
function(x)
{
  sites - unique(x$SiteID)
  years - unique(x$Year)
  result - data.frame()
  x$ID - seq(1,nrow(x))
  for (i in 1:length(sites))
  {
for (j in 1:length(years))
{
  if (nrow(x[as.character(x$SiteID)==as.character(sites[i]) 
x$Year==years[j],])  3)
  {
sampledID - sample(x[as.character(x$SiteID)==as.character(sites[i])
 x$Year==years[j],]$ID,3,replace=FALSE)
for (k in 1:length(sampledID))
{
  result - rbind(result,x[x$ID==sampledID[k],-4])
}  
  }
}
  }
  names(result) - c(IDbyYear,SiteID,Year)
  rownames(result) - NULL
  return(result)
}
 getSample(fitting.set)
  IDbyYearSiteID Year
142.24 A-Airport 2006
242.24 A-Airport 2006
342.24 A-Airport 2006



--
View this message in context: 
http://r.789695.n4.nabble.com/sampling-dataframe-based-upon-number-of-record-occurrences-tp4704144p4704154.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sampling dataframe based upon number of record occurrences

2015-03-04 Thread Curtis Burkhalter
That worked great, thanks so much David!

On Wed, Mar 4, 2015 at 8:23 AM, David L Carlson dcarl...@tamu.edu wrote:

 I'm not sure I understand, but I think you have a large data frame with
 records and you want to construct a sample of that data frame that includes
 no more than 3 records for each IDbyYear combination? You say there are
 5589 unique combinations and your code uses a data frame called
 fitting_set. Assuming this is the data frame you are describing, your code
 will select all of the lines since fitting_set$IDbyYear[i] is always a
 vector of length 1.

 We need a reproducible example. The best way for you to give us that would
 be to copy the result of dput(head(fitting_set, 10)). It would look
 something like this plus the 6 other columns you mention except that I've
 added dta - in front of structure() to create a data frame:

 dta - structure(list(IDbyYear = c(42.24, 42.24, 42.24, 42.24, 42.24,
 42.24, 45.32, 45.32, 45.36, 45.4, 45.4), SiteID = structure(c(1L,
 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c(A-Airport,
 A-Bark Corral East), class = factor), Year = c(2006L, 2006L,
 2006L, 2006L, 2006L, 2006L, 2008L, 2008L, 2009L, 2010L, 2010L
 )), .Names = c(IDbyYear, SiteID, Year), class = data.frame,
 row.names = c(NA,
 -11L))

 Now create a list of data frames, one for each IDbyYear:

 dta.list - split(dta, dta$IDbyYear)

 Now a function that will select 3 rows or all of them if there are fewer:

 smp - function(dframe) {
 ind - seq_len(nrow(dframe))
 dframe[sample(ind, ifelse(length(ind)2, 3, length(ind))),]
 }

 Now take the samples and combine them into a single data frame:

 sample - do.call(rbind, lapply(dta.list, smp))
 sample

 -
 David L Carlson
 Department of Anthropology
 Texas AM University
 College Station, TX 77840-4352


 -Original Message-
 From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Curtis
 Burkhalter
 Sent: Tuesday, March 3, 2015 3:23 PM
 To: r-help@r-project.org
 Subject: [R] sampling dataframe based upon number of record occurrences

 Hello everyone,

 I'm having trouble performing a task that is probably very simple, but
 can't seem to figure out how to get my code to work. What I want to do is
 use the sample function to pick records within in a dataframe, but only if
 a column attribute value is repeated more than 3 times. So if you look at
 the data below I have created a unique attribute value that corresponds to
 every site by year combination (i.e. IDxYear). So you can see that for the
 site called A-Airport it was sampled 6 times in 2006, A-Bank Corral
 East was sampled twice in 2008. So what I want to do is randomly select 3
 records for A-Airport in 2006 for the existing 6 records, but for A-Bark
 Corral East in 2008 I just want to leave these records as they currently
 are.

 I've used the following code to try and  accomplish this, but like I said I
 can't get it to work so I'm clearly doing something wrong. If you could
 check out the code and provide any suggestions that would be great. It
 should be noted that there are 5589 unique IDxYear combinations so that's
 why that number is in the code. If any further clarification is needed also
 let me know.

 boom=data.frame()
 for (i in 1:5589){


 boom[i,]=ifelse(length(fitting_set$IDbyYear[i]3),fitting_set[sample(nrow(fitting_set),3),],fitting_set)

 }
 boom


   *IDbyYear*   *SiteID *  *Year*
  *6 other column attributes*
   42.24   A-Airport 2006
  42.24   A-Airport 2006
   42.24   A-Airport 2006
  42.24   A-Airport 2006
   42.24   A-Airport 2006
  42.24   A-Airport 2006
  45.32  A-Bark Corral East2008
  45.32  A-Bark Corral East2008
  45.36  A-Bark Corral East2009
  45.40  A-Bark Corral East2010
  45.40   A-Bark Corral East   2010

  Thanks


 --
 Curtis Burkhalter

 https://sites.google.com/site/curtisburkhalter/

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Curtis Burkhalter

https://sites.google.com/site/curtisburkhalter/

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html

Re: [R] sampling dataframe based upon number of record occurrences

2015-03-04 Thread JS Huang
Since you indicated there are six more columns in the data.frame, getSample
modified below to take care of it.

 getSample
function(x)
{
  sites - unique(x$SiteID)
  years - unique(x$Year)
  result - data.frame()
  x$ID - seq(1,nrow(x))
  for (i in 1:length(sites))
  {
for (j in 1:length(years))
{
  if (nrow(x[as.character(x$SiteID)==as.character(sites[i]) 
x$Year==years[j],])  3)
  {
sampledID - sample(x[as.character(x$SiteID)==as.character(sites[i])
 x$Year==years[j],]$ID,3,replace=FALSE)
for (k in 1:length(sampledID))
{
  result - rbind(result,x[x$ID==sampledID[k],-ncol(x)])
}  
  }
}
  }
  names(result) - names(x)[-ncol(x)]
  rownames(result) - NULL
  return(result)
}
 getSample(fitting.set)
  IDbyYearSiteID Year
142.24 A-Airport 2006
242.24 A-Airport 2006
342.24 A-Airport 2006




--
View this message in context: 
http://r.789695.n4.nabble.com/sampling-dataframe-based-upon-number-of-record-occurrences-tp4704144p4704155.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] sampling dataframe based upon number of record occurrences

2015-03-03 Thread Curtis Burkhalter
Hello everyone,

I'm having trouble performing a task that is probably very simple, but
can't seem to figure out how to get my code to work. What I want to do is
use the sample function to pick records within in a dataframe, but only if
a column attribute value is repeated more than 3 times. So if you look at
the data below I have created a unique attribute value that corresponds to
every site by year combination (i.e. IDxYear). So you can see that for the
site called A-Airport it was sampled 6 times in 2006, A-Bank Corral
East was sampled twice in 2008. So what I want to do is randomly select 3
records for A-Airport in 2006 for the existing 6 records, but for A-Bark
Corral East in 2008 I just want to leave these records as they currently
are.

I've used the following code to try and  accomplish this, but like I said I
can't get it to work so I'm clearly doing something wrong. If you could
check out the code and provide any suggestions that would be great. It
should be noted that there are 5589 unique IDxYear combinations so that's
why that number is in the code. If any further clarification is needed also
let me know.

boom=data.frame()
for (i in 1:5589){

boom[i,]=ifelse(length(fitting_set$IDbyYear[i]3),fitting_set[sample(nrow(fitting_set),3),],fitting_set)

}
boom


  *IDbyYear*   *SiteID *  *Year*
 *6 other column attributes*
  42.24   A-Airport 2006
 42.24   A-Airport 2006
  42.24   A-Airport 2006
 42.24   A-Airport 2006
  42.24   A-Airport 2006
 42.24   A-Airport 2006
 45.32  A-Bark Corral East2008
 45.32  A-Bark Corral East2008
 45.36  A-Bark Corral East2009
 45.40  A-Bark Corral East2010
 45.40   A-Bark Corral East   2010

 Thanks


-- 
Curtis Burkhalter

https://sites.google.com/site/curtisburkhalter/

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sampling dataframe

2009-11-28 Thread Juliet Hannah
Here are some options that may help you out. First,
let's put the data in a format that can be cut-and-pasted
into R.

myData - read.table(textConnection(var1 var2 var3
1 111
2 312
3 813
4 614
51015
6 221
7 422
8 623
9 824
10   1025),header=TRUE,row.names=1)
closeAllConnections()

or

use dput

myData - structure(list(var1 = c(1L, 3L, 8L, 6L, 10L, 2L, 4L, 6L, 8L,
10L), var2 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), var3 = c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)), .Names = c(var1, var2,
var3), class = data.frame, row.names = c(1, 2, 3, 4,
5, 6, 7, 8, 9, 10))


#Select data where v2=1

select_v2 - myData[myData$var2==1,]

# sample two rows of select_v2

sampled_v2 - select_v2[sample(1:nrow(select_v2),2),]

# select rows of var3 not equal to 1

select_v3 - myData[myData$v3 !=1,]

# ?rbind may also come in useful.

2009/11/25 Ronaldo Reis Júnior chrys...@gmail.com:
 Hi,

 I have a table like that:

 datatest
   var1 var2 var3
 1     1    1    1
 2     3    1    2
 3     8    1    3
 4     6    1    4
 5    10    1    5
 6     2    2    1
 7     4    2    2
 8     6    2    3
 9     8    2    4
 10   10    2    5

 I need to create another table based on that with the rules:

 take a random sample by var2==1 (2 sample rows for example):

   var1 var2 var3
 1     1    1    1
 4     6    1    4

 in this random sample a get the 1 and 4 value on the var3, now I need to
 complete the table with var1==2 with the lines that var3 are not select on
 var2==1

 The resulting table is:
   var1 var2 var3
 1     1    1    1
 4     6    1    4
 7     4    2    2
 8     6    2    3
 10   10    2    5

 the value 1 and 4 on var3 is not present in the var2==2.

 I try several options but without success. take a random value is easy, but I
 cant select the others value excluding the random selected values.

 Any help?

 Thanks
 Ronaldo


 --
 17ª lei - Seu orientador quer que você se torne famoso,
          de modo que ele possa, finalmente, se tornar famoso.

      --Herman, I. P. 2007. Following the law. NATURE, Vol 445, p. 228.
 --
 Prof. Ronaldo Reis Júnior
 |  .''`. UNIMONTES/DBG/Lab. Ecologia Comportamental e Computacional
 | : :'  : Campus Universitário Prof. Darcy Ribeiro, Vila Mauricéia
 | `. `'` CP: 126, CEP: 39401-089, Montes Claros - MG - Brasil
 |   `- Fone: (38) 3229-8192 | ronaldo.r...@unimontes.br | chrys...@gmail.com
 | http://www.ppgcb.unimontes.br/lecc | ICQ#: 5692561 | LinuxUser#: 205366
 --
 Favor NÃO ENVIAR arquivos do Word ou Powerpoint
 Prefira enviar em PDF, Texto, OpenOffice (ODF), HTML, or RTF.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Sampling dataframe

2009-11-25 Thread Ronaldo Reis Júnior
Hi,

I have a table like that:

 datatest 
   var1 var2 var3
1 111
2 312
3 813
4 614
51015
6 221
7 422
8 623
9 824
10   1025

I need to create another table based on that with the rules:

take a random sample by var2==1 (2 sample rows for example):

   var1 var2 var3
1 111
4 614

in this random sample a get the 1 and 4 value on the var3, now I need to 
complete the table with var1==2 with the lines that var3 are not select on 
var2==1

The resulting table is:
   var1 var2 var3
1 111
4 614
7 422
8 623
10   1025

the value 1 and 4 on var3 is not present in the var2==2.

I try several options but without success. take a random value is easy, but I 
cant select the others value excluding the random selected values.

Any help?

Thanks
Ronaldo


-- 
17ª lei - Seu orientador quer que você se torne famoso, 
  de modo que ele possa, finalmente, se tornar famoso.

  --Herman, I. P. 2007. Following the law. NATURE, Vol 445, p. 228.
--
 Prof. Ronaldo Reis Júnior
|  .''`. UNIMONTES/DBG/Lab. Ecologia Comportamental e Computacional
| : :'  : Campus Universitário Prof. Darcy Ribeiro, Vila Mauricéia
| `. `'` CP: 126, CEP: 39401-089, Montes Claros - MG - Brasil
|   `- Fone: (38) 3229-8192 | ronaldo.r...@unimontes.br | chrys...@gmail.com
| http://www.ppgcb.unimontes.br/lecc | ICQ#: 5692561 | LinuxUser#: 205366
--
Favor NÃO ENVIAR arquivos do Word ou Powerpoint
Prefira enviar em PDF, Texto, OpenOffice (ODF), HTML, or RTF.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.