Re: [R] Question About Repeat Random Sampling from a Data Frame

2009-12-22 Thread Adam Carr
Thanks to both of you for the comments and suggestions. Over the next couple of 
days I plan to work through my simple problem using the help offered in this 
forum.





From: David Winsemius dwinsem...@comcast.net
To: Bert Gunter gunter.ber...@gene.com

Sent: Mon, December 21, 2009 2:31:26 PM
Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame


On Dec 21, 2009, at 1:01 PM, Bert Gunter wrote:

 Didn't read this thread in detail, so the following suggestion may just be
 nonsense... (caveat emptor), but:
 
 To sample from an data frame or matrix, sample from the row indices and then
 extract what you want from the sampled rows. Or sample directly from
 individual columns if that suffices. In general,
 
 ?sample
 
 on appropriate indices of object in question.
 
 Bert Gunter
 Genentech Nonclinical Biostatistics
 
 
 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
 Behalf Of Adam Carr
 Sent: Monday, December 21, 2009 9:53 AM
 To: David Winsemius
 Cc: r-help@r-project.org
 Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame
 
 Good Afternoon Dr. Winsemius:
 
 You ask some very good questions and make excellent points; my responses are
 below. I've tried to extract your questions and provide answers just to
 reduce the clutter.
 
 1. You might want to provide statistical justification for the otherwise
 puzzling sampling strategy.
 
 I assume you mean my overall process of random sampling from a large data
 set. The data set is comprised of observations collected over four years.
 Although the basis for sampling would make a good four-frame Dilbert cartoon
 if it could be condensed enough, my answer begins with the unfortunate truth
 that there is a great divide between the technical and marketing groups at
 the business where I am employed. Many powerful marketing executives, some
 with technical backgrounds, feel that there is something fundamentally wrong
 with the manufacturing process because the data generated over the long term
 is not approximately normally distributed. My task was to examine this set
 of data, trying to keep the representation of Y, N and F approximately equal
 in the sample when compared to the large set, to determine if any subset
 exhibits the holy grail-like normal distribution characteristics. I don't
 feel that this is statistical justification, but it is the
 reason why I am doing this.
 
 2. It would help if you explained what you are attempting here in ordinary
 English. There are 10 elements in mysamples, each of which is a 100 x 5
 dataframe, and mat is just one 100 x 5 matrix, which you seem to be
 referencing incorrectly, given the fact that it has two, rather than one,
 dimension. Furthermore, those dataframes may not be of a uniform class,
 since you said you had character variable. Do you really want these all in a
 character type matrix, which would be what is likely to happen given R's
 requirement that matrix element be of only one class? What you say above
 suggests not.
 
 It seems from your response that I incorrectly assumed that a list is not
 the same as a data frame. I started down this path after reading the
 questions and answers to a similar problem where the r-help responder
 suggested a two step process and said that the list must be converted to
 another form in order to be available for analysis.

A data.frame is a special type of list. You can also make lists of dataframes 
(just as you can make lists of lists), which I thought the first portion of 
your code would have done:

mysamples-list()
for (i in 1:10){
mysamples[[i]] - dataset[ sample(1:1637,100, prob=c(rep(163.7/1637,513), 
rep(245.5/1637,197), rep(1227.8/1637,927)), replace = TRUE), ]

Each element in that list would have been a subset of your larger data.frame 
and would itself have been a data.frame.


 
 And you are absolutely correct that I do not want each sample in a character
 type matrix.
 
 In plain English, I hope, I am simply trying to iterate the process of
 removing random samples from the large data set, and then saving these
 samples in a format that is available for simple analysis. For example, if I
 remove five hundred mysample sets, each of which is composed of a 100 x 5
 sample of the large data set I am interested in determining the skewness,
 kurtosis, mean and standard deviation of each of the four numeric variables
 in each of the five hundred mysample sets.

So make a small dataframe with variables (columns) of the same type as in your 
real data, maybe 25-30 rows in extent (not length, since for a dataframe, 
the length() function returns the number of columns).
 
 3. Sorting out such problems is best done with smaller test objects. I was
 surprised to see...type character.
 
 I agree. I began to do this with a small test data set but it was late last
 evening and I realized that I should ask for help before proceeding on what
 I thought

Re: [R] Question About Repeat Random Sampling from a Data Frame

2009-12-21 Thread Gustaf Rydevik
On Mon, Dec 21, 2009 at 4:12 PM, Adam Carr adamlc...@yahoo.com wrote:
 Good Morning:

 I've read many, many posts on the r-help system and I feel compelled to 
 quickly admit that I am relatively new to R, I do have several reference 
 books around me, but I cannot count myself among the fortunate who seem to 
 strong programming intuition.

 I have a data set consisting of 1637 observations of five variables: tensile 
 strength, yield strength, elongation, hardness and a character indicator with 
 three levels: (Y)es, (N)o, and (F)ail.

 My objective is to randomly sample various subsets from this data set and 
 then evaluate these subsets using simple parameters among them tests for 
 normality, shape and skewness. The data set is ordered by the character 
 variable prior to sampling, and the samples are weighted to mirror 
 representation in an overall, physical process.

 I am sampling the data set using this code:

 sample - dataset[sample(1:1637, 500, 
 prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace 
 = TRUE),]

 What I would like to do is iterate this process to create many (say 500 or 
 more) sampled sets of n=500 and then evaluate each set for the parameters of 
 interest. I would actually be evaluating each variable within each subset for 
 my characteristic of interest. I am familiar with sampling and saving single 
 columns of data to do this sort of thing, but I am not sure how to accomplish 
 this with a multiple-variable data set.

 For example, I am currently iterating this using a clunky process:

 mysamples-list()
 for (i in 1:10){
 mysamples[[i]] - dataset[ 
 sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace
  = TRUE), ]
 }

 But this leaves me with the additional task of defining each mysample[i] 
 iteration and converting it to a form on which I can apply a standard 
 statistical test like mean() or skewness() to the variable columns within 
 each subset. I have attempted to iteratively convert these lists using this 
 code:

 mat-matrix(nrow=100,ncol=5)
 for (i in 1:length(mysamples))
 {mat[i]-do.call('rbind',mysamples[i])}

 but running the code generates the error message: number of items to replace 
 is not a multiple of replacement length. I have tried unsuccessfully, by 
 reading many, many helpful r-help emails on this error, to understand my 
 probably obvious mistake.

 Based on the small amount that I think I know about R it seems to me that 
 sampling the data frame and containing the samples in a list is likely a 
 pretty inefficient way to do this task. Any help that any of you could 
 provide to assist me in iteratively sampling the data frame, and storing the 
 samples in a form on which I can apply other statistical tests would be 
 greatly appreciated.

 Thank you very much for taking the time to consider my questions.

 Adam



        [[alternative HTML version deleted]]

That's pretty much how I tend to do those things. what you seem to be
missing is the ?apply family:

mysamples.means-lapply(mysamples,function(x)mean(x[,1]))


Hope that gets you on your way. If you want more help, I'd suggest
including an example data set in your follow-up messages.

/Gustaf

-- 
Gustaf Rydevik, M.Sci.
tel: +46(0)703 051 451
address:Essingetorget 40,112 66 Stockholm, SE
skype:gustaf_rydevik

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question About Repeat Random Sampling from a Data Frame

2009-12-21 Thread David Winsemius


On Dec 21, 2009, at 10:12 AM, Adam Carr wrote:


Good Morning:

I've read many, many posts on the r-help system and I feel compelled  
to quickly admit that I am relatively new to R, I do have several  
reference books around me, but I cannot count myself among the  
fortunate who seem to strong programming intuition.


I have a data set consisting of 1637 observations of five variables:  
tensile strength, yield strength, elongation, hardness and a  
character indicator with three levels: (Y)es, (N)o, and (F)ail.


My objective is to randomly sample various subsets from this data  
set and then evaluate these subsets using simple parameters among  
them tests for normality, shape and skewness. The data set is  
ordered by the character variable prior to sampling, and the samples  
are weighted to mirror representation in an overall, physical process.


I am sampling the data set using this code:

sample - dataset[sample(1:1637, 500,  
prob 
= 
c 
(rep 
(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace =  
TRUE),]


What I would like to do is iterate this process to create many (say  
500 or more) sampled sets of n=500 and then evaluate each set for  
the parameters of interest. I would actually be evaluating each  
variable within each subset for my characteristic of interest. I am  
familiar with sampling and saving single columns of data to do this  
sort of thing, but I am not sure how to accomplish this with a  
multiple-variable data set.


For example, I am currently iterating this using a clunky process:

mysamples-list()
for (i in 1:10){
mysamples[[i]] -  
dataset 
[ sample 
(1 
: 
1637,100 
,prob 
= 
c 
(rep 
(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace =  
TRUE), ]

}



Using lists to store intermediate results is not considered clunky in  
R. (You might want to provide statistical justification for the  
otherwise puzzling sampling strategy.)


But this leaves me with the additional task of defining each  
mysample[i] iteration and converting it to a form on which I can  
apply a standard statistical test like mean() or skewness() to the  
variable columns within each subset. I have attempted to iteratively  
convert these lists using this code:


mat-matrix(nrow=100,ncol=5)
for (i in 1:length(mysamples))
{mat[i]-do.call('rbind',mysamples[i])}


It would help if you explained what you are attempting here in  
ordinary English. There are 10 elements in mysamples, each of which is  
a 100 x 5 dataframe, and mat is just one 100 x 5 matrix, which you  
seem to be referencing incorrectly, given the fact that it has two,  
rather than one, dimension. Furthermore, those dataframes may not be  
of a uniform class, since you said you had character variable. Do you  
really want these all in a character type matrix, which would be what  
is likely to happen given R's requirement that matrix element be of  
only one class? What you say above suggests not.




but running the code generates the error message: number of items to  
replace is not a multiple of replacement length.


Because of the way you are referencing the matrix, probably. If you  
wanted a 10 x 100 x 5 array, then create an array. In R, as far as I  
can tell anyway, matrices are necessarily of 2 dimensions. Tables and  
arrays can be of higher dimension.


I have tried unsuccessfully, by reading many, many helpful r-help  
emails on this error, to understand my probably obvious mistake.


Sorting out such problems is best done with smaller test objects. I  
was surprised to see that you thought it was necessary to convert  
dataframes to matrices in order to calculate descriptive statistics.  
Nothing could be farther from the truth. Furthermore, it for some  
other more valid reason you wanted a list of matrices, there is a  
perfectly good function that will convert a dataframe to a matrix,  
data.matrix(), remembering of course that if there is a single  
character variable in the dataframe, that the entire matrix will be of  
type character.


Based on the small amount that I think I know about R it seems to me  
that sampling the data frame and containing the samples in a list is  
likely a pretty inefficient way to do this task. Any help that any  
of you could provide to assist me in iteratively sampling the data  
frame, and storing the samples in a form on which I can apply other  
statistical tests would be greatly appreciated.


Thank you very much for taking the time to consider my questions.

--

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question About Repeat Random Sampling from a Data Frame

2009-12-21 Thread Adam Carr
Good Afternoon Dr. Winsemius:

You ask some very good questions and make excellent points; my responses are 
below. I've tried to extract your questions and provide answers just to reduce 
the clutter.

1. You might want to provide statistical justification for the otherwise 
puzzling sampling strategy. 

I assume you mean my overall process of random sampling from a large data set. 
The data set is comprised of observations collected over four years. Although 
the basis for sampling would make a good four-frame Dilbert cartoon if it could 
be condensed enough, my answer begins with the unfortunate truth that there is 
a great divide between the technical and marketing groups at the business where 
I am employed. Many powerful marketing executives, some with technical 
backgrounds, feel that there is something fundamentally wrong with the 
manufacturing process because the data generated over the long term is not 
approximately normally distributed. My task was to examine this set of data, 
trying to keep the representation of Y, N and F approximately equal in the 
sample when compared to the large set, to determine if any subset exhibits the 
holy grail-like normal distribution characteristics. I don't feel that this is 
statistical justification, but it is the
 reason why I am doing this.

2. It would help if you explained what you are attempting here in ordinary 
English. There are 10 elements in mysamples, each of which is a 100 x 5 
dataframe, and mat is just one 100 x 5 matrix, which you seem to be referencing 
incorrectly, given the fact that it has two, rather than one, dimension. 
Furthermore, those dataframes may not be of a uniform class, since you said you 
had character variable. Do you really want these all in a character type 
matrix, which would be what is likely to happen given R's requirement that 
matrix element be of only one class? What you say above suggests not.

It seems from your response that I incorrectly assumed that a list is not the 
same as a data frame. I started down this path after reading the questions and 
answers to a similar problem where the r-help responder suggested a two step 
process and said that the list must be converted to another form in order to be 
available for analysis. 

And you are absolutely correct that I do not want each sample in a character 
type matrix. 

In plain English, I hope, I am simply trying to iterate the process of removing 
random samples from the large data set, and then saving these samples in a 
format that is available for simple analysis. For example, if I remove five 
hundred mysample sets, each of which is composed of a 100 x 5 sample of the 
large data set I am interested in determining the skewness, kurtosis, mean and 
standard deviation of each of the four numeric variables in each of the five 
hundred mysample sets.

3. Sorting out such problems is best done with smaller test objects. I was 
surprised to see...type character.

I agree. I began to do this with a small test data set but it was late last 
evening and I realized that I should ask for help before proceeding on what I 
thought might be incorrect assumptions. I clearly misunderstood that a list 
needed to be converted to a data frame in order to be available for analysis. 

Thank you for taking the time to respond. The discussion and suggestions are 
very helpful. 

Adam






From: David Winsemius dwinsem...@comcast.net

Cc: r-help@r-project.org
Sent: Mon, December 21, 2009 11:23:43 AM
Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame


On Dec 21, 2009, at 10:12 AM, Adam Carr wrote:

 Good Morning:
 
 I've read many, many posts on the r-help system and I feel compelled to 
 quickly admit that I am relatively new to R, I do have several reference 
 books around me, but I cannot count myself among the fortunate who seem to 
 strong programming intuition.
 
 I have a data set consisting of 1637 observations of five variables: tensile 
 strength, yield strength, elongation, hardness and a character indicator with 
 three levels: (Y)es, (N)o, and (F)ail.
 
 My objective is to randomly sample various subsets from this data set and 
 then evaluate these subsets using simple parameters among them tests for 
 normality, shape and skewness. The data set is ordered by the character 
 variable prior to sampling, and the samples are weighted to mirror 
 representation in an overall, physical process.
 
 I am sampling the data set using this code:
 
 sample - dataset[sample(1:1637, 500, 
 prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace 
 = TRUE),]
 
 What I would like to do is iterate this process to create many (say 500 or 
 more) sampled sets of n=500 and then evaluate each set for the parameters of 
 interest. I would actually be evaluating each variable within each subset for 
 my characteristic of interest. I am familiar with sampling and saving single 
 columns of data to do this sort