Re: [R] how to control the sampling to make each sample unique
Urania Sun wrote: I have a dataset of 1 records which I want to use to compare two prediction models. I split the records into test dataset (size = ntest) and training dataset (size = ntrain). Then I run the two models. Now I want to shuffle the data and rerun the models. I want many shuffles. I know that the following command sample ((1:1), ntrain) can pick ntrain numbers from 1 to 1. Then I just use these rows as the training dataset. But how can I make sure each run of sample produce different results? I want the data output be unique each time. I tested sample(). and found it usually produce different combinations. But can I control it some how? Is there a better way to write this? Thank you, You could have numbers, not picked yet, in a vector, use this vector with sample and remove picked numbers from it iteratively. Something like the following (not fully tested) index-1:1 for( blah-blah-blah ) { train.index-sample(index,ntrain) index-index[!index %in% train.index] test.index-sample(index,ntest) index-index[!index %in% test.index] } -- View this message in context: http://www.nabble.com/how-to-control-the-sampling-to-make-each-sample-unique-tf3719058.html#a10410229 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to control the sampling to make each sample unique
I think you're asking a design question about a Monte Carlo simulation. You have a population (size 10,000) from which you're defining an empirical distribution, and you're sampling from this to create pairs of training and test samples. You need to ensure that each specific pair of training and test samples is disjoint, meaning no observations in common. Normally, you wouldn't want to make the different training samples disjoint, if that's what you meant by them being unique. Or were you using it to mean identical? Regards Rory Martin From: HelponR suncertain_at_gmail.com Date: Wed, 09 May 2007 17:28:19 I have a dataset of 1 records which I want to use to compare two prediction models. I split the records into test dataset (size = ntest) and training dataset (size = ntrain). Then I run the two models. Now I want to shuffle the data and rerun the models. I want many shuffles. I know that the following command sample ((1:1), ntrain) can pick ntrain numbers from 1 to 1. Then I just use these rows as the training dataset. But how can I make sure each run of sample produce different results? I want the data output be unique each time. I tested sample(). and found it usually produce different combinations. But can I control it some how? Is there a better way to write this? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to control the sampling to make each sample unique
Yeah, I want to get all unique combinations of choosing ntest from ntotal. for example, choosing 4000 training data from 10,000 total data. Suppose they are sequenced as 1:10,000 One obvious combination is 1:4000 Then I run sample ((1:1000), 4000) it may output 4000 numbers: 1, 3, 5, 7999 Then I run again, it may output another 4000 numbers: 2, 4, 6, ..., 8000 I know the number of such unique combinations is Choose 4000 from 10,000 (I forgot how to denote this.) Anyway, I remember choosing m from n is computed as T = n! /(m!(m-n)!) ! is factorial My concern is: when the sample output will start to repeat? For example, maybe I run next time, the output will be the same as the first time. 1,2, 3, , 4000 That's not what I want. I hope to get T different or unique combinations in T runs. It is fine it may start to repeat after T times. I know the sample() may already do this way. But I am not sure. Thank you! On 5/10/07, Rory Martin [EMAIL PROTECTED] wrote: I think you're asking a design question about a Monte Carlo simulation. You have a population (size 10,000) from which you're defining an empirical distribution, and you're sampling from this to create pairs of training and test samples. You need to ensure that each specific pair of training and test samples is disjoint, meaning no observations in common. Normally, you wouldn't want to make the different training samples disjoint, if that's what you meant by them being unique. Or were you using it to mean identical? Regards Rory Martin From: HelponR suncertain_at_gmail.com Date: Wed, 09 May 2007 17:28:19 I have a dataset of 1 records which I want to use to compare two prediction models. I split the records into test dataset (size = ntest) and training dataset (size = ntrain). Then I run the two models. Now I want to shuffle the data and rerun the models. I want many shuffles. I know that the following command sample ((1:1), ntrain) can pick ntrain numbers from 1 to 1. Then I just use these rows as the training dataset. But how can I make sure each run of sample produce different results? I want the data output be unique each time. I tested sample(). and found it usually produce different combinations. But can I control it some how? Is there a better way to write this? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to control the sampling to make each sample unique
I know. But I am curious about how sample() works. For a small sample size. choose 1 digit from 0, 1 it only has two combinations. It is easy to test that the below can happen consecutively. sample (c(0,1), 1) [1] 0 sample (c(0,1), 1) [1] 0 That means, the output did not deplete all unique combinations before repeating. So I am concerned about how to control this. What I like to see after the control is: sample (c(0,1), 1) [1] 0 sample (c(0,1), 1) [1] 1 sample (c(0,1), 1) [1] 0 I don't think that is possible. Anyway, I just think a way to control is recording all output in files, checking the new output, if they are repeating with any of the previous files, then do not use it. That is kind of clumsy. For each new combination, I have to compare with all previous combinations. First I sort the sequence, then I do a difference. then I square it, then I sum it. If the result is 0 then a repetition happens. Thanks all. On 5/10/07, Rory Martin [EMAIL PROTECTED] wrote: sample(1:1000, 4000) returns a =random= sample of 4000 integers from [1,1000]. It is exceedingly unlikely you will generate exactly the same set of 4000 integers. And if it did happen, it wouldn't make the slightest difference to your results. Rory - Original Message - *From:* HelponR [EMAIL PROTECTED] *To:* Rory Martin [EMAIL PROTECTED] *Cc:* r-help@stat.math.ethz.ch *Sent:* Thursday, May 10, 2007 4:47 PM *Subject:* Re: [R] how to control the sampling to make each sample unique Yeah, I want to get all unique combinations of choosing ntest from ntotal. for example, choosing 4000 training data from 10,000 total data. Suppose they are sequenced as 1:10,000 One obvious combination is 1:4000 Then I run sample ((1:1000), 4000) it may output 4000 numbers: 1, 3, 5, 7999 Then I run again, it may output another 4000 numbers: 2, 4, 6, ..., 8000 I know the number of such unique combinations is Choose 4000 from 10,000 (I forgot how to denote this.) Anyway, I remember choosing m from n is computed as T = n! /(m!(m-n)!) ! is factorial My concern is: when the sample output will start to repeat? For example, maybe I run next time, the output will be the same as the first time. 1,2, 3, , 4000 That's not what I want. I hope to get T different or unique combinations in T runs. It is fine it may start to repeat after T times. I know the sample() may already do this way. But I am not sure. Thank you! On 5/10/07, Rory Martin [EMAIL PROTECTED] wrote: I think you're asking a design question about a Monte Carlo simulation. You have a population (size 10,000) from which you're defining an empirical distribution, and you're sampling from this to create pairs of training and test samples. You need to ensure that each specific pair of training and test samples is disjoint, meaning no observations in common. Normally, you wouldn't want to make the different training samples disjoint, if that's what you meant by them being unique. Or were you using it to mean identical? Regards Rory Martin From: HelponR suncertain_at_gmail.com Date: Wed, 09 May 2007 17:28:19 I have a dataset of 1 records which I want to use to compare two prediction models. I split the records into test dataset (size = ntest) and training dataset (size = ntrain). Then I run the two models. Now I want to shuffle the data and rerun the models. I want many shuffles. I know that the following command sample ((1:1), ntrain) can pick ntrain numbers from 1 to 1. Then I just use these rows as the training dataset. But how can I make sure each run of sample produce different results? I want the data output be unique each time. I tested sample(). and found it usually produce different combinations. But can I control it some how? Is there a better way to write this? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to control the sampling to make each sample unique
I have a dataset of 1 records which I want to use to compare two prediction models. I split the records into test dataset (size = ntest) and training dataset (size = ntrain). Then I run the two models. Now I want to shuffle the data and rerun the models. I want many shuffles. I know that the following command sample ((1:1), ntrain) can pick ntrain numbers from 1 to 1. Then I just use these rows as the training dataset. But how can I make sure each run of sample produce different results? I want the data output be unique each time. I tested sample(). and found it usually produce different combinations. But can I control it some how? Is there a better way to write this? Thank you, [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.