Re: [R] how to control the sampling to make each sample unique

2007-05-10 Thread Vladimir Eremeev


Urania Sun wrote:
 
 I have a dataset of 1 records which I want to use to compare two
 prediction models.
 
 I split the records into test dataset (size = ntest) and training dataset
 (size = ntrain). Then I run the two models.
 
 Now I want to shuffle the data and rerun the models. I want many shuffles.
 
 I know that the following command
 
 sample ((1:1), ntrain)
 
 can pick ntrain numbers from 1 to 1. Then I just use these rows as the
 training dataset.
 
 But how can I make sure each run of sample  produce different results? I
 want the data output be unique each time.
 I tested sample(). and found it usually produce different combinations.
 But
 can I control it some how? Is there a better way to write this?
 
 Thank you,
 
 

You could have numbers, not picked yet, in a vector, use this vector with
sample and remove picked numbers from it iteratively.

Something like the following (not fully tested)

index-1:1

for( blah-blah-blah ) {
  train.index-sample(index,ntrain)
  index-index[!index %in% train.index]
  test.index-sample(index,ntest)
  index-index[!index %in% test.index]
}

-- 
View this message in context: 
http://www.nabble.com/how-to-control-the-sampling-to-make-each-sample-unique-tf3719058.html#a10410229
Sent from the R help mailing list archive at Nabble.com.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to control the sampling to make each sample unique

2007-05-10 Thread Rory Martin
I think you're asking a design question about a Monte Carlo simulation.  You
have a population (size 10,000) from which you're defining an empirical
distribution, and you're sampling from this to create pairs of training and
test samples.

You need to ensure that each specific pair of training and test samples is
disjoint, meaning no observations in common.  Normally, you wouldn't want to
make the different training samples disjoint, if that's what you meant by
them being unique.  Or were you using it to mean identical?

Regards
Rory Martin


 From: HelponR suncertain_at_gmail.com Date: Wed, 09 May 2007 17:28:19

 I have a dataset of 1 records which I want to use to compare two
 prediction models.

 I split the records into test dataset (size = ntest) and training dataset
 (size = ntrain). Then I run the two models.

 Now I want to shuffle the data and rerun the models. I want many shuffles.

 I know that the following command

 sample ((1:1), ntrain)

 can pick ntrain numbers from 1 to 1. Then I just use these rows as the
 training dataset.

 But how can I make sure each run of sample produce different results? I
 want the data output be unique each time. I tested sample(). and found it
 usually produce different combinations. But can I control it some how? Is
 there a better way to write this?

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to control the sampling to make each sample unique

2007-05-10 Thread HelponR
Yeah, I want to get all unique combinations of choosing ntest from ntotal.

for example, choosing 4000 training data from 10,000 total data.

Suppose they are sequenced as 1:10,000

One obvious combination is 1:4000

Then I run

sample ((1:1000), 4000)

it may output 4000 numbers:

1, 3, 5,  7999

Then I run again,

it may output another 4000 numbers:

2, 4, 6, ..., 8000

I know the number of such unique combinations is

Choose 4000 from 10,000

(I forgot how to denote this.)

Anyway, I remember choosing m from n is  computed as
T = n! /(m!(m-n)!)

! is factorial


My concern is:
when the sample output will start to repeat?

For example, maybe I run next time, the output will be the same as the first
time.
1,2, 3, , 4000
That's not what I want.

I hope to get T different or unique combinations in T runs. It is fine it
may start to repeat after T times.

I know the sample() may already do this way. But I am not sure.


Thank you!



On 5/10/07, Rory Martin [EMAIL PROTECTED] wrote:

 I think you're asking a design question about a Monte Carlo
 simulation.  You
 have a population (size 10,000) from which you're defining an empirical
 distribution, and you're sampling from this to create pairs of training
 and
 test samples.

 You need to ensure that each specific pair of training and test samples is
 disjoint, meaning no observations in common.  Normally, you wouldn't want
 to
 make the different training samples disjoint, if that's what you meant by
 them being unique.  Or were you using it to mean identical?

 Regards
 Rory Martin


  From: HelponR suncertain_at_gmail.com Date: Wed, 09 May 2007 17:28:19
 
  I have a dataset of 1 records which I want to use to compare two
  prediction models.
 
  I split the records into test dataset (size = ntest) and training
 dataset
  (size = ntrain). Then I run the two models.
 
  Now I want to shuffle the data and rerun the models. I want many
 shuffles.
 
  I know that the following command
 
  sample ((1:1), ntrain)
 
  can pick ntrain numbers from 1 to 1. Then I just use these rows as
 the
  training dataset.
 
  But how can I make sure each run of sample produce different results? I
  want the data output be unique each time. I tested sample(). and found
 it
  usually produce different combinations. But can I control it some how?
 Is
  there a better way to write this?

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to control the sampling to make each sample unique

2007-05-10 Thread HelponR
I know. But I am curious about how sample() works.

For a small sample size. choose 1 digit from 0, 1
it only has two combinations. It is easy to test that the below can happen
consecutively.

 sample (c(0,1), 1)
[1] 0
 sample (c(0,1), 1)
[1] 0

That means, the output did not deplete all unique combinations before
repeating.

So I am concerned about how to control this. What I like to see after
the control is:
 sample (c(0,1), 1)
[1] 0
 sample (c(0,1), 1)
[1] 1
 sample (c(0,1), 1)
[1] 0

I don't think that is possible. Anyway, I just think a way to control is
recording all output in files, checking the new output, if they are
repeating with any of the previous files, then do not use it.
That is kind of clumsy. For each new combination, I have to compare with all
previous combinations.

First I sort the sequence, then I do a difference. then I square it, then I
sum it. If the result is 0 then a repetition happens.


Thanks all.



On 5/10/07, Rory Martin [EMAIL PROTECTED] wrote:

  sample(1:1000, 4000) returns a =random= sample of 4000
 integers from [1,1000].  It is exceedingly unlikely
 you will generate exactly the same set of 4000 integers.
 And if it did happen, it wouldn't make the slightest
 difference to your results.

 Rory



 - Original Message -
 *From:* HelponR [EMAIL PROTECTED]
 *To:* Rory Martin [EMAIL PROTECTED]
 *Cc:* r-help@stat.math.ethz.ch
 *Sent:* Thursday, May 10, 2007 4:47 PM
 *Subject:* Re: [R] how to control the sampling to make each sample unique


 Yeah, I want to get all unique combinations of choosing ntest from ntotal.

 for example, choosing 4000 training data from 10,000 total data.

 Suppose they are sequenced as 1:10,000

 One obvious combination is 1:4000

 Then I run

 sample ((1:1000), 4000)

 it may output 4000 numbers:

 1, 3, 5,  7999

 Then I run again,

 it may output another 4000 numbers:

 2, 4, 6, ..., 8000

 I know the number of such unique combinations is

 Choose 4000 from 10,000

 (I forgot how to denote this.)

 Anyway, I remember choosing m from n is  computed as
 T = n! /(m!(m-n)!)

 ! is factorial


 My concern is:
 when the sample output will start to repeat?

 For example, maybe I run next time, the output will be the same as the
 first time.
 1,2, 3, , 4000
 That's not what I want.

 I hope to get T different or unique combinations in T runs. It is fine it
 may start to repeat after T times.

 I know the sample() may already do this way. But I am not sure.


 Thank you!



 On 5/10/07, Rory Martin [EMAIL PROTECTED] wrote:
 
  I think you're asking a design question about a Monte Carlo
  simulation.  You
  have a population (size 10,000) from which you're defining an
  empirical
  distribution, and you're sampling from this to create pairs of training
  and
  test samples.
 
  You need to ensure that each specific pair of training and test samples
  is
  disjoint, meaning no observations in common.  Normally, you wouldn't
  want to
  make the different training samples disjoint, if that's what you meant
  by
  them being unique.  Or were you using it to mean identical?
 
  Regards
  Rory Martin
 
 
   From: HelponR suncertain_at_gmail.com Date: Wed, 09 May 2007
  17:28:19
  
   I have a dataset of 1 records which I want to use to compare two
   prediction models.
  
   I split the records into test dataset (size = ntest) and training
  dataset
   (size = ntrain). Then I run the two models.
  
   Now I want to shuffle the data and rerun the models. I want many
  shuffles.
  
   I know that the following command
  
   sample ((1:1), ntrain)
  
   can pick ntrain numbers from 1 to 1. Then I just use these rows as
  the
   training dataset.
  
   But how can I make sure each run of sample produce different results?
  I
   want the data output be unique each time. I tested sample(). and found
  it
   usually produce different combinations. But can I control it some how?
  Is
   there a better way to write this?
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 



[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] how to control the sampling to make each sample unique

2007-05-09 Thread HelponR
I have a dataset of 1 records which I want to use to compare two
prediction models.

I split the records into test dataset (size = ntest) and training dataset
(size = ntrain). Then I run the two models.

Now I want to shuffle the data and rerun the models. I want many shuffles.

I know that the following command

sample ((1:1), ntrain)

can pick ntrain numbers from 1 to 1. Then I just use these rows as the
training dataset.

But how can I make sure each run of sample  produce different results? I
want the data output be unique each time.
I tested sample(). and found it usually produce different combinations. But
can I control it some how? Is there a better way to write this?

Thank you,

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.