Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-20 Thread Dan Bolser
hadley wickham wrote:
Thing is, for one reason or another, the number of marbles per bag may
systematically vary with age too. However, I am not interested in the
number of marbles per bag, so I would like to group the students into 8
groups such that each group has the same total number of marbles. (Each
group having a different sized age range, none the less ordered by age).
 
 
 This sounds very much like a bin-packing problem
 (http://en.wikipedia.org/wiki/Bin_packing_problem), which is NP-hard. 
 The wikipedia page mentions some heuristics you may want to look into.
 
 Hadley

Man, I hate NP-hard problems!

Thanks for the link :)

Dan.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-20 Thread Dan Bolser
Gabor Grothendieck wrote:
 On 3/19/06, Dan Bolser [EMAIL PROTECTED] wrote:
 
Gabor Grothendieck wrote:

On 3/18/06, Dan Bolser [EMAIL PROTECTED] wrote:


Gabor Grothendieck wrote:


If you are just looking for something simple that may be good enough
then assign the largest one to group 1, the second largest to group 2,
..., the 8th largest to group 8 and then start over again with group 1
and so on.

# test data
set.seed(1)
x - sample(100, 100, rep = TRUE)

xs - sort(x)
g - gl(8, 1, length(xs)) # 8 groups

# so that g contains the groups that correspond to xs.

tapply(xs, g, sum)   # 659 671 687 701 612 622 629 646



That is a fairly neat way of getting groups with a good 'approximate
same size', however, in general I would like to be able to order my data
in any way, and still cut it into equal 'size' groups (like quantiles
for rows, but for row variable totals instead).


Do you mean you want g to be in the original order of x?

No. What I mean is that I want to order x by any particular variable in
my data.frame, then group over x such that each group has roughly the
same sum.

I get the feeling I have missed a very simple trick.
 
 
 
 Suggest providing a short self contained reproducible example including
 input and desired output and a detailed explanation.

Does my subsequent post answer your question in this regard? Seems as 
though 'optimality' is not possible with any reasonable approach, 
however, the ordering criteria may mean we can get optimal solutions.

Dan.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-20 Thread Adaikalavan Ramasamy
Are you saying that your data might look like this ?

 set.seed(1)  # For reproducibility only - remove this
 mydf - data.frame( age=round(runif(100, min=5, max=65), digits=1),
 nred=rpois(100, lambda=10), 
 nblue=rpois(100, lambda=5), 
 ngreen=rpois(100, lambda=15) )
 mydf$total - rowSums( mydf[ , c(nred, nblue, ngreen)] )

 head(mydf)
age nred nblue ngreen total
 1 20.9   11 7 1533
 2 27.38 2 1828
 3 39.4   11 4  823
 4 59.56 5  819
 5 17.1   10 3 1629
 6 58.9   11 5 1430


If so, then try this :

 mydf  - mydf[order(mydf$age), ]  ## re-order by age
 mydf$cumtotal - cumsum(mydf$total)   ## cummulative total

 brk.pts   - seq(from=0, to=sum(mydf$total), len=9)
 mydf$grp  - cut( mydf$cumtotal , brk.pts, labels=F )

 age nred nblue ngreen total cumtotal grp
 27  5.89 5  822   22   1
 47  6.46 5 1324   46   1
 92  8.58 4 1830   76   1
 10  8.7   12 5  825  101   1
 55  9.2   10 7 1330  131   1
 69 10.19 3 1830  161   1


So here your 'grp' column is what you really want. Just to check 

 tapply( mydf$total, mydf$grp, sum )
   1   2   3   4   5   6   7   8 
 352 363 372 387 358 377 377 370 

 sapply( tapply( mydf$age, mydf$grp, range ), c )
 12345678
 [1,]  5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7
 [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5

The last command says that your youngest student in group 1 is aged 5.8
and oldest is aged 16.2.


Taking this one step further, you can calculate the proportion of the
red, green and blue for each of the 8 groups.

 props - mydf[ , c(nred, nblue, ngreen)]/mydf$total # proportions
 apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
nred nbluengreen
 1 0.3459898 0.1776441 0.4763661
 2 0.3280712 0.1730796 0.4988492
 3 0.3061429 0.1748149 0.5190422
 4 0.3759380 0.2084694 0.4155926
 5 0.3548805 0.1587353 0.4863842
 6 0.3106835 0.1829349 0.5063816
 7 0.3525933 0.1599737 0.4874330
 8 0.3133796 0.1795567 0.5070637

Hope this of some use.

Regards, Adai



On Sun, 2006-03-19 at 18:58 +, Dan Bolser wrote:
 Adaikalavan Ramasamy wrote:
  Do you by any chance want to sample from each group equally to get an
  equal representation matrix ? 
 
 No.
 
 I want to make groups of equal sizes, where size isn't simply number of 
 rows (allowing a simple 'gl'), but a sum of the variable.
 
 Thanks for the code though, it looks useful.
 
 
 
 Here is an analogy for what I want to do (in case it helps).
 
 A group of students have some bags of marbles - The marbles have 
 different colours. Each student has one bag, but can have between 5 and 
 50 marbles per bag with any given strange distribution you like. I line 
 the students up by age, and want to see if there is any systematic 
 difference between the number of each color of marble by age (older 
 students may find primary colours less 'cool').
 
 Because the statistics of each individual student are bad (like the 
 proportion of each color per student -- has a high variance) I first put 
 all the students into 8 groups (for example).
 
 Thing is, for one reason or another, the number of marbles per bag may 
 systematically vary with age too. However, I am not interested in the 
 number of marbles per bag, so I would like to group the students into 8 
 groups such that each group has the same total number of marbles. (Each 
 group having a different sized age range, none the less ordered by age).
 
 Then I can look at the proportion (or count) of colours in each group, 
 and I can compare the groups or any trend accross the groups.
 
 Does that make sense?
 
 Cheers,
 Dan.
 
 
 
 
 
 
  Here is an example of the input :
  
   mydf - data.frame( value=1:100, value2=rnorm(100),
   grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) )
  
  which has 35 observations from A, 15 from B, 30 from C and 20 from D.
  
  
  And here is a function that I wrote:
  
   sample.by.group - function(df, grp, k, replace=FALSE){
  
 if(length(k)==1){ k - rep(k, length(unique(grp))) }
  
 if(!replace  any(k  table(grp)))
   stop( paste(Cannot take a sample larger than the population when
   'replace = FALSE'.\n, Please specify a value greater than,
   min(table(grp)), or use 'replace = TRUE'.\n) )
  

 ind   - model.matrix( ~ -1 + grp )
 w.mat - list(NULL)
 
 for(i in 1:ncol(ind)){
   w.mat[[i]] - sample( which( ind[,i]==1 ), k[i], replace=replace )
 }

 out - df[ unlist(w.mat), ]
 return(out)
   }
  
  
  And here are some examples of how to use it :
   
  mydf - mydf[ sample(1:nrow(mydf)), ]   # scramble it for fun
  
  
  out1 - sample.by.group(mydf, mydf$grp, k=10 )
  table( out1$grp )
  
   out2 - sample.by.group(mydf, mydf$grp, k=50, 

Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-20 Thread Dan Bolser
Adaikalavan Ramasamy wrote:
 Are you saying that your data might look like this ?
 
  set.seed(1)  # For reproducibility only - remove this
  mydf - data.frame( age=round(runif(100, min=5, max=65), digits=1),
  nred=rpois(100, lambda=10), 
  nblue=rpois(100, lambda=5), 
  ngreen=rpois(100, lambda=15) )
  mydf$total - rowSums( mydf[ , c(nred, nblue, ngreen)] )
 
  head(mydf)
 age nred nblue ngreen total
  1 20.9   11 7 1533
  2 27.38 2 1828
  3 39.4   11 4  823
  4 59.56 5  819
  5 17.1   10 3 1629
  6 58.9   11 5 1430
 
 
 If so, then try this :
 
  mydf  - mydf[order(mydf$age), ]  ## re-order by age
  mydf$cumtotal - cumsum(mydf$total)   ## cummulative total
 
  brk.pts   - seq(from=0, to=sum(mydf$total), len=9)
  mydf$grp  - cut( mydf$cumtotal , brk.pts, labels=F )
 
  age nred nblue ngreen total cumtotal grp
  27  5.89 5  822   22   1
  47  6.46 5 1324   46   1
  92  8.58 4 1830   76   1
  10  8.7   12 5  825  101   1
  55  9.2   10 7 1330  131   1
  69 10.19 3 1830  161   1
 
 
 So here your 'grp' column is what you really want. Just to check 
 
  tapply( mydf$total, mydf$grp, sum )
1   2   3   4   5   6   7   8 
  352 363 372 387 358 377 377 370 
 
  sapply( tapply( mydf$age, mydf$grp, range ), c )
  12345678
  [1,]  5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7
  [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5
 
 The last command says that your youngest student in group 1 is aged 5.8
 and oldest is aged 16.2.
 
 
 Taking this one step further, you can calculate the proportion of the
 red, green and blue for each of the 8 groups.
 
  props - mydf[ , c(nred, nblue, ngreen)]/mydf$total # proportions
  apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
 nred nbluengreen
  1 0.3459898 0.1776441 0.4763661
  2 0.3280712 0.1730796 0.4988492
  3 0.3061429 0.1748149 0.5190422
  4 0.3759380 0.2084694 0.4155926
  5 0.3548805 0.1587353 0.4863842
  6 0.3106835 0.1829349 0.5063816
  7 0.3525933 0.1599737 0.4874330
  8 0.3133796 0.1795567 0.5070637
 
 Hope this of some use.

Yes, this is very useful! I have just one remaining question, above you 
take the mean of the group proportion...

apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )


instead of explicitly recalculating the proportion for the group (what I 
couldn't script real good) ...

rbind(
   colSums(mydf[ mydf$grp==1, c(nred, nblue, ngreen)])/
  sum (mydf[ mydf$grp==1, c(nred, nblue, ngreen)]),
   ...
   colSums(mydf[ mydf$grp==8, c(nred, nblue, ngreen)])/
  sum (mydf[ mydf$grp==8, c(nred, nblue, ngreen)])
  )


Giving (from the same seed)...

   nred nbluengreen
[1,] 0.3465909 0.1704545 0.4829545
[2,] 0.3250689 0.1735537 0.5013774
[3,] 0.3064516 0.1774194 0.5161290
[4,] 0.3746770 0.2067183 0.4186047
[5,] 0.3519553 0.1564246 0.4916201
[6,] 0.3103448 0.1830239 0.5066313
[7,] 0.3501326 0.1644562 0.4854111
[8,] 0.3081081 0.1837838 0.5081081


Which is *slightly* different from the 'mean' approach.

  round(former-latter,4)
  nred   nblue  ngreen
1 -0.0006  0.0072 -0.0066
2  0.0030 -0.0005 -0.0025
3 -0.0003 -0.0026  0.0029
4  0.0013  0.0018 -0.0030
5  0.0029  0.0023 -0.0052
6  0.0003 -0.0001 -0.0002
7  0.0025 -0.0045  0.0020
8  0.0053 -0.0042 -0.0010


I know this less a question about R, and more a question about general 
stats, but why did you choose the former and not the latter method? Is 
one wrong and one right? Or did the former better fit the situation as 
described?

Thanks for any insight into your decision, as this is something that has 
always puzzled me.

Thanks for the beautifully clear examples!


Dan.

 
 Regards, Adai
 
 
 
 On Sun, 2006-03-19 at 18:58 +, Dan Bolser wrote:
 
Adaikalavan Ramasamy wrote:

Do you by any chance want to sample from each group equally to get an
equal representation matrix ? 

No.

I want to make groups of equal sizes, where size isn't simply number of 
rows (allowing a simple 'gl'), but a sum of the variable.

Thanks for the code though, it looks useful.



Here is an analogy for what I want to do (in case it helps).

A group of students have some bags of marbles - The marbles have 
different colours. Each student has one bag, but can have between 5 and 
50 marbles per bag with any given strange distribution you like. I line 
the students up by age, and want to see if there is any systematic 
difference between the number of each color of marble by age (older 
students may find primary colours less 'cool').

Because the statistics of each individual student are bad (like the 
proportion of each color per student -- has a high variance) I first put 
all the students into 8 groups (for example).

Thing is, for one reason or 

Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-20 Thread Adaikalavan Ramasamy
Lets say there are 10 students in the first group and denote x1 as (say)
the number of red balls for student 1 and s1 the total balls. Then I was
calculating the average the proportion ( x1/s1 + x2/s2 + ... + x10/s10 )
and you were calculating the average number of events (x1+x2
+...+x10)/(s1+s2+...+s10).

On second thoughts I think it is much better to calculate the a weighted
average of the proportions. The weights should reflect the variance of
the estimate of the proportions.



( w1*x1/s1 + w2*x2/s2 + ... + w10*x10/s10 )



On Mon, 2006-03-20 at 15:11 +, Dan Bolser wrote:
 Adaikalavan Ramasamy wrote:
  Are you saying that your data might look like this ?
  
   set.seed(1)  # For reproducibility only - remove this
   mydf - data.frame( age=round(runif(100, min=5, max=65), digits=1),
   nred=rpois(100, lambda=10), 
   nblue=rpois(100, lambda=5), 
   ngreen=rpois(100, lambda=15) )
   mydf$total - rowSums( mydf[ , c(nred, nblue, ngreen)] )
  
   head(mydf)
  age nred nblue ngreen total
   1 20.9   11 7 1533
   2 27.38 2 1828
   3 39.4   11 4  823
   4 59.56 5  819
   5 17.1   10 3 1629
   6 58.9   11 5 1430
  
  
  If so, then try this :
  
   mydf  - mydf[order(mydf$age), ]  ## re-order by age
   mydf$cumtotal - cumsum(mydf$total)   ## cummulative total
  
   brk.pts   - seq(from=0, to=sum(mydf$total), len=9)
   mydf$grp  - cut( mydf$cumtotal , brk.pts, labels=F )
  
   age nred nblue ngreen total cumtotal grp
   27  5.89 5  822   22   1
   47  6.46 5 1324   46   1
   92  8.58 4 1830   76   1
   10  8.7   12 5  825  101   1
   55  9.2   10 7 1330  131   1
   69 10.19 3 1830  161   1
  
  
  So here your 'grp' column is what you really want. Just to check 
  
   tapply( mydf$total, mydf$grp, sum )
 1   2   3   4   5   6   7   8 
   352 363 372 387 358 377 377 370 
  
   sapply( tapply( mydf$age, mydf$grp, range ), c )
   12345678
   [1,]  5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7
   [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5
  
  The last command says that your youngest student in group 1 is aged 5.8
  and oldest is aged 16.2.
  
  
  Taking this one step further, you can calculate the proportion of the
  red, green and blue for each of the 8 groups.
  
   props - mydf[ , c(nred, nblue, ngreen)]/mydf$total # proportions
   apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
  nred nbluengreen
   1 0.3459898 0.1776441 0.4763661
   2 0.3280712 0.1730796 0.4988492
   3 0.3061429 0.1748149 0.5190422
   4 0.3759380 0.2084694 0.4155926
   5 0.3548805 0.1587353 0.4863842
   6 0.3106835 0.1829349 0.5063816
   7 0.3525933 0.1599737 0.4874330
   8 0.3133796 0.1795567 0.5070637
  
  Hope this of some use.
 
 Yes, this is very useful! I have just one remaining question, above you 
 take the mean of the group proportion...
 
 apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
 
 
 instead of explicitly recalculating the proportion for the group (what I 
 couldn't script real good) ...
 
 rbind(
colSums(mydf[ mydf$grp==1, c(nred, nblue, ngreen)])/
   sum (mydf[ mydf$grp==1, c(nred, nblue, ngreen)]),
...
colSums(mydf[ mydf$grp==8, c(nred, nblue, ngreen)])/
   sum (mydf[ mydf$grp==8, c(nred, nblue, ngreen)])
   )
 
 
 Giving (from the same seed)...
 
nred nbluengreen
 [1,] 0.3465909 0.1704545 0.4829545
 [2,] 0.3250689 0.1735537 0.5013774
 [3,] 0.3064516 0.1774194 0.5161290
 [4,] 0.3746770 0.2067183 0.4186047
 [5,] 0.3519553 0.1564246 0.4916201
 [6,] 0.3103448 0.1830239 0.5066313
 [7,] 0.3501326 0.1644562 0.4854111
 [8,] 0.3081081 0.1837838 0.5081081
 
 
 Which is *slightly* different from the 'mean' approach.
 
   round(former-latter,4)
   nred   nblue  ngreen
 1 -0.0006  0.0072 -0.0066
 2  0.0030 -0.0005 -0.0025
 3 -0.0003 -0.0026  0.0029
 4  0.0013  0.0018 -0.0030
 5  0.0029  0.0023 -0.0052
 6  0.0003 -0.0001 -0.0002
 7  0.0025 -0.0045  0.0020
 8  0.0053 -0.0042 -0.0010
 
 
 I know this less a question about R, and more a question about general 
 stats, but why did you choose the former and not the latter method? Is 
 one wrong and one right? Or did the former better fit the situation as 
 described?
 
 Thanks for any insight into your decision, as this is something that has 
 always puzzled me.
 
 Thanks for the beautifully clear examples!
 
 
 Dan.
 
  
  Regards, Adai
  
  
  
  On Sun, 2006-03-19 at 18:58 +, Dan Bolser wrote:
  
 Adaikalavan Ramasamy wrote:
 
 Do you by any chance want to sample from each group equally to get an
 equal representation matrix ? 
 
 No.
 
 I want to make groups of equal sizes, where size isn't simply number of 
 rows (allowing a simple 'gl'), but a sum of the 

Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-20 Thread Adaikalavan Ramasamy
[[ Please ignore the last email which was sent incomplete ]]

Lets say there are 10 students in the first group and denote x1 as (say)
the number of red balls for student 1 and s1 the total balls. Then I was
calculating the average the proportion ( x1/s1 + x2/s2 + ... + x10/s10 )
and you were calculating the average number of events (x1+x2
+...+x10)/(s1+s2+...+s10).

It is just by chance that your calculation and mine agrees. When the
numbers are highly unbalanced, you may get very different results.



On second thoughts I think it is much better to calculate the a weighted
average of the proportions. The weights should reflect the variance of
the estimate of the proportions. Assuming that your outcome of interest
is proportions, the summary effect size might look something like 
 
  p_hat = ( w1*p1 + w2*p2+ ... + w10*p10 ) 
 
  where p1 = x1/s1 and w1=1/var(p1).

You should be able to obtain the standard errors for this estimate. 
Using this you can build a confidence interval and see if they overlap 
with proportion of reds in other groups. 



There is a big field called meta-analysis that deals with this kind of 
issue. You might want to read up more about this area. However I am not 
too familiar with the meta-analysis of proportion

Perhaps someone on the mailing list can advise you if this approach is
appropriate for your situation and perhaps even some references.


Regards, Adai

SNIP

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-19 Thread Dan Bolser
Gabor Grothendieck wrote:
 On 3/18/06, Dan Bolser [EMAIL PROTECTED] wrote:
 
Gabor Grothendieck wrote:

If you are just looking for something simple that may be good enough
then assign the largest one to group 1, the second largest to group 2,
..., the 8th largest to group 8 and then start over again with group 1
and so on.

# test data
set.seed(1)
x - sample(100, 100, rep = TRUE)

xs - sort(x)
g - gl(8, 1, length(xs)) # 8 groups

# so that g contains the groups that correspond to xs.

tapply(xs, g, sum)   # 659 671 687 701 612 622 629 646



That is a fairly neat way of getting groups with a good 'approximate
same size', however, in general I would like to be able to order my data
in any way, and still cut it into equal 'size' groups (like quantiles
for rows, but for row variable totals instead).
 
 
 Do you mean you want g to be in the original order of x?  

No. What I mean is that I want to order x by any particular variable in 
my data.frame, then group over x such that each group has roughly the 
same sum.

I get the feeling I have missed a very simple trick.












 order(x)
 is the permutation which sorts x and order(order(x)) is its inverse
 permutation so apply that to the gl expression:
 
 x - c(10, 4, 15, 2, 20, 13)
 g - gl(2, 1, length(x))[order(order(x))]
 
 # check it
 identical(tapply(sort(x), gl(2, 1, length(x)), sum), tapply(x, g, sum))
 
 
Seems it should be possible without an explicit loop (and some more
'refinement' of the final group sizes), but I can't work it out.





On 3/17/06, Dan Bolser [EMAIL PROTECTED] wrote:


Dan Bolser wrote:


Hi,

I have tuples of data in rows of a data.frame, each column is a variable
for the 'items' (one per row).

One of the variables is the 'size' of the item (row).

I would like to cut my data.frame into groups such that each group has
the same *total size*. So, assuming that we order by size, some groups
should have several small items while other groups have a few large
items. All the groups should have approximately the same total size.

I have tried various combinations of cut, quantile, and ecdf, and I just
can't work out how to do this!

Any help is greatly appreciated!

All the best,
Dan.


Perhaps there is a cleaver way, but I just wrote this in despiration...


my.groups - 8

my.total -
 sum(my.res.1$TOT)   ## The 'size' variable in my data.frame

my.approx.size -
 my.total/
 my.groups

my.j - 1
my.roll - 0
my.factor - numeric()

for(i in sort(my.res.1$TOT)){

 my.roll -
   my.roll + i

 if (my.roll  my.approx.size * my.j)
   my.j - my.j + 1

 my.factor -
   append(my.factor,my.j)
}

my.factor -
 as.factor(my.factor)



Then...


tapply(my.factor,my.factor,length)

 1   2   3   4   5   6   7   8
152  62  45  34  25  21  14   8


And...


tapply(sort(my.res.1$TOT),my.factor,sum)

  12345678
2880 2848 2912 2893 2832 2906 2776 3029



Which isn't bad.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-19 Thread Dan Bolser
Adaikalavan Ramasamy wrote:
 Do you by any chance want to sample from each group equally to get an
 equal representation matrix ? 

No.

I want to make groups of equal sizes, where size isn't simply number of 
rows (allowing a simple 'gl'), but a sum of the variable.

Thanks for the code though, it looks useful.



Here is an analogy for what I want to do (in case it helps).

A group of students have some bags of marbles - The marbles have 
different colours. Each student has one bag, but can have between 5 and 
50 marbles per bag with any given strange distribution you like. I line 
the students up by age, and want to see if there is any systematic 
difference between the number of each color of marble by age (older 
students may find primary colours less 'cool').

Because the statistics of each individual student are bad (like the 
proportion of each color per student -- has a high variance) I first put 
all the students into 8 groups (for example).

Thing is, for one reason or another, the number of marbles per bag may 
systematically vary with age too. However, I am not interested in the 
number of marbles per bag, so I would like to group the students into 8 
groups such that each group has the same total number of marbles. (Each 
group having a different sized age range, none the less ordered by age).

Then I can look at the proportion (or count) of colours in each group, 
and I can compare the groups or any trend accross the groups.

Does that make sense?

Cheers,
Dan.






 Here is an example of the input :
 
  mydf - data.frame( value=1:100, value2=rnorm(100),
  grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) )
 
 which has 35 observations from A, 15 from B, 30 from C and 20 from D.
 
 
 And here is a function that I wrote:
 
  sample.by.group - function(df, grp, k, replace=FALSE){
 
if(length(k)==1){ k - rep(k, length(unique(grp))) }
 
if(!replace  any(k  table(grp)))
  stop( paste(Cannot take a sample larger than the population when
  'replace = FALSE'.\n, Please specify a value greater than,
  min(table(grp)), or use 'replace = TRUE'.\n) )
 
   
ind   - model.matrix( ~ -1 + grp )
w.mat - list(NULL)

for(i in 1:ncol(ind)){
  w.mat[[i]] - sample( which( ind[,i]==1 ), k[i], replace=replace )
}
   
out - df[ unlist(w.mat), ]
return(out)
  }
 
 
 And here are some examples of how to use it :
  
 mydf - mydf[ sample(1:nrow(mydf)), ]   # scramble it for fun
 
 
 out1 - sample.by.group(mydf, mydf$grp, k=10 )
 table( out1$grp )
 
  out2 - sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap
  table( out2$grp )
 
 and you can even do bootstrapping or sampling with weights via:
 
  out3 - sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T)
  table( out3$grp )
 
 
 Regards, Adai
 
 
 
 On Fri, 2006-03-17 at 16:01 +, Dan Bolser wrote:
 
Hi,

I have tuples of data in rows of a data.frame, each column is a variable 
for the 'items' (one per row).

One of the variables is the 'size' of the item (row).

I would like to cut my data.frame into groups such that each group has 
the same *total size*. So, assuming that we order by size, some groups 
should have several small items while other groups have a few large 
items. All the groups should have approximately the same total size.

I have tried various combinations of cut, quantile, and ecdf, and I just 
can't work out how to do this!

Any help is greatly appreciated!

All the best,
Dan.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

 


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-19 Thread Gabor Grothendieck
On 3/19/06, Dan Bolser [EMAIL PROTECTED] wrote:
 Gabor Grothendieck wrote:
  On 3/18/06, Dan Bolser [EMAIL PROTECTED] wrote:
 
 Gabor Grothendieck wrote:
 
 If you are just looking for something simple that may be good enough
 then assign the largest one to group 1, the second largest to group 2,
 ..., the 8th largest to group 8 and then start over again with group 1
 and so on.
 
 # test data
 set.seed(1)
 x - sample(100, 100, rep = TRUE)
 
 xs - sort(x)
 g - gl(8, 1, length(xs)) # 8 groups
 
 # so that g contains the groups that correspond to xs.
 
 tapply(xs, g, sum)   # 659 671 687 701 612 622 629 646
 
 
 
 That is a fairly neat way of getting groups with a good 'approximate
 same size', however, in general I would like to be able to order my data
 in any way, and still cut it into equal 'size' groups (like quantiles
 for rows, but for row variable totals instead).
 
 
  Do you mean you want g to be in the original order of x?

 No. What I mean is that I want to order x by any particular variable in
 my data.frame, then group over x such that each group has roughly the
 same sum.

 I get the feeling I have missed a very simple trick.


Suggest providing a short self contained reproducible example including
input and desired output and a detailed explanation.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-19 Thread hadley wickham
 Thing is, for one reason or another, the number of marbles per bag may
 systematically vary with age too. However, I am not interested in the
 number of marbles per bag, so I would like to group the students into 8
 groups such that each group has the same total number of marbles. (Each
 group having a different sized age range, none the less ordered by age).

This sounds very much like a bin-packing problem
(http://en.wikipedia.org/wiki/Bin_packing_problem), which is NP-hard. 
The wikipedia page mentions some heuristics you may want to look into.

Hadley

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-18 Thread Dan Bolser
Gabor Grothendieck wrote:
 If you are just looking for something simple that may be good enough
 then assign the largest one to group 1, the second largest to group 2,
 ..., the 8th largest to group 8 and then start over again with group 1
 and so on.
 
 # test data
 set.seed(1)
 x - sample(100, 100, rep = TRUE)
 
 xs - sort(x)
 g - gl(8, 1, length(xs)) # 8 groups
 
 # so that g contains the groups that correspond to xs.
 
 tapply(xs, g, sum)   # 659 671 687 701 612 622 629 646
 


That is a fairly neat way of getting groups with a good 'approximate 
same size', however, in general I would like to be able to order my data 
in any way, and still cut it into equal 'size' groups (like quantiles 
for rows, but for row variable totals instead).

Seems it should be possible without an explicit loop (and some more 
'refinement' of the final group sizes), but I can't work it out.




 
 On 3/17/06, Dan Bolser [EMAIL PROTECTED] wrote:
 
Dan Bolser wrote:

Hi,

I have tuples of data in rows of a data.frame, each column is a variable
for the 'items' (one per row).

One of the variables is the 'size' of the item (row).

I would like to cut my data.frame into groups such that each group has
the same *total size*. So, assuming that we order by size, some groups
should have several small items while other groups have a few large
items. All the groups should have approximately the same total size.

I have tried various combinations of cut, quantile, and ecdf, and I just
can't work out how to do this!

Any help is greatly appreciated!

All the best,
Dan.


Perhaps there is a cleaver way, but I just wrote this in despiration...


my.groups - 8

my.total -
  sum(my.res.1$TOT)   ## The 'size' variable in my data.frame

my.approx.size -
  my.total/
  my.groups

my.j - 1
my.roll - 0
my.factor - numeric()

for(i in sort(my.res.1$TOT)){

  my.roll -
my.roll + i

  if (my.roll  my.approx.size * my.j)
my.j - my.j + 1

  my.factor -
append(my.factor,my.j)
}

my.factor -
  as.factor(my.factor)



Then...

  tapply(my.factor,my.factor,length)
  1   2   3   4   5   6   7   8
152  62  45  34  25  21  14   8


And...

  tapply(sort(my.res.1$TOT),my.factor,sum)
   12345678
2880 2848 2912 2893 2832 2906 2776 3029
 



Which isn't bad.













__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-18 Thread Gabor Grothendieck
On 3/18/06, Dan Bolser [EMAIL PROTECTED] wrote:
 Gabor Grothendieck wrote:
  If you are just looking for something simple that may be good enough
  then assign the largest one to group 1, the second largest to group 2,
  ..., the 8th largest to group 8 and then start over again with group 1
  and so on.
 
  # test data
  set.seed(1)
  x - sample(100, 100, rep = TRUE)
 
  xs - sort(x)
  g - gl(8, 1, length(xs)) # 8 groups
 
  # so that g contains the groups that correspond to xs.
 
  tapply(xs, g, sum)   # 659 671 687 701 612 622 629 646
 


 That is a fairly neat way of getting groups with a good 'approximate
 same size', however, in general I would like to be able to order my data
 in any way, and still cut it into equal 'size' groups (like quantiles
 for rows, but for row variable totals instead).

Do you mean you want g to be in the original order of x?  order(x)
is the permutation which sorts x and order(order(x)) is its inverse
permutation so apply that to the gl expression:

x - c(10, 4, 15, 2, 20, 13)
g - gl(2, 1, length(x))[order(order(x))]

# check it
identical(tapply(sort(x), gl(2, 1, length(x)), sum), tapply(x, g, sum))


 Seems it should be possible without an explicit loop (and some more
 'refinement' of the final group sizes), but I can't work it out.




 
  On 3/17/06, Dan Bolser [EMAIL PROTECTED] wrote:
 
 Dan Bolser wrote:
 
 Hi,
 
 I have tuples of data in rows of a data.frame, each column is a variable
 for the 'items' (one per row).
 
 One of the variables is the 'size' of the item (row).
 
 I would like to cut my data.frame into groups such that each group has
 the same *total size*. So, assuming that we order by size, some groups
 should have several small items while other groups have a few large
 items. All the groups should have approximately the same total size.
 
 I have tried various combinations of cut, quantile, and ecdf, and I just
 can't work out how to do this!
 
 Any help is greatly appreciated!
 
 All the best,
 Dan.
 
 
 Perhaps there is a cleaver way, but I just wrote this in despiration...
 
 
 my.groups - 8
 
 my.total -
   sum(my.res.1$TOT)   ## The 'size' variable in my data.frame
 
 my.approx.size -
   my.total/
   my.groups
 
 my.j - 1
 my.roll - 0
 my.factor - numeric()
 
 for(i in sort(my.res.1$TOT)){
 
   my.roll -
 my.roll + i
 
   if (my.roll  my.approx.size * my.j)
 my.j - my.j + 1
 
   my.factor -
 append(my.factor,my.j)
 }
 
 my.factor -
   as.factor(my.factor)
 
 
 
 Then...
 
   tapply(my.factor,my.factor,length)
   1   2   3   4   5   6   7   8
 152  62  45  34  25  21  14   8
 
 
 And...
 
   tapply(sort(my.res.1$TOT),my.factor,sum)
12345678
 2880 2848 2912 2893 2832 2906 2776 3029
  
 
 
 
 Which isn't bad.
 

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-18 Thread Adaikalavan Ramasamy
Do you by any chance want to sample from each group equally to get an
equal representation matrix ? Here is an example of the input :

 mydf - data.frame( value=1:100, value2=rnorm(100),
 grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) )

which has 35 observations from A, 15 from B, 30 from C and 20 from D.


And here is a function that I wrote:

 sample.by.group - function(df, grp, k, replace=FALSE){

   if(length(k)==1){ k - rep(k, length(unique(grp))) }

   if(!replace  any(k  table(grp)))
 stop( paste(Cannot take a sample larger than the population when
 'replace = FALSE'.\n, Please specify a value greater than,
 min(table(grp)), or use 'replace = TRUE'.\n) )

  
   ind   - model.matrix( ~ -1 + grp )
   w.mat - list(NULL)
   
   for(i in 1:ncol(ind)){
 w.mat[[i]] - sample( which( ind[,i]==1 ), k[i], replace=replace )
   }
  
   out - df[ unlist(w.mat), ]
   return(out)
 }


And here are some examples of how to use it :
 
mydf - mydf[ sample(1:nrow(mydf)), ]   # scramble it for fun


out1 - sample.by.group(mydf, mydf$grp, k=10 )
table( out1$grp )

 out2 - sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap
 table( out2$grp )

and you can even do bootstrapping or sampling with weights via:

 out3 - sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T)
 table( out3$grp )


Regards, Adai



On Fri, 2006-03-17 at 16:01 +, Dan Bolser wrote:
 Hi,
 
 I have tuples of data in rows of a data.frame, each column is a variable 
 for the 'items' (one per row).
 
 One of the variables is the 'size' of the item (row).
 
 I would like to cut my data.frame into groups such that each group has 
 the same *total size*. So, assuming that we order by size, some groups 
 should have several small items while other groups have a few large 
 items. All the groups should have approximately the same total size.
 
 I have tried various combinations of cut, quantile, and ecdf, and I just 
 can't work out how to do this!
 
 Any help is greatly appreciated!
 
 All the best,
 Dan.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Binning question (binning rows of a data.frame according to a variable)

2006-03-17 Thread Dan Bolser
Hi,

I have tuples of data in rows of a data.frame, each column is a variable 
for the 'items' (one per row).

One of the variables is the 'size' of the item (row).

I would like to cut my data.frame into groups such that each group has 
the same *total size*. So, assuming that we order by size, some groups 
should have several small items while other groups have a few large 
items. All the groups should have approximately the same total size.

I have tried various combinations of cut, quantile, and ecdf, and I just 
can't work out how to do this!

Any help is greatly appreciated!

All the best,
Dan.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-17 Thread Dan Bolser
Dan Bolser wrote:
 Hi,
 
 I have tuples of data in rows of a data.frame, each column is a variable 
 for the 'items' (one per row).
 
 One of the variables is the 'size' of the item (row).
 
 I would like to cut my data.frame into groups such that each group has 
 the same *total size*. So, assuming that we order by size, some groups 
 should have several small items while other groups have a few large 
 items. All the groups should have approximately the same total size.
 
 I have tried various combinations of cut, quantile, and ecdf, and I just 
 can't work out how to do this!
 
 Any help is greatly appreciated!
 
 All the best,
 Dan.
 

Perhaps there is a cleaver way, but I just wrote this in despiration...


my.groups - 8

my.total -
   sum(my.res.1$TOT)   ## The 'size' variable in my data.frame

my.approx.size -
   my.total/
   my.groups

my.j - 1
my.roll - 0
my.factor - numeric()

for(i in sort(my.res.1$TOT)){

   my.roll -
 my.roll + i

   if (my.roll  my.approx.size * my.j)
 my.j - my.j + 1

   my.factor -
 append(my.factor,my.j)
}

my.factor -
   as.factor(my.factor)



Then...

  tapply(my.factor,my.factor,length)
   1   2   3   4   5   6   7   8
152  62  45  34  25  21  14   8


And...

  tapply(sort(my.res.1$TOT),my.factor,sum)
12345678
2880 2848 2912 2893 2832 2906 2776 3029
 



Which isn't bad.












 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Binning question (binning rows of a data.frame according to a variable)

2006-03-17 Thread Gabor Grothendieck
If you are just looking for something simple that may be good enough
then assign the largest one to group 1, the second largest to group 2,
..., the 8th largest to group 8 and then start over again with group 1
and so on.

# test data
set.seed(1)
x - sample(100, 100, rep = TRUE)

xs - sort(x)
g - gl(8, 1, length(xs)) # 8 groups

# so that g contains the groups that correspond to xs.

tapply(xs, g, sum)   # 659 671 687 701 612 622 629 646


On 3/17/06, Dan Bolser [EMAIL PROTECTED] wrote:
 Dan Bolser wrote:
  Hi,
 
  I have tuples of data in rows of a data.frame, each column is a variable
  for the 'items' (one per row).
 
  One of the variables is the 'size' of the item (row).
 
  I would like to cut my data.frame into groups such that each group has
  the same *total size*. So, assuming that we order by size, some groups
  should have several small items while other groups have a few large
  items. All the groups should have approximately the same total size.
 
  I have tried various combinations of cut, quantile, and ecdf, and I just
  can't work out how to do this!
 
  Any help is greatly appreciated!
 
  All the best,
  Dan.
 

 Perhaps there is a cleaver way, but I just wrote this in despiration...


 my.groups - 8

 my.total -
   sum(my.res.1$TOT)   ## The 'size' variable in my data.frame

 my.approx.size -
   my.total/
   my.groups

 my.j - 1
 my.roll - 0
 my.factor - numeric()

 for(i in sort(my.res.1$TOT)){

   my.roll -
 my.roll + i

   if (my.roll  my.approx.size * my.j)
 my.j - my.j + 1

   my.factor -
 append(my.factor,my.j)
 }

 my.factor -
   as.factor(my.factor)



 Then...

   tapply(my.factor,my.factor,length)
   1   2   3   4   5   6   7   8
 152  62  45  34  25  21  14   8


 And...

   tapply(sort(my.res.1$TOT),my.factor,sum)
12345678
 2880 2848 2912 2893 2832 2906 2776 3029
  



 Which isn't bad.












  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide! 
  http://www.R-project.org/posting-guide.html

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html