Re: [R] Binning question (binning rows of a data.frame according to a variable)
hadley wickham wrote: Thing is, for one reason or another, the number of marbles per bag may systematically vary with age too. However, I am not interested in the number of marbles per bag, so I would like to group the students into 8 groups such that each group has the same total number of marbles. (Each group having a different sized age range, none the less ordered by age). This sounds very much like a bin-packing problem (http://en.wikipedia.org/wiki/Bin_packing_problem), which is NP-hard. The wikipedia page mentions some heuristics you may want to look into. Hadley Man, I hate NP-hard problems! Thanks for the link :) Dan. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
Gabor Grothendieck wrote: On 3/19/06, Dan Bolser [EMAIL PROTECTED] wrote: Gabor Grothendieck wrote: On 3/18/06, Dan Bolser [EMAIL PROTECTED] wrote: Gabor Grothendieck wrote: If you are just looking for something simple that may be good enough then assign the largest one to group 1, the second largest to group 2, ..., the 8th largest to group 8 and then start over again with group 1 and so on. # test data set.seed(1) x - sample(100, 100, rep = TRUE) xs - sort(x) g - gl(8, 1, length(xs)) # 8 groups # so that g contains the groups that correspond to xs. tapply(xs, g, sum) # 659 671 687 701 612 622 629 646 That is a fairly neat way of getting groups with a good 'approximate same size', however, in general I would like to be able to order my data in any way, and still cut it into equal 'size' groups (like quantiles for rows, but for row variable totals instead). Do you mean you want g to be in the original order of x? No. What I mean is that I want to order x by any particular variable in my data.frame, then group over x such that each group has roughly the same sum. I get the feeling I have missed a very simple trick. Suggest providing a short self contained reproducible example including input and desired output and a detailed explanation. Does my subsequent post answer your question in this regard? Seems as though 'optimality' is not possible with any reasonable approach, however, the ordering criteria may mean we can get optimal solutions. Dan. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
Are you saying that your data might look like this ? set.seed(1) # For reproducibility only - remove this mydf - data.frame( age=round(runif(100, min=5, max=65), digits=1), nred=rpois(100, lambda=10), nblue=rpois(100, lambda=5), ngreen=rpois(100, lambda=15) ) mydf$total - rowSums( mydf[ , c(nred, nblue, ngreen)] ) head(mydf) age nred nblue ngreen total 1 20.9 11 7 1533 2 27.38 2 1828 3 39.4 11 4 823 4 59.56 5 819 5 17.1 10 3 1629 6 58.9 11 5 1430 If so, then try this : mydf - mydf[order(mydf$age), ] ## re-order by age mydf$cumtotal - cumsum(mydf$total) ## cummulative total brk.pts - seq(from=0, to=sum(mydf$total), len=9) mydf$grp - cut( mydf$cumtotal , brk.pts, labels=F ) age nred nblue ngreen total cumtotal grp 27 5.89 5 822 22 1 47 6.46 5 1324 46 1 92 8.58 4 1830 76 1 10 8.7 12 5 825 101 1 55 9.2 10 7 1330 131 1 69 10.19 3 1830 161 1 So here your 'grp' column is what you really want. Just to check tapply( mydf$total, mydf$grp, sum ) 1 2 3 4 5 6 7 8 352 363 372 387 358 377 377 370 sapply( tapply( mydf$age, mydf$grp, range ), c ) 12345678 [1,] 5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7 [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5 The last command says that your youngest student in group 1 is aged 5.8 and oldest is aged 16.2. Taking this one step further, you can calculate the proportion of the red, green and blue for each of the 8 groups. props - mydf[ , c(nred, nblue, ngreen)]/mydf$total # proportions apply( props, 2, function(v) tapply( v, mydf$grp, mean ) ) nred nbluengreen 1 0.3459898 0.1776441 0.4763661 2 0.3280712 0.1730796 0.4988492 3 0.3061429 0.1748149 0.5190422 4 0.3759380 0.2084694 0.4155926 5 0.3548805 0.1587353 0.4863842 6 0.3106835 0.1829349 0.5063816 7 0.3525933 0.1599737 0.4874330 8 0.3133796 0.1795567 0.5070637 Hope this of some use. Regards, Adai On Sun, 2006-03-19 at 18:58 +, Dan Bolser wrote: Adaikalavan Ramasamy wrote: Do you by any chance want to sample from each group equally to get an equal representation matrix ? No. I want to make groups of equal sizes, where size isn't simply number of rows (allowing a simple 'gl'), but a sum of the variable. Thanks for the code though, it looks useful. Here is an analogy for what I want to do (in case it helps). A group of students have some bags of marbles - The marbles have different colours. Each student has one bag, but can have between 5 and 50 marbles per bag with any given strange distribution you like. I line the students up by age, and want to see if there is any systematic difference between the number of each color of marble by age (older students may find primary colours less 'cool'). Because the statistics of each individual student are bad (like the proportion of each color per student -- has a high variance) I first put all the students into 8 groups (for example). Thing is, for one reason or another, the number of marbles per bag may systematically vary with age too. However, I am not interested in the number of marbles per bag, so I would like to group the students into 8 groups such that each group has the same total number of marbles. (Each group having a different sized age range, none the less ordered by age). Then I can look at the proportion (or count) of colours in each group, and I can compare the groups or any trend accross the groups. Does that make sense? Cheers, Dan. Here is an example of the input : mydf - data.frame( value=1:100, value2=rnorm(100), grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) ) which has 35 observations from A, 15 from B, 30 from C and 20 from D. And here is a function that I wrote: sample.by.group - function(df, grp, k, replace=FALSE){ if(length(k)==1){ k - rep(k, length(unique(grp))) } if(!replace any(k table(grp))) stop( paste(Cannot take a sample larger than the population when 'replace = FALSE'.\n, Please specify a value greater than, min(table(grp)), or use 'replace = TRUE'.\n) ) ind - model.matrix( ~ -1 + grp ) w.mat - list(NULL) for(i in 1:ncol(ind)){ w.mat[[i]] - sample( which( ind[,i]==1 ), k[i], replace=replace ) } out - df[ unlist(w.mat), ] return(out) } And here are some examples of how to use it : mydf - mydf[ sample(1:nrow(mydf)), ] # scramble it for fun out1 - sample.by.group(mydf, mydf$grp, k=10 ) table( out1$grp ) out2 - sample.by.group(mydf, mydf$grp, k=50,
Re: [R] Binning question (binning rows of a data.frame according to a variable)
Adaikalavan Ramasamy wrote: Are you saying that your data might look like this ? set.seed(1) # For reproducibility only - remove this mydf - data.frame( age=round(runif(100, min=5, max=65), digits=1), nred=rpois(100, lambda=10), nblue=rpois(100, lambda=5), ngreen=rpois(100, lambda=15) ) mydf$total - rowSums( mydf[ , c(nred, nblue, ngreen)] ) head(mydf) age nred nblue ngreen total 1 20.9 11 7 1533 2 27.38 2 1828 3 39.4 11 4 823 4 59.56 5 819 5 17.1 10 3 1629 6 58.9 11 5 1430 If so, then try this : mydf - mydf[order(mydf$age), ] ## re-order by age mydf$cumtotal - cumsum(mydf$total) ## cummulative total brk.pts - seq(from=0, to=sum(mydf$total), len=9) mydf$grp - cut( mydf$cumtotal , brk.pts, labels=F ) age nred nblue ngreen total cumtotal grp 27 5.89 5 822 22 1 47 6.46 5 1324 46 1 92 8.58 4 1830 76 1 10 8.7 12 5 825 101 1 55 9.2 10 7 1330 131 1 69 10.19 3 1830 161 1 So here your 'grp' column is what you really want. Just to check tapply( mydf$total, mydf$grp, sum ) 1 2 3 4 5 6 7 8 352 363 372 387 358 377 377 370 sapply( tapply( mydf$age, mydf$grp, range ), c ) 12345678 [1,] 5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7 [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5 The last command says that your youngest student in group 1 is aged 5.8 and oldest is aged 16.2. Taking this one step further, you can calculate the proportion of the red, green and blue for each of the 8 groups. props - mydf[ , c(nred, nblue, ngreen)]/mydf$total # proportions apply( props, 2, function(v) tapply( v, mydf$grp, mean ) ) nred nbluengreen 1 0.3459898 0.1776441 0.4763661 2 0.3280712 0.1730796 0.4988492 3 0.3061429 0.1748149 0.5190422 4 0.3759380 0.2084694 0.4155926 5 0.3548805 0.1587353 0.4863842 6 0.3106835 0.1829349 0.5063816 7 0.3525933 0.1599737 0.4874330 8 0.3133796 0.1795567 0.5070637 Hope this of some use. Yes, this is very useful! I have just one remaining question, above you take the mean of the group proportion... apply( props, 2, function(v) tapply( v, mydf$grp, mean ) ) instead of explicitly recalculating the proportion for the group (what I couldn't script real good) ... rbind( colSums(mydf[ mydf$grp==1, c(nred, nblue, ngreen)])/ sum (mydf[ mydf$grp==1, c(nred, nblue, ngreen)]), ... colSums(mydf[ mydf$grp==8, c(nred, nblue, ngreen)])/ sum (mydf[ mydf$grp==8, c(nred, nblue, ngreen)]) ) Giving (from the same seed)... nred nbluengreen [1,] 0.3465909 0.1704545 0.4829545 [2,] 0.3250689 0.1735537 0.5013774 [3,] 0.3064516 0.1774194 0.5161290 [4,] 0.3746770 0.2067183 0.4186047 [5,] 0.3519553 0.1564246 0.4916201 [6,] 0.3103448 0.1830239 0.5066313 [7,] 0.3501326 0.1644562 0.4854111 [8,] 0.3081081 0.1837838 0.5081081 Which is *slightly* different from the 'mean' approach. round(former-latter,4) nred nblue ngreen 1 -0.0006 0.0072 -0.0066 2 0.0030 -0.0005 -0.0025 3 -0.0003 -0.0026 0.0029 4 0.0013 0.0018 -0.0030 5 0.0029 0.0023 -0.0052 6 0.0003 -0.0001 -0.0002 7 0.0025 -0.0045 0.0020 8 0.0053 -0.0042 -0.0010 I know this less a question about R, and more a question about general stats, but why did you choose the former and not the latter method? Is one wrong and one right? Or did the former better fit the situation as described? Thanks for any insight into your decision, as this is something that has always puzzled me. Thanks for the beautifully clear examples! Dan. Regards, Adai On Sun, 2006-03-19 at 18:58 +, Dan Bolser wrote: Adaikalavan Ramasamy wrote: Do you by any chance want to sample from each group equally to get an equal representation matrix ? No. I want to make groups of equal sizes, where size isn't simply number of rows (allowing a simple 'gl'), but a sum of the variable. Thanks for the code though, it looks useful. Here is an analogy for what I want to do (in case it helps). A group of students have some bags of marbles - The marbles have different colours. Each student has one bag, but can have between 5 and 50 marbles per bag with any given strange distribution you like. I line the students up by age, and want to see if there is any systematic difference between the number of each color of marble by age (older students may find primary colours less 'cool'). Because the statistics of each individual student are bad (like the proportion of each color per student -- has a high variance) I first put all the students into 8 groups (for example). Thing is, for one reason or
Re: [R] Binning question (binning rows of a data.frame according to a variable)
Lets say there are 10 students in the first group and denote x1 as (say) the number of red balls for student 1 and s1 the total balls. Then I was calculating the average the proportion ( x1/s1 + x2/s2 + ... + x10/s10 ) and you were calculating the average number of events (x1+x2 +...+x10)/(s1+s2+...+s10). On second thoughts I think it is much better to calculate the a weighted average of the proportions. The weights should reflect the variance of the estimate of the proportions. ( w1*x1/s1 + w2*x2/s2 + ... + w10*x10/s10 ) On Mon, 2006-03-20 at 15:11 +, Dan Bolser wrote: Adaikalavan Ramasamy wrote: Are you saying that your data might look like this ? set.seed(1) # For reproducibility only - remove this mydf - data.frame( age=round(runif(100, min=5, max=65), digits=1), nred=rpois(100, lambda=10), nblue=rpois(100, lambda=5), ngreen=rpois(100, lambda=15) ) mydf$total - rowSums( mydf[ , c(nred, nblue, ngreen)] ) head(mydf) age nred nblue ngreen total 1 20.9 11 7 1533 2 27.38 2 1828 3 39.4 11 4 823 4 59.56 5 819 5 17.1 10 3 1629 6 58.9 11 5 1430 If so, then try this : mydf - mydf[order(mydf$age), ] ## re-order by age mydf$cumtotal - cumsum(mydf$total) ## cummulative total brk.pts - seq(from=0, to=sum(mydf$total), len=9) mydf$grp - cut( mydf$cumtotal , brk.pts, labels=F ) age nred nblue ngreen total cumtotal grp 27 5.89 5 822 22 1 47 6.46 5 1324 46 1 92 8.58 4 1830 76 1 10 8.7 12 5 825 101 1 55 9.2 10 7 1330 131 1 69 10.19 3 1830 161 1 So here your 'grp' column is what you really want. Just to check tapply( mydf$total, mydf$grp, sum ) 1 2 3 4 5 6 7 8 352 363 372 387 358 377 377 370 sapply( tapply( mydf$age, mydf$grp, range ), c ) 12345678 [1,] 5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7 [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5 The last command says that your youngest student in group 1 is aged 5.8 and oldest is aged 16.2. Taking this one step further, you can calculate the proportion of the red, green and blue for each of the 8 groups. props - mydf[ , c(nred, nblue, ngreen)]/mydf$total # proportions apply( props, 2, function(v) tapply( v, mydf$grp, mean ) ) nred nbluengreen 1 0.3459898 0.1776441 0.4763661 2 0.3280712 0.1730796 0.4988492 3 0.3061429 0.1748149 0.5190422 4 0.3759380 0.2084694 0.4155926 5 0.3548805 0.1587353 0.4863842 6 0.3106835 0.1829349 0.5063816 7 0.3525933 0.1599737 0.4874330 8 0.3133796 0.1795567 0.5070637 Hope this of some use. Yes, this is very useful! I have just one remaining question, above you take the mean of the group proportion... apply( props, 2, function(v) tapply( v, mydf$grp, mean ) ) instead of explicitly recalculating the proportion for the group (what I couldn't script real good) ... rbind( colSums(mydf[ mydf$grp==1, c(nred, nblue, ngreen)])/ sum (mydf[ mydf$grp==1, c(nred, nblue, ngreen)]), ... colSums(mydf[ mydf$grp==8, c(nred, nblue, ngreen)])/ sum (mydf[ mydf$grp==8, c(nred, nblue, ngreen)]) ) Giving (from the same seed)... nred nbluengreen [1,] 0.3465909 0.1704545 0.4829545 [2,] 0.3250689 0.1735537 0.5013774 [3,] 0.3064516 0.1774194 0.5161290 [4,] 0.3746770 0.2067183 0.4186047 [5,] 0.3519553 0.1564246 0.4916201 [6,] 0.3103448 0.1830239 0.5066313 [7,] 0.3501326 0.1644562 0.4854111 [8,] 0.3081081 0.1837838 0.5081081 Which is *slightly* different from the 'mean' approach. round(former-latter,4) nred nblue ngreen 1 -0.0006 0.0072 -0.0066 2 0.0030 -0.0005 -0.0025 3 -0.0003 -0.0026 0.0029 4 0.0013 0.0018 -0.0030 5 0.0029 0.0023 -0.0052 6 0.0003 -0.0001 -0.0002 7 0.0025 -0.0045 0.0020 8 0.0053 -0.0042 -0.0010 I know this less a question about R, and more a question about general stats, but why did you choose the former and not the latter method? Is one wrong and one right? Or did the former better fit the situation as described? Thanks for any insight into your decision, as this is something that has always puzzled me. Thanks for the beautifully clear examples! Dan. Regards, Adai On Sun, 2006-03-19 at 18:58 +, Dan Bolser wrote: Adaikalavan Ramasamy wrote: Do you by any chance want to sample from each group equally to get an equal representation matrix ? No. I want to make groups of equal sizes, where size isn't simply number of rows (allowing a simple 'gl'), but a sum of the
Re: [R] Binning question (binning rows of a data.frame according to a variable)
[[ Please ignore the last email which was sent incomplete ]] Lets say there are 10 students in the first group and denote x1 as (say) the number of red balls for student 1 and s1 the total balls. Then I was calculating the average the proportion ( x1/s1 + x2/s2 + ... + x10/s10 ) and you were calculating the average number of events (x1+x2 +...+x10)/(s1+s2+...+s10). It is just by chance that your calculation and mine agrees. When the numbers are highly unbalanced, you may get very different results. On second thoughts I think it is much better to calculate the a weighted average of the proportions. The weights should reflect the variance of the estimate of the proportions. Assuming that your outcome of interest is proportions, the summary effect size might look something like p_hat = ( w1*p1 + w2*p2+ ... + w10*p10 ) where p1 = x1/s1 and w1=1/var(p1). You should be able to obtain the standard errors for this estimate. Using this you can build a confidence interval and see if they overlap with proportion of reds in other groups. There is a big field called meta-analysis that deals with this kind of issue. You might want to read up more about this area. However I am not too familiar with the meta-analysis of proportion Perhaps someone on the mailing list can advise you if this approach is appropriate for your situation and perhaps even some references. Regards, Adai SNIP __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
Gabor Grothendieck wrote: On 3/18/06, Dan Bolser [EMAIL PROTECTED] wrote: Gabor Grothendieck wrote: If you are just looking for something simple that may be good enough then assign the largest one to group 1, the second largest to group 2, ..., the 8th largest to group 8 and then start over again with group 1 and so on. # test data set.seed(1) x - sample(100, 100, rep = TRUE) xs - sort(x) g - gl(8, 1, length(xs)) # 8 groups # so that g contains the groups that correspond to xs. tapply(xs, g, sum) # 659 671 687 701 612 622 629 646 That is a fairly neat way of getting groups with a good 'approximate same size', however, in general I would like to be able to order my data in any way, and still cut it into equal 'size' groups (like quantiles for rows, but for row variable totals instead). Do you mean you want g to be in the original order of x? No. What I mean is that I want to order x by any particular variable in my data.frame, then group over x such that each group has roughly the same sum. I get the feeling I have missed a very simple trick. order(x) is the permutation which sorts x and order(order(x)) is its inverse permutation so apply that to the gl expression: x - c(10, 4, 15, 2, 20, 13) g - gl(2, 1, length(x))[order(order(x))] # check it identical(tapply(sort(x), gl(2, 1, length(x)), sum), tapply(x, g, sum)) Seems it should be possible without an explicit loop (and some more 'refinement' of the final group sizes), but I can't work it out. On 3/17/06, Dan Bolser [EMAIL PROTECTED] wrote: Dan Bolser wrote: Hi, I have tuples of data in rows of a data.frame, each column is a variable for the 'items' (one per row). One of the variables is the 'size' of the item (row). I would like to cut my data.frame into groups such that each group has the same *total size*. So, assuming that we order by size, some groups should have several small items while other groups have a few large items. All the groups should have approximately the same total size. I have tried various combinations of cut, quantile, and ecdf, and I just can't work out how to do this! Any help is greatly appreciated! All the best, Dan. Perhaps there is a cleaver way, but I just wrote this in despiration... my.groups - 8 my.total - sum(my.res.1$TOT) ## The 'size' variable in my data.frame my.approx.size - my.total/ my.groups my.j - 1 my.roll - 0 my.factor - numeric() for(i in sort(my.res.1$TOT)){ my.roll - my.roll + i if (my.roll my.approx.size * my.j) my.j - my.j + 1 my.factor - append(my.factor,my.j) } my.factor - as.factor(my.factor) Then... tapply(my.factor,my.factor,length) 1 2 3 4 5 6 7 8 152 62 45 34 25 21 14 8 And... tapply(sort(my.res.1$TOT),my.factor,sum) 12345678 2880 2848 2912 2893 2832 2906 2776 3029 Which isn't bad. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
Adaikalavan Ramasamy wrote: Do you by any chance want to sample from each group equally to get an equal representation matrix ? No. I want to make groups of equal sizes, where size isn't simply number of rows (allowing a simple 'gl'), but a sum of the variable. Thanks for the code though, it looks useful. Here is an analogy for what I want to do (in case it helps). A group of students have some bags of marbles - The marbles have different colours. Each student has one bag, but can have between 5 and 50 marbles per bag with any given strange distribution you like. I line the students up by age, and want to see if there is any systematic difference between the number of each color of marble by age (older students may find primary colours less 'cool'). Because the statistics of each individual student are bad (like the proportion of each color per student -- has a high variance) I first put all the students into 8 groups (for example). Thing is, for one reason or another, the number of marbles per bag may systematically vary with age too. However, I am not interested in the number of marbles per bag, so I would like to group the students into 8 groups such that each group has the same total number of marbles. (Each group having a different sized age range, none the less ordered by age). Then I can look at the proportion (or count) of colours in each group, and I can compare the groups or any trend accross the groups. Does that make sense? Cheers, Dan. Here is an example of the input : mydf - data.frame( value=1:100, value2=rnorm(100), grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) ) which has 35 observations from A, 15 from B, 30 from C and 20 from D. And here is a function that I wrote: sample.by.group - function(df, grp, k, replace=FALSE){ if(length(k)==1){ k - rep(k, length(unique(grp))) } if(!replace any(k table(grp))) stop( paste(Cannot take a sample larger than the population when 'replace = FALSE'.\n, Please specify a value greater than, min(table(grp)), or use 'replace = TRUE'.\n) ) ind - model.matrix( ~ -1 + grp ) w.mat - list(NULL) for(i in 1:ncol(ind)){ w.mat[[i]] - sample( which( ind[,i]==1 ), k[i], replace=replace ) } out - df[ unlist(w.mat), ] return(out) } And here are some examples of how to use it : mydf - mydf[ sample(1:nrow(mydf)), ] # scramble it for fun out1 - sample.by.group(mydf, mydf$grp, k=10 ) table( out1$grp ) out2 - sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap table( out2$grp ) and you can even do bootstrapping or sampling with weights via: out3 - sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T) table( out3$grp ) Regards, Adai On Fri, 2006-03-17 at 16:01 +, Dan Bolser wrote: Hi, I have tuples of data in rows of a data.frame, each column is a variable for the 'items' (one per row). One of the variables is the 'size' of the item (row). I would like to cut my data.frame into groups such that each group has the same *total size*. So, assuming that we order by size, some groups should have several small items while other groups have a few large items. All the groups should have approximately the same total size. I have tried various combinations of cut, quantile, and ecdf, and I just can't work out how to do this! Any help is greatly appreciated! All the best, Dan. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
On 3/19/06, Dan Bolser [EMAIL PROTECTED] wrote: Gabor Grothendieck wrote: On 3/18/06, Dan Bolser [EMAIL PROTECTED] wrote: Gabor Grothendieck wrote: If you are just looking for something simple that may be good enough then assign the largest one to group 1, the second largest to group 2, ..., the 8th largest to group 8 and then start over again with group 1 and so on. # test data set.seed(1) x - sample(100, 100, rep = TRUE) xs - sort(x) g - gl(8, 1, length(xs)) # 8 groups # so that g contains the groups that correspond to xs. tapply(xs, g, sum) # 659 671 687 701 612 622 629 646 That is a fairly neat way of getting groups with a good 'approximate same size', however, in general I would like to be able to order my data in any way, and still cut it into equal 'size' groups (like quantiles for rows, but for row variable totals instead). Do you mean you want g to be in the original order of x? No. What I mean is that I want to order x by any particular variable in my data.frame, then group over x such that each group has roughly the same sum. I get the feeling I have missed a very simple trick. Suggest providing a short self contained reproducible example including input and desired output and a detailed explanation. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
Thing is, for one reason or another, the number of marbles per bag may systematically vary with age too. However, I am not interested in the number of marbles per bag, so I would like to group the students into 8 groups such that each group has the same total number of marbles. (Each group having a different sized age range, none the less ordered by age). This sounds very much like a bin-packing problem (http://en.wikipedia.org/wiki/Bin_packing_problem), which is NP-hard. The wikipedia page mentions some heuristics you may want to look into. Hadley __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
Gabor Grothendieck wrote: If you are just looking for something simple that may be good enough then assign the largest one to group 1, the second largest to group 2, ..., the 8th largest to group 8 and then start over again with group 1 and so on. # test data set.seed(1) x - sample(100, 100, rep = TRUE) xs - sort(x) g - gl(8, 1, length(xs)) # 8 groups # so that g contains the groups that correspond to xs. tapply(xs, g, sum) # 659 671 687 701 612 622 629 646 That is a fairly neat way of getting groups with a good 'approximate same size', however, in general I would like to be able to order my data in any way, and still cut it into equal 'size' groups (like quantiles for rows, but for row variable totals instead). Seems it should be possible without an explicit loop (and some more 'refinement' of the final group sizes), but I can't work it out. On 3/17/06, Dan Bolser [EMAIL PROTECTED] wrote: Dan Bolser wrote: Hi, I have tuples of data in rows of a data.frame, each column is a variable for the 'items' (one per row). One of the variables is the 'size' of the item (row). I would like to cut my data.frame into groups such that each group has the same *total size*. So, assuming that we order by size, some groups should have several small items while other groups have a few large items. All the groups should have approximately the same total size. I have tried various combinations of cut, quantile, and ecdf, and I just can't work out how to do this! Any help is greatly appreciated! All the best, Dan. Perhaps there is a cleaver way, but I just wrote this in despiration... my.groups - 8 my.total - sum(my.res.1$TOT) ## The 'size' variable in my data.frame my.approx.size - my.total/ my.groups my.j - 1 my.roll - 0 my.factor - numeric() for(i in sort(my.res.1$TOT)){ my.roll - my.roll + i if (my.roll my.approx.size * my.j) my.j - my.j + 1 my.factor - append(my.factor,my.j) } my.factor - as.factor(my.factor) Then... tapply(my.factor,my.factor,length) 1 2 3 4 5 6 7 8 152 62 45 34 25 21 14 8 And... tapply(sort(my.res.1$TOT),my.factor,sum) 12345678 2880 2848 2912 2893 2832 2906 2776 3029 Which isn't bad. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
On 3/18/06, Dan Bolser [EMAIL PROTECTED] wrote: Gabor Grothendieck wrote: If you are just looking for something simple that may be good enough then assign the largest one to group 1, the second largest to group 2, ..., the 8th largest to group 8 and then start over again with group 1 and so on. # test data set.seed(1) x - sample(100, 100, rep = TRUE) xs - sort(x) g - gl(8, 1, length(xs)) # 8 groups # so that g contains the groups that correspond to xs. tapply(xs, g, sum) # 659 671 687 701 612 622 629 646 That is a fairly neat way of getting groups with a good 'approximate same size', however, in general I would like to be able to order my data in any way, and still cut it into equal 'size' groups (like quantiles for rows, but for row variable totals instead). Do you mean you want g to be in the original order of x? order(x) is the permutation which sorts x and order(order(x)) is its inverse permutation so apply that to the gl expression: x - c(10, 4, 15, 2, 20, 13) g - gl(2, 1, length(x))[order(order(x))] # check it identical(tapply(sort(x), gl(2, 1, length(x)), sum), tapply(x, g, sum)) Seems it should be possible without an explicit loop (and some more 'refinement' of the final group sizes), but I can't work it out. On 3/17/06, Dan Bolser [EMAIL PROTECTED] wrote: Dan Bolser wrote: Hi, I have tuples of data in rows of a data.frame, each column is a variable for the 'items' (one per row). One of the variables is the 'size' of the item (row). I would like to cut my data.frame into groups such that each group has the same *total size*. So, assuming that we order by size, some groups should have several small items while other groups have a few large items. All the groups should have approximately the same total size. I have tried various combinations of cut, quantile, and ecdf, and I just can't work out how to do this! Any help is greatly appreciated! All the best, Dan. Perhaps there is a cleaver way, but I just wrote this in despiration... my.groups - 8 my.total - sum(my.res.1$TOT) ## The 'size' variable in my data.frame my.approx.size - my.total/ my.groups my.j - 1 my.roll - 0 my.factor - numeric() for(i in sort(my.res.1$TOT)){ my.roll - my.roll + i if (my.roll my.approx.size * my.j) my.j - my.j + 1 my.factor - append(my.factor,my.j) } my.factor - as.factor(my.factor) Then... tapply(my.factor,my.factor,length) 1 2 3 4 5 6 7 8 152 62 45 34 25 21 14 8 And... tapply(sort(my.res.1$TOT),my.factor,sum) 12345678 2880 2848 2912 2893 2832 2906 2776 3029 Which isn't bad. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
Do you by any chance want to sample from each group equally to get an equal representation matrix ? Here is an example of the input : mydf - data.frame( value=1:100, value2=rnorm(100), grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) ) which has 35 observations from A, 15 from B, 30 from C and 20 from D. And here is a function that I wrote: sample.by.group - function(df, grp, k, replace=FALSE){ if(length(k)==1){ k - rep(k, length(unique(grp))) } if(!replace any(k table(grp))) stop( paste(Cannot take a sample larger than the population when 'replace = FALSE'.\n, Please specify a value greater than, min(table(grp)), or use 'replace = TRUE'.\n) ) ind - model.matrix( ~ -1 + grp ) w.mat - list(NULL) for(i in 1:ncol(ind)){ w.mat[[i]] - sample( which( ind[,i]==1 ), k[i], replace=replace ) } out - df[ unlist(w.mat), ] return(out) } And here are some examples of how to use it : mydf - mydf[ sample(1:nrow(mydf)), ] # scramble it for fun out1 - sample.by.group(mydf, mydf$grp, k=10 ) table( out1$grp ) out2 - sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap table( out2$grp ) and you can even do bootstrapping or sampling with weights via: out3 - sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T) table( out3$grp ) Regards, Adai On Fri, 2006-03-17 at 16:01 +, Dan Bolser wrote: Hi, I have tuples of data in rows of a data.frame, each column is a variable for the 'items' (one per row). One of the variables is the 'size' of the item (row). I would like to cut my data.frame into groups such that each group has the same *total size*. So, assuming that we order by size, some groups should have several small items while other groups have a few large items. All the groups should have approximately the same total size. I have tried various combinations of cut, quantile, and ecdf, and I just can't work out how to do this! Any help is greatly appreciated! All the best, Dan. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Binning question (binning rows of a data.frame according to a variable)
Hi, I have tuples of data in rows of a data.frame, each column is a variable for the 'items' (one per row). One of the variables is the 'size' of the item (row). I would like to cut my data.frame into groups such that each group has the same *total size*. So, assuming that we order by size, some groups should have several small items while other groups have a few large items. All the groups should have approximately the same total size. I have tried various combinations of cut, quantile, and ecdf, and I just can't work out how to do this! Any help is greatly appreciated! All the best, Dan. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
Dan Bolser wrote: Hi, I have tuples of data in rows of a data.frame, each column is a variable for the 'items' (one per row). One of the variables is the 'size' of the item (row). I would like to cut my data.frame into groups such that each group has the same *total size*. So, assuming that we order by size, some groups should have several small items while other groups have a few large items. All the groups should have approximately the same total size. I have tried various combinations of cut, quantile, and ecdf, and I just can't work out how to do this! Any help is greatly appreciated! All the best, Dan. Perhaps there is a cleaver way, but I just wrote this in despiration... my.groups - 8 my.total - sum(my.res.1$TOT) ## The 'size' variable in my data.frame my.approx.size - my.total/ my.groups my.j - 1 my.roll - 0 my.factor - numeric() for(i in sort(my.res.1$TOT)){ my.roll - my.roll + i if (my.roll my.approx.size * my.j) my.j - my.j + 1 my.factor - append(my.factor,my.j) } my.factor - as.factor(my.factor) Then... tapply(my.factor,my.factor,length) 1 2 3 4 5 6 7 8 152 62 45 34 25 21 14 8 And... tapply(sort(my.res.1$TOT),my.factor,sum) 12345678 2880 2848 2912 2893 2832 2906 2776 3029 Which isn't bad. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Binning question (binning rows of a data.frame according to a variable)
If you are just looking for something simple that may be good enough then assign the largest one to group 1, the second largest to group 2, ..., the 8th largest to group 8 and then start over again with group 1 and so on. # test data set.seed(1) x - sample(100, 100, rep = TRUE) xs - sort(x) g - gl(8, 1, length(xs)) # 8 groups # so that g contains the groups that correspond to xs. tapply(xs, g, sum) # 659 671 687 701 612 622 629 646 On 3/17/06, Dan Bolser [EMAIL PROTECTED] wrote: Dan Bolser wrote: Hi, I have tuples of data in rows of a data.frame, each column is a variable for the 'items' (one per row). One of the variables is the 'size' of the item (row). I would like to cut my data.frame into groups such that each group has the same *total size*. So, assuming that we order by size, some groups should have several small items while other groups have a few large items. All the groups should have approximately the same total size. I have tried various combinations of cut, quantile, and ecdf, and I just can't work out how to do this! Any help is greatly appreciated! All the best, Dan. Perhaps there is a cleaver way, but I just wrote this in despiration... my.groups - 8 my.total - sum(my.res.1$TOT) ## The 'size' variable in my data.frame my.approx.size - my.total/ my.groups my.j - 1 my.roll - 0 my.factor - numeric() for(i in sort(my.res.1$TOT)){ my.roll - my.roll + i if (my.roll my.approx.size * my.j) my.j - my.j + 1 my.factor - append(my.factor,my.j) } my.factor - as.factor(my.factor) Then... tapply(my.factor,my.factor,length) 1 2 3 4 5 6 7 8 152 62 45 34 25 21 14 8 And... tapply(sort(my.res.1$TOT),my.factor,sum) 12345678 2880 2848 2912 2893 2832 2906 2776 3029 Which isn't bad. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html