Re: [R] Question: how to obtain the clusters of genes (basically the ones in the row dendrograms) from an object obtained by heatmap.2 function
Thanks Michael for your help. At least its good to know that there is no function which does what I wanted. I will definitely try to code something that fulfills my requirements. Regards, S. On Fri, Sep 17, 2010 at 2:32 AM, Michael Bedward michael.bedw...@gmail.comwrote: Hello Sunny, Defining groups at a given height can be done with the cut function (see ?cut.dendrogram). Unfortunately, cut doesn't give you the option of specifying a number of clusters rather than a height in the way that cutree does for hclust objects. Perhaps someone else here knows of a package that includes a function to do this (I did a quick sos search but didn't find anything obvious). Otherwise you could examine the cut function and try to hack a version that takes number of groups as an argument. You can see the function's code with the command getS3method(cut, dendrogram). Sorry I can't help more. Being a simple soul I find hclust objects so much easier to deal with ! Michael On 17 September 2010 11:30, Sunny Srivastava research.b...@gmail.com wrote: Hello R-Helpers, I have a question about extracting the clusters of genes after we make the heatmap (say ht4) using the heatmap.2 function. Basically, I want to get the clusters which are shown as row dendrogram in the heatmap. I understand that ht4$rowDendrogram is an object of dendrogram and it containes details of all the nodes and branches, but lets say I want to know the number of clusters and the genes in each cluster if I terminated the tree (dendrogram) at a particular height. Also, if I know that I want 12 clusters, how do I know which height I should terminate the tree (or branching structure) I am sorry if I am not clear. Please let me know if you need any further clarifications. Thanks in advance for your help. Best Regards, S. PS: I had posted this question on the Bioconductor mailing list, but no one responded with the answer about my problem (some suggested to use other package). Probably this is more related to R, so I am reposting here. Below is the dump of the matrix *row.scaled.fc2* and the object *ht4* which was obtained by using the heatmap.2 function. - Dump of the matrix row.scaled.fc2 of dimension 47X4 for which I need to get the cluster - row.scaled.fc2 - structure(c(1.28370586906697, 0.045882176168387, -1.36708804887146, 0.521861081643557, 0.931917365795697, -1.26811754842825, -0.72130612803134, 0.997233560332114, 1.10914280037357, 0.906822512746599, 0.124305385892705, 0.243716750638903, -0.81506628597585, 0.9281945227055, -1.02514155647985, -0.0148828263869010, 0.610771143828774, -1.31512789127346, -1.03419747081316, -1.37364737258546, 1.25426184614502, -0.901983912371582, 1.39208493297302, 1.46330419386939, 1.46904838309704, 1.33893188130515, 1.19407808189758, 1.2218547353343, 1.19698274357976, 1.18155526998177, 0.841732283108634, 0.747807260442244, 0.714318042078153, -1.33532716080095, -0.313607205847584, 0.355541486307312, -0.116351310506438, 0.77912190137299, 1.19372966187956, -1.46614749631243, 1.05871763558761, -0.943184299406566, 1.03714731356991, 1.25047276064487, 0.851530489918317, 0.97326112450597, 0.776853817614179, 0.254354524536168, 1.31978778177031, 1.03174081073449, 1.03284070831524, 0.653353551741362, -0.215733545477378, -0.966047927590969, 0.652368565446036, 0.536560120952493, 0.807139899513123, 1.26763097889282, 1.28335333872251, 1.45704025225707, 0.57691754078049, 1.07113369815538, 0.610158458070122, -0.762088920575592, 1.00819322156949, 1.14148232415467, 0.297815716619546, 0.143195107796418, -0.0065855621849476, 0.062650188298147, -0.177601977084224, -0.437288024655434, 0.178377570495840, 0.447251122498145, 0.400521563178456, 0.441487949431983, 0.46509369129, 0.754248218272813, 0.657576754588525, 0.832332574891687, -0.194585070239614, 1.09572866565514, 1.04256940502540, 0.583290457043162, 0.947182223637108, 0.453501818870319, 0.362539212141846, 0.64658837487362, 0.778492522245523, 0.406650195058153, -0.113538768459753, 0.596257630693658, 0.652082611403661, 0.731202922578465, -0.540351240198989, -0.280636135117373, 0.0957282195118376, -0.301771114678491, -0.319287162711085,
Re: [R] Question: how to obtain the clusters of genes (basically the ones in the row dendrograms) from an object obtained by heatmap.2 function
Cool. If you get some code working and don't mind sharing it please post it here. Michael On 17 September 2010 16:49, Sunny Srivastava research.b...@gmail.com wrote: Thanks Michael for your help. At least its good to know that there is no function which does what I wanted. I will definitely try to code something that fulfills my requirements. Regards, S. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about converting a matrix to a dataframe
Use as.data.frame instead. It does what you want it to do. newdata.df-as.data.frame(stocks1[1:5,2:5]) Cheers, Michael -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Leigh E. Lommen Sent: 16. september 2010 17:18 To: r-help@r-project.org Subject: [R] question about converting a matrix to a dataframe First I have a matrix called stocks1: class(stocks1) [1] matrix Here are the first 5 rows of the last 4 columns: stocks1[1:5,2:5] [,1] [,2] [,3] [,4] [1,] 80.73 31.95 25.4 25.69 [2,] 83.66 31.95 27.12 25.2 [3,] 83.27 32.93 28.74 26.29 [4,] 83.9 34.07 29.77 26.6 [5,] 82.74 35.18 30.24 27.41 Now, why can't I convert this into a dataframe? It automatically converts the columns into 1 long row?? newdata.df-data.frame(stocks1[1:5,2:5]) newdata.df X1.1 X1.2 X1.3 X1.4 X1.5 X1.6 X1.7 X1.8 X1.9 X1.10 X1.11 X1.12 1 80.73 83.66 83.27 83.9 82.74 31.95 31.95 32.93 34.07 35.18 25.4 27.12 X1.13 X1.14 X1.15 X1.16 X1.17 X1.18 X1.19 X1.20 1 28.74 29.77 30.24 25.69 25.2 26.29 26.6 27.41 Regards, Leigh [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about converting a matrix to a dataframe
Hi, Well, it works for me: x - matrix(1:20, nrow=5, ncol=4) data.frame(x[1:5,2:4]) X1 X2 X3 1 6 11 16 2 7 12 17 3 8 13 18 4 9 14 19 5 10 15 20 Maybe with as.data.frame(), or set the drop argument to FALSE: data.frame(x[1:5,2:4,drop=FALSE]) Not sure why it doesn't work for you. Check the output of str(stocks1) HTH, Ivan Le 9/16/2010 17:18, Leigh E. Lommen a écrit : First I have a matrix called stocks1: class(stocks1) [1] matrix Here are the first 5 rows of the last 4 columns: stocks1[1:5,2:5] [,1] [,2] [,3] [,4] [1,] 80.73 31.95 25.4 25.69 [2,] 83.66 31.95 27.12 25.2 [3,] 83.27 32.93 28.74 26.29 [4,] 83.9 34.07 29.77 26.6 [5,] 82.74 35.18 30.24 27.41 Now, why can't I convert this into a dataframe? It automatically converts the columns into 1 long row?? newdata.df-data.frame(stocks1[1:5,2:5]) newdata.df X1.1 X1.2 X1.3 X1.4 X1.5 X1.6 X1.7 X1.8 X1.9 X1.10 X1.11 X1.12 1 80.73 83.66 83.27 83.9 82.74 31.95 31.95 32.93 34.07 35.18 25.4 27.12 X1.13 X1.14 X1.15 X1.16 X1.17 X1.18 X1.19 X1.20 1 28.74 29.77 30.24 25.69 25.2 26.29 26.6 27.41 Regards, Leigh [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Ivan CALANDRA PhD Student University of Hamburg Biozentrum Grindel und Zoologisches Museum Abt. Säugetiere Martin-Luther-King-Platz 3 D-20146 Hamburg, GERMANY +49(0)40 42838 6231 ivan.calan...@uni-hamburg.de ** http://www.for771.uni-bonn.de http://webapp5.rrz.uni-hamburg.de/mammals/eng/mitarbeiter.php [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on optim
thanks. Ravi and Nash. I will read the new package and may use it after I am familiar with it. I may bother both of you when I have questions.thanks for that in advance. Nan from Montreal Hi Nan, You can take a look at the optimx package on CRAN. John Nash and I wrote this package to help lay and sophisticated users alike. This package unifies various optimization algorithms in R for smooth, box-constrained optimization. It has features for checking objective function, gradient (and hessian) specifications. It checks for potential problems due to poor scaling; checks feasibility of starting values. It provides diagnostics (KKT conditions) on whether or not a local optimum has been located. It also allows the user to run various optimization algorithms in one simple call, which is essentially identical to optim call. This feature can be especially useful for developers to benchmark different algorithms and choose the best one for their class of problems. http://cran.r-project.org/web/packages/optimx/index.html Ravi. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on staRt package
On 14.09.2010 12:25, Paulo Teles wrote: Dear Sirs, I have been using the package staRt but it has disappeared from the latest R versions. I emailed the author but he never replied. Is it possible to let me know if that package has been removed or if it has been replaced by another or what happened? The latest version is 1.1.12. I would like to know if I can keep using it or not. If it has been permanently removed, I should look for an alternative. I cannot remember the reason why this has been archived. If it was not due to a maintainer's request to do so, then probably because the package does not pass checks OK any more and the maintainer was unresponsive. In any case, you can get the old source package from the archives and install and check it yourself. Since the package is GPL 2 according to the DESCRIPTION file, you could take over maintainership and submit a new version to CRAN. Best, Uwe Ligges Thank you very much Paulo Teles __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question: Form a new list with the index replicated equal to the number of elements in that index
Sunny - I don't think mapply is needed: lapply(1:length(mylist),function(x)rep(x,length(mylist[[x]]))) [[1]] [1] 1 1 1 [[2]] [1] 2 [[3]] [1] 3 3 - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Mon, 13 Sep 2010, Sunny Srivastava wrote: Dear R-Helpers, I have a list l1 like: l1[[1]] a b c l1[[2]] d l1[[3]] e f I want an output res like: res[[1]] 1 1 1 res[[2]] 2 res[[3]] 3 3 Essentially, I want to replicate each index equal to the number of elements present in that index. Below is what I do to accomplish this: l1 - list(c(a, b, c), d, c(e, f)) res - mapply(rep, seq_along(l1),times=(lapply(l1, length))) Is there a more elegant way of doing this (possibly using one (l/m)apply function)? Thanks in advance for the help. Best Regards, S. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question: Form a new list with the index replicated equal to the number of elements in that index
Try this: relist(rep(1:length(l1), sapply(l1, length)), l1) On Mon, Sep 13, 2010 at 3:41 PM, Sunny Srivastava research.b...@gmail.comwrote: Dear R-Helpers, I have a list l1 like: l1[[1]] a b c l1[[2]] d l1[[3]] e f I want an output res like: res[[1]] 1 1 1 res[[2]] 2 res[[3]] 3 3 Essentially, I want to replicate each index equal to the number of elements present in that index. Below is what I do to accomplish this: l1 - list(c(a, b, c), d, c(e, f)) res - mapply(rep, seq_along(l1),times=(lapply(l1, length))) Is there a more elegant way of doing this (possibly using one (l/m)apply function)? Thanks in advance for the help. Best Regards, S. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40 S 49° 16' 22 O [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question: Form a new list with the index replicated equal to the number of elements in that index
Thank you very much list! On Mon, Sep 13, 2010 at 5:04 PM, Phil Spector spec...@stat.berkeley.eduwrote: Sunny - I don't think mapply is needed: lapply(1:length(mylist),function(x)rep(x,length(mylist[[x]]))) [[1]] [1] 1 1 1 [[2]] [1] 2 [[3]] [1] 3 3 - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Mon, 13 Sep 2010, Sunny Srivastava wrote: Dear R-Helpers, I have a list l1 like: l1[[1]] a b c l1[[2]] d l1[[3]] e f I want an output res like: res[[1]] 1 1 1 res[[2]] 2 res[[3]] 3 3 Essentially, I want to replicate each index equal to the number of elements present in that index. Below is what I do to accomplish this: l1 - list(c(a, b, c), d, c(e, f)) res - mapply(rep, seq_along(l1),times=(lapply(l1, length))) Is there a more elegant way of doing this (possibly using one (l/m)apply function)? Thanks in advance for the help. Best Regards, S. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question: Form a new list with the index replicated equal to the number of elements in that index
On 09/13/2010 08:41 PM, Sunny Srivastava wrote: Dear R-Helpers, I have a list l1 like: l1[[1]] a b c l1[[2]] d l1[[3]] e f I want an output res like: res[[1]] 1 1 1 res[[2]] 2 res[[3]] 3 3 Essentially, I want to replicate each index equal to the number of elements present in that index. Below is what I do to accomplish this: l1 - list(c(a, b, c), d, c(e, f)) res - mapply(rep, seq_along(l1),times=(lapply(l1, length))) Is there a more elegant way of doing this (possibly using one (l/m)apply function)? Don't know about elegance, but how about lapply(seq_along(l1), function(i)rep(i, length(l1[[j]])) ) -- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on optim
Hi Nan, You can take a look at the optimx package on CRAN. John Nash and I wrote this package to help lay and sophisticated users alike. This package unifies various optimization algorithms in R for smooth, box-constrained optimization. It has features for checking objective function, gradient (and hessian) specifications. It checks for potential problems due to poor scaling; checks feasibility of starting values. It provides diagnostics (KKT conditions) on whether or not a local optimum has been located. It also allows the user to run various optimization algorithms in one simple call, which is essentially identical to optim call. This feature can be especially useful for developers to benchmark different algorithms and choose the best one for their class of problems. http://cran.r-project.org/web/packages/optimx/index.html Ravi. -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Hey Sky Sent: Tuesday, September 07, 2010 2:48 PM To: Ben Bolker; r-h...@stat.math.ethz.ch Subject: Re: [R] question on optim thanks. Ben after read your email, I realized the initial value of w[5]=0 is a stupid mistake. and I have changed it. but I am sorry I cannot reproduce the result, convergence, as you get. the error message is non-finite finite difference value [12]. any suggestion about it? and could you plz recommend some R books on optimization, such as tips for setup gradient and others, or common mistakes? thanks Nan - Original Message From: Ben Bolker bbol...@gmail.com To: r-h...@stat.math.ethz.ch Sent: Tue, September 7, 2010 11:15:43 AM Subject: Re: [R] question on quot;optimquot; Hey Sky heyskywalker at yahoo.com writes: I do not know how to describe my question. I am a new user for R and write the following code for a dynamic labor economics model and use OPTIM to get optimizations and parameter values. the following code does not work due to the equation: wden[,i]-dnorm((1-regw[,i])/w[5])/w[5] where w[5] is one of the parameters (together with vector a, b and other elements in vector w) need to be estimated. if I delete the w[5] from the upper equation. that is: wden[,i]-dnorm(1-regw[,i]) optim will give me the estimated parameters. Thank you for the reproducible example! The problem is that you are setting the initial value of w[5] to zero, and then trying to divide by it ... I find that guess-rep(0,times=npar) guess[16] - 1 system.time(r1-optim(guess,myfunc1,data=mydata, method=BFGS,hessian=TRUE, control=list(trace=TRUE))) seems to work OK (I have no idea if the answers are sensible are not ...) If you're going to be doing a lot of this it might be wise to see if you can specify the gradient of your objective function for R -- it will speed up and stabilize the fitting considerably. By the way, you should be careful with this function: if we try this with Nelder-Mead instead, it appears to head to a set of parameters that lead to some sort of singularity in the objective function: system.time(r2-optim(guess,myfunc1,data=mydata, method=Nelder-Mead,hessian=FALSE, control=list(trace=TRUE,maxit=5000))) ## still thinks it hasn't converged, but objective function is ## much smaller ## plot 'slice' through objective space where 0 corresponds to ## fit-1 parameters and 1 corresponds to fit-2 parameters; ## adapted from emdbook::calcslice range - seq(-0.1,1.1,length=400) slicep - seq(range[1], range[2], length = 400) slicepars - t(sapply(slicep, function(x) (1 - x) * r1$par + x * r2$par)) v - apply(slicepars, 1, myfunc1) plot(range,v,type=l) Ideally, you should be able to look at the parameters of fit #2 and figure out (a) what the result means in terms of labor economics and (b) how to keep the objective function from going there, or at least identifying when it does. Ben Bolker __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on optim
Dear R-list members, I am using R 2.11.1 on Windows XP. When I try to install package optimx through the GUI menu Packages / Install packages, this package does not appear in the list that opens up (chosen from the Austria CRAN site). The package is listed on Austria's CRAN web page, but today (8 September 2010) it does not show in the list obtained through the menu. Thank you. Paulo Barata (Rio de Janeiro - Brazil) -- -- On 8/9/2010 11:01, Ravi Varadhan wrote: Hi Nan, You can take a look at the optimx package on CRAN. John Nash and I wrote this package to help lay and sophisticated users alike. This package unifies various optimization algorithms in R for smooth, box-constrained optimization. It has features for checking objective function, gradient (and hessian) specifications. It checks for potential problems due to poor scaling; checks feasibility of starting values. It provides diagnostics (KKT conditions) on whether or not a local optimum has been located. It also allows the user to run various optimization algorithms in one simple call, which is essentially identical to optim call. This feature can be especially useful for developers to benchmark different algorithms and choose the best one for their class of problems. http://cran.r-project.org/web/packages/optimx/index.html Ravi. -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Hey Sky Sent: Tuesday, September 07, 2010 2:48 PM To: Ben Bolker; r-h...@stat.math.ethz.ch Subject: Re: [R] question on optim thanks. Ben after read your email, I realized the initial value of w[5]=0 is a stupid mistake. and I have changed it. but I am sorry I cannot reproduce the result, convergence, as you get. the error message isnon-finite finite difference value [12]. any suggestion about it? and could you plz recommend some R books on optimization, such as tips for setup gradient and others, or common mistakes? thanks Nan - Original Message From: Ben Bolkerbbol...@gmail.com To: r-h...@stat.math.ethz.ch Sent: Tue, September 7, 2010 11:15:43 AM Subject: Re: [R] question onquot;optimquot; Hey Skyheyskywalkerat yahoo.com writes: I do not know how to describe my question. I am a new user for R and write the following code for a dynamic labor economics model and use OPTIM to get optimizations and parameter values. the following code does not work due to the equation: wden[,i]-dnorm((1-regw[,i])/w[5])/w[5] where w[5] is one of the parameters (together with vector a, b and other elements in vector w) need to be estimated. if I delete the w[5] from the upper equation. that is: wden[,i]-dnorm(1-regw[,i]) optim will give me the estimated parameters. Thank you for the reproducible example! The problem is that you are setting the initial value of w[5] to zero, and then trying to divide by it ... I find that guess-rep(0,times=npar) guess[16]- 1 system.time(r1-optim(guess,myfunc1,data=mydata, method=BFGS,hessian=TRUE, control=list(trace=TRUE))) seems to work OK (I have no idea if the answers are sensible are not ...) If you're going to be doing a lot of this it might be wise to see if you can specify the gradient of your objective function for R -- it will speed up and stabilize the fitting considerably. By the way, you should be careful with this function: if we try this with Nelder-Mead instead, it appears to head to a set of parameters that lead to some sort of singularity in the objective function: system.time(r2-optim(guess,myfunc1,data=mydata, method=Nelder-Mead,hessian=FALSE, control=list(trace=TRUE,maxit=5000))) ## still thinks it hasn't converged, but objective function is ## much smaller ## plot 'slice' through objective space where 0 corresponds to ## fit-1 parameters and 1 corresponds to fit-2 parameters; ## adapted from emdbook::calcslice range- seq(-0.1,1.1,length=400) slicep- seq(range[1], range[2], length = 400) slicepars- t(sapply(slicep, function(x) (1 - x) * r1$par + x * r2$par)) v- apply(slicepars, 1, myfunc1) plot(range,v,type=l) Ideally, you should be able to look at the parameters of fit #2 and figure out (a) what the result means in terms of labor economics and (b) how to keep the objective function from going there, or at least identifying when it does. Ben Bolker __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting
Re: [R] question on optim
The windows build is not on CRAN at the moment. It might be there in a few days. In the meanwhile you can get the windows binary from the optimizer project on r-forge: https://r-forge.r-project.org/R/?group_id=395 Ravi. -Original Message- From: Paulo Barata [mailto:pbar...@infolink.com.br] Sent: Wednesday, September 08, 2010 4:26 PM To: Ravi Varadhan Cc: 'Hey Sky'; 'Ben Bolker'; r-h...@stat.math.ethz.ch Subject: Re: [R] question on optim Dear R-list members, I am using R 2.11.1 on Windows XP. When I try to install package optimx through the GUI menu Packages / Install packages, this package does not appear in the list that opens up (chosen from the Austria CRAN site). The package is listed on Austria's CRAN web page, but today (8 September 2010) it does not show in the list obtained through the menu. Thank you. Paulo Barata (Rio de Janeiro - Brazil) -- -- On 8/9/2010 11:01, Ravi Varadhan wrote: Hi Nan, You can take a look at the optimx package on CRAN. John Nash and I wrote this package to help lay and sophisticated users alike. This package unifies various optimization algorithms in R for smooth, box-constrained optimization. It has features for checking objective function, gradient (and hessian) specifications. It checks for potential problems due to poor scaling; checks feasibility of starting values. It provides diagnostics (KKT conditions) on whether or not a local optimum has been located. It also allows the user to run various optimization algorithms in one simple call, which is essentially identical to optim call. This feature can be especially useful for developers to benchmark different algorithms and choose the best one for their class of problems. http://cran.r-project.org/web/packages/optimx/index.html Ravi. -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Hey Sky Sent: Tuesday, September 07, 2010 2:48 PM To: Ben Bolker; r-h...@stat.math.ethz.ch Subject: Re: [R] question on optim thanks. Ben after read your email, I realized the initial value of w[5]=0 is a stupid mistake. and I have changed it. but I am sorry I cannot reproduce the result, convergence, as you get. the error message isnon-finite finite difference value [12]. any suggestion about it? and could you plz recommend some R books on optimization, such as tips for setup gradient and others, or common mistakes? thanks Nan - Original Message From: Ben Bolkerbbol...@gmail.com To: r-h...@stat.math.ethz.ch Sent: Tue, September 7, 2010 11:15:43 AM Subject: Re: [R] question onquot;optimquot; Hey Skyheyskywalkerat yahoo.com writes: I do not know how to describe my question. I am a new user for R and write the following code for a dynamic labor economics model and use OPTIM to get optimizations and parameter values. the following code does not work due to the equation: wden[,i]-dnorm((1-regw[,i])/w[5])/w[5] where w[5] is one of the parameters (together with vector a, b and other elements in vector w) need to be estimated. if I delete the w[5] from the upper equation. that is: wden[,i]-dnorm(1-regw[,i]) optim will give me the estimated parameters. Thank you for the reproducible example! The problem is that you are setting the initial value of w[5] to zero, and then trying to divide by it ... I find that guess-rep(0,times=npar) guess[16]- 1 system.time(r1-optim(guess,myfunc1,data=mydata, method=BFGS,hessian=TRUE, control=list(trace=TRUE))) seems to work OK (I have no idea if the answers are sensible are not ...) If you're going to be doing a lot of this it might be wise to see if you can specify the gradient of your objective function for R -- it will speed up and stabilize the fitting considerably. By the way, you should be careful with this function: if we try this with Nelder-Mead instead, it appears to head to a set of parameters that lead to some sort of singularity in the objective function: system.time(r2-optim(guess,myfunc1,data=mydata, method=Nelder-Mead,hessian=FALSE, control=list(trace=TRUE,maxit=5000))) ## still thinks it hasn't converged, but objective function is ## much smaller ## plot 'slice' through objective space where 0 corresponds to ## fit-1 parameters and 1 corresponds to fit-2 parameters; ## adapted from emdbook::calcslice range- seq(-0.1,1.1,length=400) slicep- seq(range[1], range[2], length = 400) slicepars- t(sapply(slicep, function(x) (1 - x) * r1$par + x * r2$par)) v- apply(slicepars, 1, myfunc1) plot(range,v,type=l) Ideally, you should be able to look at the parameters of fit #2 and figure out (a) what the result means in terms of labor economics and (b) how to keep the objective function from going
Re: [R] question on quot;optimquot;
Hey Sky heyskywalker at yahoo.com writes: I do not know how to describe my question. I am a new user for R and write the following code for a dynamic labor economics model and use OPTIM to get optimizations and parameter values. the following code does not work due to the equation: wden[,i]-dnorm((1-regw[,i])/w[5])/w[5] where w[5] is one of the parameters (together with vector a, b and other elements in vector w) need to be estimated. if I delete the w[5] from the upper equation. that is: wden[,i]-dnorm(1-regw[,i]) optim will give me the estimated parameters. Thank you for the reproducible example! The problem is that you are setting the initial value of w[5] to zero, and then trying to divide by it ... I find that guess-rep(0,times=npar) guess[16] - 1 system.time(r1-optim(guess,myfunc1,data=mydata, method=BFGS,hessian=TRUE, control=list(trace=TRUE))) seems to work OK (I have no idea if the answers are sensible are not ...) If you're going to be doing a lot of this it might be wise to see if you can specify the gradient of your objective function for R -- it will speed up and stabilize the fitting considerably. By the way, you should be careful with this function: if we try this with Nelder-Mead instead, it appears to head to a set of parameters that lead to some sort of singularity in the objective function: system.time(r2-optim(guess,myfunc1,data=mydata, method=Nelder-Mead,hessian=FALSE, control=list(trace=TRUE,maxit=5000))) ## still thinks it hasn't converged, but objective function is ## much smaller ## plot 'slice' through objective space where 0 corresponds to ## fit-1 parameters and 1 corresponds to fit-2 parameters; ## adapted from emdbook::calcslice range - seq(-0.1,1.1,length=400) slicep - seq(range[1], range[2], length = 400) slicepars - t(sapply(slicep, function(x) (1 - x) * r1$par + x * r2$par)) v - apply(slicepars, 1, myfunc1) plot(range,v,type=l) Ideally, you should be able to look at the parameters of fit #2 and figure out (a) what the result means in terms of labor economics and (b) how to keep the objective function from going there, or at least identifying when it does. Ben Bolker __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on optim
sorry. there is a type in the following code. there is no w[5]*acwrk[,i] in the regw equation. the right one should be as following: regw[,i]-w[1]+ w[2]*eta[m]+exp(w[3]+w[4]*eta[m])*actr[,i] - Original Message From: Hey Sky heyskywal...@yahoo.com To: R r-help@r-project.org Sent: Tue, September 7, 2010 10:38:55 AM Subject: [R] question on optim Hey, R users I do not know how to describe my question. I am a new user for R and write the following code for a dynamic labor economics model and use OPTIM to get optimizations and parameter values. the following code does not work due to the equation: wden[,i]-dnorm((1-regw[,i])/w[5])/w[5] where w[5] is one of the parameters (together with vector a, b and other elements in vector w) need to be estimated. if I delete the w[5] from the upper equation. that is: wden[,i]-dnorm(1-regw[,i]) optim will give me the estimated parameters. but the paramter w[5] is a key one for me. now my questions is: what reason lead to the first equation does not work and the way to correct it to make it work? I wish I made the queston clear and thanks for any suggestion. Nan from Montreal #the function myfunc1-function(par,data) { # define the parameter matrix used in following part vbar2-matrix(0,n,nt) vbar3-matrix(0,n,nt) v8 -matrix(0,n,nt) regw-matrix(0,n,nt) wden-matrix(0,n,nt) lia-matrix(0,n,nt) ccl-matrix(1,n,ns) eta-c(0,0) # setup the parts for loglikelihood q1-exp(par[1]) pr1-q1/(1+q1) pr2-1-pr1 eta[2]-par[2] a-par[3:6] b-par[7:11] w-par[12:npar] for(m in 1:ns){ for(i in 1:nt){ regw[,i]-w[1]+ w[2]*eta[m]+exp(w[3]+w[4]*eta[m])*actr[,i]+w[5]*acwrk[,i] vbar2[,i]=a[1]+ eta[m]+regw[,i]*a[2]+acwrk[,i]*a[3]+actr[,i]*a[4] vbar3[,i]=b[1]+b[2]*eta[m]+regw[,i]*b[3]+acwrk[,i]*b[4]+actr[,i]*b[5] v8[,i]=1+exp(vbar2[,i])+exp(vbar3[,i]) for(j in 1:n){ if (home[j,i]==1) lia[j,i]=1/v8[j,i] if (wrk[j,i]==1) lia[j,i]=exp(vbar2[j,i])/v8[j,i] if (tr[j,i]==1) lia[j,i]=exp(vbar3[j,i])/v8[j,i] } wden[,i]-dnorm((1-regw[,i])/w[5])/w[5] ccl[,m]-lia[,i]*ccl[,m]*wden[,i] } } func-pr1*ccl[,1]+pr2*ccl[,2] f-sum(log(func)) return(-f) } #*** # main program ** gen random data and get the optimization ** nt-16 # number of periods ns-2 # number of person type n-50 # number of people npar-17 # number of parameters tr-matrix(0,n,nt) wrk-matrix(0,n,nt) home-matrix(0,n,nt) actr-matrix(0,n,nt) acwrk-matrix(0,n,nt) for(i in 1:nt){ tr[,i]-round(runif(n)) home[,i]-round(runif(n)) } for(i in 1:nt){ for(k in 1:n){ if(tr[k,i]==1 home[k,i]==1) home[k,i]=0 wrk[k,i]- 1-tr[k,i]-home[k,i] } } actr[,1]-tr[,1] acwrk[,1]-wrk[,1] for(j in 2:nt){ actr[,j]-actr[,j-1]+tr[,j] acwrk[,j]-acwrk[,j-1]+wrk[,j] } mydata-cbind(tr,wrk,home,actr,acwrk) guess-rep(0,times=npar) system.time(r1-optim(guess,myfunc1,data=mydata, method=BFGS,hessian=T)) r1$par __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on optim
Hi, I do not see how `data' is used in your objective function. The objective function is not even evaluable at the initial guess. myfunc1(guess, mydata) [1] NaN I also think that some of the parameters may have to be constrained, for example, par[1] 0. At a minimum, make sure that the obj fn is correctly specified before we can start tackling other issues. Ravi. Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 502-2619 email: rvarad...@jhmi.edu - Original Message - From: Hey Sky heyskywal...@yahoo.com Date: Tuesday, September 7, 2010 10:40 am Subject: [R] question on optim To: R r-help@r-project.org Hey, R users I do not know how to describe my question. I am a new user for R and write the following code for a dynamic labor economics model and use OPTIM to get optimizations and parameter values. the following code does not work due to the equation: wden[,i]-dnorm((1-regw[,i])/w[5])/w[5] where w[5] is one of the parameters (together with vector a, b and other elements in vector w) need to be estimated. if I delete the w[5] from the upper equation. that is: wden[,i]-dnorm(1-regw[,i]) optim will give me the estimated parameters. but the paramter w[5] is a key one for me. now my questions is: what reason lead to the first equation does not work and the way to correct it to make it work? I wish I made the queston clear and thanks for any suggestion. Nan from Montreal #the function myfunc1-function(par,data) { # define the parameter matrix used in following part vbar2-matrix(0,n,nt) vbar3-matrix(0,n,nt) v8 -matrix(0,n,nt) regw-matrix(0,n,nt) wden-matrix(0,n,nt) lia-matrix(0,n,nt) ccl-matrix(1,n,ns) eta-c(0,0) # setup the parts for loglikelihood q1-exp(par[1]) pr1-q1/(1+q1) pr2-1-pr1 eta[2]-par[2] a-par[3:6] b-par[7:11] w-par[12:npar] for(m in 1:ns){ for(i in 1:nt){ regw[,i]-w[1]+ w[2]*eta[m]+exp(w[3]+w[4]*eta[m])*actr[,i]+w[5]*acwrk[,i] vbar2[,i]=a[1]+ eta[m]+regw[,i]*a[2]+acwrk[,i]*a[3]+actr[,i]*a[4] vbar3[,i]=b[1]+b[2]*eta[m]+regw[,i]*b[3]+acwrk[,i]*b[4]+actr[,i]*b[5] v8[,i]=1+exp(vbar2[,i])+exp(vbar3[,i]) for(j in 1:n){ if (home[j,i]==1) lia[j,i]=1/v8[j,i] if (wrk[j,i]==1) lia[j,i]=exp(vbar2[j,i])/v8[j,i] if (tr[j,i]==1) lia[j,i]=exp(vbar3[j,i])/v8[j,i] } wden[,i]-dnorm((1-regw[,i])/w[5])/w[5] ccl[,m]-lia[,i]*ccl[,m]*wden[,i] } } func-pr1*ccl[,1]+pr2*ccl[,2] f-sum(log(func)) return(-f) } #*** # main program ** gen random data and get the optimization ** nt-16 # number of periods ns-2 # number of person type n-50 # number of people npar-17 # number of parameters tr-matrix(0,n,nt) wrk-matrix(0,n,nt) home-matrix(0,n,nt) actr-matrix(0,n,nt) acwrk-matrix(0,n,nt) for(i in 1:nt){ tr[,i]-round(runif(n)) home[,i]-round(runif(n)) } for(i in 1:nt){ for(k in 1:n){ if(tr[k,i]==1 home[k,i]==1) home[k,i]=0 wrk[k,i]- 1-tr[k,i]-home[k,i] } } actr[,1]-tr[,1] acwrk[,1]-wrk[,1] for(j in 2:nt){ actr[,j]-actr[,j-1]+tr[,j] acwrk[,j]-acwrk[,j-1]+wrk[,j] } mydata-cbind(tr,wrk,home,actr,acwrk) guess-rep(0,times=npar) system.time(r1-optim(guess,myfunc1,data=mydata, method=BFGS,hessian=T)) r1$par __ R-help@r-project.org mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on optim
thanks. Ben after read your email, I realized the initial value of w[5]=0 is a stupid mistake. and I have changed it. but I am sorry I cannot reproduce the result, convergence, as you get. the error message is non-finite finite difference value [12]. any suggestion about it? and could you plz recommend some R books on optimization, such as tips for setup gradient and others, or common mistakes? thanks Nan - Original Message From: Ben Bolker bbol...@gmail.com To: r-h...@stat.math.ethz.ch Sent: Tue, September 7, 2010 11:15:43 AM Subject: Re: [R] question on quot;optimquot; Hey Sky heyskywalker at yahoo.com writes: I do not know how to describe my question. I am a new user for R and write the following code for a dynamic labor economics model and use OPTIM to get optimizations and parameter values. the following code does not work due to the equation: wden[,i]-dnorm((1-regw[,i])/w[5])/w[5] where w[5] is one of the parameters (together with vector a, b and other elements in vector w) need to be estimated. if I delete the w[5] from the upper equation. that is: wden[,i]-dnorm(1-regw[,i]) optim will give me the estimated parameters. Thank you for the reproducible example! The problem is that you are setting the initial value of w[5] to zero, and then trying to divide by it ... I find that guess-rep(0,times=npar) guess[16] - 1 system.time(r1-optim(guess,myfunc1,data=mydata, method=BFGS,hessian=TRUE, control=list(trace=TRUE))) seems to work OK (I have no idea if the answers are sensible are not ...) If you're going to be doing a lot of this it might be wise to see if you can specify the gradient of your objective function for R -- it will speed up and stabilize the fitting considerably. By the way, you should be careful with this function: if we try this with Nelder-Mead instead, it appears to head to a set of parameters that lead to some sort of singularity in the objective function: system.time(r2-optim(guess,myfunc1,data=mydata, method=Nelder-Mead,hessian=FALSE, control=list(trace=TRUE,maxit=5000))) ## still thinks it hasn't converged, but objective function is ## much smaller ## plot 'slice' through objective space where 0 corresponds to ## fit-1 parameters and 1 corresponds to fit-2 parameters; ## adapted from emdbook::calcslice range - seq(-0.1,1.1,length=400) slicep - seq(range[1], range[2], length = 400) slicepars - t(sapply(slicep, function(x) (1 - x) * r1$par + x * r2$par)) v - apply(slicepars, 1, myfunc1) plot(range,v,type=l) Ideally, you should be able to look at the parameters of fit #2 and figure out (a) what the result means in terms of labor economics and (b) how to keep the objective function from going there, or at least identifying when it does. Ben Bolker __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on optim
On 10-09-07 02:48 PM, Hey Sky wrote: thanks. Ben after read your email, I realized the initial value of w[5]=0 is a stupid mistake. and I have changed it. but I am sorry I cannot reproduce the result, convergence, as you get. the error message is non-finite finite difference value [12]. any suggestion about it? When I 'fix' the objective function as you specified in your second message, I get into trouble too. and could you plz recommend some R books on optimization, such as tips for setup gradient and others, or common mistakes? thanks Nan I'm afraid I don't know of a great reference, although pp. 340-341 of http://www.math.mcmaster.ca/~bolker/emdbook/book.pdf do give some basic trouble-shooting suggestions (nothing about gradients, though, but the example in ?optim does use gradients). I would say that my best _general_ advice is to understand what all the parameters of your model mean: what are reasonable starting values and reasonable ranges? Use control(parscale) to tell R the approximate expected order of magnitude of each parameter. You may be able to keep the model from getting into trouble by using method=L-BFGS-B and bounding the parameters within a reasonable range. For more general troubleshooting: try different optimization methods (in particular, Nelder-Mead doesn't need to compute finite differences); trap non-finite (NA, NaN, Inf, -Inf) that occur in the function and report them and/or stop the function (see ?browser) and/or replace them with large finite values; use control(trace=TRUE) or add cat() statements to your function to see where the optimizer is trying to go. good luck, Ben Bolker [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding significance of a covariate in a coxme survival model
David Winsemius wrote: That is different than my understanding of AIC. I thought that the AIC and BIC both took as input the difference in -2LL and then adjusted those differences for the differences in number of degrees of freedom. David! Your words make sense to me now. Sorry for the lapse. A very smart professor took the time out to school me. I misunderstood the output from coxme. I see now that it's giving the LL for the NULL model and for the model I have specified and the AIC output is the difference between the full model and the NULL. So the numbers all make sense to me and in fact the p-values are in agreement in that they decrease as the specified model is an improvement over the NULL. I am using only the AIC values and akaike weights in my analyses so the p-values are not my basis for reaching a conclusion. I was distressed over the seeming disagreement between the AIC, the p-values and my results using lmer (ignoring that the data were censored), and bar graphs illustrating a clear difference when coxme was telling me otherwise. To spell it out for others like me that need to see the numbers add up... Given this output from coxme: --- NULL Integrated Penalized Log-likelihood -119.8470 -112.1598 -108.1663 Chisq df pAIC BIC Integrated loglik 15.37 2.00 0.00045863 11.37 8.05 Penalized loglik 23.36 7.06 0.00153710 9.25 -2.49 -2(LL) + (2*df) = AIC NULL:-2(-119.8470) = 239.694 Integrated: -2(-112.1598) + (2 * 2) = 228.3196 Penalized: -2(-108.1663) + (2 * 7.06) = 230.4526 subtract the integrated model's AIC from the NULL model's AIC, you get the stated AIC for the integrated model in the output (same for penalized model). 239.694 - 228.3196 =11.3744 So the larger (positive range) the AIC, in the coxme output, the better that model does compared to the NULL model. Incidentally, we see that the p-value decreases with an increase in the coxme AIC and so there is no disagreement. Thank you very smart professor! -Teresa Iglesias Davis, Ca __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://r.789695.n4.nabble.com/Question-regarding-significance-of-a-covariate-in-a-coxme-survival-model-tp2313880p2527739.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding significance of a covariate in a coxme survival
In 2010-08-30, C. Peng button...@hotmail.com wrote: What statistical measure(s) tend to be answering ALL(?) question of practical interest? None. All I had said was that significance testing doesn't really answer any questions of practical interest. Unfortunately, that doesn't mean there's something to answer all such questions. In the regression case brought up by the original poster (that was Cox regression, but the principle is the same), prediction-oriented measures such AIC or cross-validation directly address the question of interest, if that question is predictive ability. In general, one should at the very least form confidence intervals instead of significance tests. In the case in which tests are demanded, e.g. medical journals, replace instead of by in addition to. Norm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding significance of a covariate in a coxme survival
What statistical measure(s) tend to be answering ALL(?) question of practical interest? -- View this message in context: http://r.789695.n4.nabble.com/Re-Question-regarding-significance-of-a-covariate-in-a-coxme-survival-tp2399386p2399577.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding significance of a covariate in a coxme survival model
The likelihood ratio test is more reliable when one model is nested in the other. This true for your case. AIC/SBC are usually used when two models are in a hiearchical structure. Please also note that any decision made made based on AIC/SBC scores are very subjective since no sampling distribution can be used to make a rigorous decision. Regarding the magnitutes between the loglikelihood and AIC/SBC, I would say the author must used a modified version in coxme() since several different modified AIC/SBC scores are running in practice. My suggestion would be to use LR test for your case: For the integrated likelihhod: LL.small.model = - 467.3549(including lifedxm) LL.large.model = - 471.(excluding lifedxm) DF.diff = 3 - 1 = 2 LR: -2*(- 471. + 467.3549) = 7.9568 p-value: 1-pchisq(7.9568,2) = 0.01871556 For the penalized likelihhod: LPL.small.model = -435.2096 (including lifedxm) LPL.large.model = -436.0478 (excluding lifedxm) DF.diff = 3 - 1 = 2 PLR: -2*(- 436.0478 + 435.2096 ) = 1.6764 p-value: 1-pchisq(1.6764,2) = 0.4324883 Two different likehood methods produce different results, which one you should use depends on which likelihood makes more sense to you (or which likehood is better). HTH -- View this message in context: http://r.789695.n4.nabble.com/Question-regarding-significance-of-a-covariate-in-a-coxme-survival-model-tp2313880p2399114.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding significance of a covariate in a coxme survival model
The likelihood ratio test is more reliable when one model is nested in the other. This true for your case. AIC/SBC are usually used when two models are in a hiearchical structure. Please also note that any decision made made based on AIC/SBC scores are very subjective since no sampling distribution can be used to make a rigorous decision. regarding the magnitutes between the loglikelihood and AIC/SBC, I would say the author must used a modified version in coxme() since several different modified AIC/SBC scores are running in practice. My suggestion would be to use LR test for your case: For the integrated likelihhod: LL.small.model = - 467.3549(including lifedxm) LL.large.model = - 471.(excluding lifedxm) DF.diff = 3 - 1 = 2 LR: -2*(- 471. + 467.3549) = 7.9568 p-value: 1-pchisq(7.9568,2) = 0.01871556 For the penalized likelihhod: LPL.small.model = -435.2096 (including lifedxm) LPL.large.model = -436.0478 (excluding lifedxm) DF.diff = 3 - 1 = 2 PLR: -2*(- 436.0478 + 435.2096 ) = 1.6764 p-value: 1-pchisq(1.6764,2) = 0.4324883 Two different likehood methods produce different results, which one you should use depends on which likelihood makes more sense to you (or which likehood is better). HTH -- View this message in context: http://r.789695.n4.nabble.com/Question-regarding-significance-of-a-covariate-in-a-coxme-survival-model-tp2313880p2399090.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding significance of a covariate in a coxme survival model
My suggestion: If compare model 1 and model 2 with model 0 respectively, the (penalized) likelihood ratio test is valid. IF you compare model 2 with model 3, the (penalized) likelihood ratio test is invalid. You may want to use AIC/SBC to make a subjective decision. -- View this message in context: http://r.789695.n4.nabble.com/Question-regarding-significance-of-a-covariate-in-a-coxme-survival-model-tp2313880p2399095.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding significance of a covariate in a coxme survival model
My suggestion for Teresa: If compare model 1 and model 2 with model 0 respectively, the (penalized) likelihood ratio test is valid. IF you compare model 2 with model 3, the (penalized) likelihood ratio test is invalid. You may want to use AIC/SBC to make a subjective decision. -- View this message in context: http://r.789695.n4.nabble.com/Question-regarding-significance-of-a-covariate-in-a-coxme-survival-model-tp2313880p2399116.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding significance of a covariate in a coxme survival
Using a p-value to make any kind of decision is questionable to begin with, and especially unreliable in choosing covariates in regression. Old studies, e.g. by Walls and Weeks and by Bendel and Afifi, have shown that if predictive ability is the criterion of interest and one wishes to use p-values for deciding whether to include a covariate, one should set the p-value bar very large, at 0.25 and even 0.40. By contrast, methods such as AIC are aimed at avoiding overfitting, by penalizing models with large numbers of covariates. Same for Mallows' Cp, cross validation etc. So, the p-value and AIC are answering quite different questions, and thus should not be expected to give the same or even similar results. But, worse than that, many point out that p-values tend not to be answering ANY question of practical interest. It's a shame that the use of p-values is so entrenched. I can expand on this, with references, if there is interest. Norm Matloff Professor of Computer Science (formerly Statistics) University of California, Davis __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding significance of a covariate in a coxme survival
What statistical measure(s) tend to be answering ALL(?) question of practical interest? -- View this message in context: http://r.789695.n4.nabble.com/Re-Question-regarding-significance-of-a-covariate-in-a-coxme-survival-tp2399386p2399524.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding significance of a covariate in a coxme survival model
Christopher David Desjardins desja004 at umn.edu writes: Hi, I am running a Cox Mixed Effects Hazard model using the library coxme. I am trying to model time to onset (age_sym1) of thought problems (e.g. hearing voices) (sym1). As I have siblings in my dataset, I have decided to account for this by including a random effect for family (famid). My covariate of interest is Mother's diagnosis where a 0 is bipolar, 1 is control, and 2 is major depression. I am trying to fit the following model. thorp1 - coxme(Surv(age_sym1, sym1) ~ lifedxm + (1 | famid), data = bip.surv) Which provides the following output: - thorp1 Cox mixed-effects model fit by maximum likelihood Data: bip.surv events, n = 99, 189 (3 observations deleted due to missingness) Iterations= 10 54 NULL Integrated Penalized Log-likelihood -479.0372 -467.3549 -435.2096 Chisqdf p AIC BIC Integrated loglik 23.36 3.00 3.3897e-05 17.36 9.58 Penalized loglik 87.66 47.27 3.2374e-04 -6.88 -129.54 Model: Surv(age_sym1, sym1) ~ lifedxm + (1 | famid) Fixed coefficients coef exp(coef) se(coef) z p lifedxmCONTROL -1.0576257 0.3472794 0.3872527 -2.73 0.0063 lifedxmMAJOR -0.6329957 0.5309987 0.3460847 -1.83 0.0670 Random effects Group Variable Std DevVariance famid Intercept 0.980 0.9757033 -- So it appears that there is a difference in hazard rates associated with Mother's diagnosis between BIPOLAR and CONTROL. Just to be safe, I fit a model without this covariate present and decided to compare the models with AIC and see if fit decreased with this covariate not in the model. -- thorp1.n - coxme(Surv(age_sym1, sym1) ~ (1 | famid), data = bip.surv) thorp1.n Cox mixed-effects model fit by maximum likelihood Data: bip.surv events, n = 99, 189 (3 observations deleted due to missingness) Iterations= 10 54 NULL Integrated Penalized Log-likelihood -479.0372 -471. -436.0478 Chisqdf pAIC BIC Integrated loglik 15.41 1.00 0.8663 13.41 10.81 Penalized loglik 85.98 49.48 0.00099915 -12.97 -141.37 Model: Surv(age_sym1, sym1) ~ (1 | famid) Random effects Group Variable Std Dev Variance famid Intercept 1.050949 1.104495 The problem is that the AIC for the model w/o lifedxm is 13.41 and the model w/ lifedxm is 17.36. So fit actually improved without lifedxm in the model but really the fit is indistinguishable if I use Burnham Anderson, 2002's criteria. So my conundrum is this - The AIC and the p-values appear to disagree. The p-value would suggest that lifedxm should be included in the model and the AIC says it doesn't improve fit. In general, I tend to favor the AIC (I usually work with large sample sizes and multilevel models) but I am just beginning with survival models and I am curious if a survival model guru might suggest whether lifedxm needs to be in the model or not based on my results here? Would people generally use the AIC in this situation? Also, I can't run anova() on this models because of the random effect. I am happy to provide the data if necessary. Please cc me on a reply. Thanks, Chris Hi Chris, Did you get an answer to why the p-value and AIC score disagreed? I am getting the same results with my own data. They're not small disagreements either. The AIC score is telling me the opposite of what the p-value and the parameter coef are saying. The p-value and the coef for the predictor variable are in agreement. I've also noticed in some published papers with tables containing p-values and AIC scores that the chisq p-value and AIC score are in opposition too. This makes me think I'm missing something and that this all actually makes sense. However given that AIC = − 2 (log L) + 2K (where K is the number of parameters) and seeing as how the log-likelihood scores below are in the hundreds, shouldn't the AIC score be in the hundreds as well?? -- model0 (intercept only model) NULL Integrated Penalized Log-likelihood -119.8470 -117.9749 -113.1215 Chisq dfp AICBIC Integrated loglik 3.74 1.00 0.052989 1.74 0.08 Penalized loglik 13.45 7.06 0.063794 -0.68 -12.43 Random effects Group Variable Std Dev Variance SiteIntercept0.6399200 0.4094977 _ model1 (before vs after) NULL Integrated Penalized Log-likelihood -119.8470 -112.1598 -108.1663 Chisq df pAIC BIC Integrated loglik 15.37 2.00
Re: [R] Question regarding significance of a covariate in a coxme survival model
On Aug 27, 2010, at 4:32 PM, Teresa Iglesias wrote: Christopher David Desjardins desja004 at umn.edu writes: Hi, I am running a Cox Mixed Effects Hazard model using the library coxme. I am trying to model time to onset (age_sym1) of thought problems (e.g. hearing voices) (sym1). As I have siblings in my dataset, I have decided to account for this by including a random effect for family (famid). My covariate of interest is Mother's diagnosis where a 0 is bipolar, 1 is control, and 2 is major depression. I am trying to fit the following model. thorp1 - coxme(Surv(age_sym1, sym1) ~ lifedxm + (1 | famid), data = bip.surv) Which provides the following output: - thorp1 Cox mixed-effects model fit by maximum likelihood Data: bip.surv events, n = 99, 189 (3 observations deleted due to missingness) Iterations= 10 54 NULL Integrated Penalized Log-likelihood -479.0372 -467.3549 -435.2096 Chisqdf p AIC BIC Integrated loglik 23.36 3.00 3.3897e-05 17.36 9.58 Penalized loglik 87.66 47.27 3.2374e-04 -6.88 -129.54 Model: Surv(age_sym1, sym1) ~ lifedxm + (1 | famid) Fixed coefficients coef exp(coef) se(coef) z p lifedxmCONTROL -1.0576257 0.3472794 0.3872527 -2.73 0.0063 lifedxmMAJOR -0.6329957 0.5309987 0.3460847 -1.83 0.0670 Random effects Group Variable Std DevVariance famid Intercept 0.980 0.9757033 -- So it appears that there is a difference in hazard rates associated with Mother's diagnosis between BIPOLAR and CONTROL. Just to be safe, I fit a model without this covariate present and decided to compare the models with AIC and see if fit decreased with this covariate not in the model. -- thorp1.n - coxme(Surv(age_sym1, sym1) ~ (1 | famid), data = bip.surv) thorp1.n Cox mixed-effects model fit by maximum likelihood Data: bip.surv events, n = 99, 189 (3 observations deleted due to missingness) Iterations= 10 54 NULL Integrated Penalized Log-likelihood -479.0372 -471. -436.0478 Chisqdf pAIC BIC Integrated loglik 15.41 1.00 0.8663 13.41 10.81 Penalized loglik 85.98 49.48 0.00099915 -12.97 -141.37 Model: Surv(age_sym1, sym1) ~ (1 | famid) Random effects Group Variable Std Dev Variance famid Intercept 1.050949 1.104495 The problem is that the AIC for the model w/o lifedxm is 13.41 and the model w/ lifedxm is 17.36. So fit actually improved without lifedxm in the model but really the fit is indistinguishable if I use Burnham Anderson, 2002's criteria. So my conundrum is this - The AIC and the p-values appear to disagree. The p-value would suggest that lifedxm should be included in the model and the AIC says it doesn't improve fit. In general, I tend to favor the AIC (I usually work with large sample sizes and multilevel models) but I am just beginning with survival models and I am curious if a survival model guru might suggest whether lifedxm needs to be in the model or not based on my results here? Would people generally use the AIC in this situation? Also, I can't run anova() on this models because of the random effect. I am happy to provide the data if necessary. Please cc me on a reply. Thanks, Chris Hi Chris, Did you get an answer to why the p-value and AIC score disagreed? I am getting the same results with my own data. They're not small disagreements either. The AIC score is telling me the opposite of what the p-value and the parameter coef are saying. The p-value and the coef for the predictor variable are in agreement. I've also noticed in some published papers with tables containing p-values and AIC scores that the chisq p-value and AIC score are in opposition too. This makes me think I'm missing something and that this all actually makes sense. However given that AIC = − 2 (log L) + 2K (where K is the number of parameters) and seeing as how the log-likelihood scores below are in the hundreds, shouldn't the AIC score be in the hundreds as well?? That is different than my understanding of AIC. I thought that the AIC and BIC both took as input the difference in -2LL and then adjusted those differences for the differences in number of degrees of freedom. That perspective would inform one that the absolute value of the deviance ( = -2LL) is immaterial and only differences are subject to interpretation. The p-values are calculated as Wald tests, which are basically first order approximations and have lower credence than model level comparisons. The OP also seemed to have ignored the fact that the penalized estimates of AIC and BIC went in the opposite direction from what he might have hoped. --
Re: [R] Question regarding significance of a covariate in a coxme survival model
On Sat, Aug 28, 2010 at 8:38 PM, David Winsemius dwinsem...@comcast.netwrote: On Aug 27, 2010, at 4:32 PM, Teresa Iglesias wrote: Christopher David Desjardins desja004 at umn.edu writes: Hi, I am running a Cox Mixed Effects Hazard model using the library coxme. I am trying to model time to onset (age_sym1) of thought problems (e.g. hearing voices) (sym1). As I have siblings in my dataset, I have decided to account for this by including a random effect for family (famid). My covariate of interest is Mother's diagnosis where a 0 is bipolar, 1 is control, and 2 is major depression. I am trying to fit the following model. thorp1 - coxme(Surv(age_sym1, sym1) ~ lifedxm + (1 | famid), data = bip.surv) Which provides the following output: - thorp1 Cox mixed-effects model fit by maximum likelihood Data: bip.surv events, n = 99, 189 (3 observations deleted due to missingness) Iterations= 10 54 NULL Integrated Penalized Log-likelihood -479.0372 -467.3549 -435.2096 Chisqdf p AIC BIC Integrated loglik 23.36 3.00 3.3897e-05 17.36 9.58 Penalized loglik 87.66 47.27 3.2374e-04 -6.88 -129.54 Model: Surv(age_sym1, sym1) ~ lifedxm + (1 | famid) Fixed coefficients coef exp(coef) se(coef) z p lifedxmCONTROL -1.0576257 0.3472794 0.3872527 -2.73 0.0063 lifedxmMAJOR -0.6329957 0.5309987 0.3460847 -1.83 0.0670 Random effects Group Variable Std DevVariance famid Intercept 0.980 0.9757033 -- So it appears that there is a difference in hazard rates associated with Mother's diagnosis between BIPOLAR and CONTROL. Just to be safe, I fit a model without this covariate present and decided to compare the models with AIC and see if fit decreased with this covariate not in the model. -- thorp1.n - coxme(Surv(age_sym1, sym1) ~ (1 | famid), data = bip.surv) thorp1.n Cox mixed-effects model fit by maximum likelihood Data: bip.surv events, n = 99, 189 (3 observations deleted due to missingness) Iterations= 10 54 NULL Integrated Penalized Log-likelihood -479.0372 -471. -436.0478 Chisqdf pAIC BIC Integrated loglik 15.41 1.00 0.8663 13.41 10.81 Penalized loglik 85.98 49.48 0.00099915 -12.97 -141.37 Model: Surv(age_sym1, sym1) ~ (1 | famid) Random effects Group Variable Std Dev Variance famid Intercept 1.050949 1.104495 The problem is that the AIC for the model w/o lifedxm is 13.41 and the model w/ lifedxm is 17.36. So fit actually improved without lifedxm in the model but really the fit is indistinguishable if I use Burnham Anderson, 2002's criteria. So my conundrum is this - The AIC and the p-values appear to disagree. The p-value would suggest that lifedxm should be included in the model and the AIC says it doesn't improve fit. In general, I tend to favor the AIC (I usually work with large sample sizes and multilevel models) but I am just beginning with survival models and I am curious if a survival model guru might suggest whether lifedxm needs to be in the model or not based on my results here? Would people generally use the AIC in this situation? Also, I can't run anova() on this models because of the random effect. I am happy to provide the data if necessary. Please cc me on a reply. Thanks, Chris Hi Chris, Did you get an answer to why the p-value and AIC score disagreed? I am getting the same results with my own data. They're not small disagreements either. The AIC score is telling me the opposite of what the p-value and the parameter coef are saying. The p-value and the coef for the predictor variable are in agreement. I've also noticed in some published papers with tables containing p-values and AIC scores that the chisq p-value and AIC score are in opposition too. This makes me think I'm missing something and that this all actually makes sense. However given that AIC = â 2 (log L) + 2K (where K is the number of parameters) and seeing as how the log-likelihood scores below are in the hundreds, shouldn't the AIC score be in the hundreds as well?? That is different than my understanding of AIC. I thought that the AIC and BIC both took as input the difference in -2LL and then adjusted those differences for the differences in number of degrees of freedom. That perspective would inform one that the absolute value of the deviance ( = -2LL) is immaterial and only differences are subject to interpretation. The p-values are calculated as Wald tests, which are basically first order approximations and have lower credence than model level comparisons. The OP also seemed to have ignored the fact that the penalized estimates of AIC and
Re: [R] question
I'm not sure I understand exactly what you're asking but look at the truncated normal distribution. On Aug 20, 2010, at 5:13 PM, solafah bh wrote: Hello I want to know how can i sampling from upper and lower tail of normal distribution , in two cases , if i know the upper and lower bounds of distribution and if i do not. Regards [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about unwanted addresses in contact list
Scott Compton wrote: Hi, It looks like about 200+ addresses from people who post on this list have been added to my contact list without my permission. Has this happened to other people? I use Mac OS. This is quite disconcerting and suggests a virus floating around on this distribution list. Very likely, your unstated mailer collects all addresses it 'sees' and puts them in its address book. If you don't like this behavior, it might be able to be turned off. But that depends on your particular mail program, and not R. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about unwanted addresses in contact list
It doesn't happen to other lists that I belong to, just the R lists. But I will check. Thanks for the tip. Best, Scott On Aug 18, 2010, at 12:23 PM, Erik Iverson wrote: Scott Compton wrote: Hi, It looks like about 200+ addresses from people who post on this list have been added to my contact list without my permission. Has this happened to other people? I use Mac OS. This is quite disconcerting and suggests a virus floating around on this distribution list. Very likely, your unstated mailer collects all addresses it 'sees' and puts them in its address book. If you don't like this behavior, it might be able to be turned off. But that depends on your particular mail program, and not R. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about unwanted addresses in contact list
The suggestion that a virus floating around on this distribution list is baseless. All source code distributed on the list is embedded in messages in plaintext -- i.e. there is no distribution of binary files as part of the mailing list. How would a virus propagate under these circumstances? Your complaint is almost certainly the result of your email client and its particular configuration. * Sent from my BlackBerry handheld device -Original Message- From: Scott Compton scomp...@duke.edu Sender: r-help-boun...@r-project.org Date: Wed, 18 Aug 2010 12:32:15 To: Erik Iversoner...@ccbr.umn.edu Cc: r-help@r-project.org Subject: Re: [R] question about unwanted addresses in contact list It doesn't happen to other lists that I belong to, just the R lists. But I will check. Thanks for the tip. Best, Scott On Aug 18, 2010, at 12:23 PM, Erik Iverson wrote: Scott Compton wrote: Hi, It looks like about 200+ addresses from people who post on this list have been added to my contact list without my permission. Has this happened to other people? I use Mac OS. This is quite disconcerting and suggests a virus floating around on this distribution list. Very likely, your unstated mailer collects all addresses it 'sees' and puts them in its address book. If you don't like this behavior, it might be able to be turned off. But that depends on your particular mail program, and not R. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about bayesian model selection for quantile regression
Hi All: the package MuMIn can be used to select the model based on AIC or AICc. The code is as follows: data(Cement) lm1 - lm(y ~ ., data = Cement) dd - dredge(lm1,rank=AIC) print(dd) If I want to select the model by BIC, what code do I need to use? And when to select the best model based on AIC, what the differences between the function dredge in packageMuMIn and the function stepAIC in package MASS Many thanks best wishes XIN LI [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on contour function
Duncan and David, thank you so much. You are right. We can use z1 - outer(x, y, function(x,y) x^2+3*y^2) rather than xy - meshgrid(x,y) z2 - xy$x^2+ 3*xy$y^2 to get right answer. I run these codes on my computer and found that z2 is the transpose of z1. So I guess in order to obtain the expected result, there are at least two ways. x - seq(-1,1,0.1) y - seq(-1,1,0.1) z - outer(x,y, FUN=function(x,y) x^2+ 3*y^2) contour(x,y,z,col=blue,xlab=x,ylab=y) or require(RTOMO) x - seq(-1,1,0.1) y - seq(-1,1,0.1) xy - meshgrid(x,y) z - xy$x^2+ 3*xy$y^2 z - t(z) contour(x,y,z,col=blue,xlab=x,ylab=y) Of course, the first method is better since it only uses the base function. David Lee On 12 August 2010 01:54, David Winsemius dwinsem...@comcast.net wrote: On Aug 11, 2010, at 11:16 AM, ba ba wrote: Dear All, I tried to plot contour lines using R function contour, but got the results which are not expected. require(RTOMO) x - seq(-1,1,0.1) y - seq(-1,1,0.1) xy - meshgrid(x,y) z - xy$x^2+ 3*xy$y^2 contour(x,y,z,col=blue,xlab=x,ylab=y) The above code gave me the contour graph for z=3*x^2+y^2 rather than z=x^2+3*y^2. Is anyone know the reason? Because contour was expecting a matrix of z values for z and you gave it a list created by a function you did not understand? meshgrid function (a, b) { return(list(x = outer(b * 0, a, FUN = +), y = outer(b, a * 0, FUN = +))) } Instead: Use the base function outer(): x - seq(-1,1,0.1) y - seq(-1,1,0.1) xy - outer(x,y, FUN=function(x,y) x^2+ 3*y^2) contour(x,y,xy,col=blue,xlab=x,ylab=y) -- David Winsemius, MD West Hartford, CT [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about bayesian model selection for quantile regression
Hello: I try to use function bms to select the best model with the biggest posterior probability for different quantile. I have one return and 16 variables. However, I can just select for the linear model, how to change the code for quantile: bma1=bms(rq(y~.,tau=0.25,data=EQT), burn = 1000, iter = 3000, mcmc=bd) Error in mpparam[[1]] : subscript out of bounds The code for linear model is sucessful: bma1=bms(EQT, burn = 1000, iter = 3000, mcmc=bd) where EQT is my data beta.draws.bma(bma1[1:5]) how to use monte carlo to select the best model for quantile? Or which other function should be used? Best wishes XIN LI [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on contour function
On 11/08/2010 11:16 AM, ba ba wrote: Dear All, I tried to plot contour lines using R function contour, but got the results which are not expected. require(RTOMO) x - seq(-1,1,0.1) y - seq(-1,1,0.1) xy - meshgrid(x,y) z - xy$x^2+ 3*xy$y^2 contour(x,y,z,col=blue,xlab=x,ylab=y) The above code gave me the contour graph for z=3*x^2+y^2 rather than z=x^2+3*y^2. Is anyone know the reason? Thank you very much in advance. A problem in RTOMO, or a misuse of it? If I do the calculation of z as z - outer(x, y, function(x,y) x^2+3*y^2) I get the expected result. Duncan Murdoch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question on contour function
On Aug 11, 2010, at 11:16 AM, ba ba wrote: Dear All, I tried to plot contour lines using R function contour, but got the results which are not expected. require(RTOMO) x - seq(-1,1,0.1) y - seq(-1,1,0.1) xy - meshgrid(x,y) z - xy$x^2+ 3*xy$y^2 contour(x,y,z,col=blue,xlab=x,ylab=y) The above code gave me the contour graph for z=3*x^2+y^2 rather than z=x^2+3*y^2. Is anyone know the reason? Because contour was expecting a matrix of z values for z and you gave it a list created by a function you did not understand? meshgrid function (a, b) { return(list(x = outer(b * 0, a, FUN = +), y = outer(b, a * 0, FUN = +))) } Instead: Use the base function outer(): x - seq(-1,1,0.1) y - seq(-1,1,0.1) xy - outer(x,y, FUN=function(x,y) x^2+ 3*y^2) contour(x,y,xy,col=blue,xlab=x,ylab=y) -- David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about SVM in e1071
Jack, sorry for the late answer. I agree that my last post is misleading. Here a new try: * * Increasing the value of *C* (...) forces the creation of a more accurate model, that may not generalise well.(Try to imagine the feature space with the two mapped sets very far from each other ) A model that fits better the training data is done by adding more SV ( till we get a convex hull of the data ), this is done reducing the soft margin (i.e. decreasing C ) ( and again that may not generalise well, maybe you can do a program witch cross-validation ) Here is another question: is the complexity of the boundary determined by number of total SVs (bounded SV + free SV) or free SVs only? What do you mean by complexity of the boundary ? Regards Pau 2010/7/28 Jack Luo jluo.rh...@gmail.com Pau, Sorry for getting back to you for this again. I am getting confused about your interpretation of 3). It is obvious from your code that increasing C results in* smaller *number of SVs, this seems to contradict with your interpretation * Increasing the value of C (...) forces the creation of a more accurate model.* A more accurate model is done my adding more SV. In addition, I got to know that the number of SVs increases with C decreasing is because there are many bounded SVs (whose alpha = C, remember 0 alpha = C), those SVs with alpha smaller than C is called free SVs. Here is another question: is the complexity of the boundary determined by number of total SVs (bounded SV + free SV) or free SVs only? Thanks a bunch, -Jack On Thu, Jul 15, 2010 at 4:17 AM, Pau Carrio Gaspar paucar...@gmail.comwrote: Hi Jack, to 1) and 2) there are telling you the same. I recommend you to read the first sections of the article it is very well writen and clear. There you will read about duality. to 3) I interpret the scatter plot so: * Increasing the value of C (...) forces the creation of a more accurate model.* A more accurate model is done my adding more SV ( till we get a convex hull of the data ) hope it helps Regards Pau 2010/7/14 Jack Luo jluo.rh...@gmail.com Pau, Thanks a lot for your email, I found it very helpful. Please see below for my reply, thanks. -Jack On Wed, Jul 14, 2010 at 10:36 AM, Pau Carrio Gaspar paucar...@gmail.com wrote: Hello Jack, 1 ) why do you thought that larger C is prone to overfitting than smaller C ? *There is some statement in the link http://www.dtreg.com/svm.htm To allow some flexibility in separating the categories, SVM models have a cost parameter, C, that controls the trade off between allowing training errors and forcing rigid margins. It creates a soft marginthat permits some misclassifications. Increasing the value of C increases the cost of misclassifying points and forces the creation of a more accurate model that may not generalize well. My understanding is that this means larger C may not generalize well (prone to overfitting). * 2 ) if you look at the formulation of the quadratic program problem you will see that C rules the error of the cutting plane ( and overfitting ). Therfore for hight C you allow that the cutting plane cuts worse the set, so SVM needs less points to build it. a proper explanation is in Kristin P. Bennett and Colin Campbell, Support Vector Machines: Hype or Hallelujah?, SIGKDD Explorations, 2,2, 2000, 1-13. http://www.idi.ntnu.no/emner/it3704/lectures/papers/Bennett_2000_Support.pdf *Could you be more specific about this? I don't quite understand. * 3) you might find usefull this plots: library(e1071) m1 - matrix( c( 0,0,0,1,1,2, 1, 2,3,2,3, 3, 0, 1,2,3,0, 1, 2, 3, 1,2,3,2,3,3, 0, 0,0,1, 1, 2, 4, 4,4,4,0, 1, 2, 3, 1,1,1,1,1,1,-1,-1, -1,-1,-1,-1, 1 ,1,1,1, 1, 1,-1,-1 ), ncol = 3 ) Y = m1[,3] X = m1[,1:2] df = data.frame( X , Y ) par(mfcol=c(4,2)) for( cost in c( 1e-3 ,1e-2 ,1e-1, 1e0, 1e+1, 1e+2 ,1e+3)) { #cost - 1 model.svm - svm( Y ~ . , data = df , type = C-classification , kernel = linear, cost = cost, scale =FALSE ) #print(model.svm$SV) plot(x=0,ylim=c(0,5), xlim=c(0,3),main= paste( cost: ,cost, #SV: , nrow(model.svm$SV) )) points(m1[m1[,3]0,1], m1[m1[,3]0,2], pch=3, col=green) points(m1[m1[,3]0,1], m1[m1[,3]0,2], pch=4, col=blue) points(model.svm$SV[,1],model.svm$SV[,2], pch=18 , col = red) } * * *Thanks a lot for the code, I really appreciate it. I've run it, but I am not sure how should I interpret the scatter plot, although it is obvious that number of SVs decreases with cost increasing. * Regards Pau 2010/7/14 Jack Luo jluo.rh...@gmail.com Hi, I have a question about the parameter C (cost) in svm function in e1071. I thought larger C is prone to overfitting than smaller C, and hence leads to more support vectors. However, using the Wisconsin breast cancer example on the link:
Re: [R] question!!!!
Hi I made some design matrix X(in linear regression model) I there any method to normalize X? You can normalize a matrix column-wise like this: # m is a matrix apply(m, 2, function(x) x / max(x) ) Or normalize row-wise like this: t(apply(m, 1, function(x) x / max(x) )) I'm sure there are more elegant ways to do it but those work for me. Michael __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding S4 objects and reading in excel data with RODBC
On 04/08/2010 5:38 AM, Sander wrote: L.S. I am trying to get data from an excel sheet using the RODBC library, quite like the example on page 34 of the 'Data manipulation with R' book written by Phil Spector. It seems very straightforward, yet I get the same error everytime I try it, with different excel sheets and trying to tweek the code a bit. This is what I do: library(RODBC) sheet='X:\\example.xls' con=odbcConnectExcel(sheet) tbls=sqlTables(con) tbls$TABLE_NAME qry=paste(SELECT * FROM ',t...@table_name[1],', sep= ) The error I get is: Error in paste(SELECT * FROM ', t...@table_name[1], ', sep = ) : trying to get slot TABLE_NAME from an object (class data.frame) that is not an S4 object tbls is not an S4 object, so you can't use the @ notation with it. In some older versions of R @ would extract an attribute (because that's where S4 slots were stored), but current versions don't allow this. Without testing, I would guess you can replace t...@table_name[1] with attr(tbls, TABLE_NAME)[1] but it's possible the name is wrong. Duncan Murdoch Does anyone know what I'm doing wrong here? I have to admit that I don't know much about sql. Best regards, Sander van Kuijk __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question!!!!
?scale -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of leepama Sent: Tuesday, August 03, 2010 10:55 PM To: r-help@r-project.org Subject: [R] question I made some design matrix X(in linear regression model) I there any method to normalize X? -- View this message in context: http://r.789695.n4.nabble.com/question- tp2312938p2312938.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question!!
?replicate ?apply ?sapply Nikhil Kaza Asst. Professor, City and Regional Planning University of North Carolina nikhil.l...@gmail.com On Aug 1, 2010, at 2:42 AM, leepama wrote: hi!! imade many codes during these days.. I study Statistics please one more question!! ex1-function(n,p,rho){ muvec1=zeros(1,p) A=eye(p) for(i in 1:p){ for(j in 1:p){ A[i,j]=rho^(abs(i-j)) X=mvrnorm(n,muvec1,A)}} return(X) } this code generates design matrix in linare regression model.. but this code only one data set.. Is there any method to generate several data sets(design Matrix)?? (in R program) please give me help.. -- View this message in context: http://r.789695.n4.nabble.com/question-tp2309277p2309277.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about SVM in e1071
Pau, Sorry for getting back to you for this again. I am getting confused about your interpretation of 3). It is obvious from your code that increasing C results in* smaller *number of SVs, this seems to contradict with your interpretation * Increasing the value of C (...) forces the creation of a more accurate model.* A more accurate model is done my adding more SV. In addition, I got to know that the number of SVs increases with C decreasing is because there are many bounded SVs (whose alpha = C, remember 0 alpha = C), those SVs with alpha smaller than C is called free SVs. Here is another question: is the complexity of the boundary determined by number of total SVs (bounded SV + free SV) or free SVs only? Thanks a bunch, -Jack On Thu, Jul 15, 2010 at 4:17 AM, Pau Carrio Gaspar paucar...@gmail.comwrote: Hi Jack, to 1) and 2) there are telling you the same. I recommend you to read the first sections of the article it is very well writen and clear. There you will read about duality. to 3) I interpret the scatter plot so: * Increasing the value of C (...) forces the creation of a more accurate model.* A more accurate model is done my adding more SV ( till we get a convex hull of the data ) hope it helps Regards Pau 2010/7/14 Jack Luo jluo.rh...@gmail.com Pau, Thanks a lot for your email, I found it very helpful. Please see below for my reply, thanks. -Jack On Wed, Jul 14, 2010 at 10:36 AM, Pau Carrio Gaspar paucar...@gmail.comwrote: Hello Jack, 1 ) why do you thought that larger C is prone to overfitting than smaller C ? *There is some statement in the link http://www.dtreg.com/svm.htm To allow some flexibility in separating the categories, SVM models have a cost parameter, C, that controls the trade off between allowing training errors and forcing rigid margins. It creates a soft marginthat permits some misclassifications. Increasing the value of C increases the cost of misclassifying points and forces the creation of a more accurate model that may not generalize well. My understanding is that this means larger C may not generalize well (prone to overfitting). * 2 ) if you look at the formulation of the quadratic program problem you will see that C rules the error of the cutting plane ( and overfitting ). Therfore for hight C you allow that the cutting plane cuts worse the set, so SVM needs less points to build it. a proper explanation is in Kristin P. Bennett and Colin Campbell, Support Vector Machines: Hype or Hallelujah?, SIGKDD Explorations, 2,2, 2000, 1-13. http://www.idi.ntnu.no/emner/it3704/lectures/papers/Bennett_2000_Support.pdf *Could you be more specific about this? I don't quite understand. * 3) you might find usefull this plots: library(e1071) m1 - matrix( c( 0,0,0,1,1,2, 1, 2,3,2,3, 3, 0, 1,2,3,0, 1, 2, 3, 1,2,3,2,3,3, 0, 0,0,1, 1, 2, 4, 4,4,4, 0, 1, 2, 3, 1,1,1,1,1,1,-1,-1, -1,-1,-1,-1, 1 ,1,1,1, 1, 1,-1,-1 ), ncol = 3 ) Y = m1[,3] X = m1[,1:2] df = data.frame( X , Y ) par(mfcol=c(4,2)) for( cost in c( 1e-3 ,1e-2 ,1e-1, 1e0, 1e+1, 1e+2 ,1e+3)) { #cost - 1 model.svm - svm( Y ~ . , data = df , type = C-classification , kernel = linear, cost = cost, scale =FALSE ) #print(model.svm$SV) plot(x=0,ylim=c(0,5), xlim=c(0,3),main= paste( cost: ,cost, #SV: , nrow(model.svm$SV) )) points(m1[m1[,3]0,1], m1[m1[,3]0,2], pch=3, col=green) points(m1[m1[,3]0,1], m1[m1[,3]0,2], pch=4, col=blue) points(model.svm$SV[,1],model.svm$SV[,2], pch=18 , col = red) } * * *Thanks a lot for the code, I really appreciate it. I've run it, but I am not sure how should I interpret the scatter plot, although it is obvious that number of SVs decreases with cost increasing. * Regards Pau 2010/7/14 Jack Luo jluo.rh...@gmail.com Hi, I have a question about the parameter C (cost) in svm function in e1071. I thought larger C is prone to overfitting than smaller C, and hence leads to more support vectors. However, using the Wisconsin breast cancer example on the link: http://planatscher.net/svmtut/svmtut.html I found that the largest cost have fewest support vectors, which is contrary to what I think. please see the scripts below: Am I misunderstanding something here? Thanks a bunch, -Jack model1 - svm(databctrain, classesbctrain, kernel = linear, cost = 0.01) model2 - svm(databctrain, classesbctrain, kernel = linear, cost = 1) model3 - svm(databctrain, classesbctrain, kernel = linear, cost = 100) model1 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 0.01) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 0.01 gamma: 0.111 Number of Support Vectors: 99 model2 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 1) Parameters: SVM-Type:
Re: [R] Question regarding panel data diagnostic
Dear Lexi, Thanks a lot for your prompt answers, The issue i'm confronted to is the following: i have a panel data N=17 T=5 (annual observations) and wanted to check for stationarity to avoid a spurious regression, but the question is do i' have the right do do so??? it's statistically correct? if no is there any alternative method to verify if our regression is correct? Thanks again Ama == Dear Ama, I copy my reply to the list, in case someone needs it. Spurious regression occurs when correlation between time series variables results from their common trends - the variables tend to move together over some cycle. However, it may difficult to decipher whether or not the variables in your model have significant trends; also trends differ (see Enders 1995 Time Series Econometrics). So to deal with this, you must perform formal integration tests. If the variables have unit root (ie, non-stationary) then you cannot model the variables in their levels. You must transform them by appropriate differencing. Then you can model using a dynamic model or error-correction model (ecm) (if the variables are cointegrated). Use of ecm makes sense only if the time span of your data is long enough - it is a long run concept. Long enough depends on the phenomenon under study. If theory suggests that equilibrium could occur within the time span of your data (17 years in your case - this is long enough in most cases), then concepts of cointegration and ecm are relevant. Hope this helps. Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Setlhare Lekgatlhamang Sent: Saturday, July 24, 2010 1:01 PM To: amatoallah ouchen; r-help@r-project.org Subject: Re: [R] Question regarding panel data diagnostic Let me correct an omission in my response below. The last sentence should read But if the data are 10 quarterly or monthly values, these techniques are not relevant. Cheers Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Setlhare Lekgatlhamang Sent: Saturday, July 24, 2010 12:54 PM To: amatoallah ouchen; r-help@r-project.org Subject: Re: [R] Question regarding panel data diagnostic My thought is this: It depends on what you have in the panel. Are your data cross-section data observed over ten years for, say, 3 countries (or regions within the same country)? If so, yes you can perform integration properties (what people usually call unit root test) and then test for cointegration. But if the data are quarterly or monthly, these techniques are not relevant. Hope this helps. Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of amatoallah ouchen Sent: Friday, July 23, 2010 12:18 AM To: r-help@r-project.org Subject: [R] Question regarding panel data diagnostic Good day R-listers, I'm currently working on a panel data analysis (N=17, T=5), in order to check for the spurious regression problem, i have to test for stationarity but i've read somewhere that i needn't to test for it as my T10 , what do you think? if yes is there any other test i have to perform in such case (a kind of cointegration test for small T?) Any hint would be highly appreciated. Ama. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. DISCLAIMER:\ Sample Disclaimer added in a VBScript.\\ .{{dropped:15}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding panel data diagnostic
Oops, I misread your email in respect of the number of years you have for your data. Anyways, my comments still hold. Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Setlhare Lekgatlhamang Sent: Monday, July 26, 2010 8:59 AM To: amatoallah ouchen; r-help@r-project.org Subject: Re: [R] Question regarding panel data diagnostic Dear Lexi, Thanks a lot for your prompt answers, The issue i'm confronted to is the following: i have a panel data N=17 T=5 (annual observations) and wanted to check for stationarity to avoid a spurious regression, but the question is do i' have the right do do so??? it's statistically correct? if no is there any alternative method to verify if our regression is correct? Thanks again Ama == Dear Ama, I copy my reply to the list, in case someone needs it. Spurious regression occurs when correlation between time series variables results from their common trends - the variables tend to move together over some cycle. However, it may difficult to decipher whether or not the variables in your model have significant trends; also trends differ (see Enders 1995 Time Series Econometrics). So to deal with this, you must perform formal integration tests. If the variables have unit root (ie, non-stationary) then you cannot model the variables in their levels. You must transform them by appropriate differencing. Then you can model using a dynamic model or error-correction model (ecm) (if the variables are cointegrated). Use of ecm makes sense only if the time span of your data is long enough - it is a long run concept. Long enough depends on the phenomenon under study. If theory suggests that equilibrium could occur within the time span of your data (17 years in your case - this is long enough in most cases), then concepts of cointegration and ecm are relevant. Hope this helps. Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Setlhare Lekgatlhamang Sent: Saturday, July 24, 2010 1:01 PM To: amatoallah ouchen; r-help@r-project.org Subject: Re: [R] Question regarding panel data diagnostic Let me correct an omission in my response below. The last sentence should read But if the data are 10 quarterly or monthly values, these techniques are not relevant. Cheers Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Setlhare Lekgatlhamang Sent: Saturday, July 24, 2010 12:54 PM To: amatoallah ouchen; r-help@r-project.org Subject: Re: [R] Question regarding panel data diagnostic My thought is this: It depends on what you have in the panel. Are your data cross-section data observed over ten years for, say, 3 countries (or regions within the same country)? If so, yes you can perform integration properties (what people usually call unit root test) and then test for cointegration. But if the data are quarterly or monthly, these techniques are not relevant. Hope this helps. Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of amatoallah ouchen Sent: Friday, July 23, 2010 12:18 AM To: r-help@r-project.org Subject: [R] Question regarding panel data diagnostic Good day R-listers, I'm currently working on a panel data analysis (N=17, T=5), in order to check for the spurious regression problem, i have to test for stationarity but i've read somewhere that i needn't to test for it as my T10 , what do you think? if yes is there any other test i have to perform in such case (a kind of cointegration test for small T?) Any hint would be highly appreciated. Ama. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. DISCLAIMER:\ Sample Disclaimer added in a VBScript.\\\ {{dropped:15}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding panel data diagnostic
My thought is this: It depends on what you have in the panel. Are your data cross-section data observed over ten years for, say, 3 countries (or regions within the same country)? If so, yes you can perform integration properties (what people usually call unit root test) and then test for cointegration. But if the data are quarterly or monthly, these techniques are not relevant. Hope this helps. Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of amatoallah ouchen Sent: Friday, July 23, 2010 12:18 AM To: r-help@r-project.org Subject: [R] Question regarding panel data diagnostic Good day R-listers, I'm currently working on a panel data analysis (N=17, T=5), in order to check for the spurious regression problem, i have to test for stationarity but i've read somewhere that i needn't to test for it as my T10 , what do you think? if yes is there any other test i have to perform in such case (a kind of cointegration test for small T?) Any hint would be highly appreciated. Ama. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. DISCLAIMER:\ Sample Disclaimer added in a VBScript.\ ...{{dropped:3}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding panel data diagnostic
Let me correct an omission in my response below. The last sentence should read But if the data are 10 quarterly or monthly values, these techniques are not relevant. Cheers Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Setlhare Lekgatlhamang Sent: Saturday, July 24, 2010 12:54 PM To: amatoallah ouchen; r-help@r-project.org Subject: Re: [R] Question regarding panel data diagnostic My thought is this: It depends on what you have in the panel. Are your data cross-section data observed over ten years for, say, 3 countries (or regions within the same country)? If so, yes you can perform integration properties (what people usually call unit root test) and then test for cointegration. But if the data are quarterly or monthly, these techniques are not relevant. Hope this helps. Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of amatoallah ouchen Sent: Friday, July 23, 2010 12:18 AM To: r-help@r-project.org Subject: [R] Question regarding panel data diagnostic Good day R-listers, I'm currently working on a panel data analysis (N=17, T=5), in order to check for the spurious regression problem, i have to test for stationarity but i've read somewhere that i needn't to test for it as my T10 , what do you think? if yes is there any other test i have to perform in such case (a kind of cointegration test for small T?) Any hint would be highly appreciated. Ama. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. DISCLAIMER:\ Sample Disclaimer added in a VBScript.\ ..{{dropped:14}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding panel data diagnostic
Let me correct an omission in my response below. The last sentence should read But if the data are 10 quarterly or monthly values, these techniques are not relevant. Cheers Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Setlhare Lekgatlhamang Sent: Saturday, July 24, 2010 12:54 PM To: amatoallah ouchen; r-help@r-project.org Subject: Re: [R] Question regarding panel data diagnostic My thought is this: It depends on what you have in the panel. Are your data cross-section data observed over ten years for, say, 3 countries (or regions within the same country)? If so, yes you can perform integration properties (what people usually call unit root test) and then test for cointegration. But if the data are quarterly or monthly, these techniques are not relevant. Hope this helps. Lexi -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of amatoallah ouchen Sent: Friday, July 23, 2010 12:18 AM To: r-help@r-project.org Subject: [R] Question regarding panel data diagnostic Good day R-listers, I'm currently working on a panel data analysis (N=17, T=5), in order to check for the spurious regression problem, i have to test for stationarity but i've read somewhere that i needn't to test for it as my T10 , what do you think? if yes is there any other test i have to perform in such case (a kind of cointegration test for small T?) Any hint would be highly appreciated. Ama. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. DISCLAIMER:\ Sample Disclaimer added in a VBScript.\ ..{{dropped:14}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about a perceived irregularity in R syntax
Nordlund, Dan (DSHS/RDA) wrote: -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Peter Dalgaard Sent: Thursday, July 22, 2010 3:13 PM To: Pat Schmitz Cc: r-help@r-project.org Subject: Re: [R] Question about a perceived irregularity in R syntax Pat Schmitz wrote: Both vector query's can select the values from the data.frame as written, however in the first form assigning a value to said selected numbers fails. Can you explain the reason this fails? dat - data.frame(index = 1:10, Value = c(1:4, NA, 6, NA, 8:10)) dat$Value[dat$Value == NA] - 1 #Why does this fails to work, dat$Value[dat$Value %in% NA] - 1 #While this does work? #Particularly when str() results in an equivalent class dat - data.frame(index = 1:10, Value = c(1:4, NA, 6, NA, 8:10)) str(dat$Value[dat$Value %in% NA]) str(dat$Value[dat$Value == NA]) 1. NA and NA are very different things 2. checkout is.na() and its help page I also would have suggested is.na to do the replacement. What surprised me was that dat$Value[dat$Value %in% NA] - 1 actually worked. I guess I always assumed that if NA == NA [1] NA then an attempt to compare NA to elements in a vector would also return NA, but not so. NA %in% c(1,NA,3) [1] TRUE Learned something new today, I suspect that's not intentional, though I'm not sure it should be fixed. According to the usual convention the result should be a logical NA. Duncan Murdoch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about a perceived irregularity in R syntax
On 23/07/2010 7:14 AM, Duncan Murdoch wrote: Nordlund, Dan (DSHS/RDA) wrote: -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Peter Dalgaard Sent: Thursday, July 22, 2010 3:13 PM To: Pat Schmitz Cc: r-help@r-project.org Subject: Re: [R] Question about a perceived irregularity in R syntax Pat Schmitz wrote: Both vector query's can select the values from the data.frame as written, however in the first form assigning a value to said selected numbers fails. Can you explain the reason this fails? dat - data.frame(index = 1:10, Value = c(1:4, NA, 6, NA, 8:10)) dat$Value[dat$Value == NA] - 1 #Why does this fails to work, dat$Value[dat$Value %in% NA] - 1 #While this does work? #Particularly when str() results in an equivalent class dat - data.frame(index = 1:10, Value = c(1:4, NA, 6, NA, 8:10)) str(dat$Value[dat$Value %in% NA]) str(dat$Value[dat$Value == NA]) 1. NA and NA are very different things 2. checkout is.na() and its help page I also would have suggested is.na to do the replacement. What surprised me was that dat$Value[dat$Value %in% NA] - 1 actually worked. I guess I always assumed that if NA == NA [1] NA then an attempt to compare NA to elements in a vector would also return NA, but not so. NA %in% c(1,NA,3) [1] TRUE Learned something new today, I suspect that's not intentional, though I'm not sure it should be fixed. According to the usual convention the result should be a logical NA. Oops, not true. The behaviour is clearly documented in ?match: Exactly what matches what is to some extent a matter of definition. For all types, ‘NA’ matches ‘NA’ and no other value. For real and complex values, ‘NaN’ values are regarded as matching any other ‘NaN’ value, but not matching ‘NA’. Thanks to Brian Ripley (the author of that paragraph) for pointing this out to me. Not sure how I missed it on my first reading, but the fact that it preceded my morning coffee might be a contributing factor. Duncan Murdoch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about a perceived irregularity in R syntax
Pat Schmitz wrote: Both vector query's can select the values from the data.frame as written, however in the first form assigning a value to said selected numbers fails. Can you explain the reason this fails? dat - data.frame(index = 1:10, Value = c(1:4, NA, 6, NA, 8:10)) dat$Value[dat$Value == NA] - 1 #Why does this fails to work, dat$Value[dat$Value %in% NA] - 1 #While this does work? #Particularly when str() results in an equivalent class dat - data.frame(index = 1:10, Value = c(1:4, NA, 6, NA, 8:10)) str(dat$Value[dat$Value %in% NA]) str(dat$Value[dat$Value == NA]) 1. NA and NA are very different things 2. checkout is.na() and its help page -- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about a perceived irregularity in R syntax
-Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Peter Dalgaard Sent: Thursday, July 22, 2010 3:13 PM To: Pat Schmitz Cc: r-help@r-project.org Subject: Re: [R] Question about a perceived irregularity in R syntax Pat Schmitz wrote: Both vector query's can select the values from the data.frame as written, however in the first form assigning a value to said selected numbers fails. Can you explain the reason this fails? dat - data.frame(index = 1:10, Value = c(1:4, NA, 6, NA, 8:10)) dat$Value[dat$Value == NA] - 1 #Why does this fails to work, dat$Value[dat$Value %in% NA] - 1 #While this does work? #Particularly when str() results in an equivalent class dat - data.frame(index = 1:10, Value = c(1:4, NA, 6, NA, 8:10)) str(dat$Value[dat$Value %in% NA]) str(dat$Value[dat$Value == NA]) 1. NA and NA are very different things 2. checkout is.na() and its help page I also would have suggested is.na to do the replacement. What surprised me was that dat$Value[dat$Value %in% NA] - 1 actually worked. I guess I always assumed that if NA == NA [1] NA then an attempt to compare NA to elements in a vector would also return NA, but not so. NA %in% c(1,NA,3) [1] TRUE Learned something new today, Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about allocMatrix error message
Moohwan - It appears that you are trying to calculate a 10 by 10 matrix when all you want are the diagonal elements. Here's one approach that might work: s = t(dev)%*%dev/(nr-1) sinv = solve(s) part = sinv%*%t(dev) t2 = dev[,1]*part[1,] + dev[,2]*part[2,] - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Wed, 21 Jul 2010, Moohwan Kim wrote: Dear R family, I faced a technical problem in r coding. #s=t(dev)%*%dev/(nr-1) # dev (100,000 by 2) stands for deviation from the mean #sinv=solve(s) #t2=diag(dev%*%sinv%*%t(dev)) I got an error message at t2 statement: Error in diag(dev %*% si %*% t(dev)) : allocMatrix: too many elements specified Please let me know if there is a way to overcome this problem. best moohwan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question from day2 beginner
Thanks for the explanation! -- View this message in context: http://r.789695.n4.nabble.com/Question-from-day2-beginner-tp2293221p2294990.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about sign of probit estimates
On Jul 20, 2010, at 2:39 AM, Nita Umashankar wrote: Hello, I am getting some results from my Probit estimation in R that are in the opposite direction of what I hypothesized. In sas, the default is probability that y=0 (instead of 1) so one needs to type the word descending to get P(y=1). Is the same true for R? Is the default set to P(0)? Why not just try it? y - c(1,0,0) glm(y~1,binomial(probit)) Call: glm(formula = y ~ 1, family = binomial(probit)) Coefficients: (Intercept) -0.4307 Intercept is less than zero so it's probit(P(1))==qnorm(1/3). (BTW, I don't think what SAS does is a well-defined concept. There are three procedures that will do probit regression - fortunately PROC CATMOD will not... - and they each have more than one way of specifying the response variable.) Thank you in advance. Nita Umashankar [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question from day2 beginner
Try: ar(logrvar, aic=TRUE)$resid The problem you are running into is that 'resid' is a generic function, and the 'ar' function neither returns an object with a class nor returns an object that the default method works with. The 'summary' command you used might have instead used 'str'. The 'hints' document mentioned in my signature might help you with such questions. Welcome to the R world. On 18/07/2010 18:16, maxcheese wrote: Hello, I just started learning R and have a very basic question: When I try ar model and extract residuals it returns Null. Could someone explain what's going on here? Does that mean the model is null? But it seems residual has a length=# observation of the original series according to the model summary. I am learning it by myself and have no one to ask here so please allow me to ask this very basic question... Thank you very much. summary(ar(logrvar, aic=TRUE)) Length Class Mode order1 -none- numeric ar 40 -none- numeric var.pred 1 -none- numeric x.mean 1 -none- numeric aic 44 -none- numeric n.used 1 -none- numeric order.max1 -none- numeric partialacf 43 -none- numeric resid20731 -none- numeric method 1 -none- character series 1 -none- character frequency1 -none- numeric call 3 -none- call asy.var.coef 1600 -none- numeric resid(ar(logrvar, aic=TRUE)) NULL -- Patrick Burns pbu...@pburns.seanet.com http://www.burns-stat.com (home of 'Some hints for the R beginner' and 'The R Inferno') __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about KLdiv and large datasets
Is the 'eps' argument part of KLdiv (was not able to find that in the help pages) or part of a general environment (such as the graphics parameters 'par' ) ? I am asking so that I can read about it what it actually does to resolve the question you already raised about its reliability... Ralf On Fri, Jul 16, 2010 at 10:41 AM, Peter Ehlers ehl...@ucalgary.ca wrote: On 2010-07-16 7:56, Ralf B wrote: Hi all, when running KL on a small data set, everything is fine: require(flexmix) n- 20 a- rnorm(n) b- rnorm(n) mydata- cbind(a,b) KLdiv(mydata) however, when this dataset increases require(flexmix) n- 1000 a- rnorm(n) b- rnorm(n) mydata- cbind(a,b) KLdiv(mydata) KL seems to be not defined. Can somebody explain what is going on? Thanks, Ralf Ralf, You can adjust the 'eps=' argument. But I don't know what this will do to the reliability of the results. KLdiv(mydata, eps = 1e-7) -Peter Ehlers __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about KLdiv and large datasets
I just answered this but realize that I did so off-list. So, for completeness, here's what I said: I think I see the problem. From ?KLdiv, you're getting the modeltools help page. What you need is the flexmix help page for KLdiv. Just get to the flexmix index page (you can do ?flexmix and then click on the 'Index' link at the bottom). Then select KLdiv-method. Here you will find the eps argument described. Its default value is eps=10^-4 and the description says: eps Probabilities below this threshold are replaced by this threshold for numerical stability. It's this comment that makes me doubt the reliability for very small eps values. Cheers, Peter On 2010-07-18 20:50, Ralf B wrote: Is the 'eps' argument part of KLdiv (was not able to find that in the help pages) or part of a general environment (such as the graphics parameters 'par' ) ? I am asking so that I can read about it what it actually does to resolve the question you already raised about its reliability... Ralf On Fri, Jul 16, 2010 at 10:41 AM, Peter Ehlersehl...@ucalgary.ca wrote: On 2010-07-16 7:56, Ralf B wrote: Hi all, when running KL on a small data set, everything is fine: require(flexmix) n- 20 a- rnorm(n) b- rnorm(n) mydata- cbind(a,b) KLdiv(mydata) however, when this dataset increases require(flexmix) n- 1000 a- rnorm(n) b- rnorm(n) mydata- cbind(a,b) KLdiv(mydata) KL seems to be not defined. Can somebody explain what is going on? Thanks, Ralf Ralf, You can adjust the 'eps=' argument. But I don't know what this will do to the reliability of the results. KLdiv(mydata, eps = 1e-7) -Peter Ehlers __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about KLdiv and large datasets
On 2010-07-16 7:56, Ralf B wrote: Hi all, when running KL on a small data set, everything is fine: require(flexmix) n- 20 a- rnorm(n) b- rnorm(n) mydata- cbind(a,b) KLdiv(mydata) however, when this dataset increases require(flexmix) n- 1000 a- rnorm(n) b- rnorm(n) mydata- cbind(a,b) KLdiv(mydata) KL seems to be not defined. Can somebody explain what is going on? Thanks, Ralf Ralf, You can adjust the 'eps=' argument. But I don't know what this will do to the reliability of the results. KLdiv(mydata, eps = 1e-7) -Peter Ehlers __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about string handling....
hey, guys, all these methods work perfectly. thank you!! -- View this message in context: http://r.789695.n4.nabble.com/question-about-string-handling-tp2289178p2291497.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question regarding varImpPlot results vs. model$importance data on package RandomForest
Use the source, Luke. varImpPlot calls randomForest:::importance.randomForest (yeah, that is three colons) and reading about the scale= parameter in help(importance, package=randomForest) should enlighten you. For the impatient, try varImpPlot(mtcars.rf, scale=FALSE) Hope this helps a little. Allan On 14/07/10 00:46, Mike Williamson wrote: Hi everyone, I have another Random Forest package question: - my (presumably incorrect) understanding of the varImpPlot is that it should plot the % increase in MSE and IncNodePurity exactly as can be found from the importance section of the model results. - However, the plot does not, in fact, match the importance section of the random forest model. E.g., if you use the example given in the ?randomForest, you will see the plot showing the highest few %IncMSE values around 17 or 18%. But if you look at the $importance, it is 9.7, 9.4, 7.7, and 7.3. Perhaps more importantly, for the plot, it will show wt is highest %MSE, then disp, then cyl, then hp; whereas the $importance will show wt, then disp, then hp, then cyl. And the ratios look somewhat different, too. Here is the code for that example: set.seed(4543) data(mtcars) mtcars.rf- randomForest(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE, importance=TRUE) varImpPlot(mtcars.rf) I am using version 2.11.1 of 'R' and version 4.5-35 of Random Forest. I don't really care or need for the varImpPlot to work just right. But I am not sure which is accurate: the varImpPlot or the $importance section. Which should I trust more, especially when they disagree appreciably? Thanks! Mike Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here. -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about SVM in e1071
Hi Jack, to 1) and 2) there are telling you the same. I recommend you to read the first sections of the article it is very well writen and clear. There you will read about duality. to 3) I interpret the scatter plot so: * Increasing the value of C (...) forces the creation of a more accurate model.* A more accurate model is done my adding more SV ( till we get a convex hull of the data ) hope it helps Regards Pau 2010/7/14 Jack Luo jluo.rh...@gmail.com Pau, Thanks a lot for your email, I found it very helpful. Please see below for my reply, thanks. -Jack On Wed, Jul 14, 2010 at 10:36 AM, Pau Carrio Gaspar paucar...@gmail.comwrote: Hello Jack, 1 ) why do you thought that larger C is prone to overfitting than smaller C ? *There is some statement in the link http://www.dtreg.com/svm.htm To allow some flexibility in separating the categories, SVM models have a cost parameter, C, that controls the trade off between allowing training errors and forcing rigid margins. It creates a soft margin that permits some misclassifications. Increasing the value of C increases the cost of misclassifying points and forces the creation of a more accurate model that may not generalize well. My understanding is that this means larger C may not generalize well (prone to overfitting). * 2 ) if you look at the formulation of the quadratic program problem you will see that C rules the error of the cutting plane ( and overfitting ). Therfore for hight C you allow that the cutting plane cuts worse the set, so SVM needs less points to build it. a proper explanation is in Kristin P. Bennett and Colin Campbell, Support Vector Machines: Hype or Hallelujah?, SIGKDD Explorations, 2,2, 2000, 1-13. http://www.idi.ntnu.no/emner/it3704/lectures/papers/Bennett_2000_Support.pdf *Could you be more specific about this? I don't quite understand. * 3) you might find usefull this plots: library(e1071) m1 - matrix( c( 0,0,0,1,1,2, 1, 2,3,2,3, 3, 0, 1,2,3,0, 1, 2, 3, 1,2,3,2,3,3, 0, 0,0,1, 1, 2, 4, 4,4,4, 0, 1, 2, 3, 1,1,1,1,1,1,-1,-1, -1,-1,-1,-1, 1 ,1,1,1, 1, 1,-1,-1 ), ncol = 3 ) Y = m1[,3] X = m1[,1:2] df = data.frame( X , Y ) par(mfcol=c(4,2)) for( cost in c( 1e-3 ,1e-2 ,1e-1, 1e0, 1e+1, 1e+2 ,1e+3)) { #cost - 1 model.svm - svm( Y ~ . , data = df , type = C-classification , kernel = linear, cost = cost, scale =FALSE ) #print(model.svm$SV) plot(x=0,ylim=c(0,5), xlim=c(0,3),main= paste( cost: ,cost, #SV: , nrow(model.svm$SV) )) points(m1[m1[,3]0,1], m1[m1[,3]0,2], pch=3, col=green) points(m1[m1[,3]0,1], m1[m1[,3]0,2], pch=4, col=blue) points(model.svm$SV[,1],model.svm$SV[,2], pch=18 , col = red) } * * *Thanks a lot for the code, I really appreciate it. I've run it, but I am not sure how should I interpret the scatter plot, although it is obvious that number of SVs decreases with cost increasing. * Regards Pau 2010/7/14 Jack Luo jluo.rh...@gmail.com Hi, I have a question about the parameter C (cost) in svm function in e1071. I thought larger C is prone to overfitting than smaller C, and hence leads to more support vectors. However, using the Wisconsin breast cancer example on the link: http://planatscher.net/svmtut/svmtut.html I found that the largest cost have fewest support vectors, which is contrary to what I think. please see the scripts below: Am I misunderstanding something here? Thanks a bunch, -Jack model1 - svm(databctrain, classesbctrain, kernel = linear, cost = 0.01) model2 - svm(databctrain, classesbctrain, kernel = linear, cost = 1) model3 - svm(databctrain, classesbctrain, kernel = linear, cost = 100) model1 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 0.01) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 0.01 gamma: 0.111 Number of Support Vectors: 99 model2 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 1) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 1 gamma: 0.111 Number of Support Vectors: 46 model3 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 100) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 100 gamma: 0.111 Number of Support Vectors: 44 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list
Re: [R] question about SVM in e1071
Hello Jack, 1 ) why do you thought that larger C is prone to overfitting than smaller C ? 2 ) if you look at the formulation of the quadratic program problem you will see that C rules the error of the cutting plane ( and overfitting ). Therfore for hight C you allow that the cutting plane cuts worse the set, so SVM needs less points to build it. a proper explanation is in Kristin P. Bennett and Colin Campbell, Support Vector Machines: Hype or Hallelujah?, SIGKDD Explorations, 2,2, 2000, 1-13. http://www.idi.ntnu.no/emner/it3704/lectures/papers/Bennett_2000_Support.pdf 3) you might find usefull this plots: library(e1071) m1 - matrix( c( 0,0,0,1,1,2, 1, 2,3,2,3, 3, 0, 1,2,3, 0, 1, 2, 3, 1,2,3,2,3,3, 0, 0,0,1, 1, 2, 4, 4,4,4,0, 1, 2, 3, 1,1,1,1,1,1,-1,-1, -1,-1,-1,-1, 1 ,1,1,1, 1, 1,-1,-1 ), ncol = 3 ) Y = m1[,3] X = m1[,1:2] df = data.frame( X , Y ) par(mfcol=c(4,2)) for( cost in c( 1e-3 ,1e-2 ,1e-1, 1e0, 1e+1, 1e+2 ,1e+3)) { #cost - 1 model.svm - svm( Y ~ . , data = df , type = C-classification , kernel = linear, cost = cost, scale =FALSE ) #print(model.svm$SV) plot(x=0,ylim=c(0,5), xlim=c(0,3),main= paste( cost: ,cost, #SV: , nrow(model.svm$SV) )) points(m1[m1[,3]0,1], m1[m1[,3]0,2], pch=3, col=green) points(m1[m1[,3]0,1], m1[m1[,3]0,2], pch=4, col=blue) points(model.svm$SV[,1],model.svm$SV[,2], pch=18 , col = red) } Regards Pau 2010/7/14 Jack Luo jluo.rh...@gmail.com Hi, I have a question about the parameter C (cost) in svm function in e1071. I thought larger C is prone to overfitting than smaller C, and hence leads to more support vectors. However, using the Wisconsin breast cancer example on the link: http://planatscher.net/svmtut/svmtut.html I found that the largest cost have fewest support vectors, which is contrary to what I think. please see the scripts below: Am I misunderstanding something here? Thanks a bunch, -Jack model1 - svm(databctrain, classesbctrain, kernel = linear, cost = 0.01) model2 - svm(databctrain, classesbctrain, kernel = linear, cost = 1) model3 - svm(databctrain, classesbctrain, kernel = linear, cost = 100) model1 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 0.01) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 0.01 gamma: 0.111 Number of Support Vectors: 99 model2 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 1) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 1 gamma: 0.111 Number of Support Vectors: 46 model3 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 100) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 100 gamma: 0.111 Number of Support Vectors: 44 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about SVM in e1071
Pau, Thanks a lot for your email, I found it very helpful. Please see below for my reply, thanks. -Jack On Wed, Jul 14, 2010 at 10:36 AM, Pau Carrio Gaspar paucar...@gmail.comwrote: Hello Jack, 1 ) why do you thought that larger C is prone to overfitting than smaller C ? *There is some statement in the link http://www.dtreg.com/svm.htm To allow some flexibility in separating the categories, SVM models have a cost parameter, C, that controls the trade off between allowing training errors and forcing rigid margins. It creates a soft margin that permits some misclassifications. Increasing the value of C increases the cost of misclassifying points and forces the creation of a more accurate model that may not generalize well. My understanding is that this means larger C may not generalize well (prone to overfitting). * 2 ) if you look at the formulation of the quadratic program problem you will see that C rules the error of the cutting plane ( and overfitting ). Therfore for hight C you allow that the cutting plane cuts worse the set, so SVM needs less points to build it. a proper explanation is in Kristin P. Bennett and Colin Campbell, Support Vector Machines: Hype or Hallelujah?, SIGKDD Explorations, 2,2, 2000, 1-13. http://www.idi.ntnu.no/emner/it3704/lectures/papers/Bennett_2000_Support.pdf *Could you be more specific about this? I don't quite understand. * 3) you might find usefull this plots: library(e1071) m1 - matrix( c( 0,0,0,1,1,2, 1, 2,3,2,3, 3, 0, 1,2,3,0, 1, 2, 3, 1,2,3,2,3,3, 0, 0,0,1, 1, 2, 4, 4,4,4, 0, 1, 2, 3, 1,1,1,1,1,1,-1,-1, -1,-1,-1,-1, 1 ,1,1,1, 1, 1,-1,-1 ), ncol = 3 ) Y = m1[,3] X = m1[,1:2] df = data.frame( X , Y ) par(mfcol=c(4,2)) for( cost in c( 1e-3 ,1e-2 ,1e-1, 1e0, 1e+1, 1e+2 ,1e+3)) { #cost - 1 model.svm - svm( Y ~ . , data = df , type = C-classification , kernel = linear, cost = cost, scale =FALSE ) #print(model.svm$SV) plot(x=0,ylim=c(0,5), xlim=c(0,3),main= paste( cost: ,cost, #SV: , nrow(model.svm$SV) )) points(m1[m1[,3]0,1], m1[m1[,3]0,2], pch=3, col=green) points(m1[m1[,3]0,1], m1[m1[,3]0,2], pch=4, col=blue) points(model.svm$SV[,1],model.svm$SV[,2], pch=18 , col = red) } * * *Thanks a lot for the code, I really appreciate it. I've run it, but I am not sure how should I interpret the scatter plot, although it is obvious that number of SVs decreases with cost increasing. * Regards Pau 2010/7/14 Jack Luo jluo.rh...@gmail.com Hi, I have a question about the parameter C (cost) in svm function in e1071. I thought larger C is prone to overfitting than smaller C, and hence leads to more support vectors. However, using the Wisconsin breast cancer example on the link: http://planatscher.net/svmtut/svmtut.html I found that the largest cost have fewest support vectors, which is contrary to what I think. please see the scripts below: Am I misunderstanding something here? Thanks a bunch, -Jack model1 - svm(databctrain, classesbctrain, kernel = linear, cost = 0.01) model2 - svm(databctrain, classesbctrain, kernel = linear, cost = 1) model3 - svm(databctrain, classesbctrain, kernel = linear, cost = 100) model1 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 0.01) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 0.01 gamma: 0.111 Number of Support Vectors: 99 model2 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 1) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 1 gamma: 0.111 Number of Support Vectors: 46 model3 Call: svm.default(x = databctrain, y = classesbctrain, kernel = linear, cost = 100) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 100 gamma: 0.111 Number of Support Vectors: 44 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about string handling....
On Wed, Jul 14, 2010 at 2:21 PM, karena dr.jz...@gmail.com wrote: Hi, I have a data.frame as following: var1 var2 1 ab_c_(ok) 2 okf789(db)_c 3 jojfiod(90).gt 4 ij_(78)__op 5 (iojfodjfo)_ab what I want is to create a new variable called var3. the value of var3 is the content in the Parentheses. so var3 would be: var3 ok db 90 78 iojfodjfo Here are several alternatives. The gsub solution matches everything up to the ( as well as everything after the ) and replaces each with nothing. The strsplit solution splits each into three fields, everything before the (, everything with in the (), and everything after the ) and the picks off the second. The strapply solution matches everything from ( to ) and returns everything between them. The below works whether DF$var2 is factor or character but if you know its character you can drop the as.character in #2 and #3. # 1 gsub(.*[(]|[)].*, , DF$var2) # 2 sapply(strsplit(as.character(DF$var2), [()]), [, 2) # 3 library(gsubfn) strapply(as.character(DF$var2), [(](.*)[)], simplify = TRUE) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about string handling....
Try this: text - 'var1 var2 1 ab_c_(ok) 2 okf789(db)_c 3 jojfiod(90).gt 4 ij_(78)__op 5 (iojfodjfo)_ab' df - read.table(textConnection(text), head=T, sep= ,quote=) df$var3 - gsub((.*\\()(.*)(\\).*),\\2,df$var2) - A R learner. -- View this message in context: http://r.789695.n4.nabble.com/question-about-string-handling-tp2289178p2289327.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about string handling....
Another option could be: df$var3 - gsub(.*\\((.*)\\).*, \\1, df$var2) On Wed, Jul 14, 2010 at 3:21 PM, karena dr.jz...@gmail.com wrote: Hi, I have a data.frame as following: var1 var2 1 ab_c_(ok) 2 okf789(db)_c 3 jojfiod(90).gt 4 ij_(78)__op 5 (iojfodjfo)_ab what I want is to create a new variable called var3. the value of var3 is the content in the Parentheses. so var3 would be: var3 ok db 90 78 iojfodjfo how to do this? thanks, karena -- View this message in context: http://r.789695.n4.nabble.com/question-about-string-handling-tp2289178p2289178.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40 S 49° 16' 22 O [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about food sampling analysis
Dear Sarah, [snip...] I know that samples within each facility cannot be treated as independent, so I need an approach that accounts for (1) clustering within facilities and You could just use lm() some planning. The data from within a specific facility can be fit with a model to generate parameters that are compared between facilities. Not to practical though - assuming the 57 production facilities each have their own analytical lab, you'll have 57 different fits to get your parameters from to use in your between test. Questions about dependent data are fairly common, so it should be relatively straight forward to get a solution and/or idea for a suitable package from the archives. (2) the different number of samples taken at each facility. It's a waste of time to worry about that. You'll be comparing aggregate values between groups, and you'll have too few data-points within a group to detect within effects... [snip...] Sincerely, KeithC. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about food sampling analysis
Hi Sarah, We regularly undertake work in the food sector and have developed many custom built solutions. To be more specific, the statistics we employ is that of sensory analysis and we regularly use the sensominer package in R. Regards, Richard Weeks Mangosolutions data analysis that delivers Mail: rwe...@mango-solutions.com T: +44 (0)1249 767700 F: +44 (0)1249 767707 M: +44 (0)7500 040365 Unit 2 Greenways Business Park Bellinger Close Chippenham Wilts SN15 1BN UK -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Sarah Henderson Sent: 12 July 2010 22:42 To: R List Subject: [R] Question about food sampling analysis Greetings to all, and my apologies for a question that is mostly about statistics and secondarily about R. I have just started a new job that (this week, apparently) requires statistical knowledge beyond my training (as an epidemiologist). The problem: - We have 57 food production facilities in three categories - Samples of 4-6 different foods were tested for listeria at each facility - I need to describe the presence of listeria in food (1) overall and (2) by facility category. I know that samples within each facility cannot be treated as independent, so I need an approach that accounts for (1) clustering within facilities and (2) the different number of samples taken at each facility. If someone could kindly point me towards the right type of analysis for this and/or its associated R functions/packages, I would greatly appreciate it. Many thanks, Sarah [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. LEGAL NOTICE This message is intended for the use o...{{dropped:9}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about food sampling analysis
Greetings to all -- As always, I am grateful for the help of the R community. The generosity of other R users never ceases to impress me. Just to recap the responses I have had to this question, in case they can help anyone else down the line: Robert LaBudde suggested Applied Survey Data Analysis by Heeringa et al. as a go-to reference Ruben Roa suggested glm() with binomial family for modeling Listeria presence/absence, glm() with Gamma family for modeling Listeria concentrations, or the betareg() function of the betareg package for modeling Listeria proportions. Jim Lemon suggest the lme family of functions in the lme4 package, and Modern Applied Statistics with S as a good reference (which, thankfully, I have in my library). Richard Weeks replied to the whole group suggesting sensory analysis with the sensominer package. Thanks again for your time, everyone. Sarah On Tue, Jul 13, 2010 at 3:07 AM, Richard Weeks rwe...@mango-solutions.comwrote: Hi Sarah, We regularly undertake work in the food sector and have developed many custom built solutions. To be more specific, the statistics we employ is that of sensory analysis and we regularly use the sensominer package in R. Regards, Richard Weeks Mangosolutions data analysis that delivers Mail: rwe...@mango-solutions.com T: +44 (0)1249 767700 F: +44 (0)1249 767707 M: +44 (0)7500 040365 Unit 2 Greenways Business Park Bellinger Close Chippenham Wilts SN15 1BN UK -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Sarah Henderson Sent: 12 July 2010 22:42 To: R List Subject: [R] Question about food sampling analysis Greetings to all, and my apologies for a question that is mostly about statistics and secondarily about R. I have just started a new job that (this week, apparently) requires statistical knowledge beyond my training (as an epidemiologist). The problem: - We have 57 food production facilities in three categories - Samples of 4-6 different foods were tested for listeria at each facility - I need to describe the presence of listeria in food (1) overall and (2) by facility category. I know that samples within each facility cannot be treated as independent, so I need an approach that accounts for (1) clustering within facilities and (2) the different number of samples taken at each facility. If someone could kindly point me towards the right type of analysis for this and/or its associated R functions/packages, I would greatly appreciate it. Many thanks, Sarah [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. LEGAL NOTICE This message is intended for the use of the named recipient(s) only and may contain confidential and / or privileged information. If you are not the intended recipient, please contact the sender and delete this message. Any unauthorised use of the information contained in this message is prohibited. Mango Solutions Limited is registered in England under No. 4560258 with its registered office at Suite 3, Middlesex House, Rutherford Close, Stevenage, Herts, SG1 2EF, UK. PLEASE CONSIDER THE ENVIRONMENT BEFORE PRINTING THIS EMAIL [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about lpSolve package
On Jul 6, 2010, at 3:32 PM, Xiaoxi Gao wrote: Hello R users, I have two quick questions while using lpSolve package for linear programming. (1) the result contains both characters and numbers, e.g., Success: the objective function is 40.5, but I only need the number, can I only store the number? ?lp ?lp.object # or just use str() (2) How to set boundaries for variables? e.g., all variable are positive. It appears you need to take time to re-read the help page and work through the examples. -- David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question concerning VGAM
== Martin Spindler martin.spind...@gmx.de on Mon, 5 Jul 2010 07:48:42 +0200 writes: Hello everyone, using the VGAM package and the following code library(VGAM) bp1 - vglm(cbind(daten$anzahl_b, daten$deckung_b) ~ ., binom2.rho, data=daten1) summary(bp1) coef(bp1, matrix=TRUE) produced this error message: error in object$coefficients : $ operator not defined for this S4 class I am bit confused because some day ago this error message did not show up and everything was fine. Thank you very much in advance for your help. Best, Martin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help --- PLEASE do read the posting guide http://www.R-project.org/posting-guide.html --- and provide commented, minimal, self-contained, reproducible code. Hmm, and which part of the two lines above did you not understand? example(vglm) already contains uses of coef() which do work fine; so it must be you, or your setup which breaks things. Martin Maechler, ETH Zurich __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question regarding panel data analysis
Hello On Thu, Jul 1, 2010 at 1:12 AM, amatoallah ouchen at.ouc...@gmail.com wrote: serious issue for me . I'm currently running a panel data analysis i've used the plm package to perform the Tests of poolability as results intercepts and coefficients are assumed different. so my The above is not very clear and you should probably give us more information, such as minimal code and results (see the posting guide). Otherwise please refrain from sending the same message 4 times to the list. Regards Liviu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question on WLS (gls vs lm)
Isn't that exactly what you would expect when using a _generalized_ least squares compared to a normal least squares? GLS is not the same as WLS. http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_least_squares_generalized.htm Cheers Joris On Thu, Jun 24, 2010 at 9:16 AM, Stats Wolf stats.w...@gmail.com wrote: Hi all, I understand that gls() uses generalized least squares, but I thought that maybe optimum weights from gls might be used as weights in lm (as shown below), but apparently this is not the case. See: library(nlme) f1 - gls(Petal.Width ~ Species / Petal.Length, data = iris, weights = varIdent(form = ~ 1 | Species)) aa - attributes(summary(f1)$modelStruct$varStruct)$weights f2 - lm(Petal.Width ~ Species / Petal.Length, data = iris, weights = aa) summary(f1)$tTable; summary(f2) So, the two models with the very same weights do differ (in terms of standard errors). Could you please explain why? Are these different types of weights? Many thanks in advance, Stats Wolf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question on WLS (gls vs lm)
Thanks for reply. Yes, they do differ, but does not gls() with the weights argument (correlation being unchanged) make the special version of GLS, as this sentence from the page you provided says: The method leading to this result is called Generalized Least Squares estimation (GLS), of which WLS is just a special case? Best, Stats Wolf On Thu, Jun 24, 2010 at 12:49 PM, Joris Meys jorism...@gmail.com wrote: Isn't that exactly what you would expect when using a _generalized_ least squares compared to a normal least squares? GLS is not the same as WLS. http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_least_squares_generalized.htm Cheers Joris On Thu, Jun 24, 2010 at 9:16 AM, Stats Wolf stats.w...@gmail.com wrote: Hi all, I understand that gls() uses generalized least squares, but I thought that maybe optimum weights from gls might be used as weights in lm (as shown below), but apparently this is not the case. See: library(nlme) f1 - gls(Petal.Width ~ Species / Petal.Length, data = iris, weights = varIdent(form = ~ 1 | Species)) aa - attributes(summary(f1)$modelStruct$varStruct)$weights f2 - lm(Petal.Width ~ Species / Petal.Length, data = iris, weights = aa) summary(f1)$tTable; summary(f2) So, the two models with the very same weights do differ (in terms of standard errors). Could you please explain why? Are these different types of weights? Many thanks in advance, Stats Wolf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question on WLS (gls vs lm)
Indeed, WLS is a special case of GLS, where the error covariance matrix is a diagonal matrix. OLS is a special case of GLS, where the error is considered homoscedastic and all weights are equal to 1. And I now realized that the varIdent() indeed makes a diagonal covariance matrix, so the results should be the same in fact. Sorry for missing that one. A closer inspection shows that the results don't differ too much. The fitting method differs between both functions; lm.wfit uses the QR decomposition, whereas gls() uses restricted maximum likelihood. In Asymptopia, they should give the same result. Cheers Joris On Thu, Jun 24, 2010 at 12:54 PM, Stats Wolf stats.w...@gmail.com wrote: Thanks for reply. Yes, they do differ, but does not gls() with the weights argument (correlation being unchanged) make the special version of GLS, as this sentence from the page you provided says: The method leading to this result is called Generalized Least Squares estimation (GLS), of which WLS is just a special case? Best, Stats Wolf On Thu, Jun 24, 2010 at 12:49 PM, Joris Meys jorism...@gmail.com wrote: Isn't that exactly what you would expect when using a _generalized_ least squares compared to a normal least squares? GLS is not the same as WLS. http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_least_squares_generalized.htm Cheers Joris On Thu, Jun 24, 2010 at 9:16 AM, Stats Wolf stats.w...@gmail.com wrote: Hi all, I understand that gls() uses generalized least squares, but I thought that maybe optimum weights from gls might be used as weights in lm (as shown below), but apparently this is not the case. See: library(nlme) f1 - gls(Petal.Width ~ Species / Petal.Length, data = iris, weights = varIdent(form = ~ 1 | Species)) aa - attributes(summary(f1)$modelStruct$varStruct)$weights f2 - lm(Petal.Width ~ Species / Petal.Length, data = iris, weights = aa) summary(f1)$tTable; summary(f2) So, the two models with the very same weights do differ (in terms of standard errors). Could you please explain why? Are these different types of weights? Many thanks in advance, Stats Wolf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question on WLS (gls vs lm)
Indeed, they should give the same results, and hence I was worried to see that the results were not that same. Suffice it to look at standard errors and p-values: they do differ, and the differences are not really that small. Thanks, Stats Wolf On Thu, Jun 24, 2010 at 2:39 PM, Joris Meys jorism...@gmail.com wrote: Indeed, WLS is a special case of GLS, where the error covariance matrix is a diagonal matrix. OLS is a special case of GLS, where the error is considered homoscedastic and all weights are equal to 1. And I now realized that the varIdent() indeed makes a diagonal covariance matrix, so the results should be the same in fact. Sorry for missing that one. A closer inspection shows that the results don't differ too much. The fitting method differs between both functions; lm.wfit uses the QR decomposition, whereas gls() uses restricted maximum likelihood. In Asymptopia, they should give the same result. Cheers Joris On Thu, Jun 24, 2010 at 12:54 PM, Stats Wolf stats.w...@gmail.com wrote: Thanks for reply. Yes, they do differ, but does not gls() with the weights argument (correlation being unchanged) make the special version of GLS, as this sentence from the page you provided says: The method leading to this result is called Generalized Least Squares estimation (GLS), of which WLS is just a special case? Best, Stats Wolf On Thu, Jun 24, 2010 at 12:49 PM, Joris Meys jorism...@gmail.com wrote: Isn't that exactly what you would expect when using a _generalized_ least squares compared to a normal least squares? GLS is not the same as WLS. http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_least_squares_generalized.htm Cheers Joris On Thu, Jun 24, 2010 at 9:16 AM, Stats Wolf stats.w...@gmail.com wrote: Hi all, I understand that gls() uses generalized least squares, but I thought that maybe optimum weights from gls might be used as weights in lm (as shown below), but apparently this is not the case. See: library(nlme) f1 - gls(Petal.Width ~ Species / Petal.Length, data = iris, weights = varIdent(form = ~ 1 | Species)) aa - attributes(summary(f1)$modelStruct$varStruct)$weights f2 - lm(Petal.Width ~ Species / Petal.Length, data = iris, weights = aa) summary(f1)$tTable; summary(f2) So, the two models with the very same weights do differ (in terms of standard errors). Could you please explain why? Are these different types of weights? Many thanks in advance, Stats Wolf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question on WLS (gls vs lm)
The weights in 'aa' are the inverse standard deviations. But you want to use the inverse variances as the weights: aa - (attributes(summary(f1)$modelStruct$varStruct)$weights)^2 And then the results are essentially identical. Best, -- Wolfgang Viechtbauerhttp://www.wvbauer.com/ Department of Methodology and StatisticsTel: +31 (0)43 388-2277 School for Public Health and Primary Care Office Location: Maastricht University, P.O. Box 616 Room B2.01 (second floor) 6200 MD Maastricht, The Netherlands Debyeplein 1 (Randwyck) Original Message From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Stats Wolf Sent: Thursday, June 24, 2010 15:00 To: Joris Meys Cc: r-help@r-project.org Subject: Re: [R] Question on WLS (gls vs lm) Indeed, they should give the same results, and hence I was worried to see that the results were not that same. Suffice it to look at standard errors and p-values: they do differ, and the differences are not really that small. Thanks, Stats Wolf On Thu, Jun 24, 2010 at 2:39 PM, Joris Meys jorism...@gmail.com wrote: Indeed, WLS is a special case of GLS, where the error covariance matrix is a diagonal matrix. OLS is a special case of GLS, where the error is considered homoscedastic and all weights are equal to 1. And I now realized that the varIdent() indeed makes a diagonal covariance matrix, so the results should be the same in fact. Sorry for missing that one. A closer inspection shows that the results don't differ too much. The fitting method differs between both functions; lm.wfit uses the QR decomposition, whereas gls() uses restricted maximum likelihood. In Asymptopia, they should give the same result. Cheers Joris On Thu, Jun 24, 2010 at 12:54 PM, Stats Wolf stats.w...@gmail.com wrote: Thanks for reply. Yes, they do differ, but does not gls() with the weights argument (correlation being unchanged) make the special version of GLS, as this sentence from the page you provided says: The method leading to this result is called Generalized Least Squares estimation (GLS), of which WLS is just a special case? Best, Stats Wolf On Thu, Jun 24, 2010 at 12:49 PM, Joris Meys jorism...@gmail.com wrote: Isn't that exactly what you would expect when using a _generalized_ least squares compared to a normal least squares? GLS is not the same as WLS. http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_least_square s_generalized.htm Cheers Joris On Thu, Jun 24, 2010 at 9:16 AM, Stats Wolf stats.w...@gmail.com wrote: Hi all, I understand that gls() uses generalized least squares, but I thought that maybe optimum weights from gls might be used as weights in lm (as shown below), but apparently this is not the case. See: library(nlme) f1 - gls(Petal.Width ~ Species / Petal.Length, data = iris, weights = varIdent(form = ~ 1 | Species)) aa - attributes(summary(f1)$modelStruct$varStruct)$weights f2 - lm(Petal.Width ~ Species / Petal.Length, data = iris, weights = aa) summary(f1)$tTable; summary(f2) So, the two models with the very same weights do differ (in terms of standard errors). Could you please explain why? Are these different types of weights? Many thanks in advance, Stats Wolf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question on WLS (gls vs lm)
Thanks! I was getting confused as wel, Wolf really had a point. I have to admit that this is all a bit counterintuitive. I would expect those weights to be the inverse of the variances, as GLS uses the inverse of the variance-covariance matrix. I finally found in the help files ?nlme::varWeights the needed information, after you pointed out the problem and after strolling through quite a part of the help files... Does this mean that the GLS algorithm uses 1/sd as weights (can't imagine that...), or is this one of the things in R that just are like that? Cheers Joris On Thu, Jun 24, 2010 at 3:20 PM, Viechtbauer Wolfgang (STAT) wolfgang.viechtba...@stat.unimaas.nl wrote: The weights in 'aa' are the inverse standard deviations. But you want to use the inverse variances as the weights: aa - (attributes(summary(f1)$modelStruct$varStruct)$weights)^2 And then the results are essentially identical. Best, -- Wolfgang Viechtbauer http://www.wvbauer.com/ Department of Methodology and Statistics Tel: +31 (0)43 388-2277 School for Public Health and Primary Care Office Location: Maastricht University, P.O. Box 616 Room B2.01 (second floor) 6200 MD Maastricht, The Netherlands Debyeplein 1 (Randwyck) Original Message From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Stats Wolf Sent: Thursday, June 24, 2010 15:00 To: Joris Meys Cc: r-help@r-project.org Subject: Re: [R] Question on WLS (gls vs lm) Indeed, they should give the same results, and hence I was worried to see that the results were not that same. Suffice it to look at standard errors and p-values: they do differ, and the differences are not really that small. Thanks, Stats Wolf On Thu, Jun 24, 2010 at 2:39 PM, Joris Meys jorism...@gmail.com wrote: Indeed, WLS is a special case of GLS, where the error covariance matrix is a diagonal matrix. OLS is a special case of GLS, where the error is considered homoscedastic and all weights are equal to 1. And I now realized that the varIdent() indeed makes a diagonal covariance matrix, so the results should be the same in fact. Sorry for missing that one. A closer inspection shows that the results don't differ too much. The fitting method differs between both functions; lm.wfit uses the QR decomposition, whereas gls() uses restricted maximum likelihood. In Asymptopia, they should give the same result. Cheers Joris On Thu, Jun 24, 2010 at 12:54 PM, Stats Wolf stats.w...@gmail.com wrote: Thanks for reply. Yes, they do differ, but does not gls() with the weights argument (correlation being unchanged) make the special version of GLS, as this sentence from the page you provided says: The method leading to this result is called Generalized Least Squares estimation (GLS), of which WLS is just a special case? Best, Stats Wolf On Thu, Jun 24, 2010 at 12:49 PM, Joris Meys jorism...@gmail.com wrote: Isn't that exactly what you would expect when using a _generalized_ least squares compared to a normal least squares? GLS is not the same as WLS. http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_least_square s_generalized.htm Cheers Joris On Thu, Jun 24, 2010 at 9:16 AM, Stats Wolf stats.w...@gmail.com wrote: Hi all, I understand that gls() uses generalized least squares, but I thought that maybe optimum weights from gls might be used as weights in lm (as shown below), but apparently this is not the case. See: library(nlme) f1 - gls(Petal.Width ~ Species / Petal.Length, data = iris, weights = varIdent(form = ~ 1 | Species)) aa - attributes(summary(f1)$modelStruct$varStruct)$weights f2 - lm(Petal.Width ~ Species / Petal.Length, data = iris, weights = aa) summary(f1)$tTable; summary(f2) So, the two models with the very same weights do differ (in terms of standard errors). Could you please explain why? Are these different types of weights? Many thanks in advance, Stats Wolf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https
Re: [R] Question on WLS (gls vs lm)
On Thu, Jun 24, 2010 at 9:20 AM, Viechtbauer Wolfgang (STAT) wolfgang.viechtba...@stat.unimaas.nl wrote: The weights in 'aa' are the inverse standard deviations. But you want to use the inverse variances as the weights: aa - (attributes(summary(f1)$modelStruct$varStruct)$weights)^2 And then the results are essentially identical. We might now ask how we might have found Wolfgang's answer via calculation. Lets redo the gls calculation of variance from scratch by iterated re-weighted least squares (just one iteration here) and compare that to the gls aa calculated by the original poster: # estimate beta fm - lm(Petal.Width ~ Species / Petal.Length, iris) # estimate variance v - fitted(lm(resid(fm)^2 ~ Species, iris)) v - v/v[1] # compare to aa from original poster lm(log(aa) ~ log(v)) The last line gives: (Intercept) log(v) -4.212e-07 -5.000e-01 which suggsts: aa = 1/sqrt(variance) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about a program
use Rprof to determine where your function is spending time. What is the problem you are trying to solve? Sent from my iPhone. On Jun 23, 2010, at 5:21, li li hannah@gmail.com wrote: Dear all, I have the following program for a multiple comparison procedure. There are two functions for the two steps. First step is to calculate the critical values, while the second step is the actual procedure [see below: program with two functions]. This work fine. However, However I want to put them into one function for the convenience of later use [see below: program with one function]. Some how the big function works extremely slow. For example I chose m - 10 rho - 0.1 k - 2 alpha - 0.05 pvaluessort - sort(1-pnorm(rnorm(10)) Is there anything that I did wrong? Thank you! Hannah ##Program with two functions ## first step library(mvtnorm) cc_f - function(m, rho, k, alpha) { cc_z - numeric(m) var - matrix(c(1,rho,rho,1), nrow=2, ncol=2, byrow=T) for (i in 1:m){ if (i = k) {cc_z[i] - qmvnorm((k*(k-1))/(m*(m-1))*alpha, tail=upper, sigma=var)$quantile} else {cc_z[i] - qmvnorm((k*(k-1))/((m-i+k)*(m-i+k-1))*alpha, tail=upper, sigma=var)$quantile} } cc - 1-pnorm(cc_z) return(cc) } ## second step pair_kFWER=function(m,crit_cons,pvaluessort){ k=0 while((k m)(crit_cons[m-k] pvaluessort[m-k])){ k=k+1 } return(m-k) } ### ##Program with one function ## pair_kFWER=function(m,alpha,rho,k,pvaluessort){ library(mvtnorm) cc_z - numeric(m) var - matrix(c(1,rho,rho,1), nrow=2, ncol=2, byrow=T) for (i in 1:m){ if (i = k) {cc_z[i] - qmvnorm((k*(k-1))/(m*(m-1))*alpha, tail=upper, sigma=var)$quantile} else {cc_z[i] - qmvnorm((k*(k-1))/((m-i+k)*(m-i+k-1))*alpha, tail=upper, sigma=var)$quantile} } crit_cons - 1-pnorm(cc_z) k=0 while((k m)(crit_cons[m-k] pvaluessort[m-k])){ k=k+1 } return(m-k) } # [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about a program
Most of the computation time is in the functions qvnorm. You can win a little bit by optimizing the code, but the gain is relatively small. You can also decrease the interval used to evaluate qvnorm to win some speed there. As you look for the upper tail, no need to evaluate the function in negative numbers. Eg : pair_kFWER2=function(m,alpha,rho,k,pvaluessort){ library(mvtnorm) cc_z - numeric(m) Var - matrix(c(1,rho,rho,1), nrow=2, ncol=2, byrow=T) lpart - 1:k # first part hpart - (k+1):m # second part # make first part of the cc_z cc_z[lpart] - qmvnorm((k*(k-1))/(m*(m-1))*alpha, tail=upper, interval=c(0,4),sigma=Var)$quantile # make second part of the cc_z cc_z[hpart] - sapply(hpart,function(i){ qmvnorm((k*(k-1))/((m-i+k)*(m-i+k-1))*alpha,interval=c(0,4), tail=upper, sigma=Var)$quantile }) crit_cons - 1-pnorm(cc_z) # does the test whether values of crit_cons[i] are smaller than # values pvaluessort[i] and returns a vector # which says at which positions does not occur # tail takes the last position. I guess that's what you wanted out - tail(which(!(crit_cons pvaluessort)),1) return(out) } system.time(replicate(100,pair_kFWER(m,alpha,rho,k,pvaluessort))) user system elapsed 5.970.046.04 system.time(replicate(100,pair_kFWER2(m,alpha,rho,k,pvaluessort))) user system elapsed 4.430.004.44 Check whether the function works correctly. It gives the same result as the other one in my tests. Only in the case where your function returns 0, the modified returns integer(0) Cheers Joris On Wed, Jun 23, 2010 at 2:21 PM, li li hannah@gmail.com wrote: Dear all, I have the following program for a multiple comparison procedure. There are two functions for the two steps. First step is to calculate the critical values, while the second step is the actual procedure [see below: program with two functions]. This work fine. However, However I want to put them into one function for the convenience of later use [see below: program with one function]. Some how the big function works extremely slow. For example I chose m - 10 rho - 0.1 k - 2 alpha - 0.05 pvaluessort - sort(1-pnorm(rnorm(10)) Is there anything that I did wrong? Thank you! Hannah ##Program with two functions ## first step library(mvtnorm) cc_f - function(m, rho, k, alpha) { cc_z - numeric(m) var - matrix(c(1,rho,rho,1), nrow=2, ncol=2, byrow=T) for (i in 1:m){ if (i = k) {cc_z[i] - qmvnorm((k*(k-1))/(m*(m-1))*alpha, tail=upper, sigma=var)$quantile} else {cc_z[i] - qmvnorm((k*(k-1))/((m-i+k)*(m-i+k-1))*alpha, tail=upper, sigma=var)$quantile} } cc - 1-pnorm(cc_z) return(cc) } ## second step pair_kFWER=function(m,crit_cons,pvaluessort){ k=0 while((k m)(crit_cons[m-k] pvaluessort[m-k])){ k=k+1 } return(m-k) } ### ##Program with one function ## pair_kFWER=function(m,alpha,rho,k,pvaluessort){ library(mvtnorm) cc_z - numeric(m) var - matrix(c(1,rho,rho,1), nrow=2, ncol=2, byrow=T) for (i in 1:m){ if (i = k) {cc_z[i] - qmvnorm((k*(k-1))/(m*(m-1))*alpha, tail=upper, sigma=var)$quantile} else {cc_z[i] - qmvnorm((k*(k-1))/((m-i+k)*(m-i+k-1))*alpha, tail=upper, sigma=var)$quantile} } crit_cons - 1-pnorm(cc_z) k=0 while((k m)(crit_cons[m-k] pvaluessort[m-k])){ k=k+1 } return(m-k) } # [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about boosting(Adaboosting. M1)
Changbin, The weights don't have to sum up to one. These are the weights of the trees in the bag used to combine them into the final fit, and if I'm not mistaken expressed as the logit of the error for the respective trees. If you use a method, be sure you understand it. If you don't understand those weights, I reckon there's a whole lot of other things you don't understand about the method. A good start is running and analyzing the example in the help file. Those weights don't sum up to one either, but the authors don't seem to mind... There are references given in the help files, and you should understand what they say before you apply the algorithm. You're not going to handle a chainsaw without reading the manual carefully either. Cheers Joris On Sat, Jun 19, 2010 at 11:35 PM, Changbin Du changb...@gmail.com wrote: HI, Guys, I am trying to use the AdaBoosting. M.1 algorithm to integrate three models. I found the sum of weights for each model is not equal to one. How to deal with this? Thanks, any response or suggestions are appreciated! -- Sincerely, Changbin -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question in R
See below. On Fri, Jun 18, 2010 at 7:11 PM, li li hannah@gmail.com wrote: Dear all, I am trying to calculate certain critical values from bivariate normal distribution (please see the function below). m - 10 rho - 0.1 k - 2 alpha - 0.05 ## calculate critical constants cc_z - numeric(m) var - matrix(c(1,rho,rho,1), nrow=2, ncol=2, byrow=T) for (i in 1:m){ if (i = k) {cc_z[i] - qmvnorm((k*(k-1))/(m*(m-1))*alpha, tail=upper, sigma=var)$quantile} else {cc_z[i] - qmvnorm((k*(k-1))/((m-i+k)*(m-i+k-1))*alpha, tail=upper, sigma=var)$quantile} } After the critical constants cc_z is calculated, I wanted to check whether they are correct. ##check whether cc_z is correct pmvnorm(lower=c(cc_z[1], cc_z[1]), upper=Inf,sigma=var)-(k*(k-1))/(n*(n-1)) Shouldn't this be pmvnorm(lower=c(cc_z[1], cc_z[1]), + upper=Inf,sigma=var)-(k*(k-1))/(m*(m-1))*alpha [1] -5.87e-09 attr(,error) [1] 1e-15 attr(,msg) [1] Normal Completion This still gives a bit of an error, but you have to take into account as well that the underlying algorithms use randomized quasi-MC methods, and that floating point issues can play here as well. So it looks to me that your calculations are correct. Cheers Joris -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding print
cat(out, '\n') On Thu, Jun 17, 2010 at 10:19 AM, Adolf STIPS adolf.st...@jrc.ec.europa.eu wrote: Hi, Does anybody know how to have output from print, without the leading [1]? (Or must I use cat/write?) out=r15 print(out,quote=FALSE) [1] r15 And I definitely do not want the leading [1] as I want to construct a table from this. Ciao, Adolf Adolf Stips (new email: adolf.st...@jrc.ec.europa.eu) Global Environment Monitoring unit CEC Joint Research Centre, TP 272 I-21027 Ispra, Italy Tel: +39-0332-789876 Fax: +39-0332-789034 I know that I do not know, but even that not for sure! The views expressed are purely those of the writer and may not in any circumstances be regarded as stating an official position of the European Commission. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question regarding print
The cat function is probably the best approach, but if your really feel the need to use print then you can just assign blank names (now it will be a named vector and slower in heavy calculations, but the printing is different). Try something like: names(x) - rep( '', length(x) ) print(x) # or x This of course leads to the question of how to prevent the blank line(s). The best answer there is use the cat function (or some other specialized function to print your objects (which will probably use cat)). Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Adolf STIPS Sent: Thursday, June 17, 2010 8:20 AM To: r-help@r-project.org Subject: [R] Question regarding print Hi, Does anybody know how to have output from print, without the leading [1]? (Or must I use cat/write?) out=r15 print(out,quote=FALSE) [1] r15 And I definitely do not want the leading [1] as I want to construct a table from this. Ciao, Adolf Adolf Stips (new email: adolf.st...@jrc.ec.europa.eu) Global Environment Monitoring unit CEC Joint Research Centre, TP 272 I-21027 Ispra, Italy Tel: +39-0332-789876 Fax: +39-0332-789034 I know that I do not know, but even that not for sure! The views expressed are purely those of the writer and may not in any circumstances be regarded as stating an official position of the European Commission. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question
Two possibilities : rescale your random vector, or resample to get numbers within the range. But neither of these solutions will give you a true exponential distribution. I am not aware of truncated exponential distributions that are available in R, but somebody else might know more about that. # possibility I : rescaling rsample - rexp(5) lim - 0.8 rsample - rsample*lim/max(rsample) rsample # possibility II : resampling rsample - rexp(5) while(sum(rsamplelim)0) { rsample - ifelse(rsamplelim,rexp(length(rsample)),rsample) } rsample Cheers Joris On Wed, Jun 16, 2010 at 12:00 PM, Assieh Rashidi assiehrash...@yahoo.com wrote: Dear Mr. for writing program about Gibbs sampling, i have a question. if i want to generate data from Exponential distribution but range of X is restricted, how can i do? regards, A.Rashidi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about user define function
On 15/06/10 21:39, GL wrote: Have the following function that is called by the statement below. Trying to return the two dataframes, but instead get one large list including both tables. ReadInputDataFrames- function() { dbs.this= read.delim(this.txt, header = TRUE, sep = \t, quote=\, dec=.) dbs.that= read.delim(that.txt, header = TRUE, sep = \t, quote=\, dec=.) c(this= dbs.this,patdb = dbs.that) If you really want to return two dataframes, then return(list(this = dbs.this, that = dbs.that)) More likely, you want to return all the data in one dataframe. If they have the same structure (columns and column names) then you want return(rbind(dbs.this, dbs,that)) If you want something else, provide an example. Hope this helps a little. Allan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question
Since there is a simple closed form for the truncated exponential CDF, you can use inverse transform sampling. I believe this is quite common in survival analysis methods. The first step is to compute and write an R function to compute the inverse CDF for the truncated exponential, say itexp - function(u, m, t) { -log(1-u*(1-exp(-t*m)))/m } where u is the quantile, m is the rate, and t is the level of truncation. Next, we draw from the truncated exponential with something like rtexp - function(n, m, t) { itexp(runif(n), m, t) } Check it out with texp - rtexp(1,1,pi) hist(texp) summary(texp) Matt Shotwell Graduate Student Div. Biostatistics and Epidemiology Medical University of South Carolina On Wed, 2010-06-16 at 11:11 -0400, Joris Meys wrote: Two possibilities : rescale your random vector, or resample to get numbers within the range. But neither of these solutions will give you a true exponential distribution. I am not aware of truncated exponential distributions that are available in R, but somebody else might know more about that. # possibility I : rescaling rsample - rexp(5) lim - 0.8 rsample - rsample*lim/max(rsample) rsample # possibility II : resampling rsample - rexp(5) while(sum(rsamplelim)0) { rsample - ifelse(rsamplelim,rexp(length(rsample)),rsample) } rsample Cheers Joris On Wed, Jun 16, 2010 at 12:00 PM, Assieh Rashidi assiehrash...@yahoo.com wrote: Dear Mr. for writing program about Gibbs sampling, i have a question. if i want to generate data from Exponential distribution but range of X is restricted, how can i do? regards, A.Rashidi [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about user define function
On Jun 15, 2010, at 4:39 PM, GL wrote: Have the following function that is called by the statement below. Trying to return the two dataframes, but instead get one large list including both tables. ReadInputDataFrames - function() { dbs.this= read.delim(this.txt, header = TRUE, sep = \t, quote=\, dec=.) dbs.that= read.delim(that.txt, header = TRUE, sep = \t, quote=\, dec=.) ## Possible that you want to cbind() them rather than c() them cbind(this= dbs.this,patdb = dbs.that) } Which would require that they have the same number of columns and possibly that the column names match up. ?cbind Called by: c - ReadInputDataFrames() Results: str(c) yields a list of 106 items $this.variabe1..53, $that $variable1..53 David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about mean
apply(iris[, -5], 2, tapply, iris$Species, mean) On Wed, Jun 9, 2010 at 3:43 PM, SH.Chou cls3...@gmail.com wrote: Hi there: I have a question about generating mean value of a data.frame. Take iris data for example, if I have a data.frame looking like the following: - Sepal.Length Sepal.Width Petal.Length Petal.WidthSpecies 15.1 3.5 1.4 0.2 setosa 24.9 3.0 1.4 0.2 setosa 34.7 3.2 1.3 0.2 setosa . . . . . . . . . . . . . . . . . . --- There are three different species in this table. I want to make a table and calculate mean value for each specie as the following table: - Sepal.Length Sepal.Width Petal.Length Petal.Width mean.setosa5.0063.428 1.462 0.246 mean.versicolor 5.936 2.770 4.260 1.326 mean.virginica 6.5882.974 5.552 2.026 - Is there any short syntax can do it?? I mean shorter than the code I wrote as following: attach(iris) mean.setosa-mean(iris[Species==setosa, 1:4]) mean.versicolor-mean(iris[Species==versicolor, 1:4]) mean.virginica-mean(iris[Species==virginica, 1:4]) data.mean-rbind(mean.setosa, mean.versicolor, mean.virginica) detach(iris) -- Thanks a million!!! -- = Shih-Hsiung, Chou System Administrator / PH.D Student at Department of Industrial Manufacturing and Systems Engineering Kansas State University [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about mean
Here is an alternative with(iris, rowsum(iris[, -5], Species)/table(Species)) -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Peter Langfelder Sent: Thursday, 10 June 2010 12:27 PM To: SH.Chou Cc: r-help@r-project.org Subject: Re: [R] question about mean apply(iris[, -5], 2, tapply, iris$Species, mean) On Wed, Jun 9, 2010 at 3:43 PM, SH.Chou cls3...@gmail.com wrote: Hi there: I have a question about generating mean value of a data.frame. Take iris data for example, if I have a data.frame looking like the following: - Sepal.Length Sepal.Width Petal.Length Petal.WidthSpecies 15.1 3.5 1.4 0.2 setosa 24.9 3.0 1.4 0.2 setosa 34.7 3.2 1.3 0.2 setosa . . . . . . . . . . . . . . . . . . --- There are three different species in this table. I want to make a table and calculate mean value for each specie as the following table: - Sepal.Length Sepal.Width Petal.Length Petal.Width mean.setosa5.0063.428 1.462 0.246 mean.versicolor 5.936 2.770 4.260 1.326 mean.virginica 6.5882.974 5.552 2.026 - Is there any short syntax can do it?? I mean shorter than the code I wrote as following: attach(iris) mean.setosa-mean(iris[Species==setosa, 1:4]) mean.versicolor-mean(iris[Species==versicolor, 1:4]) mean.virginica-mean(iris[Species==virginica, 1:4]) data.mean-rbind(mean.setosa, mean.versicolor, mean.virginica) detach(iris) -- Thanks a million!!! -- = Shih-Hsiung, Chou System Administrator / PH.D Student at Department of Industrial Manufacturing and Systems Engineering Kansas State University [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about mean
One possibility is aggregate(iris[,-5],list(iris[,5]),mean) Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa5.006 3.4281.462 0.246 2 versicolor5.936 2.7704.260 1.326 3 virginica6.588 2.9745.552 2.026 - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Wed, 9 Jun 2010, SH.Chou wrote: Hi there: I have a question about generating mean value of a data.frame. Take iris data for example, if I have a data.frame looking like the following: - Sepal.Length Sepal.Width Petal.Length Petal.WidthSpecies 15.1 3.5 1.4 0.2 setosa 24.9 3.0 1.4 0.2 setosa 34.7 3.2 1.3 0.2 setosa . . . . . . . . . . . . . . . . . . --- There are three different species in this table. I want to make a table and calculate mean value for each specie as the following table: - Sepal.Length Sepal.Width Petal.Length Petal.Width mean.setosa5.0063.428 1.462 0.246 mean.versicolor 5.936 2.770 4.260 1.326 mean.virginica 6.5882.974 5.552 2.026 - Is there any short syntax can do it?? I mean shorter than the code I wrote as following: attach(iris) mean.setosa-mean(iris[Species==setosa, 1:4]) mean.versicolor-mean(iris[Species==versicolor, 1:4]) mean.virginica-mean(iris[Species==virginica, 1:4]) data.mean-rbind(mean.setosa, mean.versicolor, mean.virginica) detach(iris) -- Thanks a million!!! -- = Shih-Hsiung, Chou System Administrator / PH.D Student at Department of Industrial Manufacturing and Systems Engineering Kansas State University [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about avoid the for loop
Hi Carrie, Here are two options: # Option 1 d - data.frame(x, t) y - with(d, ifelse(t == 0, rbinom(2, 1, 0.2), rbinom(3, 1, 0.8))) y # Option 2 -- more general case, e.g. you do not know # how many 0's and 1's you have within each strata spd - with(d, split(d, x)) do.call(c, lapply(spd, function(comp) with(comp, ifelse(t == 0, rbinom(sum(t==0), 1, 0.2), rbinom(sum(t!=0), 1, 0.8) HTH, Jorge On Thu, Jun 3, 2010 at 11:49 AM, Carrie Li wrote: Dear R-helpers, I would like to generate a binary random variable within a stratum's stratum. Here is a simple example. ## x is the first level strata index, here I have 3 strata. x=c(rep(1,5), rep(2,5), rep(3,5)) ## within x, there is a second strata indexed by t=0 and t=1 t=rep(c(0,0,1,1,1),3) ## and within strata i and t=0 and t=1, I generate the random binomial variable respectively, and save in y y=rep(NA, length(x)) for (i in 1:3) { y[(x==i)(t==0)]=rbinom(2, 1, 0.2) y[(x==i)(t==1)]=rbinom(3, 1, 0.8) } My question: is there any way to avoid the for loop, since I have the first level strata has thousands of strata. (Within each x stratum, the t only has 2 levels, 0 and 1 ) Thanks for any help! Carrie [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about avoid the for loop
Thanks! Jorge Just one more question I don't get it even after checking help For option, why just using with(d,...), ifelse works on stratum indexed by x automatically ? Since in with, we didn't specify the stratum is indexed by x, what if you have another categorical variable in the data ? Thanks again! On Thu, Jun 3, 2010 at 12:21 PM, Jorge Ivan Velez jorgeivanve...@gmail.comwrote: Hi Carrie, Here are two options: # Option 1 d - data.frame(x, t) y - with(d, ifelse(t == 0, rbinom(2, 1, 0.2), rbinom(3, 1, 0.8))) y # Option 2 -- more general case, e.g. you do not know # how many 0's and 1's you have within each strata spd - with(d, split(d, x)) do.call(c, lapply(spd, function(comp) with(comp, ifelse(t == 0, rbinom(sum(t==0), 1, 0.2), rbinom(sum(t!=0), 1, 0.8) HTH, Jorge On Thu, Jun 3, 2010 at 11:49 AM, Carrie Li wrote: Dear R-helpers, I would like to generate a binary random variable within a stratum's stratum. Here is a simple example. ## x is the first level strata index, here I have 3 strata. x=c(rep(1,5), rep(2,5), rep(3,5)) ## within x, there is a second strata indexed by t=0 and t=1 t=rep(c(0,0,1,1,1),3) ## and within strata i and t=0 and t=1, I generate the random binomial variable respectively, and save in y y=rep(NA, length(x)) for (i in 1:3) { y[(x==i)(t==0)]=rbinom(2, 1, 0.2) y[(x==i)(t==1)]=rbinom(3, 1, 0.8) } My question: is there any way to avoid the for loop, since I have the first level strata has thousands of strata. (Within each x stratum, the t only has 2 levels, 0 and 1 ) Thanks for any help! Carrie [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about avoid the for loop
Hi Jorge, I found a problem. I just want to check if the answer is random, I change the code as follows: d - data.frame(x, t) y - with(d, ifelse(t == 0, rbinom(2, 1, 0.5), rnorm(3))) cbind(x, t,y) x t y [1,] 1 0 0.000 [2,] 1 0 0.000 [3,] 1 1 0.8920037 [4,] 1 1 1.6695435 [5,] 1 1 0.8289429 [6,] 2 0 0.000 [7,] 2 0 0.000 [8,] 2 1 0.8289429 [9,] 2 1 0.8920037 [10,] 2 1 1.6695435 [11,] 3 0 0.000 [12,] 3 0 0.000 [13,] 3 1 1.6695435 [14,] 3 1 0.8289429 [15,] 3 1 0.8920037 So, notice that the normal random variable only takes 3 values, and same as the binary part, it's all 0 0 1 1 1 for each stratum. Is there way to make them complete randomly different over the strata ? Thank you again! Carrie On Thu, Jun 3, 2010 at 7:24 PM, Carrie Li carrieands...@gmail.com wrote: Thanks! Jorge Just one more question I don't get it even after checking help For option, why just using with(d,...), ifelse works on stratum indexed by x automatically ? Since in with, we didn't specify the stratum is indexed by x, what if you have another categorical variable in the data ? Thanks again! On Thu, Jun 3, 2010 at 12:21 PM, Jorge Ivan Velez jorgeivanve...@gmail.com wrote: Hi Carrie, Here are two options: # Option 1 d - data.frame(x, t) y - with(d, ifelse(t == 0, rbinom(2, 1, 0.2), rbinom(3, 1, 0.8))) y # Option 2 -- more general case, e.g. you do not know # how many 0's and 1's you have within each strata spd - with(d, split(d, x)) do.call(c, lapply(spd, function(comp) with(comp, ifelse(t == 0, rbinom(sum(t==0), 1, 0.2), rbinom(sum(t!=0), 1, 0.8) HTH, Jorge On Thu, Jun 3, 2010 at 11:49 AM, Carrie Li wrote: Dear R-helpers, I would like to generate a binary random variable within a stratum's stratum. Here is a simple example. ## x is the first level strata index, here I have 3 strata. x=c(rep(1,5), rep(2,5), rep(3,5)) ## within x, there is a second strata indexed by t=0 and t=1 t=rep(c(0,0,1,1,1),3) ## and within strata i and t=0 and t=1, I generate the random binomial variable respectively, and save in y y=rep(NA, length(x)) for (i in 1:3) { y[(x==i)(t==0)]=rbinom(2, 1, 0.2) y[(x==i)(t==1)]=rbinom(3, 1, 0.8) } My question: is there any way to avoid the for loop, since I have the first level strata has thousands of strata. (Within each x stratum, the t only has 2 levels, 0 and 1 ) Thanks for any help! Carrie [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about avoid the for loop
Hi Carrie, It works just fine in this case because you have the same number of 0's and 1's within each strata. If that would not be the case, option 1 would not work. That's why I provided you a second option. Best, Jorge On Thu, Jun 3, 2010 at 7:24 PM, Carrie Li wrote: Thanks! Jorge Just one more question I don't get it even after checking help For option, why just using with(d,...), ifelse works on stratum indexed by x automatically ? Since in with, we didn't specify the stratum is indexed by x, what if you have another categorical variable in the data ? Thanks again! On Thu, Jun 3, 2010 at 12:21 PM, Jorge Ivan Velez jorgeivanve...@gmail.com wrote: Hi Carrie, Here are two options: # Option 1 d - data.frame(x, t) y - with(d, ifelse(t == 0, rbinom(2, 1, 0.2), rbinom(3, 1, 0.8))) y # Option 2 -- more general case, e.g. you do not know # how many 0's and 1's you have within each strata spd - with(d, split(d, x)) do.call(c, lapply(spd, function(comp) with(comp, ifelse(t == 0, rbinom(sum(t==0), 1, 0.2), rbinom(sum(t!=0), 1, 0.8) HTH, Jorge On Thu, Jun 3, 2010 at 11:49 AM, Carrie Li wrote: Dear R-helpers, I would like to generate a binary random variable within a stratum's stratum. Here is a simple example. ## x is the first level strata index, here I have 3 strata. x=c(rep(1,5), rep(2,5), rep(3,5)) ## within x, there is a second strata indexed by t=0 and t=1 t=rep(c(0,0,1,1,1),3) ## and within strata i and t=0 and t=1, I generate the random binomial variable respectively, and save in y y=rep(NA, length(x)) for (i in 1:3) { y[(x==i)(t==0)]=rbinom(2, 1, 0.2) y[(x==i)(t==1)]=rbinom(3, 1, 0.8) } My question: is there any way to avoid the for loop, since I have the first level strata has thousands of strata. (Within each x stratum, the t only has 2 levels, 0 and 1 ) Thanks for any help! Carrie [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about avoid the for loop
Yes, in my case here, each strata has the same number of 0's and 1's. But I want the y to be randomly generated within each strata, so y should have some difference across the strata. (at least for the rnorm part we would see much clear randomness) (I hope what I am asking here is clear to you. ) Using option 2 you provided previously the output: cbind(x, t,y) x t y [1,] 1 0 0.000 [2,] 1 0 0.000 [3,] 1 1 0.8920037 [4,] 1 1 1.6695435 [5,] 1 1 0.8289429 [6,] 2 0 0.000 [7,] 2 0 0.000 [8,] 2 1 0.8289429 [9,] 2 1 0.8920037 [10,] 2 1 1.6695435 [11,] 3 0 0.000 [12,] 3 0 0.000 [13,] 3 1 1.6695435 [14,] 3 1 0.8289429 [15,] 3 1 0.8920037 For the rnorm part, it seems only take 3 values 1.66954, 0.82894, and 0.8920 On Thu, Jun 3, 2010 at 8:16 PM, Jorge Ivan Velez jorgeivanve...@gmail.comwrote: Hi Carrie, It works just fine in this case because you have the same number of 0's and 1's within each strata. If that would not be the case, option 1 would not work. That's why I provided you a second option. Best, Jorge On Thu, Jun 3, 2010 at 7:24 PM, Carrie Li wrote: Thanks! Jorge Just one more question I don't get it even after checking help For option, why just using with(d,...), ifelse works on stratum indexed by x automatically ? Since in with, we didn't specify the stratum is indexed by x, what if you have another categorical variable in the data ? Thanks again! On Thu, Jun 3, 2010 at 12:21 PM, Jorge Ivan Velez jorgeivanve...@gmail.com wrote: Hi Carrie, Here are two options: # Option 1 d - data.frame(x, t) y - with(d, ifelse(t == 0, rbinom(2, 1, 0.2), rbinom(3, 1, 0.8))) y # Option 2 -- more general case, e.g. you do not know # how many 0's and 1's you have within each strata spd - with(d, split(d, x)) do.call(c, lapply(spd, function(comp) with(comp, ifelse(t == 0, rbinom(sum(t==0), 1, 0.2), rbinom(sum(t!=0), 1, 0.8) HTH, Jorge On Thu, Jun 3, 2010 at 11:49 AM, Carrie Li wrote: Dear R-helpers, I would like to generate a binary random variable within a stratum's stratum. Here is a simple example. ## x is the first level strata index, here I have 3 strata. x=c(rep(1,5), rep(2,5), rep(3,5)) ## within x, there is a second strata indexed by t=0 and t=1 t=rep(c(0,0,1,1,1),3) ## and within strata i and t=0 and t=1, I generate the random binomial variable respectively, and save in y y=rep(NA, length(x)) for (i in 1:3) { y[(x==i)(t==0)]=rbinom(2, 1, 0.2) y[(x==i)(t==1)]=rbinom(3, 1, 0.8) } My question: is there any way to avoid the for loop, since I have the first level strata has thousands of strata. (Within each x stratum, the t only has 2 levels, 0 and 1 ) Thanks for any help! Carrie [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.