[R] R: Optimization
Thanks for the help. I thought about taking the logarithms of the expression but to cope with the problem of zero values both in yield and the values of the environmental variables another transformation is needed (I simply add one to the yield value and to the environmental variables values). I follow the suggestion kindly provided by Spencer Graves but I found the following problems, which depend probably on my understanding: 1) I don't understand why it is needed to test the no-constant model (which provides a better fitting of the model). Can I simply assume a no-constant model? 2) Here comes the big problem. I don't understand what it is done with the models fit.0 and fit.1. In both cases I have to leave out one variable from the linear model but I don't understand why in this way I would test the constraint that all the coefficients should sum to 1. And it is not possible to apply anova(fit0,fit.0) and anova(fit1,fit.1) because I have different response variables. In conclusion, I don't actually know how to proceed and if in R this kind of analysis is possible. Any help and any further explanation would be very appreciated. Below I report part of the data frame I'm using for the analysis. The environmental variables (Salinity, Hydrodynamism, Sediment, Oxygen, Chlorophyll and Bathymetry) values have been transformed using suitability function (and thus are bounded between 0 and 1) while Yield is in kg/m2. Sedi Sal Bathy Chl Hydro Oxy Yield 1 1.00 1.00 0.51 1 0.17 0.75 1.5 2 0.50 0.95 1.00 1 0.09 0.94 0.4 3 0.50 1.00 0.17 1 0.44 0.90 1.8 4 1.00 0.98 1.00 1 0.10 0.89 4.5 5 0.13 0.84 0.73 1 0.16 0.84 0.4 6 0.50 0.90 0.91 1 0.22 0.84 0.4 7 0.13 0.75 1.00 1 0.14 0.86 0.2 8 0.13 0.84 0.75 1 0.10 0.83 0.3 9 0.13 0.78 0.97 1 0.06 0.84 0.5 10 0.13 0.87 0.70 1 0.45 0.85 1.0 11 1.00 0.77 1.00 1 0.19 0.86 1.5 12 1.00 0.94 0.81 1 0.47 0.86 3.0 13 1.00 0.93 1.00 1 0.45 0.89 2.5 14 0.50 1.00 1.00 1 0.54 0.84 4.0 15 0.50 1.00 1.00 1 0.25 0.88 2.2 16 1.00 1.00 0.56 1 0.25 0.90 5.0 17 1.00 0.90 0.56 1 0.40 0.90 1.5 18 0.50 0.97 1.00 1 0.22 0.95 1.0 19 0.54 0.96 1.00 1 0.18 0.91 0.3 20 1.00 0.97 0.33 1 0.39 0.90 3.0 And here the results of the fitted models so far. summary(fit0) Call: lm(formula = log(Yield + 1) ~ log(Sal + 1) + log(Bathy + 1) + log(Chl + 1) + log(Hydro + 1) + log(Oxy + 1) + log(Sedi + 1) - 1, data = data.df) Residuals: Min 1Q Median 3Q Max -1.05485 -0.23759 0.01331 0.18692 1.23803 Coefficients: Estimate Std. Error t value Pr(|t|) log(Sal + 1) 5.071931.00584 5.042 1.98e-06 *** log(Bathy + 1) 0.058040.20684 0.281 0.779561 log(Chl + 1) -8.539412.11720 -4.033 0.000106 *** log(Hydro + 1) 1.738350.28815 6.033 2.56e-08 *** log(Oxy + 1) 4.199511.98459 2.116 0.036750 * log(Sedi + 1) 1.159530.14807 7.831 4.51e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3735 on 103 degrees of freedom Multiple R-Squared: 0.9148, Adjusted R-squared: 0.9098 F-statistic: 184.2 on 6 and 103 DF, p-value: 2.2e-16 summary(fit1) Call: lm(formula = log(Yield + 1) ~ log(Sal + 1) + log(Bathy + 1) + log(Chl + 1) + log(Hydro + 1) + log(Oxy + 1) + log(Sedi + 1), data = (data.df)) Residuals: Min1QMedian3Q Max -1.057766 -0.227720 0.006146 0.192200 1.222543 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) -4.494004.76603 -0.943 0.3479 log(Sal + 1) 5.134991.00861 5.091 1.63e-06 *** log(Bathy + 1) 0.066850.20716 0.323 0.7476 log(Chl + 1) -2.489096.75718 -0.368 0.7134 log(Hydro + 1) 1.706740.29025 5.880 5.22e-08 *** log(Oxy + 1) 4.637582.03928 2.274 0.0251 * log(Sedi + 1) 1.140050.14959 7.621 1.34e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3737 on 102 degrees of freedom Multiple R-Squared: 0.7589, Adjusted R-squared: 0.7447 F-statistic: 53.51 on 6 and 102 DF, p-value: 2.2e-16 summary(fit.0) Call: lm(formula = log((Yield + 1)/(Sedi + 1)) ~ log((Sal + 1)/(Sedi + 1)) + log((Bathy + 1)/(Sedi + 1)) + log((Chl + 1)/(Sedi + 1)) + log((Hydro + 1)/(Sedi + 1)) + log((Oxy + 1)/(Sedi + 1)) - 1, data = subset(data.df)) Residuals: Min 1Q Median 3Q Max -1.5762 -0.2591 0.1065 0.5023 1.4533 Coefficients: Estimate Std. Error t value Pr(|t|) log((Sal + 1)/(Sedi + 1))3.0170 1.5304 1.971 0.05134 . log((Bathy + 1)/(Sedi + 1)) -0.3982 0.3139 -1.268 0.20748 log((Chl + 1)/(Sedi + 1))8.8137 2.3941 3.681 0.00037 *** log((Hydro + 1)/(Sedi + 1)) 0.3309 0.4066 0.814 0.41760 log((Oxy + 1)/(Sedi + 1)) -12.5242 2.1881 -5.724 1.01e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Re: [R] R: Optimization
comments in line Simone Vincenzi wrote: Thanks for the help. I thought about taking the logarithms of the expression but to cope with the problem of zero values both in yield and the values of the environmental variables another transformation is needed (I simply add one to the yield value and to the environmental variables values). Do you have 0 yield in any cases where the values of all the environmental variables were non-zero? I didn't see any 0's in the data you have below. I would not worry about 0's unless they actually occur. If you have a larger data set with some 0's then I suggest you consider the cases where the 0's actually occur. * If every 0 yield occurs when at least one of the environmental variables is 0, you can safely delete all cases with 0 yield, because they provide zero information to estimate any of the parameters in your model. I would rerun the model without the 1's, as they make no sense to me. * Otherwise, you need to evaluate the noise in the model. In other words, the equation you wrote will not fit exactly. The standard regression model (linear or nonlinear) assumes that y = f(X, b) + e, where X here is a vector of A, B, ..., b is a vector of the parameters to be estimated, b0, b1, ..., in the model I wrote, and the errors e are normally distributed with constant variance. If you have cases where the yield is zero but none of the X's are, you have problems with this assumption no matter what. You could use nls with the model as you specified it. However, to get starting values for nls I suggest you run lm on the logarithms, adding 1 if you like, as you suggested. I follow the suggestion kindly provided by Spencer Graves but I found the following problems, which depend probably on my understanding: 1) I don't understand why it is needed to test the no-constant model (which provides a better fitting of the model). Can I simply assume a no-constant model? I was assuming yield would be a number between 0 and 1, similar to A, B, ... . If your inputs A, B, .., are all between 0 and 1 but yield is not. I suggest you carefully examine the source for that equation, because it makes no physical sense to me. Your regression results below report that the intercept is not significantly different from 0. However, I would suspect that might be just an accident. If you change units from kg/m2 to psi or something else, you should get a statistically significant intercept. * If yield is between 0 and 1, then the yield equation you wrote makes physical sense. And then it makes sense to test whether b0 = 0, because that is a test for whether there are other environmental that impact yield that are not in the model. * Similarly, the comparisons of fit0 and fit.0 plus fit1 and fit.1 are designed to test for other environmental variable(s) not in the model that affect yield but are correlated somewhat with the variables you already have. For more, I suggest you consult a statistician. There must be several at Uni Parma. Hope this helps. Spencer Graves 2) Here comes the big problem. I don't understand what it is done with the models fit.0 and fit.1. In both cases I have to leave out one variable from the linear model but I don't understand why in this way I would test the constraint that all the coefficients should sum to 1. And it is not possible to apply anova(fit0,fit.0) and anova(fit1,fit.1) because I have different response variables. In conclusion, I don't actually know how to proceed and if in R this kind of analysis is possible. Any help and any further explanation would be very appreciated. Below I report part of the data frame I'm using for the analysis. The environmental variables (Salinity, Hydrodynamism, Sediment, Oxygen, Chlorophyll and Bathymetry) values have been transformed using suitability function (and thus are bounded between 0 and 1) while Yield is in kg/m2. Sedi Sal Bathy Chl Hydro Oxy Yield 1 1.00 1.00 0.51 1 0.17 0.75 1.5 2 0.50 0.95 1.00 1 0.09 0.94 0.4 3 0.50 1.00 0.17 1 0.44 0.90 1.8 4 1.00 0.98 1.00 1 0.10 0.89 4.5 5 0.13 0.84 0.73 1 0.16 0.84 0.4 6 0.50 0.90 0.91 1 0.22 0.84 0.4 7 0.13 0.75 1.00 1 0.14 0.86 0.2 8 0.13 0.84 0.75 1 0.10 0.83 0.3 9 0.13 0.78 0.97 1 0.06 0.84 0.5 10 0.13 0.87 0.70 1 0.45 0.85 1.0 11 1.00 0.77 1.00 1 0.19 0.86 1.5 12 1.00 0.94 0.81 1 0.47 0.86 3.0 13 1.00 0.93 1.00 1 0.45 0.89 2.5 14 0.50 1.00 1.00 1 0.54 0.84 4.0 15 0.50 1.00 1.00 1 0.25 0.88 2.2 16 1.00 1.00 0.56 1 0.25 0.90 5.0 17 1.00 0.90 0.56 1 0.40 0.90 1.5 18 0.50 0.97 1.00 1 0.22 0.95 1.0 19 0.54 0.96 1.00 1 0.18 0.91 0.3 20 1.00 0.97 0.33 1 0.39 0.90 3.0 And here the results of the fitted models so far. summary(fit0) Call: lm(formula = log(Yield + 1) ~ log(Sal + 1) + log(Bathy + 1) + log(Chl + 1) + log(Hydro + 1) + log(Oxy + 1) + log(Sedi + 1) - 1, data =
[R] R Optimization 101 for R Newcomers: An extended example
To new R-Programmers: This long-winded note gives an example of optimizing in R that should only be of interest to newcomers to the language. Others should ignore. My hope is that it might help illuminate some basic notions of code improvement, looping, and vectorization in R. However, I welcome comments from anyone that enlarge, correct, or improve it. I remain a novice R programmer. For the curious, the hardware/opsys details are a 1.5 ghz Windows 2000 machine, R1.9.0. The task is this: I have a data set, mydat, consisting of values =0, and a fixed constant, K, that is much larger than any of the values. I want to simulate the distribution of the number of bootstrap draws (random sampling with replacement) from mydat that is needed to make the sum of the values drawn just exceed K. That is, repeatedly bootstrap sample from mydat until the sum of the results exceeds K. Suppose this requires 7 draws. Then 7 is a random sample of size 1 from the distribution I'm interested in. I want to repeat this procedure a large number, nsim, of times to estimate the distribution. Before reading on, I invite R novices to consider how -- or better yet write code -- to do this. Approach 1 (Naive/ straightforward): Perform the simulation exactly as described by using a nested loop. The outer loop goes from 1 to nsim to record the number of draws needed. The inner loop repeatedly draws a sample of size 1 from mydat and accumulates and checks the sum until it exceeds K, returning the number of times it took to do this to the outer loop. This is easy to program up using a while loop for the inner loop and takes about a short look's time out my window (a few seconds) to run for modest nsim (5-10K). But surely one can do better... Approach 2: Was essentially the same, except all the sampling is done at once. That is, each execution of the inner loop in (1) required R to call the sample() function to get a bootstrap sample of size 1. This is very inefficient, because there is a high overhead of making that function call at each iteration. It can be avoided by calling sample() only once -- or at most a small number of times -- to get a large sample and then using indexing to work one's way through the large sample. This turns out to be a little more complicated than approach 1 because you first have to figure how large a large sample should be and then have to be a little careful about setting up your indexing. But it's not very difficult (especially for any C programmer who can manipulate pointers), and random sampling is so fast that it's no problem to generate a sample WAYYY bigger than you need (5 or 10 times as large, even) just to be safe. Or generate a not so big version and just check at each outer loop iteration whether you need to generate some more. Doing it this way -- now using indexing, not function calls in the inner loop -- made a considerable improvement (5 - 10 fold). It now took about one quick glance out the window time to generate the distribution (less than a second). This was more than adequate for my needs under any conceivable situation. But still, self-respecting R programmers are supposed to eschew looping, and here was this ugly(?) loop within a loop! So let us persevere. Approach 3: So the question is -- how to remove that inner loop? Again, I invite novice R programmers to think about that before continuing on... The key here is that the inner loop is doing a very simple calculation: just accumulating a sum. Forsooth! -- the answer fairly leaps out: R already has a cumsum() function that does this (looping in the underlying C code), so one only has to figure out how to make use of it. The essential idea to do this is to get a small sample from the large mydat sample that is guaranteed to be bigger than one needs to total up to more than K. Again, one can be profligate: if I need about 10 mydat values on average to sum up to K, if I sample 30 or 40 or some such thing, I'm sure to have enough (and can always add a check or use try() to make sure if I am faint of heart). If smallsamp is this sample of the next 30 or 40 values from my large bootstrap sample, then the following code vectorizes the inner loops: count - sum(cumsum(smallsamp)K)+1 Note a trick here: The K produces a vector of logical TRUE's for all cumsum values K. This is silently treated as numeric 1's when added by sum(). A moment's thought will make clear why 1 must be added at the end.. Sure enough, this vectorization reduced the time about another twofold -- to an eyeblink -- for my modest nsim sizes. Approach 4: The last stage, of course, is to get rid of the outer loop, which fills the vector for the distribution one element at a time, again another R no-no. One way to do this -- the ONLY way I could think of (so a challenge to smarter R programmers) -- is to make use of apply(), which basically always can remove loops. Unfortunately, this is an illusion, because what the apply() functions actually do is hide the