[R] R: Optimization

2006-09-05 Thread Simone Vincenzi
Thanks for the help.
I thought about taking the logarithms of the expression but to cope with the
problem of zero values both in yield and the values of the environmental
variables another transformation is needed (I simply add one to the yield
value and to the environmental variables values). I follow the suggestion
kindly provided by Spencer Graves but I found the following problems, which
depend probably on my understanding:
1) I don't understand why it is needed to test the no-constant model (which
provides a better fitting of the model). Can I simply assume a no-constant
model? 
2) Here comes the big problem. I don't understand what it is done with the
models fit.0 and fit.1. In both cases I have to leave out one variable from
the linear model but I don't understand why in this way I would test the
constraint that all the coefficients should sum to 1. And it is not possible
to apply anova(fit0,fit.0) and anova(fit1,fit.1) because I have different
response variables. 

In conclusion, I don't actually know how to proceed and if in R this kind of
analysis is possible.
Any help and any further explanation would be very appreciated. Below I
report part of the data frame I'm using for the analysis. The environmental
variables (Salinity, Hydrodynamism, Sediment, Oxygen, Chlorophyll and
Bathymetry) values have been transformed using suitability function (and
thus are bounded between 0 and 1) while Yield is in kg/m2.

   Sedi Sal Bathy Chl Hydro Oxy  Yield
1  1.00 1.00 0.51   1 0.17 0.75 1.5
2  0.50 0.95 1.00   1 0.09 0.94 0.4
3  0.50 1.00 0.17   1 0.44 0.90 1.8
4  1.00 0.98 1.00   1 0.10 0.89 4.5
5  0.13 0.84 0.73   1 0.16 0.84 0.4
6  0.50 0.90 0.91   1 0.22 0.84 0.4
7  0.13 0.75 1.00   1 0.14 0.86 0.2
8  0.13 0.84 0.75   1 0.10 0.83 0.3
9  0.13 0.78 0.97   1 0.06 0.84 0.5
10 0.13 0.87 0.70   1 0.45 0.85 1.0
11 1.00 0.77 1.00   1 0.19 0.86 1.5
12 1.00 0.94 0.81   1 0.47 0.86 3.0
13 1.00 0.93 1.00   1 0.45 0.89 2.5
14 0.50 1.00 1.00   1 0.54 0.84 4.0
15 0.50 1.00 1.00   1 0.25 0.88 2.2
16 1.00 1.00 0.56   1 0.25 0.90 5.0
17 1.00 0.90 0.56   1 0.40 0.90 1.5
18 0.50 0.97 1.00   1 0.22 0.95 1.0
19 0.54 0.96 1.00   1 0.18 0.91 0.3
20 1.00 0.97 0.33   1 0.39 0.90 3.0


And here the results of the fitted models so far.

summary(fit0)

Call:
lm(formula = log(Yield + 1) ~ log(Sal + 1) + log(Bathy + 1) + log(Chl + 
1) + log(Hydro + 1) + log(Oxy + 1) + log(Sedi + 1) - 1, data = data.df)

Residuals:
 Min   1Q   Median   3Q  Max 
-1.05485 -0.23759  0.01331  0.18692  1.23803 

Coefficients:
  Estimate Std. Error t value Pr(|t|)
log(Sal + 1)   5.071931.00584   5.042 1.98e-06 ***
log(Bathy + 1)  0.058040.20684   0.281 0.779561
log(Chl + 1)  -8.539412.11720  -4.033 0.000106 ***
log(Hydro + 1)  1.738350.28815   6.033 2.56e-08 ***
log(Oxy + 1)   4.199511.98459   2.116 0.036750 *  
log(Sedi + 1)  1.159530.14807   7.831 4.51e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.3735 on 103 degrees of freedom
Multiple R-Squared: 0.9148, Adjusted R-squared: 0.9098 
F-statistic: 184.2 on 6 and 103 DF,  p-value:  2.2e-16 

 summary(fit1)

Call:
lm(formula = log(Yield + 1) ~ log(Sal + 1) + log(Bathy + 1) + log(Chl + 
1) + log(Hydro + 1) + log(Oxy + 1) + log(Sedi + 1), data = (data.df))

Residuals:
  Min1QMedian3Q   Max 
-1.057766 -0.227720  0.006146  0.192200  1.222543 

Coefficients:
  Estimate Std. Error t value Pr(|t|)
(Intercept)   -4.494004.76603  -0.943   0.3479
log(Sal + 1)   5.134991.00861   5.091 1.63e-06 ***
log(Bathy + 1)  0.066850.20716   0.323   0.7476
log(Chl + 1)  -2.489096.75718  -0.368   0.7134
log(Hydro + 1)  1.706740.29025   5.880 5.22e-08 ***
log(Oxy + 1)   4.637582.03928   2.274   0.0251 *  
log(Sedi + 1)  1.140050.14959   7.621 1.34e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.3737 on 102 degrees of freedom
Multiple R-Squared: 0.7589, Adjusted R-squared: 0.7447 
F-statistic: 53.51 on 6 and 102 DF,  p-value:  2.2e-16 

 summary(fit.0)

Call:
lm(formula = log((Yield + 1)/(Sedi + 1)) ~ log((Sal + 1)/(Sedi + 
1)) + log((Bathy + 1)/(Sedi + 1)) + log((Chl + 1)/(Sedi + 
1)) + log((Hydro + 1)/(Sedi + 1)) + log((Oxy + 1)/(Sedi + 
1)) - 1, data = subset(data.df))

Residuals:
Min  1Q  Median  3Q Max 
-1.5762 -0.2591  0.1065  0.5023  1.4533 

Coefficients:
   Estimate Std. Error t value Pr(|t|)
log((Sal + 1)/(Sedi + 1))3.0170 1.5304   1.971  0.05134 .  
log((Bathy + 1)/(Sedi + 1))  -0.3982 0.3139  -1.268  0.20748
log((Chl + 1)/(Sedi + 1))8.8137 2.3941   3.681  0.00037 ***
log((Hydro + 1)/(Sedi + 1))   0.3309 0.4066   0.814  0.41760
log((Oxy + 1)/(Sedi + 1))  -12.5242 2.1881  -5.724 1.01e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


Re: [R] R: Optimization

2006-09-05 Thread Spencer Graves
comments in line 

Simone Vincenzi wrote:
 Thanks for the help.
 I thought about taking the logarithms of the expression but to cope with the
 problem of zero values both in yield and the values of the environmental
 variables another transformation is needed (I simply add one to the yield
 value and to the environmental variables values). 
Do you have 0 yield in any cases where the values of all the 
environmental variables were non-zero?   I didn't see any 0's in the 
data you have below.  I would not worry about 0's unless they actually 
occur.  If you have a larger data set with some 0's then I suggest you 
consider the cases where the 0's actually occur. 

  * If every 0 yield occurs when at least one of the environmental 
variables is 0, you can safely delete all cases with 0 yield, because 
they provide zero information to estimate any of the parameters in your 
model.  I would rerun the model without the 1's, as they make no sense 
to me. 

  * Otherwise, you need to evaluate the noise in the model.  In 
other words, the equation you wrote will not fit exactly.  The standard 
regression model (linear or nonlinear) assumes that y = f(X, b) + e, 
where X here is a vector of A, B, ..., b is a vector of the parameters 
to be estimated, b0, b1, ..., in the model I wrote, and the errors e 
are normally distributed with constant variance.  If you have cases 
where the yield is zero but none of the X's are, you have problems with 
this assumption no matter what.  You could use nls with the model as 
you specified it.  However, to get starting values for nls I suggest 
you run lm on the logarithms, adding 1 if you like, as you suggested. 
 I follow the suggestion
 kindly provided by Spencer Graves but I found the following problems, which
 depend probably on my understanding:
 1) I don't understand why it is needed to test the no-constant model (which
 provides a better fitting of the model). Can I simply assume a no-constant
 model? 
   
I was assuming yield would be a number between 0 and 1, similar to A, B, 
... .  If your inputs A, B, .., are all between 0 and 1 but yield is 
not. I suggest you carefully examine the source for that equation, 
because it makes no physical sense to me.  Your regression results below 
report that the intercept is not significantly different from 0.  
However, I would suspect that might be just an accident.  If you change 
units from kg/m2 to psi or something else, you should get a 
statistically significant intercept. 

  * If yield is between 0 and 1, then the yield equation you wrote 
makes physical sense.  And then it makes sense to test whether b0 = 0, 
because that is a test for whether there are other environmental that 
impact yield that are not in the model. 

  * Similarly, the comparisons of fit0 and fit.0 plus fit1 and fit.1 
are designed to test for other environmental variable(s) not in the 
model that affect yield but are correlated somewhat with the variables 
you already have. 

  For more, I suggest you consult a statistician.  There must be 
several at Uni Parma. 

  Hope this helps. 
  Spencer Graves
 2) Here comes the big problem. I don't understand what it is done with the
 models fit.0 and fit.1. In both cases I have to leave out one variable from
 the linear model but I don't understand why in this way I would test the
 constraint that all the coefficients should sum to 1. And it is not possible
 to apply anova(fit0,fit.0) and anova(fit1,fit.1) because I have different
 response variables. 

 In conclusion, I don't actually know how to proceed and if in R this kind of
 analysis is possible.
 Any help and any further explanation would be very appreciated. Below I
 report part of the data frame I'm using for the analysis. The environmental
 variables (Salinity, Hydrodynamism, Sediment, Oxygen, Chlorophyll and
 Bathymetry) values have been transformed using suitability function (and
 thus are bounded between 0 and 1) while Yield is in kg/m2.

Sedi Sal Bathy Chl Hydro Oxy  Yield
 1  1.00 1.00 0.51   1 0.17 0.75 1.5
 2  0.50 0.95 1.00   1 0.09 0.94 0.4
 3  0.50 1.00 0.17   1 0.44 0.90 1.8
 4  1.00 0.98 1.00   1 0.10 0.89 4.5
 5  0.13 0.84 0.73   1 0.16 0.84 0.4
 6  0.50 0.90 0.91   1 0.22 0.84 0.4
 7  0.13 0.75 1.00   1 0.14 0.86 0.2
 8  0.13 0.84 0.75   1 0.10 0.83 0.3
 9  0.13 0.78 0.97   1 0.06 0.84 0.5
 10 0.13 0.87 0.70   1 0.45 0.85 1.0
 11 1.00 0.77 1.00   1 0.19 0.86 1.5
 12 1.00 0.94 0.81   1 0.47 0.86 3.0
 13 1.00 0.93 1.00   1 0.45 0.89 2.5
 14 0.50 1.00 1.00   1 0.54 0.84 4.0
 15 0.50 1.00 1.00   1 0.25 0.88 2.2
 16 1.00 1.00 0.56   1 0.25 0.90 5.0
 17 1.00 0.90 0.56   1 0.40 0.90 1.5
 18 0.50 0.97 1.00   1 0.22 0.95 1.0
 19 0.54 0.96 1.00   1 0.18 0.91 0.3
 20 1.00 0.97 0.33   1 0.39 0.90 3.0


 And here the results of the fitted models so far.

 summary(fit0)

 Call:
 lm(formula = log(Yield + 1) ~ log(Sal + 1) + log(Bathy + 1) + log(Chl + 
 1) + log(Hydro + 1) + log(Oxy + 1) + log(Sedi + 1) - 1, data = 

[R] R Optimization 101 for R Newcomers: An extended example

2004-05-19 Thread Berton Gunter
To new R-Programmers:

This long-winded note gives an example of optimizing in R that should
only be of interest to newcomers to the language. Others should ignore.
My hope is that it might help illuminate some basic notions of code
improvement, looping, and vectorization in R. However, I welcome
comments from anyone that enlarge, correct, or improve it. I remain a
novice R programmer.

For the curious, the hardware/opsys details are a 1.5 ghz Windows 2000
machine, R1.9.0.

The task is this: I have a data set, mydat, consisting of values =0,
and a fixed constant, K, that is much larger than any of the values. I
want to simulate the distribution of the number of bootstrap draws
(random sampling with replacement) from mydat that is needed  to make
the sum of the values drawn just exceed K. That is, repeatedly bootstrap
sample from mydat until the sum of the results exceeds K. Suppose this
requires 7 draws. Then 7 is a random sample of size 1 from the
distribution I'm interested in. I want to repeat this procedure a
large number, nsim, of times to estimate the distribution.

Before reading on, I invite R novices to consider how -- or better yet
write code -- to do this.

Approach 1 (Naive/ straightforward): Perform the simulation exactly as
described by using a nested loop. The outer loop goes from 1 to nsim to
record the number of draws needed. The inner loop repeatedly draws a
sample of size 1 from mydat and accumulates and checks the sum until it
exceeds K, returning the number of times it took to do this to the outer
loop.

This is easy to program up using a while loop for the inner loop and
takes about a short look's time out my window (a few seconds) to run for
modest nsim (5-10K). But surely one can do better...

Approach 2: Was essentially the same, except all the sampling is done
at once. That is, each execution of the inner loop in (1) required R
to call the sample() function to get a bootstrap sample of size 1.
This is very inefficient, because there is a high overhead of making
that function call at each iteration. It can be avoided by calling
sample() only once -- or at most a small number of times -- to get a
large sample and then using indexing to work one's way through the large
sample. This turns out to be a little more complicated than approach 1
because you first have to figure how large a large sample should be
and then have to be a little careful about setting up your indexing. But
it's not very difficult (especially for any C programmer who can
manipulate pointers), and random sampling is so fast that it's no
problem to generate a sample WAYYY bigger than you need (5 or 10 times
as large, even) just to be safe. Or generate a not so big version and
just check at each outer loop iteration whether you need to generate
some more.

Doing it this way -- now using indexing, not function calls in the inner
loop -- made a considerable improvement (5 - 10 fold). It now took about
one quick glance out the window time to generate the distribution (less
than a second). This was more than adequate for my needs under any
conceivable situation. But still, self-respecting R programmers are
supposed to eschew looping, and here was this ugly(?) loop within a
loop!  So let us persevere.

Approach 3: So the question is -- how to remove that inner loop? Again,
I invite novice R programmers to think about that before continuing
on...

The key here is that the inner loop is doing a very simple calculation:
just accumulating a sum.  Forsooth! -- the answer fairly leaps out: R
already has a cumsum() function that does this (looping in the
underlying C code), so one only has to figure out how to make use of it.

The essential idea to do this is to get a small sample from the large
mydat sample that is guaranteed to be bigger than one needs to total up
to more than K. Again, one can be profligate: if I need about 10 mydat
values on average to sum up to K, if I sample 30 or 40 or some such
thing, I'm sure to have enough (and can always add a check or use
try() to make sure if I am faint of heart). If smallsamp is this
sample of the next 30 or 40 values from my large bootstrap sample, then
the following code vectorizes the inner loops:

count - sum(cumsum(smallsamp)K)+1

Note a trick here: The K produces a vector of logical TRUE's for all
cumsum values K. This is silently treated as numeric 1's when added by
sum(). A moment's thought will make clear why 1 must be added at the
end..

Sure enough, this vectorization  reduced the time about another twofold
-- to an eyeblink --  for my modest nsim sizes.

Approach 4: The last stage, of course, is to get rid of the outer loop,
which fills the vector for the distribution one element at a time, again
another R no-no. One way to do this -- the ONLY way I could think of (so
a challenge to smarter R programmers) -- is to make use of apply(),
which basically always can remove loops. Unfortunately, this is an
illusion, because what the apply() functions actually do is hide the