Re: Curve fitting

2008-10-19 Thread Chris MacRaild
On Sat, Oct 18, 2008 at 8:20 AM, Edward d'Auvergne
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> Before you sent this message, I was talking to Ben Frank (a PhD
> student in Griesinger's lab) about this exact problem - baseplane RMSD
> noise to volume error.  The formula of Nicholson et al., 1992 you
> mentioned makes perfect sense as that's what we came up with too.
> Volume integration over a given area is the sum of the heights of all
> the discrete points in the frequency domain spectrum within that box.
> So the error of a single point is the same as that of the peak height.
>  We just have n*m points within this box.  And as variances add, not
> standard deviations, then the variance (sigma^2) of the volume is:
>
> sigma_vol^2 = sigma_i^2 * n * m,
>
> where sigma_vol is the standard deviation of the volume, sigma_i is
> the standard deviation of a single point assumed to be equal to the
> RMSD of the baseplane noise, and n and m are the dimensions of the
> box.  Taking the square root of this gives the Nicholson et al.
> formula:
>
> sigma_vol = sigma_i * sqrt(n*m).
>

This is the strategy I have used to try and get precision estimates
from peak volumes. As I said earlier, in my hands it does not perform
well. Uncertainties from this method will systematically over-estimate
the precision of strong peaks and underestimate the precision of weak
ones as compared to estimates from duplicate spectra (or perhaps its
the other way around, I don't remember). This may not be evident for
proteins like ubiquitin, where virtually all amides give uniformly
strong peaks in the HSQC, but for proteins with more varied relaxation
behaviour, this can be a major issue. Its important to keep in mind
just how much signal processing goes on between a raw fid (in which
the noise in adjacent points is independent and uncorrelated) and the
spectrum that we integrate (in which, apparently, noise in adjacent
points is not always independent and uncorrelated).

Even apart from this issue, I have always found peak height to give
better results for fitting relaxation data. Heights would be expected
to be less sensitive to all sorts of experimental complications like
imperfect baselines, peak overlap, phase errors, etc. In my hands this
always seems to outweigh the greater precision afforded by peak
volumes.

> However I doubt if many people in the field use this volume
> integration method.  I know that this is only one method of 3 in
> Sparky (http://www.cgl.ucsf.edu/home/sparky/manual/peaks.html#Integration).
>  The program peakint
> (http://hugin.ethz.ch/wuthrich/software/xeasy/xeasy_m15.html) doesn't
> use all points in the box.  And Cara does things differently again
> (http://www.cara.ethz.ch/Wiki/Integration).  So even if the user has
> the baseplane RMSD measured, I have no idea if they can work out how
> many points in the spectra were use in the integration.  And if a
> Gaussian or Lorentzian fit is used, then this formula definitely
> cannot be used.  Maybe support for the different methods of the
> different programs can be added one by one to relax, or maybe ask the
> user which type of fit was used and how many points were integrated,
> but this is going to be difficult to do.
>
> Regards,
>
> Edward
>
>
> On Fri, Oct 17, 2008 at 6:15 PM, Sébastien Morin
> <[EMAIL PROTECTED]> wrote:
>> Hi all,
>>
>> Here is a reference some people in our lab have been using to extract
>> errors on NOE calculated from volumes.
>>
>> Nicholson, Kay, Baldisseri, Arango, Young, Bax, and Torchia (1992)
>> Biochemistry, 31: 5253-5263.
>>
>> And here is the interesting part...
>>
>> ==
>> To estimate the error in NOE values, the standard deviation in baseline
>> noise (a) was determined for each spectrum and expanded into a standard
>> deviation in volume given by
>>
>>Delta V = sigma sqrt(nk)
>>
>> where n is the number of points in the f1 dimension and k is the number
>> of points in the f2 dimension of a typical peak at the baseline. The
>> error associated with the NOE value for peak j is then given by
>>
>> error_j = (V_A_j / V_B_j) * sqrt[(Delta_V_A / V_A_j)^2 + (Delta_V_B /
>> V_B_j)^2]
>>
>> where V_A and V_B denote the volume of peak j in the presence and
>> absense of NOE enhancement, respectively, and Delta_V_A and Delta_V_B
>> denote the standard deviations in volume for the spectra recorded in the
>> presence and absense of NOE enhancement, respectively.
>> ==
>>
>> I'm not an expert in statistics, so won't judge the value of this
>> approach. However, this is an approach some people have been using...
>>
>> Ciao !
>>
>>
>> Séb
>>
>>
>>
>>
>> Edward d'Auvergne wrote:
>>> Oh, I forgot about the std error formula.  Is where the sqrt(2) comes
>>> from?  Doh, that would be retarded.  Then I know someone who would
>>> require sqrt(3) for the NOE spectra!  Is that really what Palmer
>>> meant, that std error is the same as "the standard deviation of the
>>> differences between t

Re: Relax_fit.py problem

2008-10-19 Thread Chris MacRaild
On Fri, Oct 17, 2008 at 7:51 PM, Edward d'Auvergne <[EMAIL PROTECTED]> wrote:
> On Fri, Oct 17, 2008 at 12:53 AM, Chris MacRaild <[EMAIL PROTECTED]> wrote:
>> On Thu, Oct 16, 2008 at 6:57 PM, Edward d'Auvergne
>> <[EMAIL PROTECTED]> wrote:
>>> On Thu, Oct 16, 2008 at 1:09 AM, Chris MacRaild <[EMAIL PROTECTED]> wrote:
>
> Well, the Jackknife technique
> (http://en.wikipedia.org/wiki/Resampling_(statistics)#Jackknife) does
> something like this.  It uses the errors present inside the collected
> data to estimate the parameter errors.  It's not great, but is useful
> when errors cannot be measured.  You can also use the covariance
> matrix from the optimisation space to estimate errors.  Both are rough
> and approximate, and in convoluted spaces (the diffusion tensor space
> and double motion model-free models of Clore et al., 1990) are known
> to have problems.  Monte Carlo simulations perform much better in
> complex spaces.
>

 I have used (and extensively tested) Bootstrap resampling for this
 problem. In my hands it works very well provided the data quality is
 high (which of course it must be if the resulting values are to be of
 any use in model-free analysis). In other words it gives errors
 indistinguishable from those derived by Monte Carlo based on duplicate
 spectra. Bootstraping, like Jacknife, does not depend on an estimate
 of peak hight uncertainty. Its success presumably reflects the smooth
 and simple optimisation space involved in an exponential fit to good
 data - I fully expect it to fail if applied to the complex spaces of
 model-free optimisation.
>>>
>>> If someone would like bootstrapping for a certain technique, this
>>> could added to relax without too much problem by duplicating the Monte
>>> Carlo code and making slight modifications.  Implementing Jackknife or
>>> the covariance matrix for error propagation would be more complex and
>>> questionable as to its value.  Anyway, if it's not absolutely
>>> necessary I will concentrate my efforts on getting Gary Thompson's
>>> multi processor code functional (to run relax on clusters, grids, or
>>> multi-cpu systems - see
>>> https://mail.gna.org/public/relax-devel/2006-04/msg00023.html).  And
>>> the BMRB and CCPN integration (CCPN at
>>> https://mail.gna.org/public/relax-devel/2007-11/msg00037.html
>>> continued at https://mail.gna.org/public/relax-devel/2007-12/msg0.html,
>>> and BMRB at https://mail.gna.org/public/relax-devel/2008-07/msg00057.html).
>>>
>>> One question I have about the bootstrapping you used Chris is, how did
>>> you get the errors for the variance of the Gaussian distributions used
>>> to generate the bootstrapping samples?  The bootstrapping method I
>>> know for error analysis is very similar to Monte Carlo simulations.
>>> For Monte Carlo simulations you have:
>>>
>>> 1)  Fit the original data set to get the fitted parameter set (this
>>> uses the original error set).
>>> 2)  Generate the back calculated data set from the fitted parameter set.
>>> 3)  Randomise n times, assuming a Gaussian distribution, the back
>>> calculated data set using original error set.
>>> 4)  Fit the n Monte Carlo data sets as in 1).
>>> 5)  The values of 1) and standard deviation of 4) give the final
>>> parameter values.
>>>
>>> The bootstrapping technique for error analysis I am familiar with is:
>>>
>>> 1)  Fit the original data set to get the fitted parameter set (this
>>> uses the original error set).
>>> 2)  N/A.
>>> 3)  Randomise n times, assuming a Gaussian distribution, the original
>>> data set using original error set.
>>> 4)  Fit the n bootstrapped data sets as in 1).
>>> 5)  The values of 1) and standard deviation of 4) give the final
>>> parameter values.
>>>
>>> Is this how you implemented it?
>>>
>>
>> No. That is quite different to what I understand bootstrapping to be.
>> The bootstrap as I have implimented it is much more closely related to
>> the Jacknife, and requires no estimate of the precision of each peak
>> height. See http://en.wikipedia.org/wiki/Bootstrap_(statistics) for an
>> overview of various Bootstrap approaches. The work I have done has
>> been exclusively with the most basic form. So, having fit the original
>> dataset, I resample the data by random selection with replacement from
>> the original dataset.
>> eg:
>> If we have peak height,relaxation time pairs stored as
>>
>> data = [(i0,t0), (i1,t1), ... ]
>>
>> we resample (your step 3 above) with:
>>
>> new_data = [random.choice(data) for i in range(len(data))]
>>
>> The effect is that each resampled dataset has the same number of
>> datapoints as the original, but with a fraction of points missing and
>> replaced with duplicates of other points. It is so simple that its
>> remarkable that it works, but it does seem to for this purpose at
>> least.
>
> Ah, now I remember this bootstrapping method!  This is exactly as
> described in the Numerical Recep