Selecting a test set that minimizes Rfree is so wrong on so many levels.
Unless, of course, the only thing I know about Rfree is that it is the
magic number that I need to make small by all means necessary.
By using a simple genetic algorithm, I managed to get Rfree for a
well-refined model as
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Dear Nicholas,
for a data set with 5132 unique reflections you should flag 10.5% for
Rfree, otherwise you could as well drop Rfree completely and use the
whole data set for refinement. At least this is how I understand Axel
Brunger's article about
Dear Gerard,Tom and Bernhard,
Thankyou for highlighting the IUCr Diffraction Data Deposition Working
Group and Forum.
Dear Colleagues,
I am travelling at present and apologise for not replying sooner to
the CCP4bb, and also am with intermittent email access until later
this week when I 'return
I think that we are using the test set for many things:
1. Determining and communicating to others whether our overall procedure
is overfitting the data.
2. Identifying the optimal overall procedure in cases where very different
options are being considered (e.g., should I use TLS).
3.
Yes, Rsleep seems to be just the right thing to use for this:
Separating model optimization and model validation in statistical
cross-validation as applied to crystallography
G. J. Kleywegt
Acta Cryst. (2007). D63, 939-940
Practically, it would mean that we split 10% of test reflections into 5%
On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote:
For structures with a small number of reflections, the
statistical
noise in the 5% sets can be very significant indeed. We have seen
differences between Rfree values obtained from different sets
reaching
up to 4%.
This
Hi,
On Sun, Oct 16, 2011 at 7:48 PM, Ed Pozharski epozh...@umaryland.eduwrote:
On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote:
For structures with a small number of reflections, the
statistical
noise in the 5% sets can be very significant indeed. We have seen
Dear Ethan, List,
Surely someone must have done this! But I can't recall ever reading
an analysis of such a refinement protocol.
Does anyone know of relevant reports in the literature?
Total statistical cross validation is indeed what we should be doing, but
for large structures the
For structures with a small number of reflections, the statistical noise
in the 5% sets can be very significant indeed. We have seen differences
between Rfree values obtained from different sets reaching up to 4%.
This is very intriguing indeed!
Is there something specific in these
For structures with a small number of reflections, the statistical
noise in the 5% sets can be very significant indeed. We have seen
differences between Rfree values obtained from different sets reaching
up to 4%.
This is very intriguing indeed! Is there something specific in these
This is a follow up (or a digression) to James comparing test set to
missing reflections. I also heard this issue mentioned before but was
always too lazy to actually pursue it.
So.
The role of the test set is to prevent overfitting. Let's say I have
the final model and I monitored the Rfree
On Fri, Oct 14, 2011 at 12:52 PM, Ed Pozharski epozh...@umaryland.eduwrote:
The second question is practical. Let's say I want to deposit the
results of the refinement against the full dataset as my final model.
Should I not report the Rfree and instead insert a remark explaining the
Hi Ed,
This is a follow up (or a digression) to James comparing test set to
missing reflections. I also heard this issue mentioned before but was
always too lazy to actually pursue it.
So.
The role of the test set is to prevent overfitting. Let's say I have
the final model and I
Sorry, I don't quite understand your reasoning for how the structure
is rendered useless if one refined it with all data.
Would your argument also apply to all the structures that were refined
before R-free existed?
Quyen
You should enter the statistics for the model and data that you
Recent experience indicates that the PDB is checking these statistics very
closely for new depositions. The checks made by the PDB are intended to
prevent accidents and oversights made by honest people from creeping into the
database. Getting away with something seems to imply some intention
Regarding refinement against all reflections: the main goal of our work is
to provide the best possible representation of the experimental data in the
form of the structure model. Once the structure building and refinement
process is finished keeping the Rfree set separate does not make sense any
On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang qqho...@gmail.com wrote:
Sorry, I don't quite understand your reasoning for how the structure is
rendered useless if one refined it with all data.
Useless was too strong a word (it's Friday, sorry). I guess simulated
annealing can address the
I still don't understand how a structure model refined with all data
would negatively affect the determination and/or refinement of an
isomorphous structure using a different data set (even without doing
SA first).
Quyen
On Oct 14, 2011, at 4:35 PM, Nat Echols wrote:
On Fri, Oct 14, 2011
Let's say you have two isomorphous crystals of two different
protein-ligand complexes. Same protein different ligand, same xtal
form. Conventionally you'd keep the same free set reflections (hkl
values) between the two datasets to reduce biasing. However if the
first model had been refined
Recently we (I mean WE - community) frequently refine structures around 1
Angstrom resolution.
This is not what for the Rfree was invented. It was invented to go away with
3.0-2.8 Angstrom data
in times when people did not possess facilities good enough to look on the
electron density maps….
On Fri, 2011-10-14 at 13:07 -0700, Nat Echols wrote:
You should enter the statistics for the model and data that you
actually deposit, not statistics for some other model that you might
have had at one point but which the PDB will never see.
If you read my post carefully, you'll see that I
We have obligations that extend beyond simply presenting a best model.
In an ideal world, the PDB would accept two coordinate sets and two sets of
statistics, one for the last step where the cross-validation set was valid, and
a final model refined against all the data. Until there is a
Thanks for the clear explanation. I understood that.
But I was trying to understand how this would negatively affects the
initial model to render it useless or less useful.
In the scenario that you presented, I would expect a better result
(better model) if the initial model was refined with
On Friday, October 14, 2011 02:45:08 pm Ed Pozharski wrote:
On Fri, 2011-10-14 at 13:07 -0700, Nat Echols wrote:
The benefit of including those extra 5% of data is always minimal
And so is probably the benefit of excluding when all the steps that
require cross-validation have already
I just tried refining a finished structure turning off the FreeR set, in
Refmac, and I have to say I can barely see any difference between the two sets
of coordinates.
From this n=1 trial, I can't see that it improves the model significantly, nor
that it ruins the model irretrievably for
For those who have strong opinions on what data should be deposited...
The IUCR is just starting a serious discussion of this subject. Two
committees, the Data Deposition Working Group, led by John Helliwell,
and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su)
are working on
Dear Tom,
I am not sure that I feel happy with your invitation that views on such
crucial matters as these deposition issues be communicated to you off-list.
It would seem much healthier if these views were aired out within the BB.
Again!, some will say ... but the difference is that there
I may be missing something or someone could point out that I am wrong and why
as I am curious, but with a highly redundant dataset the difference between
refining the final model against the full dataset would be small based upon the
random selection of reflections for Rfree?
Dear Gerard,
I'm very happy for the discussion to be on the CCP4 list (or on the IUCR
forums, or both). I was only trying to not create too much traffic.
All the best,
Tom T
Dear Tom,
I am not sure that I feel happy with your invitation that views on
such
crucial matters as these
Now it would be interesting to refine this structure to convergence,
with the original free set. If I understood correctly Ian Tickle has
done essentially this, and the Free R returns essentially to its
original value: the minimum arrived at is independent of starting
point, perhaps within
Each R-free flag corresponds a particular HKL index. Redundancy refers to the
number of times a reflection corresponding to a given HKL index is observed.
The final structure factor of a given HKL can be thought of as an average of
these redundant observations.
Related to your question,
On Fri, 2011-10-14 at 23:41 +0100, Phil Evans wrote:
I just tried refining a finished structure turning off the FreeR
set, in Refmac, and I have to say I can barely see any difference
between the two sets of coordinates.
The amplitude of the shift, I presume, depends on the resolution and
data
Hi,
yes, shifts depend on resolution indeed. See pages 75-77 here:
http://www.phenix-online.org/presentations/latest/pavel_refinement_general.pdf
Pavel
On Fri, Oct 14, 2011 at 7:34 PM, Ed Pozharski epozh...@umaryland.eduwrote:
On Fri, 2011-10-14 at 23:41 +0100, Phil Evans wrote:
I just
33 matches
Mail list logo