Re: [ccp4bb] Meaning of a pdb entry

2021-06-02 Thread Marcin Wojdyr
Dear Gergely,

>
> Thank you for these examples! It is reassuring to see that multiple 
> crystallographic models do not break validation for example. I assume only 
> the validation of the first model and first reflection file is shown. I can 
> imagine that it is still a substantial change and may require an extended 
> description to make such depositions fully functional. I will ask the PDB, 
> but this made me optimistic. It would be easy to implement Tim's method if it 
> works for deposition.

I think it won't work.

(1) One entry can contain an ensemble of many models. 4PTH has 250
models and each atom has occupancy 0.004. It's similar to alternative
locations, but taken to extremes.

(2) One entry can contain a single model and multiple datasets. 5RKZ
has 1500+ datasets from crystals soaked with different compounds. You
can concatenate reflection mmCIF files, but only the first block is
used for validation. I suppose all the models were similar, so one can
obtain them by refining the single deposited model to each dataset.

I don't think you can meaningfully deposit multiple datasets with
corresponding models in a single entry. (2) is probably the closest to
what you want.

Best wishes,
Marcin



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


Re: [ccp4bb] Meaning of a pdb entry

2021-06-02 Thread Gergely Katona
Dear Ethan,

This is an interesting discussion. I agree the word uncertainty covers very 
broad concepts, but I try to narrow down what I mean. 
My starting point is that a reflection file contains point estimates of 
diffraction intensities or structure factor amplitudes. Refinement and model 
building results in a point estimate of a structural model. These estimates can 
be calculated to arbitrary precision. The variation in these parameters are 
expected from sampling. Sampling of experimental data and sampling of 
(pseudo)random events in refinement and model building algorithms. I ignore the 
role of human influence and bias for now. Error models with different 
assumptions may help to quantify the expected variation, but if I want to 
verify these error models or just have an alternative way of quantifying 
uncertainty I have to go back to sampling. 

1) This is definitely the mean position for this atom in this crystal but there 
is uncertainty in how much individual instances in different crystal unit cells 
within the lattice deviate from this mean.

Ultimately, this is the category "unknown" for me, I cannot narrow down the 
atomic positions towards a single point with just sampling or with any of the 
experimental methods that I am aware of. The best I can achieve is to improve 
the accuracy and precisions of the model parameters of the distribution that 
describe the distributions of atomic positions.

2) This is a best-effort description of the position of a ligand atom.  However 
it is uncertain what fraction of the unit cells contain the ligand at this 
position, or at all.

My uncertainty is tied to my model, but I can chose different models of course. 
I cannot tell the fraction of unit cells containing the ligand if it is not 
part of my model and I cannot estimate the uncertainty of this parameter by 
sampling. Should I face such question, I might try to compare the average 
B-factors of the ligand that is part of my model in from different samples and 
compare it to a set of control measurements. The control model should also 
contain parameters for the ligand otherwise I cannot perform a comparison. 
Clearly, this could be misinterpreted by someone else as a determined location 
of the ligand in the control group, because the crystallographic models in the 
PDB traditionally do not represent a tool for asking questions, but the best 
effort determination of the "true" structure or ensemble. 

I could define a dependent model parameter which integrates the omit electron 
density in a certain region of the ASU and I can compare the control and soaked 
group of crystals. This is a dependent/deterministic parameter, because its 
variation will entirely depend on the variation of reflection data and the 
variation of not omitted atom positions/parameters. These type of deterministic 
model parameters are not traditionally part of a crystallographic model in a 
PDB entry. Again, this could be misinterpreted as a lack of ligand atoms/lack 
of ligand in the treated group.
  
The purpose of a PDB entry evolved over time from a single type of 
crystallographic model to include, multiple NMR models, different models, 
different methods, experimental data, validation etc. I expect that this nearly 
imperceptible evolution will continue in different directions at different 
speeds. If what I try to achieve now deviates from the current perception of 
the purpose of a PDB entry then of course I have to find other means to fulfill 
my obligation to make my data open to access. Fortunately, the data I plan to 
archive is not related to the determination of ligand occupancy and perhaps 
more in line with the current purpose of the PDB.

3) It is likely that this sidechain/loop/subunit is present in different 
conformations in different copies of the unit cell.

This is again the unknown category that I cannot address by sampling. The model 
can be open to grow (non-parametric), if the refinement is coupled to automated 
rebuilding. I have to define my question differently, for example ask how many 
conformations or water molecules were built in the control and treated group. 
Interpretation may vary, but I may have sufficient evidence of significantly 
different models in the different groups.

If the variation is better represented in pdb entries, machine learning 
algorithms can also achieve better predictions, less biased towards an 
arbitrary model sample.

4) The coordinates of this specific atom/residue/conformation are well 
supported by the data for this particular crystal.
But it might be somewhere else in the next crystal from the same 
crystallization drop, or in a crystal from a different crystallization buffer, 
or at another temperature, or in solution, or in the presence of a ligand, etc.

I am interested in representing the type of variation I cannot control and when 
designing the experiment it is in my best interest to limit the variation of 
experimental conditions between the 

Re: [ccp4bb] Meaning of a pdb entry

2021-06-02 Thread Gergely Katona
Dear Marcin,

Thank you for these examples! It is reassuring to see that multiple 
crystallographic models do not break validation for example. I assume only the 
validation of the first model and first reflection file is shown. I can imagine 
that it is still a substantial change and may require an extended description 
to make such depositions fully functional. I will ask the PDB, but this made me 
optimistic. It would be easy to implement Tim's method if it works for 
deposition.

Best wishes,

Gergely

-Original Message-
From: Marcin Wojdyr  
Sent: den 1 juni 2021 21:23
To: Gergely Katona 
Cc: CCP4BB@jiscmail.ac.uk
Subject: Re: [ccp4bb] Meaning of a pdb entry

Dear Gergely,

For authoritative advice you'd need to ask the PDB. Below is my take.

> I am not sure if it is possible to use MODEL-ENDMDL loops in pdb or mmcif 
> format for storing multiple crystallographic models.

It's possible, there are a few examples such as 2VTU. They represent "ensemble 
refinement" of a crystal structure. So it's one refinement of one dataset, but 
with multiple models.

> I assume it is already possible to store multiple structure factor files (for 
> refinement, for phasing, different crystals etc) under the same entry.

yes

> In my mind, it would be a small step to associate different data sets 
> distinguished by crystal ID or data block with a particular model number, but 
> maybe it is not that simple.
>

It'd be a substantial change, but indeed, the changes in the file format would 
not be that big. As you wrote, model IDs would need to be associated with 
crystal IDs. And probably other associations would be needed, such as unit cell 
with crystal.

> I do not want to create multiple pdb entries just to provide evidence for the 
> robustness/reproducibility of crystals and crystallographic models. I would 
> rather use different pdb entries for different sampling intentions: for 
> example entry 1 contains all the control crystals, entry 2 contains all the 
> crystals subjected to treatment A, etc.

I think it's similar to PanDDA depositions, but I don't know what's the current 
best practice.
Initially, one Deposition Group would have hundreds of PDB entries (each with a 
single dataset). But later on I've seen entries with a single model and 
hundreds of datasets. I haven't looked into it closely, so perhaps someone else 
can advise.

Concatenating multiple mmCIF files (with coordinates) would produce a 
syntactically valid file, but I don't think that such file would get through 
the deposition process.

Marcin



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


Re: [ccp4bb] Meaning of a pdb entry

2021-06-01 Thread Marcin Wojdyr
Dear Gergely,

For authoritative advice you'd need to ask the PDB. Below is my take.

> I am not sure if it is possible to use MODEL-ENDMDL loops in pdb or mmcif 
> format for storing multiple crystallographic models.

It's possible, there are a few examples such as 2VTU. They represent
"ensemble refinement" of a crystal structure. So it's one refinement
of one dataset, but with multiple models.

> I assume it is already possible to store multiple structure factor files (for 
> refinement, for phasing, different crystals etc) under the same entry.

yes

> In my mind, it would be a small step to associate different data sets 
> distinguished by crystal ID or data block with a particular model number, but 
> maybe it is not that simple.
>

It'd be a substantial change, but indeed, the changes in the file
format would not be that big. As you wrote, model IDs would need to be
associated with crystal IDs. And probably other associations would be
needed, such as unit cell with crystal.

> I do not want to create multiple pdb entries just to provide evidence for the 
> robustness/reproducibility of crystals and crystallographic models. I would 
> rather use different pdb entries for different sampling intentions: for 
> example entry 1 contains all the control crystals, entry 2 contains all the 
> crystals subjected to treatment A, etc.

I think it's similar to PanDDA depositions, but I don't know what's
the current best practice.
Initially, one Deposition Group would have hundreds of PDB entries
(each with a single dataset). But later on I've seen entries with a
single model and hundreds of datasets. I haven't looked into it
closely, so perhaps someone else can advise.

Concatenating multiple mmCIF files (with coordinates) would produce a
syntactically valid file, but I don't think that such file would get
through the deposition process.

Marcin



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


Re: [ccp4bb] Meaning of a pdb entry

2021-06-01 Thread Ethan A Merritt
On Monday, 31 May 2021 10:53:46 PDT Gergely Katona wrote:
> Dear Ethan,
> 
> Thank you for your comments! I started a new thread, it was unfortunate that 
> I brought this up in a discussion about B-factors. I really wanted to discuss 
> something that is model agnostic and how to represent uncertainty by 
> sampling. I consider an ensemble model with multiple partial occupancy 
> molecules is still one model. 

Gergely,

Your questions touch on topics that are far too broad to address
satisfactorily on the bulletin board.  I will offer a few thoughts.

- The PDB format has supposedly been deprecated in favor of mmcif,
but let's disregard that.

- A PDB file can contain any number of models. Each is introduced
by a record with "MODEL " in columns 1-6.  The documentation said
  "The MODEL record specifies the model serial number when multiple
   structures are presented in a single coordinate entry, as is often
   the case with structures determined by NMR."
Note that it mentions NMR as an example but does not limit use
of multiple model sections to NMR experiments.

- The PDB format allows [requires?] a header record with with
"EXPDTA" in columns 1-6.  This is used to identify whether the 
model coordinates in the file are supported by X-ray data, NMR data,
theoretical calculation, fiber diffraction, etc.
I don't know how long the list grew to be.

In the context of your question, this EXPDTA information is important.
For example my earlier comment that ensemble models are not
statistically justified was specifically with regard to modeling
X-ray crystal diffraction data.  Generating an ensemble to describe,
say, snapshots of an MD simulation is an entirely different story.

- "Uncertainly" is pretty vague.
Just sticking with crystal structures, it could mean.

1) This is definitely the mean position for this atom in this
crystal but there is uncertainty in how much individual instances
in different crystal unit cells within the lattice deviate from
this mean.

2) This is a best-effort description of the position of a ligand
atom.  However it is uncertain what fraction of the unit cells
contain the ligand at this position, or at all.

3) It is likely that this sidechain/loop/subunit is present in
different conformations in different copies of the unit cell.

4) The coordinates of this specific atom/residue/conformation
are well supported by the data for this particular crystal.
But it might be somewhere else in the next crystal from the
same crystallization drop, or in a crystal from a different
crystallization buffer, or at another temperature, or in
solution, or in the presence of a ligand, etc.
 
best

Ethan


> I am not sure if it is possible to use MODEL-ENDMDL loops in pdb or mmcif 
> format for storing multiple crystallographic models. I assume it is already 
> possible to store multiple structure factor files (for refinement, for 
> phasing, different crystals etc) under the same entry. In my mind, it would 
> be a small step to associate different data sets distinguished by crystal ID 
> or data block with a particular model number, but maybe it is not that 
> simple. 
> 
> I do not want to create multiple pdb entries just to provide evidence for the 
> robustness/reproducibility of crystals and crystallographic models. I would 
> rather use different pdb entries for different sampling intentions: for 
> example entry 1 contains all the control crystals, entry 2 contains all the 
> crystals subjected to treatment A, etc. These would otherwise share identical 
> data reduction and refinement protocols and most of the metadata. I am afraid 
> I do know how the PDB and associated services work internally, but I hope 
> someone here can provide guidance.
> 
> Best wishes,
> 
> Gergely
> 
> 
> Gergely Katona, Professor, Chairman of the Chemistry Program Council
> Department of Chemistry and Molecular Biology, University of Gothenburg
> Box 462, 40530 Göteborg, Sweden
> Tel: +46-31-786-3959 / M: +46-70-912-3309 / Fax: +46-31-786-3910
> Web: http://katonalab.eu, Email: gergely.kat...@gu.se
> 
> -Original Message-
> From: CCP4 bulletin board  On Behalf Of Ethan A Merritt
> Sent: 29 May, 2021 19:16
> To: CCP4BB@JISCMAIL.AC.UK
> Subject: Re: [ccp4bb] AW: [ccp4bb] AW: [ccp4bb] (R)MS
> 
> On Saturday, 29 May 2021 02:12:16 PDT Gergely Katona wrote:
> [...snip...]
>  I think the assumption of independent variations per atoms is too strong in 
> many cases and does not give an accurate picture of uncertainty.
> [...snip...]
> 
> 
> Gergely, you are revisiting a line of thought that historically led to the 
> introduction of more global treatments of atomic displacement.
> These have distinct statistical and interpretational advantages.
> 
> Several approaches have been tried over the past 40 years or so.
> The one that has proved most successful is the use of TLS
> (Translation/Libration/Screw) models of bulk displacement to supplement or 
> replace per-atom descriptions.  As you 

Re: [ccp4bb] Meaning of a pdb entry

2021-05-31 Thread Tim Gruene
Dear Gergely,

you can concatenate (mm)CIF files, without violating the grammar. Thus,
if you want to deposit multiple models at once, just run 'cat 1.mmcif
2.mmcif 3.mmcif > allmy.mmcif' for deposition. This works for CIF, and
is accepted e.g. by the IUCR journals.

Best,
Tim

 On Mon, 31 May 2021
17:53:46 + Gergely Katona  wrote:

> Dear Ethan,
> 
> Thank you for your comments! I started a new thread, it was
> unfortunate that I brought this up in a discussion about B-factors. I
> really wanted to discuss something that is model agnostic and how to
> represent uncertainty by sampling. I consider an ensemble model with
> multiple partial occupancy molecules is still one model. 
> 
> I am not sure if it is possible to use MODEL-ENDMDL loops in pdb or
> mmcif format for storing multiple crystallographic models. I assume
> it is already possible to store multiple structure factor files (for
> refinement, for phasing, different crystals etc) under the same
> entry. In my mind, it would be a small step to associate different
> data sets distinguished by crystal ID or data block with a particular
> model number, but maybe it is not that simple. 
> 
> I do not want to create multiple pdb entries just to provide evidence
> for the robustness/reproducibility of crystals and crystallographic
> models. I would rather use different pdb entries for different
> sampling intentions: for example entry 1 contains all the control
> crystals, entry 2 contains all the crystals subjected to treatment A,
> etc. These would otherwise share identical data reduction and
> refinement protocols and most of the metadata. I am afraid I do know
> how the PDB and associated services work internally, but I hope
> someone here can provide guidance.
> 
> Best wishes,
> 
> Gergely
> 
> 
> Gergely Katona, Professor, Chairman of the Chemistry Program Council
> Department of Chemistry and Molecular Biology, University of
> Gothenburg Box 462, 40530 Göteborg, Sweden
> Tel: +46-31-786-3959 / M: +46-70-912-3309 / Fax: +46-31-786-3910
> Web: http://katonalab.eu, Email: gergely.kat...@gu.se
> 
> -Original Message-
> From: CCP4 bulletin board  On Behalf Of Ethan
> A Merritt Sent: 29 May, 2021 19:16
> To: CCP4BB@JISCMAIL.AC.UK
> Subject: Re: [ccp4bb] AW: [ccp4bb] AW: [ccp4bb] (R)MS
> 
> On Saturday, 29 May 2021 02:12:16 PDT Gergely Katona wrote:
> [...snip...]
>  I think the assumption of independent variations per atoms is too
> strong in many cases and does not give an accurate picture of
> uncertainty. [...snip...]
> 
> 
> Gergely, you are revisiting a line of thought that historically led
> to the introduction of more global treatments of atomic displacement.
> These have distinct statistical and interpretational advantages.
> 
> Several approaches have been tried over the past 40 years or so.
> The one that has proved most successful is the use of TLS
> (Translation/Libration/Screw) models of bulk displacement to
> supplement or replace per-atom descriptions.  As you say, a per-atom
> treatment is often too strong and is not statistically justified by
> the experimental data.  I explored this with specific examples in
> 
>"To B or not to B?" [Acta Cryst. 2012, D68, 468-477]
> http://skuld.bmsc.washington.edu/~tlsmd/references.html
> 
> An NMR-style approach that constructs and refines multiple discrete
> models has been been re-invented several times. These treatments are
> generally called "ensemble models".  IMHO they are statistically
> unjustified and strictly worse than treatments based on higher level
> descriptions such as TLS or normal-mode analysis. X-ray data is
> qualitatively different from NMR data, and optimal treatment of
> uncertainty must take this into account.
> 
>   best regards
> 
>   Ethan
> 
> 
> > Hi,
> > 
> > It is enough to have Ų as unit to express uncertainty in 3D, but
> > one can express it with a single number only in a very specific
> > case when the atom is isotropic. Few atoms have a naturally
> > isotropic distribution around their mean position in very high
> > resolution protein crystal structures. The anisotropic atoms can be
> > described by a 3x3 matrix, where each row and column is associated
> > with the uncertainty in a specific spatial direction. The matrix
> > elements are the product of the uncertainty in these directions.
> > The diagonal elements will be the square of uncertainty in the same
> > direction and they should be always positive, the off-diagonal
> > combination of directions are covariances (+,0 or -). In the end,
> > every element will have a unit distance*distance and the matrix
> > will be symmetric. We cannot just take the square root of the
> > matrix elements and expect something meaningful, if for no other
> > reason the problem with negative covariances. To calculate the
> > square root on the matrix itself one has to diagonalize it first.
> > The height of a person in your example  sounds easy to define, but
> > the mathematical formalism 

[ccp4bb] Meaning of a pdb entry

2021-05-31 Thread Gergely Katona
Dear Ethan,

Thank you for your comments! I started a new thread, it was unfortunate that I 
brought this up in a discussion about B-factors. I really wanted to discuss 
something that is model agnostic and how to represent uncertainty by sampling. 
I consider an ensemble model with multiple partial occupancy molecules is still 
one model. 

I am not sure if it is possible to use MODEL-ENDMDL loops in pdb or mmcif 
format for storing multiple crystallographic models. I assume it is already 
possible to store multiple structure factor files (for refinement, for phasing, 
different crystals etc) under the same entry. In my mind, it would be a small 
step to associate different data sets distinguished by crystal ID or data block 
with a particular model number, but maybe it is not that simple. 

I do not want to create multiple pdb entries just to provide evidence for the 
robustness/reproducibility of crystals and crystallographic models. I would 
rather use different pdb entries for different sampling intentions: for example 
entry 1 contains all the control crystals, entry 2 contains all the crystals 
subjected to treatment A, etc. These would otherwise share identical data 
reduction and refinement protocols and most of the metadata. I am afraid I do 
know how the PDB and associated services work internally, but I hope someone 
here can provide guidance.

Best wishes,

Gergely


Gergely Katona, Professor, Chairman of the Chemistry Program Council
Department of Chemistry and Molecular Biology, University of Gothenburg
Box 462, 40530 Göteborg, Sweden
Tel: +46-31-786-3959 / M: +46-70-912-3309 / Fax: +46-31-786-3910
Web: http://katonalab.eu, Email: gergely.kat...@gu.se

-Original Message-
From: CCP4 bulletin board  On Behalf Of Ethan A Merritt
Sent: 29 May, 2021 19:16
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] AW: [ccp4bb] AW: [ccp4bb] (R)MS

On Saturday, 29 May 2021 02:12:16 PDT Gergely Katona wrote:
[...snip...]
 I think the assumption of independent variations per atoms is too strong in 
many cases and does not give an accurate picture of uncertainty.
[...snip...]


Gergely, you are revisiting a line of thought that historically led to the 
introduction of more global treatments of atomic displacement.
These have distinct statistical and interpretational advantages.

Several approaches have been tried over the past 40 years or so.
The one that has proved most successful is the use of TLS
(Translation/Libration/Screw) models of bulk displacement to supplement or 
replace per-atom descriptions.  As you say, a per-atom treatment is often too 
strong and is not statistically justified by the experimental data.  I explored 
this with specific examples in

   "To B or not to B?" [Acta Cryst. 2012, D68, 468-477]
http://skuld.bmsc.washington.edu/~tlsmd/references.html

An NMR-style approach that constructs and refines multiple discrete models has 
been been re-invented several times. These treatments are generally called 
"ensemble models".  IMHO they are statistically unjustified and strictly worse 
than treatments based on higher level descriptions such as TLS or normal-mode 
analysis.
X-ray data is qualitatively different from NMR data, and optimal treatment of 
uncertainty must take this into account.

best regards

Ethan


> Hi,
> 
> It is enough to have Ų as unit to express uncertainty in 3D, but one can 
> express it with a single number only in a very specific case when the atom is 
> isotropic. Few atoms have a naturally isotropic distribution around their 
> mean position in very high resolution protein crystal structures. The 
> anisotropic atoms can be described by a 3x3 matrix, where each row and column 
> is associated with the uncertainty in a specific spatial direction. The 
> matrix elements are the product of the uncertainty in these directions. The 
> diagonal elements will be the square of uncertainty in the same direction and 
> they should be always positive, the off-diagonal combination of directions 
> are covariances (+,0 or -). In the end, every element will have a unit 
> distance*distance and the matrix will be symmetric. We cannot just take the 
> square root of the matrix elements and expect something meaningful, if for no 
> other reason the problem with negative covariances. To calculate the square 
> root on the matrix itself one has to diagonalize it first. The height of a 
> person in your example  sounds easy to define, but the mathematical formalism 
> will not decide that for me. I can also define height as the longest cord of 
> a person or the maximum elevation of a car mechanic under a car.  Through 
> diagonalization one can at least extract some interesting, intuitive, 
> principal directions. The final product, the sqrt(matrix), is not more 
> intuitive to me. To convert it to something intuitive I would have to 
> diagonalize square rooted matrix again. So shall we make an exception for the 
> special, isotropic description? Or