Re: [ccp4bb] Meaning of a pdb entry
Dear Gergely, > > Thank you for these examples! It is reassuring to see that multiple > crystallographic models do not break validation for example. I assume only > the validation of the first model and first reflection file is shown. I can > imagine that it is still a substantial change and may require an extended > description to make such depositions fully functional. I will ask the PDB, > but this made me optimistic. It would be easy to implement Tim's method if it > works for deposition. I think it won't work. (1) One entry can contain an ensemble of many models. 4PTH has 250 models and each atom has occupancy 0.004. It's similar to alternative locations, but taken to extremes. (2) One entry can contain a single model and multiple datasets. 5RKZ has 1500+ datasets from crystals soaked with different compounds. You can concatenate reflection mmCIF files, but only the first block is used for validation. I suppose all the models were similar, so one can obtain them by refining the single deposited model to each dataset. I don't think you can meaningfully deposit multiple datasets with corresponding models in a single entry. (2) is probably the closest to what you want. Best wishes, Marcin To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
Re: [ccp4bb] Meaning of a pdb entry
Dear Ethan, This is an interesting discussion. I agree the word uncertainty covers very broad concepts, but I try to narrow down what I mean. My starting point is that a reflection file contains point estimates of diffraction intensities or structure factor amplitudes. Refinement and model building results in a point estimate of a structural model. These estimates can be calculated to arbitrary precision. The variation in these parameters are expected from sampling. Sampling of experimental data and sampling of (pseudo)random events in refinement and model building algorithms. I ignore the role of human influence and bias for now. Error models with different assumptions may help to quantify the expected variation, but if I want to verify these error models or just have an alternative way of quantifying uncertainty I have to go back to sampling. 1) This is definitely the mean position for this atom in this crystal but there is uncertainty in how much individual instances in different crystal unit cells within the lattice deviate from this mean. Ultimately, this is the category "unknown" for me, I cannot narrow down the atomic positions towards a single point with just sampling or with any of the experimental methods that I am aware of. The best I can achieve is to improve the accuracy and precisions of the model parameters of the distribution that describe the distributions of atomic positions. 2) This is a best-effort description of the position of a ligand atom. However it is uncertain what fraction of the unit cells contain the ligand at this position, or at all. My uncertainty is tied to my model, but I can chose different models of course. I cannot tell the fraction of unit cells containing the ligand if it is not part of my model and I cannot estimate the uncertainty of this parameter by sampling. Should I face such question, I might try to compare the average B-factors of the ligand that is part of my model in from different samples and compare it to a set of control measurements. The control model should also contain parameters for the ligand otherwise I cannot perform a comparison. Clearly, this could be misinterpreted by someone else as a determined location of the ligand in the control group, because the crystallographic models in the PDB traditionally do not represent a tool for asking questions, but the best effort determination of the "true" structure or ensemble. I could define a dependent model parameter which integrates the omit electron density in a certain region of the ASU and I can compare the control and soaked group of crystals. This is a dependent/deterministic parameter, because its variation will entirely depend on the variation of reflection data and the variation of not omitted atom positions/parameters. These type of deterministic model parameters are not traditionally part of a crystallographic model in a PDB entry. Again, this could be misinterpreted as a lack of ligand atoms/lack of ligand in the treated group. The purpose of a PDB entry evolved over time from a single type of crystallographic model to include, multiple NMR models, different models, different methods, experimental data, validation etc. I expect that this nearly imperceptible evolution will continue in different directions at different speeds. If what I try to achieve now deviates from the current perception of the purpose of a PDB entry then of course I have to find other means to fulfill my obligation to make my data open to access. Fortunately, the data I plan to archive is not related to the determination of ligand occupancy and perhaps more in line with the current purpose of the PDB. 3) It is likely that this sidechain/loop/subunit is present in different conformations in different copies of the unit cell. This is again the unknown category that I cannot address by sampling. The model can be open to grow (non-parametric), if the refinement is coupled to automated rebuilding. I have to define my question differently, for example ask how many conformations or water molecules were built in the control and treated group. Interpretation may vary, but I may have sufficient evidence of significantly different models in the different groups. If the variation is better represented in pdb entries, machine learning algorithms can also achieve better predictions, less biased towards an arbitrary model sample. 4) The coordinates of this specific atom/residue/conformation are well supported by the data for this particular crystal. But it might be somewhere else in the next crystal from the same crystallization drop, or in a crystal from a different crystallization buffer, or at another temperature, or in solution, or in the presence of a ligand, etc. I am interested in representing the type of variation I cannot control and when designing the experiment it is in my best interest to limit the variation of experimental conditions between the
Re: [ccp4bb] Meaning of a pdb entry
Dear Marcin, Thank you for these examples! It is reassuring to see that multiple crystallographic models do not break validation for example. I assume only the validation of the first model and first reflection file is shown. I can imagine that it is still a substantial change and may require an extended description to make such depositions fully functional. I will ask the PDB, but this made me optimistic. It would be easy to implement Tim's method if it works for deposition. Best wishes, Gergely -Original Message- From: Marcin Wojdyr Sent: den 1 juni 2021 21:23 To: Gergely Katona Cc: CCP4BB@jiscmail.ac.uk Subject: Re: [ccp4bb] Meaning of a pdb entry Dear Gergely, For authoritative advice you'd need to ask the PDB. Below is my take. > I am not sure if it is possible to use MODEL-ENDMDL loops in pdb or mmcif > format for storing multiple crystallographic models. It's possible, there are a few examples such as 2VTU. They represent "ensemble refinement" of a crystal structure. So it's one refinement of one dataset, but with multiple models. > I assume it is already possible to store multiple structure factor files (for > refinement, for phasing, different crystals etc) under the same entry. yes > In my mind, it would be a small step to associate different data sets > distinguished by crystal ID or data block with a particular model number, but > maybe it is not that simple. > It'd be a substantial change, but indeed, the changes in the file format would not be that big. As you wrote, model IDs would need to be associated with crystal IDs. And probably other associations would be needed, such as unit cell with crystal. > I do not want to create multiple pdb entries just to provide evidence for the > robustness/reproducibility of crystals and crystallographic models. I would > rather use different pdb entries for different sampling intentions: for > example entry 1 contains all the control crystals, entry 2 contains all the > crystals subjected to treatment A, etc. I think it's similar to PanDDA depositions, but I don't know what's the current best practice. Initially, one Deposition Group would have hundreds of PDB entries (each with a single dataset). But later on I've seen entries with a single model and hundreds of datasets. I haven't looked into it closely, so perhaps someone else can advise. Concatenating multiple mmCIF files (with coordinates) would produce a syntactically valid file, but I don't think that such file would get through the deposition process. Marcin To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
Re: [ccp4bb] Meaning of a pdb entry
Dear Gergely, For authoritative advice you'd need to ask the PDB. Below is my take. > I am not sure if it is possible to use MODEL-ENDMDL loops in pdb or mmcif > format for storing multiple crystallographic models. It's possible, there are a few examples such as 2VTU. They represent "ensemble refinement" of a crystal structure. So it's one refinement of one dataset, but with multiple models. > I assume it is already possible to store multiple structure factor files (for > refinement, for phasing, different crystals etc) under the same entry. yes > In my mind, it would be a small step to associate different data sets > distinguished by crystal ID or data block with a particular model number, but > maybe it is not that simple. > It'd be a substantial change, but indeed, the changes in the file format would not be that big. As you wrote, model IDs would need to be associated with crystal IDs. And probably other associations would be needed, such as unit cell with crystal. > I do not want to create multiple pdb entries just to provide evidence for the > robustness/reproducibility of crystals and crystallographic models. I would > rather use different pdb entries for different sampling intentions: for > example entry 1 contains all the control crystals, entry 2 contains all the > crystals subjected to treatment A, etc. I think it's similar to PanDDA depositions, but I don't know what's the current best practice. Initially, one Deposition Group would have hundreds of PDB entries (each with a single dataset). But later on I've seen entries with a single model and hundreds of datasets. I haven't looked into it closely, so perhaps someone else can advise. Concatenating multiple mmCIF files (with coordinates) would produce a syntactically valid file, but I don't think that such file would get through the deposition process. Marcin To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
Re: [ccp4bb] Meaning of a pdb entry
On Monday, 31 May 2021 10:53:46 PDT Gergely Katona wrote: > Dear Ethan, > > Thank you for your comments! I started a new thread, it was unfortunate that > I brought this up in a discussion about B-factors. I really wanted to discuss > something that is model agnostic and how to represent uncertainty by > sampling. I consider an ensemble model with multiple partial occupancy > molecules is still one model. Gergely, Your questions touch on topics that are far too broad to address satisfactorily on the bulletin board. I will offer a few thoughts. - The PDB format has supposedly been deprecated in favor of mmcif, but let's disregard that. - A PDB file can contain any number of models. Each is introduced by a record with "MODEL " in columns 1-6. The documentation said "The MODEL record specifies the model serial number when multiple structures are presented in a single coordinate entry, as is often the case with structures determined by NMR." Note that it mentions NMR as an example but does not limit use of multiple model sections to NMR experiments. - The PDB format allows [requires?] a header record with with "EXPDTA" in columns 1-6. This is used to identify whether the model coordinates in the file are supported by X-ray data, NMR data, theoretical calculation, fiber diffraction, etc. I don't know how long the list grew to be. In the context of your question, this EXPDTA information is important. For example my earlier comment that ensemble models are not statistically justified was specifically with regard to modeling X-ray crystal diffraction data. Generating an ensemble to describe, say, snapshots of an MD simulation is an entirely different story. - "Uncertainly" is pretty vague. Just sticking with crystal structures, it could mean. 1) This is definitely the mean position for this atom in this crystal but there is uncertainty in how much individual instances in different crystal unit cells within the lattice deviate from this mean. 2) This is a best-effort description of the position of a ligand atom. However it is uncertain what fraction of the unit cells contain the ligand at this position, or at all. 3) It is likely that this sidechain/loop/subunit is present in different conformations in different copies of the unit cell. 4) The coordinates of this specific atom/residue/conformation are well supported by the data for this particular crystal. But it might be somewhere else in the next crystal from the same crystallization drop, or in a crystal from a different crystallization buffer, or at another temperature, or in solution, or in the presence of a ligand, etc. best Ethan > I am not sure if it is possible to use MODEL-ENDMDL loops in pdb or mmcif > format for storing multiple crystallographic models. I assume it is already > possible to store multiple structure factor files (for refinement, for > phasing, different crystals etc) under the same entry. In my mind, it would > be a small step to associate different data sets distinguished by crystal ID > or data block with a particular model number, but maybe it is not that > simple. > > I do not want to create multiple pdb entries just to provide evidence for the > robustness/reproducibility of crystals and crystallographic models. I would > rather use different pdb entries for different sampling intentions: for > example entry 1 contains all the control crystals, entry 2 contains all the > crystals subjected to treatment A, etc. These would otherwise share identical > data reduction and refinement protocols and most of the metadata. I am afraid > I do know how the PDB and associated services work internally, but I hope > someone here can provide guidance. > > Best wishes, > > Gergely > > > Gergely Katona, Professor, Chairman of the Chemistry Program Council > Department of Chemistry and Molecular Biology, University of Gothenburg > Box 462, 40530 Göteborg, Sweden > Tel: +46-31-786-3959 / M: +46-70-912-3309 / Fax: +46-31-786-3910 > Web: http://katonalab.eu, Email: gergely.kat...@gu.se > > -Original Message- > From: CCP4 bulletin board On Behalf Of Ethan A Merritt > Sent: 29 May, 2021 19:16 > To: CCP4BB@JISCMAIL.AC.UK > Subject: Re: [ccp4bb] AW: [ccp4bb] AW: [ccp4bb] (R)MS > > On Saturday, 29 May 2021 02:12:16 PDT Gergely Katona wrote: > [...snip...] > I think the assumption of independent variations per atoms is too strong in > many cases and does not give an accurate picture of uncertainty. > [...snip...] > > > Gergely, you are revisiting a line of thought that historically led to the > introduction of more global treatments of atomic displacement. > These have distinct statistical and interpretational advantages. > > Several approaches have been tried over the past 40 years or so. > The one that has proved most successful is the use of TLS > (Translation/Libration/Screw) models of bulk displacement to supplement or > replace per-atom descriptions. As you
Re: [ccp4bb] Meaning of a pdb entry
Dear Gergely, you can concatenate (mm)CIF files, without violating the grammar. Thus, if you want to deposit multiple models at once, just run 'cat 1.mmcif 2.mmcif 3.mmcif > allmy.mmcif' for deposition. This works for CIF, and is accepted e.g. by the IUCR journals. Best, Tim On Mon, 31 May 2021 17:53:46 + Gergely Katona wrote: > Dear Ethan, > > Thank you for your comments! I started a new thread, it was > unfortunate that I brought this up in a discussion about B-factors. I > really wanted to discuss something that is model agnostic and how to > represent uncertainty by sampling. I consider an ensemble model with > multiple partial occupancy molecules is still one model. > > I am not sure if it is possible to use MODEL-ENDMDL loops in pdb or > mmcif format for storing multiple crystallographic models. I assume > it is already possible to store multiple structure factor files (for > refinement, for phasing, different crystals etc) under the same > entry. In my mind, it would be a small step to associate different > data sets distinguished by crystal ID or data block with a particular > model number, but maybe it is not that simple. > > I do not want to create multiple pdb entries just to provide evidence > for the robustness/reproducibility of crystals and crystallographic > models. I would rather use different pdb entries for different > sampling intentions: for example entry 1 contains all the control > crystals, entry 2 contains all the crystals subjected to treatment A, > etc. These would otherwise share identical data reduction and > refinement protocols and most of the metadata. I am afraid I do know > how the PDB and associated services work internally, but I hope > someone here can provide guidance. > > Best wishes, > > Gergely > > > Gergely Katona, Professor, Chairman of the Chemistry Program Council > Department of Chemistry and Molecular Biology, University of > Gothenburg Box 462, 40530 Göteborg, Sweden > Tel: +46-31-786-3959 / M: +46-70-912-3309 / Fax: +46-31-786-3910 > Web: http://katonalab.eu, Email: gergely.kat...@gu.se > > -Original Message- > From: CCP4 bulletin board On Behalf Of Ethan > A Merritt Sent: 29 May, 2021 19:16 > To: CCP4BB@JISCMAIL.AC.UK > Subject: Re: [ccp4bb] AW: [ccp4bb] AW: [ccp4bb] (R)MS > > On Saturday, 29 May 2021 02:12:16 PDT Gergely Katona wrote: > [...snip...] > I think the assumption of independent variations per atoms is too > strong in many cases and does not give an accurate picture of > uncertainty. [...snip...] > > > Gergely, you are revisiting a line of thought that historically led > to the introduction of more global treatments of atomic displacement. > These have distinct statistical and interpretational advantages. > > Several approaches have been tried over the past 40 years or so. > The one that has proved most successful is the use of TLS > (Translation/Libration/Screw) models of bulk displacement to > supplement or replace per-atom descriptions. As you say, a per-atom > treatment is often too strong and is not statistically justified by > the experimental data. I explored this with specific examples in > >"To B or not to B?" [Acta Cryst. 2012, D68, 468-477] > http://skuld.bmsc.washington.edu/~tlsmd/references.html > > An NMR-style approach that constructs and refines multiple discrete > models has been been re-invented several times. These treatments are > generally called "ensemble models". IMHO they are statistically > unjustified and strictly worse than treatments based on higher level > descriptions such as TLS or normal-mode analysis. X-ray data is > qualitatively different from NMR data, and optimal treatment of > uncertainty must take this into account. > > best regards > > Ethan > > > > Hi, > > > > It is enough to have Ų as unit to express uncertainty in 3D, but > > one can express it with a single number only in a very specific > > case when the atom is isotropic. Few atoms have a naturally > > isotropic distribution around their mean position in very high > > resolution protein crystal structures. The anisotropic atoms can be > > described by a 3x3 matrix, where each row and column is associated > > with the uncertainty in a specific spatial direction. The matrix > > elements are the product of the uncertainty in these directions. > > The diagonal elements will be the square of uncertainty in the same > > direction and they should be always positive, the off-diagonal > > combination of directions are covariances (+,0 or -). In the end, > > every element will have a unit distance*distance and the matrix > > will be symmetric. We cannot just take the square root of the > > matrix elements and expect something meaningful, if for no other > > reason the problem with negative covariances. To calculate the > > square root on the matrix itself one has to diagonalize it first. > > The height of a person in your example sounds easy to define, but > > the mathematical formalism
[ccp4bb] Meaning of a pdb entry
Dear Ethan, Thank you for your comments! I started a new thread, it was unfortunate that I brought this up in a discussion about B-factors. I really wanted to discuss something that is model agnostic and how to represent uncertainty by sampling. I consider an ensemble model with multiple partial occupancy molecules is still one model. I am not sure if it is possible to use MODEL-ENDMDL loops in pdb or mmcif format for storing multiple crystallographic models. I assume it is already possible to store multiple structure factor files (for refinement, for phasing, different crystals etc) under the same entry. In my mind, it would be a small step to associate different data sets distinguished by crystal ID or data block with a particular model number, but maybe it is not that simple. I do not want to create multiple pdb entries just to provide evidence for the robustness/reproducibility of crystals and crystallographic models. I would rather use different pdb entries for different sampling intentions: for example entry 1 contains all the control crystals, entry 2 contains all the crystals subjected to treatment A, etc. These would otherwise share identical data reduction and refinement protocols and most of the metadata. I am afraid I do know how the PDB and associated services work internally, but I hope someone here can provide guidance. Best wishes, Gergely Gergely Katona, Professor, Chairman of the Chemistry Program Council Department of Chemistry and Molecular Biology, University of Gothenburg Box 462, 40530 Göteborg, Sweden Tel: +46-31-786-3959 / M: +46-70-912-3309 / Fax: +46-31-786-3910 Web: http://katonalab.eu, Email: gergely.kat...@gu.se -Original Message- From: CCP4 bulletin board On Behalf Of Ethan A Merritt Sent: 29 May, 2021 19:16 To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] AW: [ccp4bb] AW: [ccp4bb] (R)MS On Saturday, 29 May 2021 02:12:16 PDT Gergely Katona wrote: [...snip...] I think the assumption of independent variations per atoms is too strong in many cases and does not give an accurate picture of uncertainty. [...snip...] Gergely, you are revisiting a line of thought that historically led to the introduction of more global treatments of atomic displacement. These have distinct statistical and interpretational advantages. Several approaches have been tried over the past 40 years or so. The one that has proved most successful is the use of TLS (Translation/Libration/Screw) models of bulk displacement to supplement or replace per-atom descriptions. As you say, a per-atom treatment is often too strong and is not statistically justified by the experimental data. I explored this with specific examples in "To B or not to B?" [Acta Cryst. 2012, D68, 468-477] http://skuld.bmsc.washington.edu/~tlsmd/references.html An NMR-style approach that constructs and refines multiple discrete models has been been re-invented several times. These treatments are generally called "ensemble models". IMHO they are statistically unjustified and strictly worse than treatments based on higher level descriptions such as TLS or normal-mode analysis. X-ray data is qualitatively different from NMR data, and optimal treatment of uncertainty must take this into account. best regards Ethan > Hi, > > It is enough to have Ų as unit to express uncertainty in 3D, but one can > express it with a single number only in a very specific case when the atom is > isotropic. Few atoms have a naturally isotropic distribution around their > mean position in very high resolution protein crystal structures. The > anisotropic atoms can be described by a 3x3 matrix, where each row and column > is associated with the uncertainty in a specific spatial direction. The > matrix elements are the product of the uncertainty in these directions. The > diagonal elements will be the square of uncertainty in the same direction and > they should be always positive, the off-diagonal combination of directions > are covariances (+,0 or -). In the end, every element will have a unit > distance*distance and the matrix will be symmetric. We cannot just take the > square root of the matrix elements and expect something meaningful, if for no > other reason the problem with negative covariances. To calculate the square > root on the matrix itself one has to diagonalize it first. The height of a > person in your example sounds easy to define, but the mathematical formalism > will not decide that for me. I can also define height as the longest cord of > a person or the maximum elevation of a car mechanic under a car. Through > diagonalization one can at least extract some interesting, intuitive, > principal directions. The final product, the sqrt(matrix), is not more > intuitive to me. To convert it to something intuitive I would have to > diagonalize square rooted matrix again. So shall we make an exception for the > special, isotropic description? Or