Re: [ccp4bb] a challenge
Hi James, The datasets frac.80.mtz to frac.100.mtz are challenging to solve using SAD phasing. However these datasets can be easily solved using other experimental phasing method. Instead of using anomalous signal we could use isomorphous signal only. For example RIP or SIR phasing method, as there is a difference in intensity between the datasets due to scattering of S and Se. Since frac.80.mtz data contains 20% selenium that is sufficient to solve the structure against the frac.100.mtz. It seems the structure can be solved even as less as 10% selenium content (frac.90.mtz vs frac.100.mtz), and substructure can be solved easily. This is not surprising, the pair of the datasets is quite isomorphous, . We phase all reflections (centric and non-centric) where as anomalous phasing we could phase non-centric reflections only. In fact, Single Isomorphous Replacement phasing method is the first phasing technique. This method has been further extended by Ravelli et al with some deviation by introduction of X-ray or UV RIP phasing. I tried RIP (SIR) phasing protocol of Auto-Rickshaw using frac.90.mtz as before and frac.100.mtz as after. Auto-Rickshaw used SHELXC/D/E and ARP/wARP/REFMAC5 to get the partially refined model (Rfree below 30%) . Cheers Santosh Santosh Panjikar, Ph.D. Scientist Australian Synchrotron 800 Blackburn Road Clayton VIC 3168 Australia Ph: +61-4-67770851 From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] On Behalf Of James Holton [jmhol...@lbl.gov] Sent: Monday, January 14, 2013 8:12 PM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] a challenge I am absolutely delighted at the response I have gotten to my little John Henry Challenge! Three people already have managed to do the impossible. Congratulations to George Sheldrick, Pavol Skubak and Raj Pannu for finding ways to improve the phases over the ones I originally obtained (using the default settings of mlphare and dm) and build their way out of it. This is quite useful information! At least it is to me. Nevertheless, I do think Frances Reyes has a point. This was meant to be a map interpretation challenge, and not a SAD-phasing challenge. I appreciate that the two are linked, but the reason I did not initially provide the anomalous data is because I thought it would be too much to ask people to re-do all the phasing, etc. Yes, there do appear to be ways to improve the maps beyond the particular way I phased them, but no matter how good your phasing program is, there will always be a level of anomalous signal that will lead to phases that are off enough to make building the model impossible. Basically, once the map gets bad enough that just as many wrong atoms get built in as right atoms, then there is no escape. However, I think human beings should still have an advantage when it comes to pattern recognition, and I remain curious to see if an insightful crystallographer can tip that balance in the right direction. I am also still curious to see if tweaking some setting on some automated building program will do that too. So, my original question remains: are automated building programs better than humans? Any human? I therefore declare the John Henry Challenge still open. But yes, improving the phases can tip the balance too, and the accuracy of the anomalous differences will ultimately affect the accuracy of the phases, and so on. This is a much broader challenge. And I think the best way to frame it is with the question: How low can the anomalous signal be before any conceivable approach fails? and perhaps: What is the best procedure to use for weak anomalous signal? For those who are interested in joining George, Pavol, Raj and others in this new challenge, the full spectrum of difficulty from trivial (100% Se incorporation) to a complete waste of time (0% Se, 100% S) is here: http://bl831.als.lbl.gov/~jamesh/challenge/occ_scan/ The impossible.mtz for the John Henry Map Interpretation Challenge was derived from frac0.79.mtz and possible.mtz from frac0.78.mtz. These simulated 31% and 32% Se incorporation into Met side chains (respectively). It has now been shown that both of these can be solved automatically if you do the phasing right. But what about frac0.80.mtz? Or frac0.90.mtz ? At least on this one coordinate of Se incorporation, the prowess of a particular approach can be given a score. For example, a score of 0.78 means that the indicated procedure could solve the frac0.78.mtz dataset, but not the frac0.79.mtz dataset. Based on the reports I have gotten back so far, the difficulty score lineup is: score method 0.86 xds, xscale, right sites, crank2 (Pavol Skubak) 0.78 xds, xscale, right sites, mlphare, dm, phenix.autobuild using 20 models (James Holton) 0.75 xds, xscale, right sites, mlphare, dm, buccaneer/refmac/dm (James Holton) 0.71 xds, xscale, right sites, mlphare, dm, ARP/wARP 7.3 (James Holton) 0.51 xds, xscale, right
Re: [ccp4bb] a challenge
Dear James I actually chose 3dko because it is a kinase (with a ligand), and therefore an interesting candidate for a molecular replacement score. I have not set this up yet, but I think if you look for PDB entries that contain the word kinase and try to molecular-replace all of them into the 3dko dataset, what fraction of them will work? I think that fraction would make a good score for a given molecular replacement pipeline. At the recent CCP4 SW in Nottingham Giovanna Scapin from Merck gave a talk on MR during which she reflected upon their attempts from some time ago to troubleshoot a recalcitrant MR case of a kinase by searching with hunderds of models derived from all kinase structures known at that time. However, I am not quite sure if they published these results anywhere (at least I could not fish out a relevant reference). Along these lines, 'Wide Search MR' (Stokes-Rees and Sliz (2010) PNAS 107: 21476-21481) and (www.sbgrid.org) may also provide some options to establish such benchmarking or MR 'scores'. Best regards Savvas On Mon, Jan 14, 2013 at 2:31 PM, Nat Echols nathaniel.ech...@gmail.com wrote: On Mon, Jan 14, 2013 at 11:18 AM, Tim Gruene t...@shelx.uni-ac.gwdg.de wrote: I admit not having read all contributions to this thread. I understand the John Henry Challenge as whether there is an 'automated way of producing a model from impossible.mtz'. From looking at it and without having gone all the way to a PDB-file my feeling is one could without too much effort from the baton mode in e.g. coot. This should be even more possible if one also uses existing knowledge about the expected structure of the protein: a kinase domain is quite distinctive. So, James, how much external information from homologous structures are we allowed to use? Running Phaser would certainly be cheating, but if I take (for instance) a 25% identical kinase structure, manually align it to the map and/or a partial model, and use that as a guide to manually rebuild the target model, does that meet the terms of the challenge? -Nat
Re: [ccp4bb] a challenge
Dear Santosh, I think that it is a bit more complicated. SIR generally provides a stronger phasing signal than SAD, and can be better for phasing, provided that: (a) the native and derivative are sufficiently isomorphous, AND (b) the heavy atom substructure is itself chiral. For some space groups one site is enough to generate a chiral substructure but for others, e.g.P21, more than one site is necessary. Otherwise the first map will be a double image consisting of two overlapping positive images, and density modification will not in general be able to untangle them. SAD also gives a double image in such cases, but then instead of two positive images we have one negative and one positive image, and the simplest form of density modification - setting negative density to zero - will break the pseudosymmetry. One can also break such pseudosymmetry by using SIRAS or RIPAS instead of SIR or RIP, even if the anomalous signal alone is not sufficient to phase the structure. If MAD doesn't work and one happens to have a native (Met) dataset as well as SeMet, one should always consider analyzing the data as SIRAS. Whether this is better than SAD on the SeMet data alone will depend primarily on how isomorphous the two datasets are. Best wishes, George On 01/15/2013 11:06 AM, Santosh Panjikar wrote: Hi James, The datasets frac.80.mtz to frac.100.mtz are challenging to solve using SAD phasing. However these datasets can be easily solved using other experimental phasing method. Instead of using anomalous signal we could use isomorphous signal only. For example RIP or SIR phasing method, as there is a difference in intensity between the datasets due to scattering of S and Se. Since frac.80.mtz data contains 20% selenium that is sufficient to solve the structure against the frac.100.mtz. It seems the structure can be solved even as less as 10% selenium content (frac.90.mtz vs frac.100.mtz), and substructure can be solved easily. This is not surprising, the pair of the datasets is quite isomorphous, . We phase all reflections (centric and non-centric) where as anomalous phasing we could phase non-centric reflections only. In fact, Single Isomorphous Replacement phasing method is the first phasing technique. This method has been further extended by Ravelli et al with some deviation by introduction of X-ray or UV RIP phasing. I tried RIP (SIR) phasing protocol of Auto-Rickshaw using frac.90.mtz as before and frac.100.mtz as after. Auto-Rickshaw used SHELXC/D/E and ARP/wARP/REFMAC5 to get the partially refined model (Rfree below 30%) . Cheers Santosh Santosh Panjikar, Ph.D. Scientist Australian Synchrotron 800 Blackburn Road Clayton VIC 3168 Australia Ph: +61-4-67770851 From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] On Behalf Of James Holton [jmhol...@lbl.gov] Sent: Monday, January 14, 2013 8:12 PM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] a challenge I am absolutely delighted at the response I have gotten to my little John Henry Challenge! Three people already have managed to do the impossible. Congratulations to George Sheldrick, Pavol Skubak and Raj Pannu for finding ways to improve the phases over the ones I originally obtained (using the default settings of mlphare and dm) and build their way out of it. This is quite useful information! At least it is to me. Nevertheless, I do think Frances Reyes has a point. This was meant to be a map interpretation challenge, and not a SAD-phasing challenge. I appreciate that the two are linked, but the reason I did not initially provide the anomalous data is because I thought it would be too much to ask people to re-do all the phasing, etc. Yes, there do appear to be ways to improve the maps beyond the particular way I phased them, but no matter how good your phasing program is, there will always be a level of anomalous signal that will lead to phases that are off enough to make building the model impossible. Basically, once the map gets bad enough that just as many wrong atoms get built in as right atoms, then there is no escape. However, I think human beings should still have an advantage when it comes to pattern recognition, and I remain curious to see if an insightful crystallographer can tip that balance in the right direction. I am also still curious to see if tweaking some setting on some automated building program will do that too. So, my original question remains: are automated building programs better than humans? Any human? I therefore declare the John Henry Challenge still open. But yes, improving the phases can tip the balance too, and the accuracy of the anomalous differences will ultimately affect the accuracy of the phases, and so on. This is a much broader challenge. And I think the best way to frame it is with the question: How low can the anomalous signal be before any conceivable approach
Re: [ccp4bb] a challenge
Santosh, Although I appreciate your ingenuity and I agree that SIRAS is an excellent idea in the real world if you have only partial Se occupancy, I'm afraid I think it is cheating to use more than one of the challenge datasets at a time. The scenario I wanted to test is the all-too-common we only had that one good crystal situation. Then again, I do think it is interesting to ask how low the Se incorporation can go before SIRAS fails. Even if it is under the idyllic perfect isomorphism situation here. I have now put up 1% increments between frac0.90.mtz and frac1.00.mtz. Do you think you/Autorickshaw can solve it with frac0.99.mtz vs frac0.1.00.mtz ? If you'd like to test in the presence of non-isomorphism, I'd recommend using the radiation damaged simulated dataset here: http://bl831.als.lbl.gov/~jamesh/workshop2/decaying.mtz as the derivative. It is about 18% different from frac0.00.mtz (100% Se, but badly decayed). Thanks for all the great ideas! -James Holton MAD Scientist On 1/15/2013 2:06 AM, Santosh Panjikar wrote: Hi James, The datasets frac.80.mtz to frac.100.mtz are challenging to solve using SAD phasing. However these datasets can be easily solved using other experimental phasing method. Instead of using anomalous signal we could use isomorphous signal only. For example RIP or SIR phasing method, as there is a difference in intensity between the datasets due to scattering of S and Se. Since frac.80.mtz data contains 20% selenium that is sufficient to solve the structure against the frac.100.mtz. It seems the structure can be solved even as less as 10% selenium content (frac.90.mtz vs frac.100.mtz), and substructure can be solved easily. This is not surprising, the pair of the datasets is quite isomorphous, . We phase all reflections (centric and non-centric) where as anomalous phasing we could phase non-centric reflections only. In fact, Single Isomorphous Replacement phasing method is the first phasing technique. This method has been further extended by Ravelli et al with some deviation by introduction of X-ray or UV RIP phasing. I tried RIP (SIR) phasing protocol of Auto-Rickshaw using frac.90.mtz as before and frac.100.mtz as after. Auto-Rickshaw used SHELXC/D/E and ARP/wARP/REFMAC5 to get the partially refined model (Rfree below 30%) . Cheers Santosh Santosh Panjikar, Ph.D. Scientist Australian Synchrotron 800 Blackburn Road Clayton VIC 3168 Australia Ph: +61-4-67770851 From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] On Behalf Of James Holton [jmhol...@lbl.gov] Sent: Monday, January 14, 2013 8:12 PM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] a challenge I am absolutely delighted at the response I have gotten to my little John Henry Challenge! Three people already have managed to do the impossible. Congratulations to George Sheldrick, Pavol Skubak and Raj Pannu for finding ways to improve the phases over the ones I originally obtained (using the default settings of mlphare and dm) and build their way out of it. This is quite useful information! At least it is to me. Nevertheless, I do think Frances Reyes has a point. This was meant to be a map interpretation challenge, and not a SAD-phasing challenge. I appreciate that the two are linked, but the reason I did not initially provide the anomalous data is because I thought it would be too much to ask people to re-do all the phasing, etc. Yes, there do appear to be ways to improve the maps beyond the particular way I phased them, but no matter how good your phasing program is, there will always be a level of anomalous signal that will lead to phases that are off enough to make building the model impossible. Basically, once the map gets bad enough that just as many wrong atoms get built in as right atoms, then there is no escape. However, I think human beings should still have an advantage when it comes to pattern recognition, and I remain curious to see if an insightful crystallographer can tip that balance in the right direction. I am also still curious to see if tweaking some setting on some automated building program will do that too. So, my original question remains: are automated building programs better than humans? Any human? I therefore declare the John Henry Challenge still open. But yes, improving the phases can tip the balance too, and the accuracy of the anomalous differences will ultimately affect the accuracy of the phases, and so on. This is a much broader challenge. And I think the best way to frame it is with the question: How low can the anomalous signal be before any conceivable approach fails? and perhaps: What is the best procedure to use for weak anomalous signal? For those who are interested in joining George, Pavol, Raj and others in this new challenge, the full spectrum of difficulty from trivial (100% Se incorporation) to a complete waste of time (0% Se, 100% S) is here: http://bl831.als.lbl.gov
Re: [ccp4bb] a challenge
I am absolutely delighted at the response I have gotten to my little John Henry Challenge! Three people already have managed to do the impossible. Congratulations to George Sheldrick, Pavol Skubak and Raj Pannu for finding ways to improve the phases over the ones I originally obtained (using the default settings of mlphare and dm) and build their way out of it. This is quite useful information! At least it is to me. Nevertheless, I do think Frances Reyes has a point. This was meant to be a map interpretation challenge, and not a SAD-phasing challenge. I appreciate that the two are linked, but the reason I did not initially provide the anomalous data is because I thought it would be too much to ask people to re-do all the phasing, etc. Yes, there do appear to be ways to improve the maps beyond the particular way I phased them, but no matter how good your phasing program is, there will always be a level of anomalous signal that will lead to phases that are off enough to make building the model impossible. Basically, once the map gets bad enough that just as many wrong atoms get built in as right atoms, then there is no escape. However, I think human beings should still have an advantage when it comes to pattern recognition, and I remain curious to see if an insightful crystallographer can tip that balance in the right direction. I am also still curious to see if tweaking some setting on some automated building program will do that too. So, my original question remains: are automated building programs better than humans? Any human? I therefore declare the John Henry Challenge still open. But yes, improving the phases can tip the balance too, and the accuracy of the anomalous differences will ultimately affect the accuracy of the phases, and so on. This is a much broader challenge. And I think the best way to frame it is with the question: How low can the anomalous signal be before any conceivable approach fails? and perhaps: What is the best procedure to use for weak anomalous signal? For those who are interested in joining George, Pavol, Raj and others in this new challenge, the full spectrum of difficulty from trivial (100% Se incorporation) to a complete waste of time (0% Se, 100% S) is here: http://bl831.als.lbl.gov/~jamesh/challenge/occ_scan/ The impossible.mtz for the John Henry Map Interpretation Challenge was derived from frac0.79.mtz and possible.mtz from frac0.78.mtz. These simulated 31% and 32% Se incorporation into Met side chains (respectively). It has now been shown that both of these can be solved automatically if you do the phasing right. But what about frac0.80.mtz? Or frac0.90.mtz ? At least on this one coordinate of Se incorporation, the prowess of a particular approach can be given a score. For example, a score of 0.78 means that the indicated procedure could solve the frac0.78.mtz dataset, but not the frac0.79.mtz dataset. Based on the reports I have gotten back so far, the difficulty score lineup is: score method 0.86 xds, xscale, right sites, crank2 (Pavol Skubak) 0.78 xds, xscale, right sites, mlphare, dm, phenix.autobuild using 20 models (James Holton) 0.75 xds, xscale, right sites, mlphare, dm, buccaneer/refmac/dm (James Holton) 0.71 xds, xscale, right sites, mlphare, dm, ARP/wARP 7.3 (James Holton) 0.51 xds, xscale, right sites, mlphare, dm, ARP/wARP 6.1.1 (James Holton) Note that all of these attempts cheated on the sites. Finding the sites seems to be harder than solving the structure once you've got them. That lineup is: score method 0.82 cheating: xds, xscale, right phases, anomalous difference Fourier (James Holton) 0.79 xds, xscale, shelxc/d/e 3.5A NTRY=1 (George Sheldrick) 0.74 xds, autorickshaw (Santosh Panjikar) 0.65xds, xscale, phenix.hyss --search=full (James Holton) 0.60 xds, xscale, shelxc/d with NTRY=100 (James Holton) Where again the score is the dataset where the heavy atom site constellation found is close enough to the right one to move forward. This transition, like the model-building one, is remarkably sharp, particularly if you let each step run for a lot of cycles. The graph for model-building is here: http://bl831.als.lbl.gov/~jamesh/challenge/build_CC_vs_frac.png Note how the final map quality is pretty much independent of the initial map quality, up to the point where it all goes wrong. I think this again is an example of the solution needing to be at least half right before it can be improved. But perhaps someone can prove me wrong on that one? For those who want the unmerged data, I have all the XDS_ASCII.HKL files here: http://bl831.als.lbl.gov/~jamesh/challenge/occ_scan/XDS_ASCII.tgz If you'd like to go all the way back to the images, you can get them from here: http://bl831.als.lbl.gov/~jamesh/workshop2/ the badsignal dataset is what produced frac1.00.mtz, and goodsignal produced frac0.00.mtz. You can generate anything in
Re: [ccp4bb] a challenge
What is the best procedure to use for weak anomalous signal That opens up the can of worms which I'm happy to jump into. We've had very good success in the years 2003-2009 with shelx for finding sites (sometimes more than 1 trials) then force feeding them to sharp for phase improvement. We should also say most of the times in particular in the more difficult cases xds made the difference in detectable anomalous signal. And no we still have not published this. With we I mean Marc Robien and myself during our SGPP times. Jürgen .. Jürgen Bosch Johns Hopkins Bloomberg School of Public Health Department of Biochemistry Molecular Biology Johns Hopkins Malaria Research Institute 615 North Wolfe Street, W8708 Baltimore, MD 21205 Phone: +1-410-614-4742 Lab: +1-410-614-4894 Fax: +1-410-955-3655 http://lupo.jhsph.edu On Jan 14, 2013, at 3:13, James Holton jmhol...@lbl.gov wrote: What is the best procedure to use for weak anomalous signal
Re: [ccp4bb] a challenge
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello James and all other contributors, I admit not having read all contributions to this thread. I understand the John Henry Challenge as whether there is an 'automated way of producing a model from impossible.mtz'. From looking at it and without having gone all the way to a PDB-file my feeling is one could without too much effort from the baton mode in e.g. coot. I guess this is not what you (and this thread) mean by 'automated' which leaves the impression that crystallographers have become quite spoiled children for this notion undermines how much effort and ingenuity the authors of programs like coot, O, mifit, frodo, etc, etc, pp - compared how models were prepared before this algorithms had been implemented, there is a lot of automation even in looking at the skeleton of a map! Cheers, Tim On 01/14/2013 10:12 AM, James Holton wrote: I am absolutely delighted at the response I have gotten to my little John Henry Challenge! Three people already have managed to do the impossible. Congratulations to George Sheldrick, Pavol Skubak and Raj Pannu for finding ways to improve the phases over the ones I originally obtained (using the default settings of mlphare and dm) and build their way out of it. This is quite useful information! At least it is to me. Nevertheless, I do think Frances Reyes has a point. This was meant to be a map interpretation challenge, and not a SAD-phasing challenge. I appreciate that the two are linked, but the reason I did not initially provide the anomalous data is because I thought it would be too much to ask people to re-do all the phasing, etc. Yes, there do appear to be ways to improve the maps beyond the particular way I phased them, but no matter how good your phasing program is, there will always be a level of anomalous signal that will lead to phases that are off enough to make building the model impossible. Basically, once the map gets bad enough that just as many wrong atoms get built in as right atoms, then there is no escape. However, I think human beings should still have an advantage when it comes to pattern recognition, and I remain curious to see if an insightful crystallographer can tip that balance in the right direction. I am also still curious to see if tweaking some setting on some automated building program will do that too. So, my original question remains: are automated building programs better than humans? Any human? I therefore declare the John Henry Challenge still open. But yes, improving the phases can tip the balance too, and the accuracy of the anomalous differences will ultimately affect the accuracy of the phases, and so on. This is a much broader challenge. And I think the best way to frame it is with the question: How low can the anomalous signal be before any conceivable approach fails? and perhaps: What is the best procedure to use for weak anomalous signal? For those who are interested in joining George, Pavol, Raj and others in this new challenge, the full spectrum of difficulty from trivial (100% Se incorporation) to a complete waste of time (0% Se, 100% S) is here: http://bl831.als.lbl.gov/~jamesh/challenge/occ_scan/ The impossible.mtz for the John Henry Map Interpretation Challenge was derived from frac0.79.mtz and possible.mtz from frac0.78.mtz. These simulated 31% and 32% Se incorporation into Met side chains (respectively). It has now been shown that both of these can be solved automatically if you do the phasing right. But what about frac0.80.mtz? Or frac0.90.mtz ? At least on this one coordinate of Se incorporation, the prowess of a particular approach can be given a score. For example, a score of 0.78 means that the indicated procedure could solve the frac0.78.mtz dataset, but not the frac0.79.mtz dataset. Based on the reports I have gotten back so far, the difficulty score lineup is: score method 0.86 xds, xscale, right sites, crank2 (Pavol Skubak) 0.78 xds, xscale, right sites, mlphare, dm, phenix.autobuild using 20 models (James Holton) 0.75 xds, xscale, right sites, mlphare, dm, buccaneer/refmac/dm (James Holton) 0.71 xds, xscale, right sites, mlphare, dm, ARP/wARP 7.3 (James Holton) 0.51 xds, xscale, right sites, mlphare, dm, ARP/wARP 6.1.1 (James Holton) Note that all of these attempts cheated on the sites. Finding the sites seems to be harder than solving the structure once you've got them. That lineup is: score method 0.82 cheating: xds, xscale, right phases, anomalous difference Fourier (James Holton) 0.79 xds, xscale, shelxc/d/e 3.5A NTRY=1 (George Sheldrick) 0.74 xds, autorickshaw (Santosh Panjikar) 0.65xds, xscale, phenix.hyss --search=full (James Holton) 0.60 xds, xscale, shelxc/d with NTRY=100 (James Holton) Where again the score is the dataset where the heavy atom site constellation found is close enough to the right one to move
Re: [ccp4bb] a challenge
On Mon, Jan 14, 2013 at 11:18 AM, Tim Gruene t...@shelx.uni-ac.gwdg.de wrote: I admit not having read all contributions to this thread. I understand the John Henry Challenge as whether there is an 'automated way of producing a model from impossible.mtz'. From looking at it and without having gone all the way to a PDB-file my feeling is one could without too much effort from the baton mode in e.g. coot. This should be even more possible if one also uses existing knowledge about the expected structure of the protein: a kinase domain is quite distinctive. So, James, how much external information from homologous structures are we allowed to use? Running Phaser would certainly be cheating, but if I take (for instance) a 25% identical kinase structure, manually align it to the map and/or a partial model, and use that as a guide to manually rebuild the target model, does that meet the terms of the challenge? -Nat
Re: [ccp4bb] a challenge
I actually chose 3dko because it is a kinase (with a ligand), and therefore an interesting candidate for a molecular replacement score. I have not set this up yet, but I think if you look for PDB entries that contain the word kinase and try to molecular-replace all of them into the 3dko dataset, what fraction of them will work? I think that fraction would make a good score for a given molecular replacement pipeline. But, if you want to bootstrap S-SAD phasing with a homolog, then I'd say its definitely cheating if you use a homolog close enough to build your way out of the resulting density without any anomalous information at all. Perhaps the fairest way to do this would be to make a 2-dimensional score? The frac of the dataset you used, plus the BLAST2 E-value of the model you started with vs the 3dko sequence? -James Holton MAD Scientist On Mon, Jan 14, 2013 at 2:31 PM, Nat Echols nathaniel.ech...@gmail.com wrote: On Mon, Jan 14, 2013 at 11:18 AM, Tim Gruene t...@shelx.uni-ac.gwdg.de wrote: I admit not having read all contributions to this thread. I understand the John Henry Challenge as whether there is an 'automated way of producing a model from impossible.mtz'. From looking at it and without having gone all the way to a PDB-file my feeling is one could without too much effort from the baton mode in e.g. coot. This should be even more possible if one also uses existing knowledge about the expected structure of the protein: a kinase domain is quite distinctive. So, James, how much external information from homologous structures are we allowed to use? Running Phaser would certainly be cheating, but if I take (for instance) a 25% identical kinase structure, manually align it to the map and/or a partial model, and use that as a guide to manually rebuild the target model, does that meet the terms of the challenge? -Nat
Re: [ccp4bb] a challenge
I have now looked at James's two challenges to see what I could learn from them, and will try to give enough details so that less experienced readers of this list can repeat what I did and apply the experience thereby gained to solving their own structures. For those who are not interested in the details, the bottom line is that SHELXC/D/E can solve both 'possible' and 'impossible' almost routinely, starting by finding the substructure, without using any information derived from the known structure. It should be emphasised that this does not produce a fully refined structure, but the resulting poly-Ala trace of about 70% of the structure and 'free lunch' maps showing many side-chains would be a good starting point for programs (such as Buccaneer or wARP) that dock a known sequence and complete the structure. My students would of course be expected to complete the map interpretation themselves using the excellent facilities available in Coot, that is always very educational! I used the current SHELX beta-test programs that will shortly be released as the official versions. First i used Tim Gruene's mtz2sca to convert James's mtz files into a format that SHELX can read, and then ran SHELXC from the command line to make the files possible.hkl (native intensity data), possible_fa.hkl (h k l FA and phase shift alpha) and possible_fa.ins (input file to run SHELXD (and the same for 'impossible'). Alternatively I could have used Thomas Schneider's hkl2map GUI to call SHELXC/D/E. I looked at the d/sig row to see where to cut the resolution for finding the heavy atoms and decided on 3.5A (SHEL 999 3.5). If I had been able to input unmerged data to SHELXC, e.g. as XDS_ASCII.HKL which is always unmerged, I would also have obtained a CC1/2 value that would also indicate where to cut the resolution. 3.5A corresponded to d/sig of about 1.0 which is still rather low, but cutting at even lower resolution tends to give less accurate substructures. To compensate for this optimistic choice for the rather weak anomalous data, I increased the number of trials (NTRY) to 1. These are the two most critical parameters for SHELXD, and as it turns out, for the whole structure solution. However before running the multi-CPU version of SHELXD, since the PDB file of the refined structure was available, I ran AnoDe to use the PDB file and anomalous data in possible_fa.hkl to check the substructure. This told me that for both 'possible' and 'impossible' it should be possible to find 12 well-defined sites, and also that the original impossible.mtz was inconsistently indexed. AnoDe also outputs a list of heavy atoms in SHELX format that can be input directly into SHELXE for density modification and tracing. However that would be cheating because AnoDe reads the final PDB file to calculate the anomalous density, and I was trying to solve the structure without assuming the answer, even indirectly. In general a substructure calculated in this way by AnoDe is always much more accurate and complete that one found ab initio from the anomalous data. The best SHELXD solutions had CC 34.6 and CCweak 15.0 for 'possible' and 28.4/13.2 for 'impossible'. I always tell people to aim for at least 30/15, so maybe I should have done more than 1 tries for 'impossible' but my wife was getting impatient (I had promised her that we could go for a walk in the snow) so I accepted it. I looked at the peaklist from SHELXD pretending not to know that there should be 12 sites. There was a bit of a gap in peakheight 0.53/0.42 between peaks 11 and 12 for 'possible' and 0.53/0.45 between peaks 10 and 11 for 'impossible', so for SHELXE I used -h11 and -h10 respectively. However I also used the new -z option that refines the substructure before starting on the phasing, and as it turns out that increased the number of heavy atoms to 12 in both cases and as it happens all 12 were correct in both cases. I started shelxe with: shelxe possible possible_fa -s0.55 -a30 -h11 -z -q -e1 and similarly for 'impossible'. I was expecting problems so I did 30 cycles autotracing, normally 3 would be enough. I just guessed the solvent content (-s0.55), maybe that could be fine-tuned. For SHELXE, there is a remarkably consistent rule that if the CC for the trace against the native data gets above 25%, the structure is solved. For 'possible' this happened after 25 tracing cycles, and the final 'free lunch' map (-e1) was indeed convincing. However 'impossible' only reached a CC of 17% and although the map did not look completely wrong, I would not have been able to interpret it. So I changed one default parameter (-m30), increasing the number of density modification cycles to compensate for the poor starting phases, and ran the job again. CC reached 25% after 16 cycles and produced an excellent map and trace. Almost certainly, 'possible' would also benefit from the change, but it was solved anyway. As Tom
Re: [ccp4bb] a challenge
Ok, I'll bite. I dare anyone who considers themself an expert macromolecular crystallographer to find a way to build out of this map. I put emphasis on this map. Short of actually cheating (see below), there doesn't seem to be any automated way to arrive at a solved structure from these phases I put emphasis on these phases. I think the real challenge (and one that makes for an excellent macromolecular crystallographer) is how well one can interpret a map with poor phases. That being said, I think a recalculation of the map using any other information besides the map itself should not be allowed. PS. I'd like to see what the pre-DM phases look like. There's a huge chunk of the protein that is completely flattened out in impossible.mtz . F On Jan 12, 2013, at 1:50 PM, James Holton jmhol...@lbl.gov wrote: Woops! sorry folks. I made a mistake with the I(+)/I(-) entry. They had the wrong axis convention relative to 3dko and the F in the same file. Sorry about that. The files on the website now should be right. http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz md5 sums: c4bdb32a08c884884229e8080228d166 impossible.mtz caf05437132841b595be1c0dc1151123 possible.mtz -James Holton MAD Scientist On 1/12/2013 8:25 AM, James Holton wrote: Fair enough! I have just now added DANO and I(+)/I(-) to the files. I'll be very interested to see what you can come up with! For the record, the phases therein came from running mlphare with default parameters but exactly the correct heavy-atom constellation (all the sulfur atoms in 3dko), and then running dm with default parameters. Yes, there are other ways to run mlphare and dm that give better phases, but I was only able to determine those parameters by cheating (comparing the resulting map to the right answer), so I don't think it is fair to use those maps. I have had a few questions about what is cheating and what is not cheating. I don't have a problem with the use of sequence information because that actually is something that you realistically would know about your protein when you sat down to collect data. The sequence of this molecule is that of 3dko: http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir I also don't have a problem with anyone actually using an automation program to _help_ them solve the impossible dataset as long as they can explain what they did. Simply putting the above sequence into BALBES would, of course, be cheating! I suppose one could try eliminating 3dko and its homologs from the BALBES search, but that, in and of itself, is perhaps relevant to the challenge: what is the most distance homolog that still allows you to solve the structure?. That, I think, is also a stringent test of model-building skill. I have already tried ARP/wARP, phenix.autobuild and buccaneer/refmac. With default parameters, all of these programs fail on both the possible and impossible datasets. It was only with some substantial tweaking that I found a way to get phenix.autobuild to crack the possible dataset (using 20 models in parallel). I have not yet found a way to get any automation program to build its way out of the impossible dataset. Personally, I think that the breakthrough might be something like what Tom Terwilliger mentioned. If you build a good enough starting set of atoms, then I think an automation program should be able to take you the rest of the way. If that is the case, then it means people like Tom who develop such programs for us might be able to use that insight to improve the software, and that is something that will benefit all of us. Or, it is entirely possible that I'm just not running the current software properly! If so, I'd love it if someone who knows better (such as their developers) could enlighten me. -James Holton MAD Scientist On 1/12/2013 3:07 AM, Pavol Skubak wrote: Dear James, your challenge in its current form ignores an important source of information for model building that is available for your simulated data - namely, it does not allow to use anomalous phase information in the model building. In difficult cases on the edge of success such as this one, this typically makes the difference between building and not building. If you can make the F+/F- and Se substructure available, we can test whether this is the case indeed. However, while I expect this would push the challenge further significantly, most likely you would be able to decrease the Se incorporation of your simulated data further to such levels that the anomalous signal is again no longer sufficient to build the structure. And most likely, there would again exist an edge where a small decrease in the Se incorporation would lead from a model built to no model built. Best regards, -- Pavol Skubak Biophysical
Re: [ccp4bb] a challenge
I think the real challenge (and one that makes for an excellent macromolecular crystallographer) is how well one can interpret a map with poor phases. Let me disagree ... An excellent macromolecular crystallographer, is one that given some crystals can derive the best strategy to collect data, process the data optimally, derive phases using all available information, build a model and refine it in such a way that it best explains both data and geometrical expectations, and do these as efficiently as possible. Efficiency may suggest using one automated suite or another - or indeed may best be achieved by manual labor - be it in the map or in data collection strategy or refinement or another step: and here I am ignoring the art of transforming hair-needle-crystalline-like-dingbits to a diffracting crystal. One that can interpret a map with poor phases can be either a genius in 3d orientation - or a not necessarily too intelligent nor experienced but determined student that can drink and breathe this map for a few weeks in a row until a solution is in place. Neither would make an excellent macromolecular crystallographer by necessity. Tassos
Re: [ccp4bb] a challenge
I agree with Tassos, and btw think that this crystallographer, should be able to go back into the lab and optimize the present crystal conditions to get better crystals. In particularly, when he or she realize that the scientific question they set out to investigate cannot be answered, by analyzing the final structure, with the available data quality. Preben On 1/13/13 8:52 PM, Anastassis Perrakis wrote: I think the real challenge (and one that makes for an excellent macromolecular crystallographer) is how well one can interpret a map with poor phases. Let me disagree ... An excellent macromolecular crystallographer, is one that given some crystals can derive the best strategy to collect data, process the data optimally, derive phases using all available information, build a model and refine it in such a way that it best explains both data and geometrical expectations, and do these as efficiently as possible. Efficiency may suggest using one automated suite or another - or indeed may best be achieved by manual labor - be it in the map or in data collection strategy or refinement or another step: and here I am ignoring the art of transforming hair-needle-crystalline-like-dingbits to a diffracting crystal. One that can interpret a map with poor phases can be either a genius in 3d orientation - or a not necessarily too intelligent nor experienced but determined student that can drink and breathe this map for a few weeks in a row until a solution is in place. Neither would make an excellent macromolecular crystallographer by necessity. Tassos
Re: [ccp4bb] a challenge
Since the discussion for crystallographers is fired up. I want to put on record that I totally agree with Tassos about the profile of a crystallographer. If you take away the crystals, then a crystallographer is no long a crystallographer. Demetres On 13/1/2013 9:52 μμ, Anastassis Perrakis wrote: I think the real challenge (and one that makes for an excellent macromolecular crystallographer) is how well one can interpret a map with poor phases. Let me disagree ... An excellent macromolecular crystallographer, is one that given some crystals can derive the best strategy to collect data, process the data optimally, derive phases using all available information, build a model and refine it in such a way that it best explains both data and geometrical expectations, and do these as efficiently as possible. Efficiency may suggest using one automated suite or another - or indeed may best be achieved by manual labor - be it in the map or in data collection strategy or refinement or another step: and here I am ignoring the art of transforming hair-needle-crystalline-like-dingbits to a diffracting crystal. One that can interpret a map with poor phases can be either a genius in 3d orientation - or a not necessarily too intelligent nor experienced but determined student that can drink and breathe this map for a few weeks in a row until a solution is in place. Neither would make an excellent macromolecular crystallographer by necessity. Tassos -- --- Dr. Demetres D. Leonidas Associate Professor of Biochemistry Department of Biochemistry Biotechnology University of Thessaly 26 Ploutonos Str. 41221 Larissa, Greece - Tel. +302410 565278 Tel. +302410 565297 (Lab) Fax. +302410 565290 E-mail: ddleoni...@bio.uth.gr http://www.bio.uth.gr ---
Re: [ccp4bb] a challenge
Dear James, your challenge in its current form ignores an important source of information for model building that is available for your simulated data - namely, it does not allow to use anomalous phase information in the model building. In difficult cases on the edge of success such as this one, this typically makes the difference between building and not building. If you can make the F+/F- and Se substructure available, we can test whether this is the case indeed. However, while I expect this would push the challenge further significantly, most likely you would be able to decrease the Se incorporation of your simulated data further to such levels that the anomalous signal is again no longer sufficient to build the structure. And most likely, there would again exist an edge where a small decrease in the Se incorporation would lead from a model built to no model built. Best regards, -- Pavol Skubak Biophysical Structural Chemistry Gorleaus Laboratories Einsteinweg 55 Leiden University LEIDEN 2333CC the Netherlands tel: 0031715274414 web: http://bsc.lic.leidenuniv.nl/people/skubak-0
Re: [ccp4bb] a challenge
Dear James, I agree with Pavel that your example is not very realistic. In practice one would start from the heavy atom positions. As well as providing starting phases, they are useful in other ways. For example. shelxe (and probably most other tracing programs) adds them to a 'no-go' map so it knows where NOT to trace the main-chain. Best wishes, George Dear James, your challenge in its current form ignores an important source of information for model building that is available for your simulated data - namely, it does not allow to use anomalous phase information in the model building. In difficult cases on the edge of success such as this one, this typically makes the difference between building and not building. If you can make the F+/F- and Se substructure available, we can test whether this is the case indeed. However, while I expect this would push the challenge further significantly, most likely you would be able to decrease the Se incorporation of your simulated data further to such levels that the anomalous signal is again no longer sufficient to build the structure. And most likely, there would again exist an edge where a small decrease in the Se incorporation would lead from a model built to no model built. Best regards, -- Pavol Skubak Biophysical Structural Chemistry Gorleaus Laboratories Einsteinweg 55 Leiden University LEIDEN 2333CC the Netherlands tel: 0031715274414 tel:0031715274414 web: http://bsc.lic.leidenuniv.nl/people/skubak-0 -- Prof. George M. Sheldrick FRS Dept. Structural Chemistry, University of Goettingen, Tammannstr. 4, D37077 Goettingen, Germany Tel. +49-551-39-3021 or -3068 Fax. +49-551-39-22582
Re: [ccp4bb] a challenge
Fair enough! I have just now added DANO and I(+)/I(-) to the files. I'll be very interested to see what you can come up with! For the record, the phases therein came from running mlphare with default parameters but exactly the correct heavy-atom constellation (all the sulfur atoms in 3dko), and then running dm with default parameters. Yes, there are other ways to run mlphare and dm that give better phases, but I was only able to determine those parameters by cheating (comparing the resulting map to the right answer), so I don't think it is fair to use those maps. I have had a few questions about what is cheating and what is not cheating. I don't have a problem with the use of sequence information because that actually is something that you realistically would know about your protein when you sat down to collect data. The sequence of this molecule is that of 3dko: http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir I also don't have a problem with anyone actually using an automation program to _help_ them solve the impossible dataset as long as they can explain what they did. Simply putting the above sequence into BALBES would, of course, be cheating! I suppose one could try eliminating 3dko and its homologs from the BALBES search, but that, in and of itself, is perhaps relevant to the challenge: what is the most distance homolog that still allows you to solve the structure?. That, I think, is also a stringent test of model-building skill. I have already tried ARP/wARP, phenix.autobuild and buccaneer/refmac. With default parameters, all of these programs fail on both the possible and impossible datasets. It was only with some substantial tweaking that I found a way to get phenix.autobuild to crack the possible dataset (using 20 models in parallel). I have not yet found a way to get any automation program to build its way out of the impossible dataset. Personally, I think that the breakthrough might be something like what Tom Terwilliger mentioned. If you build a good enough starting set of atoms, then I think an automation program should be able to take you the rest of the way. If that is the case, then it means people like Tom who develop such programs for us might be able to use that insight to improve the software, and that is something that will benefit all of us. Or, it is entirely possible that I'm just not running the current software properly! If so, I'd love it if someone who knows better (such as their developers) could enlighten me. -James Holton MAD Scientist On 1/12/2013 3:07 AM, Pavol Skubak wrote: Dear James, your challenge in its current form ignores an important source of information for model building that is available for your simulated data - namely, it does not allow to use anomalous phase information in the model building. In difficult cases on the edge of success such as this one, this typically makes the difference between building and not building. If you can make the F+/F- and Se substructure available, we can test whether this is the case indeed. However, while I expect this would push the challenge further significantly, most likely you would be able to decrease the Se incorporation of your simulated data further to such levels that the anomalous signal is again no longer sufficient to build the structure. And most likely, there would again exist an edge where a small decrease in the Se incorporation would lead from a model built to no model built. Best regards, -- Pavol Skubak Biophysical Structural Chemistry Gorleaus Laboratories Einsteinweg 55 Leiden University LEIDEN 2333CC the Netherlands tel: 0031715274414 tel:0031715274414 web: http://bsc.lic.leidenuniv.nl/people/skubak-0
Re: [ccp4bb] a challenge
Fair enough! The heavy atom positions are simply the S atoms in 3dko. There are 22 of them. Also, in this case the Met side chains (12 of those) are 32% occupied with Se. The other 68% is sulfur. I think it is realistic that one could know the extent of Se incorporation ahead of time from something like mass spec (especially if you knew it could make-or-break your structure determination). However, I don't think it is realistic that you would know where they are before running shelx. -James Holton MAD Scientist On 1/12/2013 7:46 AM, George Sheldrick wrote: Dear James, I agree with Pavel that your example is not very realistic. In practice one would start from the heavy atom positions. As well as providing starting phases, they are useful in other ways. For example. shelxe (and probably most other tracing programs) adds them to a 'no-go' map so it knows where NOT to trace the main-chain. Best wishes, George Dear James, your challenge in its current form ignores an important source of information for model building that is available for your simulated data - namely, it does not allow to use anomalous phase information in the model building. In difficult cases on the edge of success such as this one, this typically makes the difference between building and not building. If you can make the F+/F- and Se substructure available, we can test whether this is the case indeed. However, while I expect this would push the challenge further significantly, most likely you would be able to decrease the Se incorporation of your simulated data further to such levels that the anomalous signal is again no longer sufficient to build the structure. And most likely, there would again exist an edge where a small decrease in the Se incorporation would lead from a model built to no model built. Best regards, -- Pavol Skubak Biophysical Structural Chemistry Gorleaus Laboratories Einsteinweg 55 Leiden University LEIDEN 2333CC the Netherlands tel: 0031715274414 tel:0031715274414 web: http://bsc.lic.leidenuniv.nl/people/skubak-0 -- Prof. George M. Sheldrick FRS Dept. Structural Chemistry, University of Goettingen, Tammannstr. 4, D37077 Goettingen, Germany Tel. +49-551-39-3021 or -3068 Fax. +49-551-39-22582
Re: [ccp4bb] a challenge
Woops! sorry folks. I made a mistake with the I(+)/I(-) entry. They had the wrong axis convention relative to 3dko and the F in the same file. Sorry about that. The files on the website now should be right. http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz md5 sums: c4bdb32a08c884884229e8080228d166 impossible.mtz caf05437132841b595be1c0dc1151123 possible.mtz -James Holton MAD Scientist On 1/12/2013 8:25 AM, James Holton wrote: Fair enough! I have just now added DANO and I(+)/I(-) to the files. I'll be very interested to see what you can come up with! For the record, the phases therein came from running mlphare with default parameters but exactly the correct heavy-atom constellation (all the sulfur atoms in 3dko), and then running dm with default parameters. Yes, there are other ways to run mlphare and dm that give better phases, but I was only able to determine those parameters by cheating (comparing the resulting map to the right answer), so I don't think it is fair to use those maps. I have had a few questions about what is cheating and what is not cheating. I don't have a problem with the use of sequence information because that actually is something that you realistically would know about your protein when you sat down to collect data. The sequence of this molecule is that of 3dko: http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir I also don't have a problem with anyone actually using an automation program to _help_ them solve the impossible dataset as long as they can explain what they did. Simply putting the above sequence into BALBES would, of course, be cheating! I suppose one could try eliminating 3dko and its homologs from the BALBES search, but that, in and of itself, is perhaps relevant to the challenge: what is the most distance homolog that still allows you to solve the structure?. That, I think, is also a stringent test of model-building skill. I have already tried ARP/wARP, phenix.autobuild and buccaneer/refmac. With default parameters, all of these programs fail on both the possible and impossible datasets. It was only with some substantial tweaking that I found a way to get phenix.autobuild to crack the possible dataset (using 20 models in parallel). I have not yet found a way to get any automation program to build its way out of the impossible dataset. Personally, I think that the breakthrough might be something like what Tom Terwilliger mentioned. If you build a good enough starting set of atoms, then I think an automation program should be able to take you the rest of the way. If that is the case, then it means people like Tom who develop such programs for us might be able to use that insight to improve the software, and that is something that will benefit all of us. Or, it is entirely possible that I'm just not running the current software properly! If so, I'd love it if someone who knows better (such as their developers) could enlighten me. -James Holton MAD Scientist On 1/12/2013 3:07 AM, Pavol Skubak wrote: Dear James, your challenge in its current form ignores an important source of information for model building that is available for your simulated data - namely, it does not allow to use anomalous phase information in the model building. In difficult cases on the edge of success such as this one, this typically makes the difference between building and not building. If you can make the F+/F- and Se substructure available, we can test whether this is the case indeed. However, while I expect this would push the challenge further significantly, most likely you would be able to decrease the Se incorporation of your simulated data further to such levels that the anomalous signal is again no longer sufficient to build the structure. And most likely, there would again exist an edge where a small decrease in the Se incorporation would lead from a model built to no model built. Best regards, -- Pavol Skubak Biophysical Structural Chemistry Gorleaus Laboratories Einsteinweg 55 Leiden University LEIDEN 2333CC the Netherlands tel: 0031715274414 tel:0031715274414 web: http://bsc.lic.leidenuniv.nl/people/skubak-0
Re: [ccp4bb] a challenge
James, I had in fact just come to the conclusion that the indexing was consistent with 3dko for 'possible' but not for 'impossible', which I suppose was logical. George Woops! sorry folks. I made a mistake with the I(+)/I(-) entry. They had the wrong axis convention relative to 3dko and the F in the same file. Sorry about that. The files on the website now should be right. http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz md5 sums: c4bdb32a08c884884229e8080228d166 impossible.mtz caf05437132841b595be1c0dc1151123 possible.mtz -James Holton MAD Scientist On 1/12/2013 8:25 AM, James Holton wrote: Fair enough! I have just now added DANO and I(+)/I(-) to the files. I'll be very interested to see what you can come up with! For the record, the phases therein came from running mlphare with default parameters but exactly the correct heavy-atom constellation (all the sulfur atoms in 3dko), and then running dm with default parameters. Yes, there are other ways to run mlphare and dm that give better phases, but I was only able to determine those parameters by cheating (comparing the resulting map to the right answer), so I don't think it is fair to use those maps. I have had a few questions about what is cheating and what is not cheating. I don't have a problem with the use of sequence information because that actually is something that you realistically would know about your protein when you sat down to collect data. The sequence of this molecule is that of 3dko: http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir I also don't have a problem with anyone actually using an automation program to _help_ them solve the impossible dataset as long as they can explain what they did. Simply putting the above sequence into BALBES would, of course, be cheating! I suppose one could try eliminating 3dko and its homologs from the BALBES search, but that, in and of itself, is perhaps relevant to the challenge: what is the most distance homolog that still allows you to solve the structure?. That, I think, is also a stringent test of model-building skill. I have already tried ARP/wARP, phenix.autobuild and buccaneer/refmac. With default parameters, all of these programs fail on both the possible and impossible datasets. It was only with some substantial tweaking that I found a way to get phenix.autobuild to crack the possible dataset (using 20 models in parallel). I have not yet found a way to get any automation program to build its way out of the impossible dataset. Personally, I think that the breakthrough might be something like what Tom Terwilliger mentioned. If you build a good enough starting set of atoms, then I think an automation program should be able to take you the rest of the way. If that is the case, then it means people like Tom who develop such programs for us might be able to use that insight to improve the software, and that is something that will benefit all of us. Or, it is entirely possible that I'm just not running the current software properly! If so, I'd love it if someone who knows better (such as their developers) could enlighten me. -James Holton MAD Scientist On 1/12/2013 3:07 AM, Pavol Skubak wrote: Dear James, your challenge in its current form ignores an important source of information for model building that is available for your simulated data - namely, it does not allow to use anomalous phase information in the model building. In difficult cases on the edge of success such as this one, this typically makes the difference between building and not building. If you can make the F+/F- and Se substructure available, we can test whether this is the case indeed. However, while I expect this would push the challenge further significantly, most likely you would be able to decrease the Se incorporation of your simulated data further to such levels that the anomalous signal is again no longer sufficient to build the structure. And most likely, there would again exist an edge where a small decrease in the Se incorporation would lead from a model built to no model built. Best regards, -- Pavol Skubak Biophysical Structural Chemistry Gorleaus Laboratories Einsteinweg 55 Leiden University LEIDEN 2333CC the Netherlands tel: 0031715274414 tel:0031715274414 web: http://bsc.lic.leidenuniv.nl/people/skubak-0 -- Prof. George M. Sheldrick FRS Dept. Structural Chemistry, University of Goettingen, Tammannstr. 4, D37077 Goettingen, Germany Tel. +49-551-39-3021 or -3068 Fax. +49-551-39-22582
Re: [ccp4bb] a challenge
I admit that made impossible more difficult to solve than possible, but not in the way I had intended! Again, sorry about that. It is corrected now. The change in indexing arises because I am processing the simulated images with a default run of XDS and as you know the autoindexing picks an indexing convention at random. I flipped it back at the time, but when I just now went back to get the I(+)/I(-) I went just one step too far. Once again, sorry. It was not my intention to waste anyone's time! -James Holton MAD Scientist On 1/12/2013 2:09 PM, George Sheldrick wrote: James, I had in fact just come to the conclusion that the indexing was consistent with 3dko for 'possible' but not for 'impossible', which I suppose was logical. George Woops! sorry folks. I made a mistake with the I(+)/I(-) entry. They had the wrong axis convention relative to 3dko and the F in the same file. Sorry about that. The files on the website now should be right. http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz md5 sums: c4bdb32a08c884884229e8080228d166 impossible.mtz caf05437132841b595be1c0dc1151123 possible.mtz -James Holton MAD Scientist On 1/12/2013 8:25 AM, James Holton wrote: Fair enough! I have just now added DANO and I(+)/I(-) to the files. I'll be very interested to see what you can come up with! For the record, the phases therein came from running mlphare with default parameters but exactly the correct heavy-atom constellation (all the sulfur atoms in 3dko), and then running dm with default parameters. Yes, there are other ways to run mlphare and dm that give better phases, but I was only able to determine those parameters by cheating (comparing the resulting map to the right answer), so I don't think it is fair to use those maps. I have had a few questions about what is cheating and what is not cheating. I don't have a problem with the use of sequence information because that actually is something that you realistically would know about your protein when you sat down to collect data. The sequence of this molecule is that of 3dko: http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir I also don't have a problem with anyone actually using an automation program to _help_ them solve the impossible dataset as long as they can explain what they did. Simply putting the above sequence into BALBES would, of course, be cheating! I suppose one could try eliminating 3dko and its homologs from the BALBES search, but that, in and of itself, is perhaps relevant to the challenge: what is the most distance homolog that still allows you to solve the structure?. That, I think, is also a stringent test of model-building skill. I have already tried ARP/wARP, phenix.autobuild and buccaneer/refmac. With default parameters, all of these programs fail on both the possible and impossible datasets. It was only with some substantial tweaking that I found a way to get phenix.autobuild to crack the possible dataset (using 20 models in parallel). I have not yet found a way to get any automation program to build its way out of the impossible dataset. Personally, I think that the breakthrough might be something like what Tom Terwilliger mentioned. If you build a good enough starting set of atoms, then I think an automation program should be able to take you the rest of the way. If that is the case, then it means people like Tom who develop such programs for us might be able to use that insight to improve the software, and that is something that will benefit all of us. Or, it is entirely possible that I'm just not running the current software properly! If so, I'd love it if someone who knows better (such as their developers) could enlighten me. -James Holton MAD Scientist On 1/12/2013 3:07 AM, Pavol Skubak wrote: Dear James, your challenge in its current form ignores an important source of information for model building that is available for your simulated data - namely, it does not allow to use anomalous phase information in the model building. In difficult cases on the edge of success such as this one, this typically makes the difference between building and not building. If you can make the F+/F- and Se substructure available, we can test whether this is the case indeed. However, while I expect this would push the challenge further significantly, most likely you would be able to decrease the Se incorporation of your simulated data further to such levels that the anomalous signal is again no longer sufficient to build the structure. And most likely, there would again exist an edge where a small decrease in the Se incorporation would lead from a model built to no model built. Best regards, -- Pavol Skubak Biophysical Structural Chemistry Gorleaus Laboratories Einsteinweg 55 Leiden University LEIDEN 2333CC the Netherlands tel: 0031715274414 tel:0031715274414 web:
Re: [ccp4bb] a challenge
I can build from the impossible.mtz data in the following two steps: 1. getting the SE substructure from anomalous difference map constructed from impossible.mtz 2. running combined model building using the substructure from step 1 and starting from the impossible.mtz map Only impossible.mtz and the sequence (which is probably not really necessary) is used in this solution. It is not a fully automatic solution - step 2 (model building combined with density modif. and phasing via a recently developed multivariate SAD function) was performed automatically using CRANK (which calls Buccaneer, REFMAC and Parrot), step 1 manually - using CCP4 tools (cfft and peakmax). Comparing to the deposited model, 96% of the mainchain is (correctly) built and 92% is (correctly) docked and R factor is 21% - clearly, the (relatively) weak anomalous signal is the only limitation in this case. However, the model building procedure did not struggle too much - I expect it would still work if the Se incorporation is decreased somewhat further (as long as the substructure can be obtained in some way). Of course, this is not a pure solution in the sense that I started from impossible.mtz rather than from scratch, ie from the data only. Obtaining the substructure from scratch might be more difficult. Pavol On Sat, Jan 12, 2013 at 10:50 PM, James Holton jmhol...@lbl.gov wrote: Woops! sorry folks. I made a mistake with the I(+)/I(-) entry. They had the wrong axis convention relative to 3dko and the F in the same file. Sorry about that. The files on the website now should be right. http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz md5 sums: c4bdb32a08c884884229e8080228d166 impossible.mtz caf05437132841b595be1c0dc1151123 possible.mtz -James Holton MAD Scientist On 1/12/2013 8:25 AM, James Holton wrote: Fair enough! I have just now added DANO and I(+)/I(-) to the files. I'll be very interested to see what you can come up with! For the record, the phases therein came from running mlphare with default parameters but exactly the correct heavy-atom constellation (all the sulfur atoms in 3dko), and then running dm with default parameters. Yes, there are other ways to run mlphare and dm that give better phases, but I was only able to determine those parameters by cheating (comparing the resulting map to the right answer), so I don't think it is fair to use those maps. I have had a few questions about what is cheating and what is not cheating. I don't have a problem with the use of sequence information because that actually is something that you realistically would know about your protein when you sat down to collect data. The sequence of this molecule is that of 3dko: http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir I also don't have a problem with anyone actually using an automation program to _help_ them solve the impossible dataset as long as they can explain what they did. Simply putting the above sequence into BALBES would, of course, be cheating! I suppose one could try eliminating 3dko and its homologs from the BALBES search, but that, in and of itself, is perhaps relevant to the challenge: what is the most distance homolog that still allows you to solve the structure?. That, I think, is also a stringent test of model-building skill. I have already tried ARP/wARP, phenix.autobuild and buccaneer/refmac. With default parameters, all of these programs fail on both the possible and impossible datasets. It was only with some substantial tweaking that I found a way to get phenix.autobuild to crack the possible dataset (using 20 models in parallel). I have not yet found a way to get any automation program to build its way out of the impossible dataset. Personally, I think that the breakthrough might be something like what Tom Terwilliger mentioned. If you build a good enough starting set of atoms, then I think an automation program should be able to take you the rest of the way. If that is the case, then it means people like Tom who develop such programs for us might be able to use that insight to improve the software, and that is something that will benefit all of us. Or, it is entirely possible that I'm just not running the current software properly! If so, I'd love it if someone who knows better (such as their developers) could enlighten me. -James Holton MAD Scientist On 1/12/2013 3:07 AM, Pavol Skubak wrote: Dear James, your challenge in its current form ignores an important source of information for model building that is available for your simulated data - namely, it does not allow to use anomalous phase information in the model building. In difficult cases on the edge of success such as this one, this typically makes the difference between building and not building. If you can make the F+/F- and Se substructure available, we can test whether this is the
Re: [ccp4bb] a challenge
Hi James, As an aside (as your point is looking for a John Henry, not investigating automated model-building) I would point out that it is not uncommon at all to find cases where a very small difference in starting parameters or starting phases leads to a very different final result in automated model-building. I suspect that this comes from the discrete nature of model-building: an atom goes either here or there and every time you put in something you have branched the search...then when this model is used in calculating a map you get a new map that depends on the exact branching...so that small starting perturbations can become amplified. As you have found a way to automatically build possible.mtz I would expect that some small change in parameters or software would solve the impossible one too (not that one could necessarily find this change easily). All the best, Tom T On Jan 11, 2013, at 12:13 PM, James Holton wrote: I have a challenge for all those expert model-builders out there: can you beat the machine? It seems these days that everything is automated, and the only decision left for a crystallographer to make is which automation package to use. But has crystallography really been solved? Is looking at maps now no more interesting than playing chess, or any of the other once noble pursuits of human beings that we no longer see as challenging because someone built a machine that can do the job better than any of us? I think not. But I need your help to prove it. Specifically, the phases in this file: http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz when fed with the right set of parameters into the best model building package I have available to me actually does converge to the correct structure, with nice low R/Rfree. However, THIS file: http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz contains the same amplitudes but very slightly different phases from those in possible.mtz above, and this file invariably leads to abysmal failure of every model-building package I have tried. Short of cheating (aka using molecular replacement with the right ansswer: 3dko), I don't think there is any automated way to arrive at a solved structure from impossible.mtz. What is interesting about this is how remarkably similar these two maps are. In fact, the correlation coefficient between them is 0.92. And yet, one can be solved automatically, and the other can't. More details can be found on the web page: http://bl831.als.lbl.gov/~jamesh/challenge/ But, my question for the CCP4BB is: Are there any John Henrys left out there who can still beat the machine? Anyone? -James Holton MAD Scientist