Re: [ccp4bb] a challenge

2013-01-12 Thread Pavol Skubak
Dear James,

your challenge in its current form ignores an important source
of information for model building that is available for your
simulated data - namely, it does not allow to use anomalous
phase information in the model building. In difficult cases on
the edge of success such as this one, this typically makes
the difference between building and not building.

If you can make the F+/F- and Se substructure available, we
can test whether this is the case indeed. However, while I
expect this would push the challenge further significantly,
most likely you would be able to decrease the Se incorporation
of your simulated data further to such levels that the anomalous
signal is again no longer sufficient to build the structure. And
most likely, there would again exist an edge where a small
decrease in the Se incorporation would lead from a model built
to no model built.

Best regards,

-- 
Pavol Skubak
Biophysical Structural Chemistry
Gorleaus Laboratories
Einsteinweg 55
Leiden University
LEIDEN  2333CC
the Netherlands
tel: 0031715274414
web: http://bsc.lic.leidenuniv.nl/people/skubak-0


Re: [ccp4bb] a challenge

2013-01-12 Thread George Sheldrick

Dear James,

I agree with Pavel that your example is not very realistic. In practice
one would start from the heavy atom positions. As well as providing
starting phases, they are useful in other ways. For example. shelxe
(and probably most other tracing programs) adds them to a 'no-go'
map so it knows where NOT to trace the main-chain.

Best wishes, George


Dear James,

your challenge in its current form ignores an important source
of information for model building that is available for your
simulated data - namely, it does not allow to use anomalous
phase information in the model building. In difficult cases on
the edge of success such as this one, this typically makes
the difference between building and not building.

If you can make the F+/F- and Se substructure available, we
can test whether this is the case indeed. However, while I
expect this would push the challenge further significantly,
most likely you would be able to decrease the Se incorporation
of your simulated data further to such levels that the anomalous
signal is again no longer sufficient to build the structure. And
most likely, there would again exist an edge where a small
decrease in the Se incorporation would lead from a model built
to no model built.

Best regards,

--
Pavol Skubak
Biophysical Structural Chemistry
Gorleaus Laboratories
Einsteinweg 55
Leiden University
LEIDEN  2333CC
the Netherlands
tel: 0031715274414 tel:0031715274414
web: http://bsc.lic.leidenuniv.nl/people/skubak-0



--
Prof. George M. Sheldrick FRS
Dept. Structural Chemistry,
University of Goettingen,
Tammannstr. 4,
D37077 Goettingen, Germany
Tel. +49-551-39-3021 or -3068
Fax. +49-551-39-22582




Re: [ccp4bb] a challenge

2013-01-12 Thread James Holton


Fair enough!

I have just now added DANO  and I(+)/I(-) to the files.  I'll be very 
interested to see what you can come up with!  For the record, the phases 
therein came from running mlphare with default parameters but exactly 
the correct heavy-atom constellation (all the sulfur atoms in 3dko), and 
then running dm with default parameters.


Yes, there are other ways to run mlphare and dm that give better phases, 
but I was only able to determine those parameters by cheating 
(comparing the resulting map to the right answer), so I don't think it 
is fair to use those maps.


I have had a few questions about what is cheating and what is not 
cheating.  I don't have a problem with the use of sequence information 
because that actually is something that you realistically would know 
about your protein when you sat down to collect data.  The sequence of 
this molecule is that of 3dko:

http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir

  I also don't have a problem with anyone actually using an automation 
program to _help_ them solve the impossible dataset as long as they 
can explain what they did.  Simply putting the above sequence into 
BALBES would, of course, be cheating!  I suppose one could try 
eliminating 3dko and its homologs from the BALBES search, but that, in 
and of itself, is perhaps relevant to the challenge: what is the most 
distance homolog that still allows you to solve the structure?.  That, 
I think, is also a stringent test of model-building skill.


  I have already tried ARP/wARP, phenix.autobuild and 
buccaneer/refmac.  With default parameters, all of these programs fail 
on both the possible and impossible datasets.  It was only with some 
substantial tweaking that I found a way to get phenix.autobuild to crack 
the possible dataset (using 20 models in parallel).  I have not yet 
found a way to get any automation program to build its way out of the 
impossible dataset. Personally, I think that the breakthrough might be 
something like what Tom Terwilliger mentioned.  If you build a good 
enough starting set of atoms, then I think an automation program should 
be able to take you the rest of the way.  If that is the case, then it 
means people like Tom who develop such programs for us might be able to 
use that insight to improve the software, and that is something that 
will benefit all of us.


Or, it is entirely possible that I'm just not running the current 
software properly!  If so, I'd love it if someone who knows better (such 
as their developers) could enlighten me.


-James Holton
MAD Scientist

On 1/12/2013 3:07 AM, Pavol Skubak wrote:


Dear James,

your challenge in its current form ignores an important source
of information for model building that is available for your
simulated data - namely, it does not allow to use anomalous
phase information in the model building. In difficult cases on
the edge of success such as this one, this typically makes
the difference between building and not building.

If you can make the F+/F- and Se substructure available, we
can test whether this is the case indeed. However, while I
expect this would push the challenge further significantly,
most likely you would be able to decrease the Se incorporation
of your simulated data further to such levels that the anomalous
signal is again no longer sufficient to build the structure. And
most likely, there would again exist an edge where a small
decrease in the Se incorporation would lead from a model built
to no model built.

Best regards,

--
Pavol Skubak
Biophysical Structural Chemistry
Gorleaus Laboratories
Einsteinweg 55
Leiden University
LEIDEN  2333CC
the Netherlands
tel: 0031715274414 tel:0031715274414
web: http://bsc.lic.leidenuniv.nl/people/skubak-0




Re: [ccp4bb] a challenge

2013-01-12 Thread James Holton


Fair enough!

The heavy atom positions are simply the S atoms in 3dko.  There are 22 
of them.  Also, in this case the Met side chains (12 of those) are 32% 
occupied with Se.  The other 68% is sulfur.   I think it is realistic 
that one could know the extent of Se incorporation ahead of time from 
something like mass spec (especially if you knew it could make-or-break 
your structure determination).  However, I don't think it is realistic 
that you would know where they are before running shelx.


-James Holton
MAD Scientist

On 1/12/2013 7:46 AM, George Sheldrick wrote:

Dear James,

I agree with Pavel that your example is not very realistic. In practice
one would start from the heavy atom positions. As well as providing
starting phases, they are useful in other ways. For example. shelxe
(and probably most other tracing programs) adds them to a 'no-go'
map so it knows where NOT to trace the main-chain.

Best wishes, George


Dear James,

your challenge in its current form ignores an important source
of information for model building that is available for your
simulated data - namely, it does not allow to use anomalous
phase information in the model building. In difficult cases on
the edge of success such as this one, this typically makes
the difference between building and not building.

If you can make the F+/F- and Se substructure available, we
can test whether this is the case indeed. However, while I
expect this would push the challenge further significantly,
most likely you would be able to decrease the Se incorporation
of your simulated data further to such levels that the anomalous
signal is again no longer sufficient to build the structure. And
most likely, there would again exist an edge where a small
decrease in the Se incorporation would lead from a model built
to no model built.

Best regards,

--
Pavol Skubak
Biophysical Structural Chemistry
Gorleaus Laboratories
Einsteinweg 55
Leiden University
LEIDEN  2333CC
the Netherlands
tel: 0031715274414 tel:0031715274414
web: http://bsc.lic.leidenuniv.nl/people/skubak-0



--
Prof. George M. Sheldrick FRS
Dept. Structural Chemistry,
University of Goettingen,
Tammannstr. 4,
D37077 Goettingen, Germany
Tel. +49-551-39-3021 or -3068
Fax. +49-551-39-22582





Re: [ccp4bb] a challenge

2013-01-12 Thread James Holton


Woops!  sorry folks.  I made a mistake with the I(+)/I(-) entry. They 
had the wrong axis convention relative to 3dko and the F in the same 
file.  Sorry about that.


The files on the website now should be right.
http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz
http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz

md5 sums:
c4bdb32a08c884884229e8080228d166  impossible.mtz
caf05437132841b595be1c0dc1151123  possible.mtz

-James Holton
MAD Scientist

On 1/12/2013 8:25 AM, James Holton wrote:


Fair enough!

I have just now added DANO  and I(+)/I(-) to the files.  I'll be very 
interested to see what you can come up with!  For the record, the 
phases therein came from running mlphare with default parameters but 
exactly the correct heavy-atom constellation (all the sulfur atoms in 
3dko), and then running dm with default parameters.


Yes, there are other ways to run mlphare and dm that give better 
phases, but I was only able to determine those parameters by 
cheating (comparing the resulting map to the right answer), so I 
don't think it is fair to use those maps.


I have had a few questions about what is cheating and what is not 
cheating.  I don't have a problem with the use of sequence information 
because that actually is something that you realistically would know 
about your protein when you sat down to collect data.  The sequence of 
this molecule is that of 3dko:

http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir

  I also don't have a problem with anyone actually using an automation 
program to _help_ them solve the impossible dataset as long as they 
can explain what they did.  Simply putting the above sequence into 
BALBES would, of course, be cheating!  I suppose one could try 
eliminating 3dko and its homologs from the BALBES search, but that, 
in and of itself, is perhaps relevant to the challenge: what is the 
most distance homolog that still allows you to solve the structure?.  
That, I think, is also a stringent test of model-building skill.


  I have already tried ARP/wARP, phenix.autobuild and 
buccaneer/refmac.  With default parameters, all of these programs fail 
on both the possible and impossible datasets. It was only with 
some substantial tweaking that I found a way to get phenix.autobuild 
to crack the possible dataset (using 20 models in parallel).  I have 
not yet found a way to get any automation program to build its way out 
of the impossible dataset.   Personally, I think that the 
breakthrough might be something like what Tom Terwilliger mentioned.  
If you build a good enough starting set of atoms, then I think an 
automation program should be able to take you the rest of the way.  If 
that is the case, then it means people like Tom who develop such 
programs for us might be able to use that insight to improve the 
software, and that is something that will benefit all of us.


Or, it is entirely possible that I'm just not running the current 
software properly!  If so, I'd love it if someone who knows better 
(such as their developers) could enlighten me.


-James Holton
MAD Scientist

On 1/12/2013 3:07 AM, Pavol Skubak wrote:


Dear James,

your challenge in its current form ignores an important source
of information for model building that is available for your
simulated data - namely, it does not allow to use anomalous
phase information in the model building. In difficult cases on
the edge of success such as this one, this typically makes
the difference between building and not building.

If you can make the F+/F- and Se substructure available, we
can test whether this is the case indeed. However, while I
expect this would push the challenge further significantly,
most likely you would be able to decrease the Se incorporation
of your simulated data further to such levels that the anomalous
signal is again no longer sufficient to build the structure. And
most likely, there would again exist an edge where a small
decrease in the Se incorporation would lead from a model built
to no model built.

Best regards,

--
Pavol Skubak
Biophysical Structural Chemistry
Gorleaus Laboratories
Einsteinweg 55
Leiden University
LEIDEN  2333CC
the Netherlands
tel: 0031715274414 tel:0031715274414
web: http://bsc.lic.leidenuniv.nl/people/skubak-0






Re: [ccp4bb] a challenge

2013-01-12 Thread George Sheldrick

James,

I had in fact just come to the conclusion that the indexing was 
consistent with 3dko for 'possible' but not for 'impossible',

which I suppose was logical.

George

Woops!  sorry folks.  I made a mistake with the I(+)/I(-) entry.  They 
had the wrong axis convention relative to 3dko and the F in the same 
file.  Sorry about that.


The files on the website now should be right.
http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz
http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz

md5 sums:
c4bdb32a08c884884229e8080228d166  impossible.mtz
caf05437132841b595be1c0dc1151123  possible.mtz

-James Holton
MAD Scientist

On 1/12/2013 8:25 AM, James Holton wrote:


Fair enough!

I have just now added DANO  and I(+)/I(-) to the files.  I'll be very 
interested to see what you can come up with!  For the record, the 
phases therein came from running mlphare with default parameters but 
exactly the correct heavy-atom constellation (all the sulfur atoms in 
3dko), and then running dm with default parameters.


Yes, there are other ways to run mlphare and dm that give better 
phases, but I was only able to determine those parameters by 
cheating (comparing the resulting map to the right answer), so I 
don't think it is fair to use those maps.


I have had a few questions about what is cheating and what is not 
cheating.  I don't have a problem with the use of sequence 
information because that actually is something that you realistically 
would know about your protein when you sat down to collect data.  The 
sequence of this molecule is that of 3dko:

http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir

  I also don't have a problem with anyone actually using an 
automation program to _help_ them solve the impossible dataset as 
long as they can explain what they did.  Simply putting the above 
sequence into BALBES would, of course, be cheating!  I suppose one 
could try eliminating 3dko and its homologs from the BALBES search, 
but that, in and of itself, is perhaps relevant to the challenge: 
what is the most distance homolog that still allows you to solve the 
structure?.  That, I think, is also a stringent test of 
model-building skill.


  I have already tried ARP/wARP, phenix.autobuild and 
buccaneer/refmac.  With default parameters, all of these programs 
fail on both the possible and impossible datasets.  It was only 
with some substantial tweaking that I found a way to get 
phenix.autobuild to crack the possible dataset (using 20 models in 
parallel).  I have not yet found a way to get any automation program 
to build its way out of the impossible dataset.   Personally, I 
think that the breakthrough might be something like what Tom 
Terwilliger mentioned.  If you build a good enough starting set of 
atoms, then I think an automation program should be able to take you 
the rest of the way.  If that is the case, then it means people like 
Tom who develop such programs for us might be able to use that 
insight to improve the software, and that is something that will 
benefit all of us.


Or, it is entirely possible that I'm just not running the current 
software properly!  If so, I'd love it if someone who knows better 
(such as their developers) could enlighten me.


-James Holton
MAD Scientist

On 1/12/2013 3:07 AM, Pavol Skubak wrote:


Dear James,

your challenge in its current form ignores an important source
of information for model building that is available for your
simulated data - namely, it does not allow to use anomalous
phase information in the model building. In difficult cases on
the edge of success such as this one, this typically makes
the difference between building and not building.

If you can make the F+/F- and Se substructure available, we
can test whether this is the case indeed. However, while I
expect this would push the challenge further significantly,
most likely you would be able to decrease the Se incorporation
of your simulated data further to such levels that the anomalous
signal is again no longer sufficient to build the structure. And
most likely, there would again exist an edge where a small
decrease in the Se incorporation would lead from a model built
to no model built.

Best regards,

--
Pavol Skubak
Biophysical Structural Chemistry
Gorleaus Laboratories
Einsteinweg 55
Leiden University
LEIDEN  2333CC
the Netherlands
tel: 0031715274414 tel:0031715274414
web: http://bsc.lic.leidenuniv.nl/people/skubak-0







--
Prof. George M. Sheldrick FRS
Dept. Structural Chemistry,
University of Goettingen,
Tammannstr. 4,
D37077 Goettingen, Germany
Tel. +49-551-39-3021 or -3068
Fax. +49-551-39-22582




Re: [ccp4bb] a challenge

2013-01-12 Thread James Holton


I admit that made impossible more difficult to solve than possible, 
but not in the way I had intended!  Again, sorry about that.  It is 
corrected now.


The change in indexing arises because I am processing the simulated 
images with a default run of XDS and as you know the autoindexing picks 
an indexing convention at random.  I flipped it back at the time, but 
when I just now went back to get the I(+)/I(-) I went just one step too 
far.


Once again, sorry.  It was not my intention to waste anyone's time!

-James Holton
MAD Scientist

On 1/12/2013 2:09 PM, George Sheldrick wrote:

James,

I had in fact just come to the conclusion that the indexing was 
consistent with 3dko for 'possible' but not for 'impossible',

which I suppose was logical.

George

Woops!  sorry folks.  I made a mistake with the I(+)/I(-) entry.  
They had the wrong axis convention relative to 3dko and the F in the 
same file.  Sorry about that.


The files on the website now should be right.
http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz
http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz

md5 sums:
c4bdb32a08c884884229e8080228d166  impossible.mtz
caf05437132841b595be1c0dc1151123  possible.mtz

-James Holton
MAD Scientist

On 1/12/2013 8:25 AM, James Holton wrote:


Fair enough!

I have just now added DANO  and I(+)/I(-) to the files. I'll be very 
interested to see what you can come up with! For the record, the 
phases therein came from running mlphare with default parameters but 
exactly the correct heavy-atom constellation (all the sulfur atoms 
in 3dko), and then running dm with default parameters.


Yes, there are other ways to run mlphare and dm that give better 
phases, but I was only able to determine those parameters by 
cheating (comparing the resulting map to the right answer), so I 
don't think it is fair to use those maps.


I have had a few questions about what is cheating and what is not 
cheating.  I don't have a problem with the use of sequence 
information because that actually is something that you 
realistically would know about your protein when you sat down to 
collect data.  The sequence of this molecule is that of 3dko:

http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir

  I also don't have a problem with anyone actually using an 
automation program to _help_ them solve the impossible dataset as 
long as they can explain what they did.  Simply putting the above 
sequence into BALBES would, of course, be cheating!  I suppose one 
could try eliminating 3dko and its homologs from the BALBES 
search, but that, in and of itself, is perhaps relevant to the 
challenge: what is the most distance homolog that still allows you 
to solve the structure?.  That, I think, is also a stringent test 
of model-building skill.


  I have already tried ARP/wARP, phenix.autobuild and 
buccaneer/refmac.  With default parameters, all of these programs 
fail on both the possible and impossible datasets.  It was only 
with some substantial tweaking that I found a way to get 
phenix.autobuild to crack the possible dataset (using 20 models in 
parallel).  I have not yet found a way to get any automation program 
to build its way out of the impossible dataset.   Personally, I 
think that the breakthrough might be something like what Tom 
Terwilliger mentioned.  If you build a good enough starting set of 
atoms, then I think an automation program should be able to take you 
the rest of the way.  If that is the case, then it means people like 
Tom who develop such programs for us might be able to use that 
insight to improve the software, and that is something that will 
benefit all of us.


Or, it is entirely possible that I'm just not running the current 
software properly!  If so, I'd love it if someone who knows better 
(such as their developers) could enlighten me.


-James Holton
MAD Scientist

On 1/12/2013 3:07 AM, Pavol Skubak wrote:


Dear James,

your challenge in its current form ignores an important source
of information for model building that is available for your
simulated data - namely, it does not allow to use anomalous
phase information in the model building. In difficult cases on
the edge of success such as this one, this typically makes
the difference between building and not building.

If you can make the F+/F- and Se substructure available, we
can test whether this is the case indeed. However, while I
expect this would push the challenge further significantly,
most likely you would be able to decrease the Se incorporation
of your simulated data further to such levels that the anomalous
signal is again no longer sufficient to build the structure. And
most likely, there would again exist an edge where a small
decrease in the Se incorporation would lead from a model built
to no model built.

Best regards,

--
Pavol Skubak
Biophysical Structural Chemistry
Gorleaus Laboratories
Einsteinweg 55
Leiden University
LEIDEN  2333CC
the Netherlands
tel: 0031715274414 tel:0031715274414
web: 

Re: [ccp4bb] a challenge

2013-01-12 Thread Pavol Skubak
I can build from the impossible.mtz data in the following two steps:

1. getting the SE substructure from anomalous difference map
constructed from impossible.mtz

2. running combined model building using the substructure
from step 1 and starting from the impossible.mtz map

Only impossible.mtz and the sequence (which is probably not
really necessary) is used in this solution.

It is not a fully automatic solution - step 2 (model building
combined with density modif. and phasing via a recently
developed multivariate SAD function) was performed
automatically using CRANK (which calls Buccaneer, REFMAC
and Parrot), step 1 manually - using CCP4 tools (cfft and
peakmax).

Comparing to the deposited model, 96% of the mainchain is
(correctly) built and 92% is (correctly) docked and R factor
is 21% - clearly, the (relatively) weak anomalous signal is the
only limitation in this case. However, the model building
procedure did not struggle too much - I expect it would still
work if the Se incorporation is decreased somewhat further
(as long as the substructure can be obtained in some way).

Of course, this is not a pure solution in the sense that
I started from impossible.mtz rather than from scratch, ie
from the data only. Obtaining the substructure from scratch
might be more difficult.

Pavol


On Sat, Jan 12, 2013 at 10:50 PM, James Holton jmhol...@lbl.gov wrote:


 Woops!  sorry folks.  I made a mistake with the I(+)/I(-) entry.  They had
 the wrong axis convention relative to 3dko and the F in the same file.
 Sorry about that.

 The files on the website now should be right.
 http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz
 http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz

 md5 sums:
 c4bdb32a08c884884229e8080228d166  impossible.mtz
 caf05437132841b595be1c0dc1151123  possible.mtz

 -James Holton
 MAD Scientist


 On 1/12/2013 8:25 AM, James Holton wrote:


 Fair enough!

 I have just now added DANO  and I(+)/I(-) to the files.  I'll be very
 interested to see what you can come up with!  For the record, the phases
 therein came from running mlphare with default parameters but exactly the
 correct heavy-atom constellation (all the sulfur atoms in 3dko), and then
 running dm with default parameters.

 Yes, there are other ways to run mlphare and dm that give better phases,
 but I was only able to determine those parameters by cheating (comparing
 the resulting map to the right answer), so I don't think it is fair to
 use those maps.

 I have had a few questions about what is cheating and what is not
 cheating.  I don't have a problem with the use of sequence information
 because that actually is something that you realistically would know about
 your protein when you sat down to collect data.  The sequence of this
 molecule is that of 3dko:
 http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir

   I also don't have a problem with anyone actually using an automation
 program to _help_ them solve the impossible dataset as long as they can
 explain what they did.  Simply putting the above sequence into BALBES
 would, of course, be cheating!  I suppose one could try eliminating 3dko
 and its homologs from the BALBES search, but that, in and of itself, is
 perhaps relevant to the challenge: what is the most distance homolog that
 still allows you to solve the structure?.  That, I think, is also a
 stringent test of model-building skill.

   I have already tried ARP/wARP, phenix.autobuild and buccaneer/refmac.
 With default parameters, all of these programs fail on both the possible
 and impossible datasets.  It was only with some substantial tweaking that
 I found a way to get phenix.autobuild to crack the possible dataset
 (using 20 models in parallel).  I have not yet found a way to get any
 automation program to build its way out of the impossible dataset.
 Personally, I think that the breakthrough might be something like what Tom
 Terwilliger mentioned.  If you build a good enough starting set of atoms,
 then I think an automation program should be able to take you the rest of
 the way.  If that is the case, then it means people like Tom who develop
 such programs for us might be able to use that insight to improve the
 software, and that is something that will benefit all of us.

 Or, it is entirely possible that I'm just not running the current software
 properly!  If so, I'd love it if someone who knows better (such as their
 developers) could enlighten me.

 -James Holton
 MAD Scientist

 On 1/12/2013 3:07 AM, Pavol Skubak wrote:


  Dear James,

  your challenge in its current form ignores an important source
 of information for model building that is available for your
 simulated data - namely, it does not allow to use anomalous
 phase information in the model building. In difficult cases on
 the edge of success such as this one, this typically makes
 the difference between building and not building.

  If you can make the F+/F- and Se substructure available, we
 can test whether this is the