Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Anastassis Perrakis

Dear all,

The discussion about keeping primary data, and what level of data can  
be considered 'primary', has - rather unsurprisingly - come up also in  
areas other than structural biology.
An example is next generation sequencing. A full-dataset is a few tera  
bytes, but post-processing reduces it to sub-Gb size. However, the  
post-processed data, as in our case,
have suffered the inadequacy of computational reduction ... At least  
out institute has decided to create double back-up of the primary data  
in triplicate. For that reason our facility bought
three -80 freezers, one on site at the basement, on at the top floor,  
and one off-site, and they keep the DNA to be sequenced. A sequencing  
run is already sub-1k$ and it will not become
more expensive. So, if its important, do it again. Its cheaper and its  
better.


At first sight, that does not apply to MX. Or does it?

So, maybe the question is not To archive or not to archive but What  
to archive.


(similarly, it never crossed my mind if I should be or not be - I  
always wondered what to be)


A.


On Oct 30, 2011, at 11:59, Kay Diederichs wrote:


Am 20:59, schrieb Jrh:
...

So:-  Universities are now establishing their own institutional
repositories, driven largely by Open Access demands of funders. For
these to host raw datasets that underpin publications is a reasonable
role in my view and indeed they already have this category in the
University of Manchester eScholar system, for example.  I am set to
explore locally here whether they would accommodate all our Lab's raw
Xray images datasets per annum that underpin our published crystal
structures.

It would be helpful if readers of this CCP4bb could kindly also
explore with their own universities if they have such an
institutional repository and if raw data sets could be accommodated.
Please do email me off list with this information if you prefer but
within the CCP4bb is also good.



Dear John,

I'm pretty sure that there exists no consistent policy to provide an  
institutional repository for deposition of scientific data at  
German universities or Max-Planck institutes or Helmholtz  
institutions, at least I never heard of something like this. More  
specifically, our University of Konstanz certainly does not have the  
infrastructure to provide this.


I don't think that Germany is the only country which is the  
exception to any rule of availability of institutional  
repository . Rather, I'm almost amazed that British and American  
institutions seem to support this.


Thus I suggest to not focus exclusively on official institutional  
repositories, but to explore alternatives: distributed filestores  
like Google's BigTable, Bittorrent or others might be just as  
suitable - check out http://en.wikipedia.org/wiki/Distributed_data_store 
. I guess that any crystallographic lab could easily sacrifice/ 
donate a TB of storage for the purposes of this project in 2011 (and  
maybe 2 TB in 2012, 3 in 2013, ...), but clearly the level of work  
to set this up should be kept as low as possible (a bittorrent  
daemon seems simple enough).


Just my 2 cents,

Kay






P please don't print this e-mail unless you really need to
Anastassis (Tassos) Perrakis, Principal Investigator / Staff Member
Department of Biochemistry (B8)
Netherlands Cancer Institute,
Dept. B8, 1066 CX Amsterdam, The Netherlands
Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile / SMS: +31 6 28 597791






Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Gerard Bricogne
Dear Tassos,

 It is unclear whether this thread will be able to resolve your deep 
existential concerns about what to be, but you do introduce a couple of
interesting points: (1) raw data archiving in areas (of biology) other than
structural biology, and (2) archiving the samples rather than the verbose
data that may have been extracted from them.

 Concerning (1), I am grateful to Peter Keller here in my group for
pointing me, mid-August when we were for the n-th time reviewing the issue
of raw data deposition under discussion in this thread, and its advantages
over only keeping derived data extracted from them, towards the Trace
Archive of DNA sequences. He found an example, at 

http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieveval=12345dopt=tracesize=1retrieve=Submit

You can check the Quality Score box below the trace, and this will refresh
the display to give a visual estimate of the reliability of the sequence.
There is clearly a problem around position 210, that would not have been
adequately dealt with by just retaining the most probable sequence. In this
context, it has been found worthwhile to preserve the raw data, to make it
possible to audit derived data against them. This is at least a very
simple example of what you were referring to whan you wrote about the
inadequacy of computational reduction. In the MX context, this is rather
similar to the contamination of integrated intensities by spots from
parasitic lattices (which would still affect unmerged intensities, by the
way - so upgrading the pdb structure factor file to unmerged data would
take care of over-merging, but not of that contamination). 

 Concerning (2) I greatly doubt there would be an equivalent for MX: few
people would have spare crystals to put to one side for a future repeat of a
diffraction experiment (except in the case of lysozyme/insulin/thaumatin!).
I can remember an esteemed colleague arguing 4-5 years ago that if you want
to improve a deposited structure, you could simply repeat the work from
scratch - a sensible position from the philosophical point of view (science
being the art of the repeatable), but far less sensible in conditions of
limited resources, and given also the difficulties of reproducing crystals.
The real-life situation is more a Carpe diem one: archive what you have,
as you may never see it again! Otherwise one would easily get drawn into the
same kind of unrealistic expectations as people who get themselves frozen in
liquid N2, with their blood replaced by DMSO, hoping to be brought back to
life some day in the future ;-) .


 With best wishes,
 
  Gerard.

--
On Mon, Oct 31, 2011 at 11:37:47AM +0100, Anastassis Perrakis wrote:
 Dear all,

 The discussion about keeping primary data, and what level of data can be 
 considered 'primary', has - rather unsurprisingly - come up also in areas 
 other than structural biology.
 An example is next generation sequencing. A full-dataset is a few tera 
 bytes, but post-processing reduces it to sub-Gb size. However, the 
 post-processed data, as in our case,
 have suffered the inadequacy of computational reduction ... At least out 
 institute has decided to create double back-up of the primary data in 
 triplicate. For that reason our facility bought
 three -80 freezers, one on site at the basement, on at the top floor, and 
 one off-site, and they keep the DNA to be sequenced. A sequencing run is 
 already sub-1k$ and it will not become
 more expensive. So, if its important, do it again. Its cheaper and its 
 better.

 At first sight, that does not apply to MX. Or does it?

 So, maybe the question is not To archive or not to archive but What to 
 archive.

 (similarly, it never crossed my mind if I should be or not be - I always 
 wondered what to be)

 A.


 On Oct 30, 2011, at 11:59, Kay Diederichs wrote:

 Am 20:59, schrieb Jrh:
 ...
 So:-  Universities are now establishing their own institutional
 repositories, driven largely by Open Access demands of funders. For
 these to host raw datasets that underpin publications is a reasonable
 role in my view and indeed they already have this category in the
 University of Manchester eScholar system, for example.  I am set to
 explore locally here whether they would accommodate all our Lab's raw
 Xray images datasets per annum that underpin our published crystal
 structures.

 It would be helpful if readers of this CCP4bb could kindly also
 explore with their own universities if they have such an
 institutional repository and if raw data sets could be accommodated.
 Please do email me off list with this information if you prefer but
 within the CCP4bb is also good.


 Dear John,

 I'm pretty sure that there exists no consistent policy to provide an 
 institutional repository for deposition of scientific data at German 
 universities or Max-Planck institutes or Helmholtz institutions, at least 
 I never heard of something like this. More specifically, our University of 
 

[ccp4bb] 2 senior positions at AstraZeneca

2011-10-31 Thread Pauptit, Richard A
Two senior structural biology positions available at AstraZeneca UK. One
is associate director of the crystallography/crystallization group
(recently mentioned in ccp4bb and still open) and has reference RD237,
the other is one position up and has just become available, the director
of protein structure and biophysics with reference RD300. You can find
the jobs directly with the links below, or search for the reference
numbers on www.astrazeneca.com under careers. Both should also appear in
Nature jobs at some point soon.



http://jobs.astrazeneca.com/jobs/3370-associate-director-protein-structu
re-crystallography



http://jobs.astrazeneca.com/jobs/3443-director-of-structure-and-biophysi
cs



Many thanks



Richard Pauptit

Discovery Sciences, Alderley Park

AstraZeneca UK




--
AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number: 03674842 and a registered office at 2 Kingdom Street, 
London, W2 6BD.
Confidentiality Notice: This message is private and may contain confidential, 
proprietary and legally privileged information. If you have received this 
message in error, please notify us and remove it from your system and note that 
you must not copy, distribute or take any action in reliance on it. Any 
unauthorised use or disclosure of the contents of this message is not permitted 
and may be unlawful.
Disclaimer: Email messages may be subject to delays, interception, non-delivery 
and unauthorised alterations. Therefore, information expressed in this message 
is not given or endorsed by AstraZeneca UK Limited unless otherwise notified by 
an authorised representative independent of this message. No contractual 
relationship is created by this message by any person unless specifically 
indicated by agreement in writing other than email.
Monitoring: AstraZeneca UK Limited may monitor email traffic data and content 
for the purposes of the prevention and detection of crime, ensuring the 
security of our computer systems and checking Compliance with our Code of 
Conduct and Policies.


Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Martin Kollmar
Still, after hundreds (?) of emails to this topic, I haven't seen any 
convincing argument in favor of archiving data. The only convincing 
arguments are against, and are from Gerard K and Tassos.


Why?
The question is not what to archive, but still why should we archive all 
the data.


Because software developers need more data? Should we raise all these 
efforts and costs because 10 developers worldwide need the data to ALL 
protein structures? Do they really need so much data, wouldn't it be 
enough to build a repository of maybe 1000 datasets for developments?


Does really someone believe that our view on the actual problem, the 
function of the proteins, changes with the analysis of whatsoever 
scattering is still in the images but not used by today's software? 
Crystal structures are static, snapshots, and obtained under artificial 
conditions. In solution (still the physiologic state) they might look 
different, not much, but at least far more dynamic. Does it therefore 
matter whether we know some sidechain positions better (in the crystal 
structure) when re-analysing the data? In turn, are our current software 
programs such bad that we would expect strong difference when 
re-analysing the data? No. And if the structures change upon reanalysis 
(more or less) who does re-interpret the structures, re-writes the papers?


There are many many cases where researchers re-did structures (or did 
closely related structures to already available structures like mutants, 
structures of closely related species, etc.), also after 10 years. I 
guess they used the latest software in the different cases, thus they 
incorporated all the software development of the 10 years. And are the 
structures really different (beyond the introduced changes, mutations, 
etc.)? Different because of the software used?


The comparison with next-generation sequencing data is useful here, but 
only in the sense Tassos explained. Well, of course not every position 
in the genomic sequence is fixed. Therefore it is sometimes useful to 
look at the original data (the traces, as Gerard B pointed out). But we 
already know, that every single organism is different (especially 
eukaryotes) and therefore it is absolutely enough to store the 
computationally reduced and merged data. If one needs better, 
position-specific data, sequencing and comparing single species becomes 
necessary, like in the ENCODE project, the sequencing of about 100 
Saccharomyces strains, the sequencing of 1000 Arabidopsis strains, etc. 
Discussion about single positions are useless if they are not 
statistically relevant. They need to be analysed in the context of 
populations, large cohorts of patients, etc. If we need personalized 
medicine adapted to personal genomes, we would also need personal sets 
of protein structures which we cannot provide yet. Therefore, storing 
the DNA in the freezer is better and cheaper than storing all the 
sequencing raw data. Do you think a reviewer re-sequences, or 
re-assembles, or re-annotates a genome, even if access to the raw reads 
would be available? If you trust these data why don't we trust our 
structure factors? Do you trust electron microscopy images, movies of 
GFP-tagged proteins? Do you think what is presented for a single or a 
few visible cells is also found in all cells?


And now, who many of you (if not everybody) uses structures from yeast, 
Drosophila, mouse etc. as MODEL for human proteins? If we stick to this 
thinking, who would care about potential minor changes in the structures 
upon re-analysis (and in the light of this discussion, arguing about 
specific genomic sequence positions becomes unimportant as well)?


Is any of the archived data useful without manual evaluation upon 
archiving? This is especially relevant for structures not solved yet. Do 
the images belong to the structure factors, if only images are 
available, where is the corresponding protein sequence, has it been 
sequenced, what has been in the buffer/crystallization condition, what 
has been used during protein purification, what was the intention for 
crystallization - e.g. a certain functional state, that the protein was 
forced to by artificial conditions, etc. etc. Who want's to evaluate 
that, and how? The question is not that we could do it. We could do it, 
but wouldn't it advance science far more if we would spend the time and 
money in new projects rather than evaluation, administration, etc?


Be honest: How many of you have really, and completely, reanalysed your 
own data, that you have deposited 10 years ago, with the latest 
software? What changes did you find? Did you have to re-write your 
former discussions in the publications? Do you think that the changes 
justify the efforts and costs of worldwide archiving of all data?


Well, for all cases there are always (and have been mentioned in earlier 
emails) single cases where these things matter or mattered. But does 
this really justify all the future 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Robert Esnouf
Dear All,

As someone who recently left crystallography for sequencing, I 
should modify Tassos's point...

A full data-set is a few terabytes, but post-processing 
reduces it to sub-Gb size.

My experience from HiSeqs is that this full here means the 
base calls - equivalent to the unmerged HKLs - hardly raw 
data. NGS (short-read) sequencing is an imaging technique and 
the images are more like 100TB for a 15-day run on a single 
flow cell. The raw base calls are about 5TB. The compressed, 
mapped data (BAM file, for a human genome, 30x coverage) is 
about 120GB. It is only a variant call file (VCF, difference 
from a stated human reference genome) that is sub-Gb and these 
files are - unsurprisingly - unsuited to detailed statistical 
analysis. Also $1k is a not yet an economic cost...

The DNA information capacity in a single human body dwarfs the 
entire world disk capacity, so storing DNA is a no brainer 
here. Sequencing groups are making very hard-nosed economic 
decisions about what to store - indeed it is a source of 
research in itself - but the scale of the problem is very much 
bigger.

My tuppence ha'penny is that depositing raw images along 
with everything else in the PDB is a nice idea but would have 
little impact on science (human/animal/plant health or 
understanding of biology).

1) If confined to structures in the PDB, the images would just 
be the ones giving the final best data - hence the ones least 
likely to have been problematic. I'd be more interested in 
SFs/maps for looking at ligand-binding etc...

2) Unless this were done before paper acceptance they would be 
of little use to referees seeking to review important 
structural papers. I'd like to see PDB validation reports 
(which could include automated data processing, perhaps culled 
from synchrotron sites, SFs and/or maps) made available to 
referees in advance of publication. This would be enabled by 
deposition, but could be achieved in other ways.

3) The datasets of interest to methods developers are unlikely 
to be the ones deposited. They should be in contact with 
synchrotron archives directly. Processing multiple lattices is 
a case in point here.

4) Remember the average consumer of a PDB file is not a 
crystallographer. More likely to be a graduate student in a 
clinical lab. For him/her things like occupancies and B-
factors are far more serious concerns... I'm not trivializing 
the issue, but importance is always relative. Are there 
outsiders on the panel to keep perspective?

Robert


--

Dr. Robert Esnouf,
University Research Lecturer, ex-crystallographer
and Head of Research Computing,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive, Oxford OX3 7BN, UK

Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
and rob...@esnouf.comFax: (+44) - 1865 - 287547


 Original message 
Date: Mon, 31 Oct 2011 11:37:47 +0100
From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf 
of Anastassis Perrakis a.perra...@nki.nl)
Subject: Re: [ccp4bb] To archive or not to archive, that's 
the question!  
To: CCP4BB@JISCMAIL.AC.UK

   Dear all,
   The discussion about keeping primary data, and what
   level of data can be considered 'primary', has -
   rather unsurprisingly - come up also in areas other
   than structural biology.
   An example is next generation sequencing. A
   full-dataset is a few tera bytes, but
   post-processing reduces it to sub-Gb size. However,
   the post-processed data, as in our case,
   have suffered the inadequacy of computational
   reduction ... At least out institute has decided
   to create double back-up of the primary data in
   triplicate. For that reason our facility bought
   three -80 freezers, one on site at the basement, on
   at the top floor, and one off-site, and they keep
   the DNA to be sequenced. A sequencing run is already
   sub-1k$ and it will not become
   more expensive. So, if its important, do it again.
   Its cheaper and its better.
   At first sight, that does not apply to MX. Or does
   it?
   So, maybe the question is not To archive or not to
   archive but What to archive.
   (similarly, it never crossed my mind if I should be
   or not be - I always wondered what to be)
   A.
   On Oct 30, 2011, at 11:59, Kay Diederichs wrote:

 Am 20:59, schrieb Jrh:
 ...

   So:-  Universities are now establishing their
   own institutional

   repositories, driven largely by Open Access
   demands of funders. For

   these to host raw datasets that underpin
   publications is a reasonable

   role in my view and indeed they already have
   this category in the

   University of Manchester eScholar system, for
   example.  I am set to

   explore locally here whether they would
   accommodate all our Lab's raw

   Xray images datasets per annum that underpin our
   published crystal

   structures.

   It would be helpful if readers of this CCP4bb
   could kindly 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Oganesyan, Vaheh

I was hesitant to add my opinion so far because I'm used more to listen this 
forum rather than tell others what I think.
Why and what to deposit are absolutely interconnected. Once you decide why 
you want to do it, then you will probably know what will be the best format and 
vice versa.
Whether this deposition of raw images will or will not help in future 
understanding the biology better I'm not sure.
But to store those difficult datasets to help the future software development 
sounds really farfetched. This assumes that in the future crystallographers 
will never grow crystals that will deliver difficult datasets. If that is the 
case and in 10-20-30 years next generation will be growing much better crystals 
then they don't need such a software development.
If that is not the case, and once in a while (or more often) they will be 
getting something out of ordinary then software developers will take them and 
develop whatever they need to develop to consider such cases.

Am I missing a point of discussion here?

Regards,

 Vaheh




-Original Message-
From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Robert 
Esnouf
Sent: Monday, October 31, 2011 10:31 AM
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] To archive or not to archive, that's the question!

Dear All,

As someone who recently left crystallography for sequencing, I
should modify Tassos's point...

A full data-set is a few terabytes, but post-processing
reduces it to sub-Gb size.

My experience from HiSeqs is that this full here means the
base calls - equivalent to the unmerged HKLs - hardly raw
data. NGS (short-read) sequencing is an imaging technique and
the images are more like 100TB for a 15-day run on a single
flow cell. The raw base calls are about 5TB. The compressed,
mapped data (BAM file, for a human genome, 30x coverage) is
about 120GB. It is only a variant call file (VCF, difference
from a stated human reference genome) that is sub-Gb and these
files are - unsurprisingly - unsuited to detailed statistical
analysis. Also $1k is a not yet an economic cost...

The DNA information capacity in a single human body dwarfs the
entire world disk capacity, so storing DNA is a no brainer
here. Sequencing groups are making very hard-nosed economic
decisions about what to store - indeed it is a source of
research in itself - but the scale of the problem is very much
bigger.

My tuppence ha'penny is that depositing raw images along
with everything else in the PDB is a nice idea but would have
little impact on science (human/animal/plant health or
understanding of biology).

1) If confined to structures in the PDB, the images would just
be the ones giving the final best data - hence the ones least
likely to have been problematic. I'd be more interested in
SFs/maps for looking at ligand-binding etc...

2) Unless this were done before paper acceptance they would be
of little use to referees seeking to review important
structural papers. I'd like to see PDB validation reports
(which could include automated data processing, perhaps culled
from synchrotron sites, SFs and/or maps) made available to
referees in advance of publication. This would be enabled by
deposition, but could be achieved in other ways.

3) The datasets of interest to methods developers are unlikely
to be the ones deposited. They should be in contact with
synchrotron archives directly. Processing multiple lattices is
a case in point here.

4) Remember the average consumer of a PDB file is not a
crystallographer. More likely to be a graduate student in a
clinical lab. For him/her things like occupancies and B-
factors are far more serious concerns... I'm not trivializing
the issue, but importance is always relative. Are there
outsiders on the panel to keep perspective?

Robert


--

Dr. Robert Esnouf,
University Research Lecturer, ex-crystallographer
and Head of Research Computing,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive, Oxford OX3 7BN, UK

Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
and rob...@esnouf.comFax: (+44) - 1865 - 287547


 Original message 
Date: Mon, 31 Oct 2011 11:37:47 +0100
From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf
of Anastassis Perrakis a.perra...@nki.nl)
Subject: Re: [ccp4bb] To archive or not to archive, that's
the question!
To: CCP4BB@JISCMAIL.AC.UK

   Dear all,
   The discussion about keeping primary data, and what
   level of data can be considered 'primary', has -
   rather unsurprisingly - come up also in areas other
   than structural biology.
   An example is next generation sequencing. A
   full-dataset is a few tera bytes, but
   post-processing reduces it to sub-Gb size. However,
   the post-processed data, as in our case,
   have suffered the inadequacy of computational
   reduction ... At least out institute has decided
   to create double back-up of the primary data in
   triplicate. For that reason our facility bought
   three -80 freezers, one 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread David Waterman
I have no doubt there are software developers out there who have spent
years building up their own personal collections of 'interesting' datasets,
file formats, and various oddities that they take with them wherever they
go, and consider this collection to be precious. Despite the fact that many
bad datasets are collected daily at beamlines the world over, it is amazing
how difficult it can be to find what you want when there is no open, single
point-of-access repository to search. Simply asking the crystallographers
and beamline scientists doesn't work: they are too busy doing their own
jobs.

-- David


On 31 October 2011 15:18, Oganesyan, Vaheh oganesy...@medimmune.com wrote:


 I was hesitant to add my opinion so far because I'm used more to listen
 this forum rather than tell others what I think.
 Why and what to deposit are absolutely interconnected. Once you decide
 why you want to do it, then you will probably know what will be the best
 format and *vice versa*.
 Whether this deposition of raw images will or will not help in future
 understanding the biology better I'm not sure.
 But to store those difficult datasets to help the future software
 development sounds really farfetched. This assumes that in the future
 crystallographers will never grow crystals that will deliver difficult
 datasets. If that is the case and in 10-20-30 years next generation will be
 growing much better crystals then they don't need such a software
 development.
 If that is not the case, and once in a while (or more often) they will be
 getting something out of ordinary then software developers will take them
 and develop whatever they need to develop to consider such cases.

 Am I missing a point of discussion here?

 Regards,

   Vaheh




 -Original Message-
 From: CCP4 bulletin board 
 [mailto:CCP4BB@JISCMAIL.AC.UKCCP4BB@JISCMAIL.AC.UK]
 On Behalf Of Robert Esnouf
 Sent: Monday, October 31, 2011 10:31 AM
 To: CCP4BB@JISCMAIL.AC.UK
 Subject: Re: [ccp4bb] To archive or not to archive, that's the question!

 Dear All,

 As someone who recently left crystallography for sequencing, I
 should modify Tassos's point...

 A full data-set is a few terabytes, but post-processing
 reduces it to sub-Gb size.

 My experience from HiSeqs is that this full here means the
 base calls - equivalent to the unmerged HKLs - hardly raw
 data. NGS (short-read) sequencing is an imaging technique and
 the images are more like 100TB for a 15-day run on a single
 flow cell. The raw base calls are about 5TB. The compressed,
 mapped data (BAM file, for a human genome, 30x coverage) is
 about 120GB. It is only a variant call file (VCF, difference
 from a stated human reference genome) that is sub-Gb and these
 files are - unsurprisingly - unsuited to detailed statistical
 analysis. Also $1k is a not yet an economic cost...

 The DNA information capacity in a single human body dwarfs the
 entire world disk capacity, so storing DNA is a no brainer
 here. Sequencing groups are making very hard-nosed economic
 decisions about what to store - indeed it is a source of
 research in itself - but the scale of the problem is very much
 bigger.

 My tuppence ha'penny is that depositing raw images along
 with everything else in the PDB is a nice idea but would have
 little impact on science (human/animal/plant health or
 understanding of biology).

 1) If confined to structures in the PDB, the images would just
 be the ones giving the final best data - hence the ones least
 likely to have been problematic. I'd be more interested in
 SFs/maps for looking at ligand-binding etc...

 2) Unless this were done before paper acceptance they would be
 of little use to referees seeking to review important
 structural papers. I'd like to see PDB validation reports
 (which could include automated data processing, perhaps culled
 from synchrotron sites, SFs and/or maps) made available to
 referees in advance of publication. This would be enabled by
 deposition, but could be achieved in other ways.

 3) The datasets of interest to methods developers are unlikely
 to be the ones deposited. They should be in contact with
 synchrotron archives directly. Processing multiple lattices is
 a case in point here.

 4) Remember the average consumer of a PDB file is not a
 crystallographer. More likely to be a graduate student in a
 clinical lab. For him/her things like occupancies and B-
 factors are far more serious concerns... I'm not trivializing
 the issue, but importance is always relative. Are there
 outsiders on the panel to keep perspective?

 Robert


 --

 Dr. Robert Esnouf,
 University Research Lecturer, ex-crystallographer
 and Head of Research Computing,
 Wellcome Trust Centre for Human Genetics,
 Roosevelt Drive, Oxford OX3 7BN, UK

 Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
 and rob...@esnouf.comFax: (+44) - 1865 - 287547


  Original message 
 Date: Mon, 31 Oct 2011 11:37:47 +0100
 From: CCP4 bulletin board 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Clemens Vonrhein
Dear Vaheh,

On Mon, Oct 31, 2011 at 03:18:07PM +, Oganesyan, Vaheh wrote:
 But to store those difficult datasets to help the future software
 development sounds really farfetched.

As far as I see the general plan, that would be a second stage (deposit
all datasets) - the first one would be the datasets related directly
to a given PDB entry.

 This assumes that in the future crystallographers will never grow
 crystals that will deliver difficult datasets.

Oh sure they will. And lots of those datasets will be available to
developers ... being thrown a difficult problem under pressure is a
very good thing to get ideas, think out of the box etc. However,
developing solid algorithms is better done in a less hectic
environment with a large collection of similar problems (changing only
one parameter at a time) to test a new method.

 If that is the case and in 10-20-30 years next generation will be
 growing much better crystals then they don't need such a software
 development.

They'll grow better crystals for the type of project we're currently
struggling with, sure. But we'll still get poor crystals for projects we
don't even attempt or tackle right now.

Software development is a slow process, often working on a different
timescale than the typical structure solution project (obvious there
are exceptions). So planing ahead for that time will prepare us.

And yes, it will have an impact on the biology then. It's not just the
here and now (and next grant, next high-profile paper) we should be
thinking about.

 Am I missing a point of discussion here?

One small point maybe: there are very few developers out there - but a
very large number of users that benefit from what they have
done. Often the work is not very visible (It's just pressing a button
or two ... so it must be trivial!) - which is a good thing: it has to
be simple, robust, automtic and useable.

I think if a large enough number of developers consider depositing
images a very useful resource for their future development (and
therefore future benefit to a large number of users), it should be
seriously considered, even if some of the advertised benefits have to
be taken on trust.

Past developments in data processing have had a big impact on a lot of
projects - high-profile or just the standard PhD-student nightmare -
with often small return for the developers in terms of publications,
grants or even citations (main paper or supplementary material).

So maybe in the sprit of the festive season it is time to consider
giving a little bit back? What is there to loose? Another 20 minutes
additional deposition work for the user in return for maybe/hopefully
saving a whole project 5 years down the line? Not a bad investment it
seems to me ...

Cheers

Clemens

-- 

***
* Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com
*
*  Global Phasing Ltd.
*  Sheraton House, Castle Park 
*  Cambridge CB3 0AX, UK
*--
* BUSTER Development Group  (http://www.globalphasing.com)
***


[ccp4bb] Archiving Images for PDB Depositions

2011-10-31 Thread Jacob Keller
Dear Crystallographers,

I am sending this to try to start a thread which addresses only the
specific issue of whether to archive, at least as a start, images
corresponding to PDB-deposited structures. I believe there could be a
real consensus about the low cost and usefulness of this degree of
archiving, but the discussion keeps swinging around to all levels of
archiving, obfuscating who's for what and for what reason. What about
this level, alone? All of the accompanying info is already entered
into the PDB, so there would be no additional costs on that score.
There could just be a simple link, added to the download files
pulldown, which could say go to image archive, or something along
those lines. Images would be pre-zipped, maybe even tarred, and people
could just download from there. What's so bad?

The benefits are that sometimes there are structures in which
resolution cutoffs might be unreasonable, or perhaps there is some
potential radiation damage in the later frames that might be
deleterious to interpretations, or perhaps there are ugly features in
the images which are invisible or obscure in the statistics.

In any case, it seems to me that this step would be pretty painless,
as it is merely an extension of the current system--just add a link to
the pulldown menu!

Best Regards,

Jacob Keller

-- 
***
Jacob Pearson Keller
Northwestern University
Medical Scientist Training Program
email: j-kell...@northwestern.edu
***


[ccp4bb] atomic scattering factors in REFMAC

2011-10-31 Thread Ivan Shabalin
Dear Refmac users,

I noticed that if I refine a structure containing SeMet, then Se atoms usually 
have big negative (red) peeks of difference map and high B-factors. As I 
understand from the diffraction theory and from some discussions at CCP4bb, 
that may result because in REFMAC the atomic scattering factors are internally 
coded for copper radiation (CuKa).
I tried to use keyword anomalous wavelength 0.9683 and found that with this 
keyword I had different values of coefficient c for Se, Mn, and P as shown in 
REFMAC log-file:

loop_
 _atom_type_symbol
 _atom_type_scat_Cromer_Mann_a1
 _atom_type_scat_Cromer_Mann_b1
 _atom_type_scat_Cromer_Mann_a2
 _atom_type_scat_Cromer_Mann_b2
 _atom_type_scat_Cromer_Mann_a3
 _atom_type_scat_Cromer_Mann_b3
 _atom_type_scat_Cromer_Mann_a4
 _atom_type_scat_Cromer_Mann_b4
 _atom_type_scat_Cromer_Mann_c

  N 12.2126   0.0057   3.1322   9.8933   2.0125  28.9975   1.1663   0.5826 
-11.5290
  C  2.3100  20.8439   1.0200  10.2075   1.5886   0.5687   0.8650  51.6512  
 0.2156
  H  0.4930  10.5109   0.3229  26.1257   0.1402   3.1424   0.0408  57.7997  
 0.0030
  O  3.0485  13.2771   2.2868   5.7011   1.5463   0.3239   0.8670  32.9089  
 0.2508
  SE17.0006   2.4098   5.8196   0.2726   3.9731  15.2372   4.3543  43.8163  
-1.0329
  MN11.2819   5.3409   7.3573   0.3432   3.0193  17.8674   2.2441  83.7543  
 1.3834
  P  6.4345   1.9067   4.1791  27.1570   1.7800   0.5260   1.4908  68.1645  
 1.2650

As a result, red peeks around Se are significantly lower, Se B-factors are a 
bit smaller (like 25.6 and 23.1), and Rf is lowered by a bit more than 0.1% 
with the same input files.

That looks pretty good. Still, I want to ask your opinion on the following:

1) Is it proper way to specify atomic scattering factors? I found this keyword 
under REFMAC documentation topic Simultaneous SAD experimental phasing and 
refinement and Im not sure if I change something else when I specify the 
keyword. I dont have separate F+, F- and corresponding SIGF+, SIGF- in my mtz, 
so SAD experimental phasing should not go.
2) Do you think it is safe to specify this keyword for every structure under 
refinement? Can it have some drawbacks (except wrong wavelength)?
As I understand, the theoretical Cromer_Mann curve can be different from 
experimental, but still it is better than not to change scattering factor at 
all.

Thank you very much!!

With best regards,
Ivan Shabalin, Ph.D.
Research Associate, University of Virginia
4-224 Jordan Hall, 1340 Jefferson Park Ave.
Charlottesville, VA 22908


Re: [ccp4bb] Archiving Images for PDB Depositions

2011-10-31 Thread Adrian Goldman
I have no problem with this idea as an opt-in. However I loathe being forced to 
do things - for my own good or anyone else's. But unless I read the tenor of 
this discussion completely wrongly, opt-in is precisely what is not being 
proposed.  

Adrian Goldman 

Sent from my iPhone

On 31 Oct 2011, at 18:02, Jacob Keller j-kell...@fsm.northwestern.edu wrote:

 Dear Crystallographers,
 
 I am sending this to try to start a thread which addresses only the
 specific issue of whether to archive, at least as a start, images
 corresponding to PDB-deposited structures. I believe there could be a
 real consensus about the low cost and usefulness of this degree of
 archiving, but the discussion keeps swinging around to all levels of
 archiving, obfuscating who's for what and for what reason. What about
 this level, alone? All of the accompanying info is already entered
 into the PDB, so there would be no additional costs on that score.
 There could just be a simple link, added to the download files
 pulldown, which could say go to image archive, or something along
 those lines. Images would be pre-zipped, maybe even tarred, and people
 could just download from there. What's so bad?
 
 The benefits are that sometimes there are structures in which
 resolution cutoffs might be unreasonable, or perhaps there is some
 potential radiation damage in the later frames that might be
 deleterious to interpretations, or perhaps there are ugly features in
 the images which are invisible or obscure in the statistics.
 
 In any case, it seems to me that this step would be pretty painless,
 as it is merely an extension of the current system--just add a link to
 the pulldown menu!
 
 Best Regards,
 
 Jacob Keller
 
 -- 
 ***
 Jacob Pearson Keller
 Northwestern University
 Medical Scientist Training Program
 email: j-kell...@northwestern.edu
 ***
 


Re: [ccp4bb] Archiving Images for PDB Depositions

2011-10-31 Thread Jacob Keller
Pilot phase, opt-in--eventually, mandatory? Like structure factors?

Jacob



On Mon, Oct 31, 2011 at 11:29 AM, Adrian Goldman
adrian.gold...@helsinki.fi wrote:
 I have no problem with this idea as an opt-in. However I loathe being forced 
 to do things - for my own good or anyone else's. But unless I read the tenor 
 of this discussion completely wrongly, opt-in is precisely what is not being 
 proposed.

 Adrian Goldman

 Sent from my iPhone

 On 31 Oct 2011, at 18:02, Jacob Keller j-kell...@fsm.northwestern.edu wrote:

 Dear Crystallographers,

 I am sending this to try to start a thread which addresses only the
 specific issue of whether to archive, at least as a start, images
 corresponding to PDB-deposited structures. I believe there could be a
 real consensus about the low cost and usefulness of this degree of
 archiving, but the discussion keeps swinging around to all levels of
 archiving, obfuscating who's for what and for what reason. What about
 this level, alone? All of the accompanying info is already entered
 into the PDB, so there would be no additional costs on that score.
 There could just be a simple link, added to the download files
 pulldown, which could say go to image archive, or something along
 those lines. Images would be pre-zipped, maybe even tarred, and people
 could just download from there. What's so bad?

 The benefits are that sometimes there are structures in which
 resolution cutoffs might be unreasonable, or perhaps there is some
 potential radiation damage in the later frames that might be
 deleterious to interpretations, or perhaps there are ugly features in
 the images which are invisible or obscure in the statistics.

 In any case, it seems to me that this step would be pretty painless,
 as it is merely an extension of the current system--just add a link to
 the pulldown menu!

 Best Regards,

 Jacob Keller

 --
 ***
 Jacob Pearson Keller
 Northwestern University
 Medical Scientist Training Program
 email: j-kell...@northwestern.edu
 ***





-- 
***
Jacob Pearson Keller
Northwestern University
Medical Scientist Training Program
email: j-kell...@northwestern.edu
***


Re: [ccp4bb] Archiving Images for PDB Depositions

2011-10-31 Thread Frank von Delft
Loathe being forced to do things?  You mean, like being forced to use 
programs developed by others at no cost to yourself?


I'm in a bit of a time-warp here - how exactly do users think our 
current suite of software got to be as astonishingly good as it is?  10 
years ago people (non-developers) were saying exactly the same things - 
yet almost every talk on phasing and auto-building that I've heard ends 
up acknowledging the JCSG datasets.


Must have been a waste of time then, I suppose.

phx.




On 31/10/2011 16:29, Adrian Goldman wrote:

I have no problem with this idea as an opt-in. However I loathe being forced to 
do things - for my own good or anyone else's. But unless I read the tenor of 
this discussion completely wrongly, opt-in is precisely what is not being 
proposed.

Adrian Goldman

Sent from my iPhone

On 31 Oct 2011, at 18:02, Jacob Kellerj-kell...@fsm.northwestern.edu  wrote:


Dear Crystallographers,

I am sending this to try to start a thread which addresses only the
specific issue of whether to archive, at least as a start, images
corresponding to PDB-deposited structures. I believe there could be a
real consensus about the low cost and usefulness of this degree of
archiving, but the discussion keeps swinging around to all levels of
archiving, obfuscating who's for what and for what reason. What about
this level, alone? All of the accompanying info is already entered
into the PDB, so there would be no additional costs on that score.
There could just be a simple link, added to the download files
pulldown, which could say go to image archive, or something along
those lines. Images would be pre-zipped, maybe even tarred, and people
could just download from there. What's so bad?

The benefits are that sometimes there are structures in which
resolution cutoffs might be unreasonable, or perhaps there is some
potential radiation damage in the later frames that might be
deleterious to interpretations, or perhaps there are ugly features in
the images which are invisible or obscure in the statistics.

In any case, it seems to me that this step would be pretty painless,
as it is merely an extension of the current system--just add a link to
the pulldown menu!

Best Regards,

Jacob Keller

--
***
Jacob Pearson Keller
Northwestern University
Medical Scientist Training Program
email: j-kell...@northwestern.edu
***



Re: [ccp4bb] Archiving Images for PDB Depositions

2011-10-31 Thread Clemens Vonrhein
Dear Adrian,

On Mon, Oct 31, 2011 at 06:29:50PM +0200, Adrian Goldman wrote:
 I have no problem with this idea as an opt-in. However I loathe being forced 
 to do things - for my own good or anyone else's. But unless I read the tenor 
 of this discussion completely wrongly, opt-in is precisely what is not being 
 proposed.  

I understood it slightly different - see Gerard Bricogne's points in

  https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1110L=CCP4BBF=S=P=363135

which sounds very much like an opt-in? Such a starting point sounds
very similar to that we had with initial PDB submission (optional for
publication) and then structure factor deposition.

Cheers

Clemens

-- 

***
* Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com
*
*  Global Phasing Ltd.
*  Sheraton House, Castle Park 
*  Cambridge CB3 0AX, UK
*--
* BUSTER Development Group  (http://www.globalphasing.com)
***


Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Martin Kollmar
The point is that science is not collecting stamps. Therefore the first 
question should always be Why. If you start with What the discussion 
immediately switches to technical issues like how many TB, PB etc. 
$/EUR, manpower. And all the intense discussion will blow out by one 
single Why. Nothing is for free. But if it would help science and 
mankind, nobody would hesitate to spend millions of $/EUR.


Supporting software development / software developers is a different 
question. If this were the  first question that someone would have asked 
the answer would have never been archiving all datasets worldwide / 
deposited structures, but how could we, the community, build up a 
resource with different kind of problems (e.g. space groups, twinning, 
overlapping lattices, etc.).


I still didn't got an answer for Why.

Best regards,
Martin



Am 31.10.2011 16:18, schrieb Oganesyan, Vaheh:
I was hesitant to add my opinion so far because I'm used more to 
listen this forum rather than tell others what I think.
Why and what to deposit are absolutely interconnected. Once you 
decide why you want to do it, then you will probably know what will be 
the best format and /vice versa/.
Whether this deposition of raw images will or will not help in future 
understanding the biology better I'm not sure.
But to store those difficult datasets to help the future software 
development sounds really farfetched. This assumes that in the future 
crystallographers will never grow crystals that will deliver difficult 
datasets. If that is the case and in 10-20-30 years next generation 
will be growing much better crystals then they don't need such a 
software development.
If that is not the case, and once in a while (or more often) they will 
be getting something out of ordinary then software developers will 
take them and develop whatever they need to develop to consider such 
cases.

Am I missing a point of discussion here?
Regards,
 Vaheh
-Original Message-
From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of 
Robert Esnouf

Sent: Monday, October 31, 2011 10:31 AM
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] To archive or not to archive, that's the question!
Dear All,
As someone who recently left crystallography for sequencing, I
should modify Tassos's point...
A full data-set is a few terabytes, but post-processing
reduces it to sub-Gb size.
My experience from HiSeqs is that this full here means the
base calls - equivalent to the unmerged HKLs - hardly raw
data. NGS (short-read) sequencing is an imaging technique and
the images are more like 100TB for a 15-day run on a single
flow cell. The raw base calls are about 5TB. The compressed,
mapped data (BAM file, for a human genome, 30x coverage) is
about 120GB. It is only a variant call file (VCF, difference
from a stated human reference genome) that is sub-Gb and these
files are - unsurprisingly - unsuited to detailed statistical
analysis. Also $1k is a not yet an economic cost...
The DNA information capacity in a single human body dwarfs the
entire world disk capacity, so storing DNA is a no brainer
here. Sequencing groups are making very hard-nosed economic
decisions about what to store - indeed it is a source of
research in itself - but the scale of the problem is very much
bigger.
My tuppence ha'penny is that depositing raw images along
with everything else in the PDB is a nice idea but would have
little impact on science (human/animal/plant health or
understanding of biology).
1) If confined to structures in the PDB, the images would just
be the ones giving the final best data - hence the ones least
likely to have been problematic. I'd be more interested in
SFs/maps for looking at ligand-binding etc...
2) Unless this were done before paper acceptance they would be
of little use to referees seeking to review important
structural papers. I'd like to see PDB validation reports
(which could include automated data processing, perhaps culled
from synchrotron sites, SFs and/or maps) made available to
referees in advance of publication. This would be enabled by
deposition, but could be achieved in other ways.
3) The datasets of interest to methods developers are unlikely
to be the ones deposited. They should be in contact with
synchrotron archives directly. Processing multiple lattices is
a case in point here.
4) Remember the average consumer of a PDB file is not a
crystallographer. More likely to be a graduate student in a
clinical lab. For him/her things like occupancies and B-
factors are far more serious concerns... I'm not trivializing
the issue, but importance is always relative. Are there
outsiders on the panel to keep perspective?
Robert
--
Dr. Robert Esnouf,
University Research Lecturer, ex-crystallographer
and Head of Research Computing,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive, Oxford OX3 7BN, UK
Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
and rob...@esnouf.comFax: (+44) - 1865 - 287547
 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Gerard Bricogne
Dear Martin,

 Thank you for this very clear message about your views on this topic.
There is nothing like well articulated dissenting views to force a real
assessment of the initial arguments, and you have certainly provided that.

 As your presentation is modular, I will interleave my comments with
your text, if you don't mind.

--
 Still, after hundreds (?) of emails to this topic, I haven't seen any 
 convincing argument in favor of archiving data. The only convincing 
 arguments are against, and are from Gerard K and Tassos.

 Why?
 The question is not what to archive, but still why should we archive all 
 the data.

 Because software developers need more data? Should we raise all these 
 efforts and costs because 10 developers worldwide need the data to ALL 
 protein structures? Do they really need so much data, wouldn't it be enough 
 to build a repository of maybe 1000 datasets for developments?

 A first impression is that your remark rather looks down on those 10
developers worldwide, a view not out of keeping with that of structural
biologists who have moved away from ground-level crystallography and view
the latter as a mature technique - a euphemism for saying that no further
improvements are likely nor even necessary. As Clemens Vonrhein has just
written, it may be the very success of those developers that has given the
benefit of what software can do to users who don't have the faintest idea of
what it does, nor of how it does it, nor of what its limitations are and how
to overcome those limitations - and therefore take it for granted.

 Another side of the mature technique kiss of death is the underlying
assumption that the demands placed on crystallographic methods are
themselves static, and nothing could be more misleading. We get caught time
and again by rushed shifts in technology without proper precautions in case
the first adaptations of the old methods do not perform as well as they
might later. Let me quote an example: 3x3 CCD detectors. It was too quickly
and hurriedly assumed that, after correcting the images recorded on these
instruments for geometric distortions and flat-field response, one would get
images that could be processed as if they came from image plates (or film).
This turned out to be a mistake: corner effects were later diagnosed, that
were partially correctible by a position-dependent modulation factor,
applied for instance by XDS in response to the problem. That correction is
not just detector-dependent and applicable to all datasets recorded on a
given detector, unfortunately, as it is related to a spatial variation in
the point-spread function. - so you really need to reprocess each set of
images to determine the necessary corrections. The tragic thing is that for
a typical resolution limit and detector distance, these corners cut into the
quick of your strongest secondary-structure defining data. If you have kept
your images, you can try and recover from that; otherwise, you are stuck
with what can be seriously sub-optimal data. Imagine what this can do to SAD
anomalous difference when Bijvoet pairs fall on detector positions where
these corner effects are vastly different ... . 

 Another example it that of the recent use of numerous microcrystals,
each giving a very small amount of data, to assemble datasets for solving
GPCR structures. The methods for doing this, for getting the indexing and
integration of such thin slices of data and getting the overall scaling to
behave, are still very rough. It would be pure insanity to throw these
images away and not to count on better algorithms to come along and improve
the final data extractible from them. 

--
 Does really someone believe that our view on the actual problem, the 
 function of the proteins, changes with the analysis of whatsoever 
 scattering is still in the images but not used by today's software? Crystal 
 structures are static, snapshots, and obtained under artificial conditions. 
 In solution (still the physiologic state) they might look different, not 
 much, but at least far more dynamic. Does it therefore matter whether we 
 know some sidechain positions better (in the crystal structure) when 
 re-analysing the data? In turn, are our current software programs such bad 
 that we would expect strong difference when re-analysing the data? No. And 
 if the structures change upon reanalysis (more or less) who does 
 re-interpret the structures, re-writes the papers?

 I think that, rather than asking rhetorical questions about people's
beliefs regarding such a general question, one needs testimonies about real
life situations. We have helped a great many academic groups in the last 15
years: in every case, they ended up feeling really overjoyed that they had
kept their images when they had, and immensely regretful when they hadn't. 
I noticed, for example, that your last PDB entry, 1LKX (2002) does not have
structure factor data associated with it. It is therefore impossible 

Re: [ccp4bb] But...what's on my tyrosine?

2011-10-31 Thread Ivan Shabalin
I would just model it with one water. Especially if resolution is worse than 
1.8 (I dont think you have better based on the map)
Only if resolution is high and R-factors are low I would worry about this peak.
Regards,
Ivan


Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Kelly Daughtry
I believe that archiving original images for published data sets could be
very useful, if linked to the PDB.
I have downloaded SFs from the PDB to use for re-refinement of the
published model (if I think the electron density maps are misinterpreted)
and personally had a different interpretation of the density (ion vs small
ligand). With that in mind, re-processing from the original images could be
useful for catching mistakes in processing (especially if a high R-factor
or low I/sigma are reported), albeit it a small percentage of the time.

As for difficult data sets, problematic cases, etc, I can see the
importance of their availability by the preceding arguments.
It seems to be most useful for software developers. In that case, I would
suggest software developers to publicly request our difficult to process
images, or create their own repository. Then they can store and use the
data as they like.  I would happily upload a few data sets.
(Just a suggestion)

Best Wishes,
Kelly Daughtry

***
Kelly Daughtry, Ph.D.
Post-Doctoral Fellow, Raetz Lab
Biochemistry Department
Duke University
Alex H. Sands, Jr. Building
303 Research Drive
RM 250
Durham, NC 27710
P: 919-684-5178
***


On Mon, Oct 31, 2011 at 12:01 PM, Martin Kollmar m...@nmr.mpibpc.mpg.dewrote:

  The point is that science is not collecting stamps. Therefore the first
 question should always be Why. If you start with What the discussion
 immediately switches to technical issues like how many TB, PB etc. $/€,
 manpower. And all the intense discussion will blow out by one single Why.
 Nothing is for free. But if it would help science and mankind, nobody would
 hesitate to spend millions of $/€.

 Supporting software development / software developers is a different
 question. If this were the  first question that someone would have asked
 the answer would have never been archiving all datasets worldwide /
 deposited structures, but how could we, the community, build up a resource
 with different kind of problems (e.g. space groups, twinning, overlapping
 lattices, etc.).

 I still didn't got an answer for Why.

 Best regards,
 Martin



 Am 31.10.2011 16:18, schrieb Oganesyan, Vaheh:


 I was hesitant to add my opinion so far because I'm used more to listen
 this forum rather than tell others what I think.
 Why and what to deposit are absolutely interconnected. Once you decide
 why you want to do it, then you will probably know what will be the best
 format and *vice versa*.
 Whether this deposition of raw images will or will not help in future
 understanding the biology better I'm not sure.
 But to store those difficult datasets to help the future software
 development sounds really farfetched. This assumes that in the future
 crystallographers will never grow crystals that will deliver difficult
 datasets. If that is the case and in 10-20-30 years next generation will be
 growing much better crystals then they don't need such a software
 development.
 If that is not the case, and once in a while (or more often) they will be
 getting something out of ordinary then software developers will take them
 and develop whatever they need to develop to consider such cases.

 Am I missing a point of discussion here?

 Regards,

   Vaheh




 -Original Message-
 From: CCP4 bulletin board 
 [mailto:CCP4BB@JISCMAIL.AC.UKCCP4BB@JISCMAIL.AC.UK]
 On Behalf Of Robert Esnouf
 Sent: Monday, October 31, 2011 10:31 AM
 To: CCP4BB@JISCMAIL.AC.UK
 Subject: Re: [ccp4bb] To archive or not to archive, that's the question!

 Dear All,

 As someone who recently left crystallography for sequencing, I
 should modify Tassos's point...

 A full data-set is a few terabytes, but post-processing
 reduces it to sub-Gb size.

 My experience from HiSeqs is that this full here means the
 base calls - equivalent to the unmerged HKLs - hardly raw
 data. NGS (short-read) sequencing is an imaging technique and
 the images are more like 100TB for a 15-day run on a single
 flow cell. The raw base calls are about 5TB. The compressed,
 mapped data (BAM file, for a human genome, 30x coverage) is
 about 120GB. It is only a variant call file (VCF, difference
 from a stated human reference genome) that is sub-Gb and these
 files are - unsurprisingly - unsuited to detailed statistical
 analysis. Also $1k is a not yet an economic cost...

 The DNA information capacity in a single human body dwarfs the
 entire world disk capacity, so storing DNA is a no brainer
 here. Sequencing groups are making very hard-nosed economic
 decisions about what to store - indeed it is a source of
 research in itself - but the scale of the problem is very much
 bigger.

 My tuppence ha'penny is that depositing raw images along
 with everything else in the PDB is a nice idea but would have
 little impact on science (human/animal/plant health or
 understanding of biology).

 1) If confined to structures in the 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Gerard Bricogne
Dear Martin,

 First of all I would like to say that I regret having made my remark
500 and apologise if you read it as a personal one - I just saw it as an
example of a dataset it might have been useful to revisit if data had been
available in any form. I am sure that there are many skeletons in many
cupboards, including my own :-) .

 Otherwise, as the discussion does seem to refocus on the very initial
proposal in gestation within the IUCr's DDDWG, i.e. voluntary involvement of
depositors and of synchrotrons, so that questions of logistics and cost
could be answered in the light of empirical evidence, your Why question is
the only one unanswered by this proposal, it seems.

 In this respect I wonder how you view the two examples I gave in my
reply to your previous message, namely the corner effects problem and the
re-development of methods for collating data from numerous small, poorly
diffracting crystals as was done in the recent solution of GPCR structures.
There remains the example I cited from the beginning, namely the integration
of images displaying several overlapping lattices. 


 With best wishes,
 
  Gerard.

--
On Mon, Oct 31, 2011 at 05:01:38PM +0100, Martin Kollmar wrote:
 The point is that science is not collecting stamps. Therefore the first 
 question should always be Why. If you start with What the discussion 
 immediately switches to technical issues like how many TB, PB etc. $/EUR, 
 manpower. And all the intense discussion will blow out by one single Why. 
 Nothing is for free. But if it would help science and mankind, nobody would 
 hesitate to spend millions of $/EUR.

 Supporting software development / software developers is a different 
 question. If this were the  first question that someone would have asked 
 the answer would have never been archiving all datasets worldwide / 
 deposited structures, but how could we, the community, build up a resource 
 with different kind of problems (e.g. space groups, twinning, overlapping 
 lattices, etc.).

 I still didn't got an answer for Why.

 Best regards,
 Martin



 Am 31.10.2011 16:18, schrieb Oganesyan, Vaheh:
 I was hesitant to add my opinion so far because I'm used more to listen 
 this forum rather than tell others what I think.
 Why and what to deposit are absolutely interconnected. Once you decide 
 why you want to do it, then you will probably know what will be the best 
 format and /vice versa/.
 Whether this deposition of raw images will or will not help in future 
 understanding the biology better I'm not sure.
 But to store those difficult datasets to help the future software 
 development sounds really farfetched. This assumes that in the future 
 crystallographers will never grow crystals that will deliver difficult 
 datasets. If that is the case and in 10-20-30 years next generation will 
 be growing much better crystals then they don't need such a software 
 development.
 If that is not the case, and once in a while (or more often) they will be 
 getting something out of ordinary then software developers will take them 
 and develop whatever they need to develop to consider such cases.
 Am I missing a point of discussion here?
 Regards,
  Vaheh
 -Original Message-
 From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of 
 Robert Esnouf
 Sent: Monday, October 31, 2011 10:31 AM
 To: CCP4BB@JISCMAIL.AC.UK
 Subject: Re: [ccp4bb] To archive or not to archive, that's the question!
 Dear All,
 As someone who recently left crystallography for sequencing, I
 should modify Tassos's point...
 A full data-set is a few terabytes, but post-processing
 reduces it to sub-Gb size.
 My experience from HiSeqs is that this full here means the
 base calls - equivalent to the unmerged HKLs - hardly raw
 data. NGS (short-read) sequencing is an imaging technique and
 the images are more like 100TB for a 15-day run on a single
 flow cell. The raw base calls are about 5TB. The compressed,
 mapped data (BAM file, for a human genome, 30x coverage) is
 about 120GB. It is only a variant call file (VCF, difference
 from a stated human reference genome) that is sub-Gb and these
 files are - unsurprisingly - unsuited to detailed statistical
 analysis. Also $1k is a not yet an economic cost...
 The DNA information capacity in a single human body dwarfs the
 entire world disk capacity, so storing DNA is a no brainer
 here. Sequencing groups are making very hard-nosed economic
 decisions about what to store - indeed it is a source of
 research in itself - but the scale of the problem is very much
 bigger.
 My tuppence ha'penny is that depositing raw images along
 with everything else in the PDB is a nice idea but would have
 little impact on science (human/animal/plant health or
 understanding of biology).
 1) If confined to structures in the PDB, the images would just
 be the ones giving the final best data - hence the ones least
 likely to have been problematic. I'd be more interested in
 

[ccp4bb] Postdoctoral position, bacterial membrane protein

2011-10-31 Thread Mark A Saper
Postdoctoral Position Available
Department of Biological Chemistry, University of Michigan 
Position open immediately for a highly motivated postdoctoral researcher to 
investigate the function and structure of an intrinsic membrane protein from 
pathogenic bacteria involved in capsule polysaccharide export. The researcher 
will develop an expression system to purify sufficient protein for 
crystallization and solid-state NMR experiments, as well as  investigate 
function at the cellular level. See our recent paper 
(http://dx.doi.org/10.1021/bi101869h) that describes another protein from the 
same operon. A doctoral degree in biochemistry, microbiology, or related field 
is required, preferably with experience in membrane proteins. Please email CV, 
a short description of your research accomplishments and career plans, and 
names of three references to Prof. Mark A. Saper, sa...@umich.edu.


_
Mark A. Saper, Ph.D.
Associate Professor of Biological Chemistry
University of Michigan

Biophysics, 3040 Chemistry Building
930 N University Ave
Ann Arbor MI  48109-1055 U.S.A.

sa...@umich.edu phone (734) 764-3353 fax (734) 764-3323
http://www.biochem.med.umich.edu/?q=saper   http://www.strucbio.umich.edu/