Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Anastassis Perrakis

Dear all,

The discussion about keeping primary data, and what level of data can  
be considered 'primary', has - rather unsurprisingly - come up also in  
areas other than structural biology.
An example is next generation sequencing. A full-dataset is a few tera  
bytes, but post-processing reduces it to sub-Gb size. However, the  
post-processed data, as in our case,
have suffered the inadequacy of computational reduction ... At least  
out institute has decided to create double back-up of the primary data  
in triplicate. For that reason our facility bought
three -80 freezers, one on site at the basement, on at the top floor,  
and one off-site, and they keep the DNA to be sequenced. A sequencing  
run is already sub-1k$ and it will not become
more expensive. So, if its important, do it again. Its cheaper and its  
better.


At first sight, that does not apply to MX. Or does it?

So, maybe the question is not To archive or not to archive but What  
to archive.


(similarly, it never crossed my mind if I should be or not be - I  
always wondered what to be)


A.


On Oct 30, 2011, at 11:59, Kay Diederichs wrote:


Am 20:59, schrieb Jrh:
...

So:-  Universities are now establishing their own institutional
repositories, driven largely by Open Access demands of funders. For
these to host raw datasets that underpin publications is a reasonable
role in my view and indeed they already have this category in the
University of Manchester eScholar system, for example.  I am set to
explore locally here whether they would accommodate all our Lab's raw
Xray images datasets per annum that underpin our published crystal
structures.

It would be helpful if readers of this CCP4bb could kindly also
explore with their own universities if they have such an
institutional repository and if raw data sets could be accommodated.
Please do email me off list with this information if you prefer but
within the CCP4bb is also good.



Dear John,

I'm pretty sure that there exists no consistent policy to provide an  
institutional repository for deposition of scientific data at  
German universities or Max-Planck institutes or Helmholtz  
institutions, at least I never heard of something like this. More  
specifically, our University of Konstanz certainly does not have the  
infrastructure to provide this.


I don't think that Germany is the only country which is the  
exception to any rule of availability of institutional  
repository . Rather, I'm almost amazed that British and American  
institutions seem to support this.


Thus I suggest to not focus exclusively on official institutional  
repositories, but to explore alternatives: distributed filestores  
like Google's BigTable, Bittorrent or others might be just as  
suitable - check out http://en.wikipedia.org/wiki/Distributed_data_store 
. I guess that any crystallographic lab could easily sacrifice/ 
donate a TB of storage for the purposes of this project in 2011 (and  
maybe 2 TB in 2012, 3 in 2013, ...), but clearly the level of work  
to set this up should be kept as low as possible (a bittorrent  
daemon seems simple enough).


Just my 2 cents,

Kay






P please don't print this e-mail unless you really need to
Anastassis (Tassos) Perrakis, Principal Investigator / Staff Member
Department of Biochemistry (B8)
Netherlands Cancer Institute,
Dept. B8, 1066 CX Amsterdam, The Netherlands
Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile / SMS: +31 6 28 597791






Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Gerard Bricogne
Dear Tassos,

 It is unclear whether this thread will be able to resolve your deep 
existential concerns about what to be, but you do introduce a couple of
interesting points: (1) raw data archiving in areas (of biology) other than
structural biology, and (2) archiving the samples rather than the verbose
data that may have been extracted from them.

 Concerning (1), I am grateful to Peter Keller here in my group for
pointing me, mid-August when we were for the n-th time reviewing the issue
of raw data deposition under discussion in this thread, and its advantages
over only keeping derived data extracted from them, towards the Trace
Archive of DNA sequences. He found an example, at 

http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieveval=12345dopt=tracesize=1retrieve=Submit

You can check the Quality Score box below the trace, and this will refresh
the display to give a visual estimate of the reliability of the sequence.
There is clearly a problem around position 210, that would not have been
adequately dealt with by just retaining the most probable sequence. In this
context, it has been found worthwhile to preserve the raw data, to make it
possible to audit derived data against them. This is at least a very
simple example of what you were referring to whan you wrote about the
inadequacy of computational reduction. In the MX context, this is rather
similar to the contamination of integrated intensities by spots from
parasitic lattices (which would still affect unmerged intensities, by the
way - so upgrading the pdb structure factor file to unmerged data would
take care of over-merging, but not of that contamination). 

 Concerning (2) I greatly doubt there would be an equivalent for MX: few
people would have spare crystals to put to one side for a future repeat of a
diffraction experiment (except in the case of lysozyme/insulin/thaumatin!).
I can remember an esteemed colleague arguing 4-5 years ago that if you want
to improve a deposited structure, you could simply repeat the work from
scratch - a sensible position from the philosophical point of view (science
being the art of the repeatable), but far less sensible in conditions of
limited resources, and given also the difficulties of reproducing crystals.
The real-life situation is more a Carpe diem one: archive what you have,
as you may never see it again! Otherwise one would easily get drawn into the
same kind of unrealistic expectations as people who get themselves frozen in
liquid N2, with their blood replaced by DMSO, hoping to be brought back to
life some day in the future ;-) .


 With best wishes,
 
  Gerard.

--
On Mon, Oct 31, 2011 at 11:37:47AM +0100, Anastassis Perrakis wrote:
 Dear all,

 The discussion about keeping primary data, and what level of data can be 
 considered 'primary', has - rather unsurprisingly - come up also in areas 
 other than structural biology.
 An example is next generation sequencing. A full-dataset is a few tera 
 bytes, but post-processing reduces it to sub-Gb size. However, the 
 post-processed data, as in our case,
 have suffered the inadequacy of computational reduction ... At least out 
 institute has decided to create double back-up of the primary data in 
 triplicate. For that reason our facility bought
 three -80 freezers, one on site at the basement, on at the top floor, and 
 one off-site, and they keep the DNA to be sequenced. A sequencing run is 
 already sub-1k$ and it will not become
 more expensive. So, if its important, do it again. Its cheaper and its 
 better.

 At first sight, that does not apply to MX. Or does it?

 So, maybe the question is not To archive or not to archive but What to 
 archive.

 (similarly, it never crossed my mind if I should be or not be - I always 
 wondered what to be)

 A.


 On Oct 30, 2011, at 11:59, Kay Diederichs wrote:

 Am 20:59, schrieb Jrh:
 ...
 So:-  Universities are now establishing their own institutional
 repositories, driven largely by Open Access demands of funders. For
 these to host raw datasets that underpin publications is a reasonable
 role in my view and indeed they already have this category in the
 University of Manchester eScholar system, for example.  I am set to
 explore locally here whether they would accommodate all our Lab's raw
 Xray images datasets per annum that underpin our published crystal
 structures.

 It would be helpful if readers of this CCP4bb could kindly also
 explore with their own universities if they have such an
 institutional repository and if raw data sets could be accommodated.
 Please do email me off list with this information if you prefer but
 within the CCP4bb is also good.


 Dear John,

 I'm pretty sure that there exists no consistent policy to provide an 
 institutional repository for deposition of scientific data at German 
 universities or Max-Planck institutes or Helmholtz institutions, at least 
 I never heard of something like this. More specifically, our University of 
 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Martin Kollmar
Still, after hundreds (?) of emails to this topic, I haven't seen any 
convincing argument in favor of archiving data. The only convincing 
arguments are against, and are from Gerard K and Tassos.


Why?
The question is not what to archive, but still why should we archive all 
the data.


Because software developers need more data? Should we raise all these 
efforts and costs because 10 developers worldwide need the data to ALL 
protein structures? Do they really need so much data, wouldn't it be 
enough to build a repository of maybe 1000 datasets for developments?


Does really someone believe that our view on the actual problem, the 
function of the proteins, changes with the analysis of whatsoever 
scattering is still in the images but not used by today's software? 
Crystal structures are static, snapshots, and obtained under artificial 
conditions. In solution (still the physiologic state) they might look 
different, not much, but at least far more dynamic. Does it therefore 
matter whether we know some sidechain positions better (in the crystal 
structure) when re-analysing the data? In turn, are our current software 
programs such bad that we would expect strong difference when 
re-analysing the data? No. And if the structures change upon reanalysis 
(more or less) who does re-interpret the structures, re-writes the papers?


There are many many cases where researchers re-did structures (or did 
closely related structures to already available structures like mutants, 
structures of closely related species, etc.), also after 10 years. I 
guess they used the latest software in the different cases, thus they 
incorporated all the software development of the 10 years. And are the 
structures really different (beyond the introduced changes, mutations, 
etc.)? Different because of the software used?


The comparison with next-generation sequencing data is useful here, but 
only in the sense Tassos explained. Well, of course not every position 
in the genomic sequence is fixed. Therefore it is sometimes useful to 
look at the original data (the traces, as Gerard B pointed out). But we 
already know, that every single organism is different (especially 
eukaryotes) and therefore it is absolutely enough to store the 
computationally reduced and merged data. If one needs better, 
position-specific data, sequencing and comparing single species becomes 
necessary, like in the ENCODE project, the sequencing of about 100 
Saccharomyces strains, the sequencing of 1000 Arabidopsis strains, etc. 
Discussion about single positions are useless if they are not 
statistically relevant. They need to be analysed in the context of 
populations, large cohorts of patients, etc. If we need personalized 
medicine adapted to personal genomes, we would also need personal sets 
of protein structures which we cannot provide yet. Therefore, storing 
the DNA in the freezer is better and cheaper than storing all the 
sequencing raw data. Do you think a reviewer re-sequences, or 
re-assembles, or re-annotates a genome, even if access to the raw reads 
would be available? If you trust these data why don't we trust our 
structure factors? Do you trust electron microscopy images, movies of 
GFP-tagged proteins? Do you think what is presented for a single or a 
few visible cells is also found in all cells?


And now, who many of you (if not everybody) uses structures from yeast, 
Drosophila, mouse etc. as MODEL for human proteins? If we stick to this 
thinking, who would care about potential minor changes in the structures 
upon re-analysis (and in the light of this discussion, arguing about 
specific genomic sequence positions becomes unimportant as well)?


Is any of the archived data useful without manual evaluation upon 
archiving? This is especially relevant for structures not solved yet. Do 
the images belong to the structure factors, if only images are 
available, where is the corresponding protein sequence, has it been 
sequenced, what has been in the buffer/crystallization condition, what 
has been used during protein purification, what was the intention for 
crystallization - e.g. a certain functional state, that the protein was 
forced to by artificial conditions, etc. etc. Who want's to evaluate 
that, and how? The question is not that we could do it. We could do it, 
but wouldn't it advance science far more if we would spend the time and 
money in new projects rather than evaluation, administration, etc?


Be honest: How many of you have really, and completely, reanalysed your 
own data, that you have deposited 10 years ago, with the latest 
software? What changes did you find? Did you have to re-write your 
former discussions in the publications? Do you think that the changes 
justify the efforts and costs of worldwide archiving of all data?


Well, for all cases there are always (and have been mentioned in earlier 
emails) single cases where these things matter or mattered. But does 
this really justify all the future 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Robert Esnouf
Dear All,

As someone who recently left crystallography for sequencing, I 
should modify Tassos's point...

A full data-set is a few terabytes, but post-processing 
reduces it to sub-Gb size.

My experience from HiSeqs is that this full here means the 
base calls - equivalent to the unmerged HKLs - hardly raw 
data. NGS (short-read) sequencing is an imaging technique and 
the images are more like 100TB for a 15-day run on a single 
flow cell. The raw base calls are about 5TB. The compressed, 
mapped data (BAM file, for a human genome, 30x coverage) is 
about 120GB. It is only a variant call file (VCF, difference 
from a stated human reference genome) that is sub-Gb and these 
files are - unsurprisingly - unsuited to detailed statistical 
analysis. Also $1k is a not yet an economic cost...

The DNA information capacity in a single human body dwarfs the 
entire world disk capacity, so storing DNA is a no brainer 
here. Sequencing groups are making very hard-nosed economic 
decisions about what to store - indeed it is a source of 
research in itself - but the scale of the problem is very much 
bigger.

My tuppence ha'penny is that depositing raw images along 
with everything else in the PDB is a nice idea but would have 
little impact on science (human/animal/plant health or 
understanding of biology).

1) If confined to structures in the PDB, the images would just 
be the ones giving the final best data - hence the ones least 
likely to have been problematic. I'd be more interested in 
SFs/maps for looking at ligand-binding etc...

2) Unless this were done before paper acceptance they would be 
of little use to referees seeking to review important 
structural papers. I'd like to see PDB validation reports 
(which could include automated data processing, perhaps culled 
from synchrotron sites, SFs and/or maps) made available to 
referees in advance of publication. This would be enabled by 
deposition, but could be achieved in other ways.

3) The datasets of interest to methods developers are unlikely 
to be the ones deposited. They should be in contact with 
synchrotron archives directly. Processing multiple lattices is 
a case in point here.

4) Remember the average consumer of a PDB file is not a 
crystallographer. More likely to be a graduate student in a 
clinical lab. For him/her things like occupancies and B-
factors are far more serious concerns... I'm not trivializing 
the issue, but importance is always relative. Are there 
outsiders on the panel to keep perspective?

Robert


--

Dr. Robert Esnouf,
University Research Lecturer, ex-crystallographer
and Head of Research Computing,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive, Oxford OX3 7BN, UK

Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
and rob...@esnouf.comFax: (+44) - 1865 - 287547


 Original message 
Date: Mon, 31 Oct 2011 11:37:47 +0100
From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf 
of Anastassis Perrakis a.perra...@nki.nl)
Subject: Re: [ccp4bb] To archive or not to archive, that's 
the question!  
To: CCP4BB@JISCMAIL.AC.UK

   Dear all,
   The discussion about keeping primary data, and what
   level of data can be considered 'primary', has -
   rather unsurprisingly - come up also in areas other
   than structural biology.
   An example is next generation sequencing. A
   full-dataset is a few tera bytes, but
   post-processing reduces it to sub-Gb size. However,
   the post-processed data, as in our case,
   have suffered the inadequacy of computational
   reduction ... At least out institute has decided
   to create double back-up of the primary data in
   triplicate. For that reason our facility bought
   three -80 freezers, one on site at the basement, on
   at the top floor, and one off-site, and they keep
   the DNA to be sequenced. A sequencing run is already
   sub-1k$ and it will not become
   more expensive. So, if its important, do it again.
   Its cheaper and its better.
   At first sight, that does not apply to MX. Or does
   it?
   So, maybe the question is not To archive or not to
   archive but What to archive.
   (similarly, it never crossed my mind if I should be
   or not be - I always wondered what to be)
   A.
   On Oct 30, 2011, at 11:59, Kay Diederichs wrote:

 Am 20:59, schrieb Jrh:
 ...

   So:-  Universities are now establishing their
   own institutional

   repositories, driven largely by Open Access
   demands of funders. For

   these to host raw datasets that underpin
   publications is a reasonable

   role in my view and indeed they already have
   this category in the

   University of Manchester eScholar system, for
   example.  I am set to

   explore locally here whether they would
   accommodate all our Lab's raw

   Xray images datasets per annum that underpin our
   published crystal

   structures.

   It would be helpful if readers of this CCP4bb
   could kindly

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Oganesyan, Vaheh

I was hesitant to add my opinion so far because I'm used more to listen this 
forum rather than tell others what I think.
Why and what to deposit are absolutely interconnected. Once you decide why 
you want to do it, then you will probably know what will be the best format and 
vice versa.
Whether this deposition of raw images will or will not help in future 
understanding the biology better I'm not sure.
But to store those difficult datasets to help the future software development 
sounds really farfetched. This assumes that in the future crystallographers 
will never grow crystals that will deliver difficult datasets. If that is the 
case and in 10-20-30 years next generation will be growing much better crystals 
then they don't need such a software development.
If that is not the case, and once in a while (or more often) they will be 
getting something out of ordinary then software developers will take them and 
develop whatever they need to develop to consider such cases.

Am I missing a point of discussion here?

Regards,

 Vaheh




-Original Message-
From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Robert 
Esnouf
Sent: Monday, October 31, 2011 10:31 AM
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] To archive or not to archive, that's the question!

Dear All,

As someone who recently left crystallography for sequencing, I
should modify Tassos's point...

A full data-set is a few terabytes, but post-processing
reduces it to sub-Gb size.

My experience from HiSeqs is that this full here means the
base calls - equivalent to the unmerged HKLs - hardly raw
data. NGS (short-read) sequencing is an imaging technique and
the images are more like 100TB for a 15-day run on a single
flow cell. The raw base calls are about 5TB. The compressed,
mapped data (BAM file, for a human genome, 30x coverage) is
about 120GB. It is only a variant call file (VCF, difference
from a stated human reference genome) that is sub-Gb and these
files are - unsurprisingly - unsuited to detailed statistical
analysis. Also $1k is a not yet an economic cost...

The DNA information capacity in a single human body dwarfs the
entire world disk capacity, so storing DNA is a no brainer
here. Sequencing groups are making very hard-nosed economic
decisions about what to store - indeed it is a source of
research in itself - but the scale of the problem is very much
bigger.

My tuppence ha'penny is that depositing raw images along
with everything else in the PDB is a nice idea but would have
little impact on science (human/animal/plant health or
understanding of biology).

1) If confined to structures in the PDB, the images would just
be the ones giving the final best data - hence the ones least
likely to have been problematic. I'd be more interested in
SFs/maps for looking at ligand-binding etc...

2) Unless this were done before paper acceptance they would be
of little use to referees seeking to review important
structural papers. I'd like to see PDB validation reports
(which could include automated data processing, perhaps culled
from synchrotron sites, SFs and/or maps) made available to
referees in advance of publication. This would be enabled by
deposition, but could be achieved in other ways.

3) The datasets of interest to methods developers are unlikely
to be the ones deposited. They should be in contact with
synchrotron archives directly. Processing multiple lattices is
a case in point here.

4) Remember the average consumer of a PDB file is not a
crystallographer. More likely to be a graduate student in a
clinical lab. For him/her things like occupancies and B-
factors are far more serious concerns... I'm not trivializing
the issue, but importance is always relative. Are there
outsiders on the panel to keep perspective?

Robert


--

Dr. Robert Esnouf,
University Research Lecturer, ex-crystallographer
and Head of Research Computing,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive, Oxford OX3 7BN, UK

Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
and rob...@esnouf.comFax: (+44) - 1865 - 287547


 Original message 
Date: Mon, 31 Oct 2011 11:37:47 +0100
From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf
of Anastassis Perrakis a.perra...@nki.nl)
Subject: Re: [ccp4bb] To archive or not to archive, that's
the question!
To: CCP4BB@JISCMAIL.AC.UK

   Dear all,
   The discussion about keeping primary data, and what
   level of data can be considered 'primary', has -
   rather unsurprisingly - come up also in areas other
   than structural biology.
   An example is next generation sequencing. A
   full-dataset is a few tera bytes, but
   post-processing reduces it to sub-Gb size. However,
   the post-processed data, as in our case,
   have suffered the inadequacy of computational
   reduction ... At least out institute has decided
   to create double back-up of the primary data in
   triplicate. For that reason our facility bought
   three -80 freezers, one

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread David Waterman
@JISCMAIL.AC.UK (on behalf
 of Anastassis Perrakis a.perra...@nki.nl)
 Subject: Re: [ccp4bb] To archive or not to archive, that's
 the question!
 To: CCP4BB@JISCMAIL.AC.UK
 
Dear all,
The discussion about keeping primary data, and what
level of data can be considered 'primary', has -
rather unsurprisingly - come up also in areas other
than structural biology.
An example is next generation sequencing. A
full-dataset is a few tera bytes, but
post-processing reduces it to sub-Gb size. However,
the post-processed data, as in our case,
have suffered the inadequacy of computational
reduction ... At least out institute has decided
to create double back-up of the primary data in
triplicate. For that reason our facility bought
three -80 freezers, one on site at the basement, on
at the top floor, and one off-site, and they keep
the DNA to be sequenced. A sequencing run is already
sub-1k$ and it will not become
more expensive. So, if its important, do it again.
Its cheaper and its better.
At first sight, that does not apply to MX. Or does
it?
So, maybe the question is not To archive or not to
archive but What to archive.
(similarly, it never crossed my mind if I should be
or not be - I always wondered what to be)
A.
On Oct 30, 2011, at 11:59, Kay Diederichs wrote:
 
  Am 20:59, schrieb Jrh:
  ...
 
So:-  Universities are now establishing their
own institutional
 
repositories, driven largely by Open Access
demands of funders. For
 
these to host raw datasets that underpin
publications is a reasonable
 
role in my view and indeed they already have
this category in the
 
University of Manchester eScholar system, for
example.  I am set to
 
explore locally here whether they would
accommodate all our Lab's raw
 
Xray images datasets per annum that underpin our
published crystal
 
structures.
 
It would be helpful if readers of this CCP4bb
could kindly also
 
explore with their own universities if they have
such an
 
institutional repository and if raw data sets
could be accommodated.
 
Please do email me off list with this
information if you prefer but
 
within the CCP4bb is also good.
 
  Dear John,
 
  I'm pretty sure that there exists no consistent
  policy to provide an institutional repository
  for deposition of scientific data at German
  universities or Max-Planck institutes or Helmholtz
  institutions, at least I never heard of something
  like this. More specifically, our University of
  Konstanz certainly does not have the
  infrastructure to provide this.
 
  I don't think that Germany is the only country
  which is the exception to any rule of availability
  of institutional repository . Rather, I'm almost
  amazed that British and American institutions seem
  to support this.
 
  Thus I suggest to not focus exclusively on
  official institutional repositories, but to
  explore alternatives: distributed filestores like
  Google's BigTable, Bittorrent or others might be
  just as suitable - check out
  http://en.wikipedia.org/wiki/Distributed_data_store.
  I guess that any crystallographic lab could easily
  sacrifice/donate a TB of storage for the purposes
  of this project in 2011 (and maybe 2 TB in 2012, 3
  in 2013, ...), but clearly the level of work to
  set this up should be kept as low as possible (a
  bittorrent daemon seems simple enough).
 
  Just my 2 cents,
 
  Kay
 
P please don't print this e-mail unless you really
need to
Anastassis (Tassos) Perrakis, Principal Investigator
/ Staff Member
Department of Biochemistry (B8)
Netherlands Cancer Institute,
Dept. B8, 1066 CX Amsterdam, The Netherlands
Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile /
SMS: +31 6 28 597791

 To the extent this electronic communication or any of its attachments
 contain information that is not in the public domain, such information is
 considered by MedImmune to be confidential and proprietary. This
 communication is expected to be read and/or used only by the individual(s)
 for whom it is intended. If you have received this electronic communication
 in error, please reply to the sender advising of the error in transmission
 and delete the original message and any accompanying documents from your
 system immediately, without copying, reviewing or otherwise using them for
 any purpose. Thank you for your cooperation.



Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Clemens Vonrhein
Dear Vaheh,

On Mon, Oct 31, 2011 at 03:18:07PM +, Oganesyan, Vaheh wrote:
 But to store those difficult datasets to help the future software
 development sounds really farfetched.

As far as I see the general plan, that would be a second stage (deposit
all datasets) - the first one would be the datasets related directly
to a given PDB entry.

 This assumes that in the future crystallographers will never grow
 crystals that will deliver difficult datasets.

Oh sure they will. And lots of those datasets will be available to
developers ... being thrown a difficult problem under pressure is a
very good thing to get ideas, think out of the box etc. However,
developing solid algorithms is better done in a less hectic
environment with a large collection of similar problems (changing only
one parameter at a time) to test a new method.

 If that is the case and in 10-20-30 years next generation will be
 growing much better crystals then they don't need such a software
 development.

They'll grow better crystals for the type of project we're currently
struggling with, sure. But we'll still get poor crystals for projects we
don't even attempt or tackle right now.

Software development is a slow process, often working on a different
timescale than the typical structure solution project (obvious there
are exceptions). So planing ahead for that time will prepare us.

And yes, it will have an impact on the biology then. It's not just the
here and now (and next grant, next high-profile paper) we should be
thinking about.

 Am I missing a point of discussion here?

One small point maybe: there are very few developers out there - but a
very large number of users that benefit from what they have
done. Often the work is not very visible (It's just pressing a button
or two ... so it must be trivial!) - which is a good thing: it has to
be simple, robust, automtic and useable.

I think if a large enough number of developers consider depositing
images a very useful resource for their future development (and
therefore future benefit to a large number of users), it should be
seriously considered, even if some of the advertised benefits have to
be taken on trust.

Past developments in data processing have had a big impact on a lot of
projects - high-profile or just the standard PhD-student nightmare -
with often small return for the developers in terms of publications,
grants or even citations (main paper or supplementary material).

So maybe in the sprit of the festive season it is time to consider
giving a little bit back? What is there to loose? Another 20 minutes
additional deposition work for the user in return for maybe/hopefully
saving a whole project 5 years down the line? Not a bad investment it
seems to me ...

Cheers

Clemens

-- 

***
* Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com
*
*  Global Phasing Ltd.
*  Sheraton House, Castle Park 
*  Cambridge CB3 0AX, UK
*--
* BUSTER Development Group  (http://www.globalphasing.com)
***


Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Martin Kollmar
 Original message 
Date: Mon, 31 Oct 2011 11:37:47 +0100
From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf
of Anastassis Perrakis a.perra...@nki.nl)
Subject: Re: [ccp4bb] To archive or not to archive, that's
the question!
To: CCP4BB@JISCMAIL.AC.UK

   Dear all,
   The discussion about keeping primary data, and what
   level of data can be considered 'primary', has -
   rather unsurprisingly - come up also in areas other
   than structural biology.
   An example is next generation sequencing. A
   full-dataset is a few tera bytes, but
   post-processing reduces it to sub-Gb size. However,
   the post-processed data, as in our case,
   have suffered the inadequacy of computational
   reduction ... At least out institute has decided
   to create double back-up of the primary data in
   triplicate. For that reason our facility bought
   three -80 freezers, one on site at the basement, on
   at the top floor, and one off-site, and they keep
   the DNA to be sequenced. A sequencing run is already
   sub-1k$ and it will not become
   more expensive. So, if its important, do it again.
   Its cheaper and its better.
   At first sight, that does not apply to MX. Or does
   it?
   So, maybe the question is not To archive or not to
   archive but What to archive.
   (similarly, it never crossed my mind if I should be
   or not be - I always wondered what to be)
   A.
   On Oct 30, 2011, at 11:59, Kay Diederichs wrote:

 Am 20:59, schrieb Jrh:
 ...

   So:-  Universities are now establishing their
   own institutional

   repositories, driven largely by Open Access
   demands of funders. For

   these to host raw datasets that underpin
   publications is a reasonable

   role in my view and indeed they already have
   this category in the

   University of Manchester eScholar system, for
   example.  I am set to

   explore locally here whether they would
   accommodate all our Lab's raw

   Xray images datasets per annum that underpin our
   published crystal

   structures.

   It would be helpful if readers of this CCP4bb
   could kindly also

   explore with their own universities if they have
   such an

   institutional repository and if raw data sets
   could be accommodated.

   Please do email me off list with this
   information if you prefer but

   within the CCP4bb is also good.

 Dear John,

 I'm pretty sure that there exists no consistent
 policy to provide an institutional repository
 for deposition of scientific data at German
 universities or Max-Planck institutes or Helmholtz
 institutions, at least I never heard of something
 like this. More specifically, our University of
 Konstanz certainly does not have the
 infrastructure to provide this.

 I don't think that Germany is the only country
 which is the exception to any rule of availability
 of institutional repository . Rather, I'm almost
 amazed that British and American institutions seem
 to support this.

 Thus I suggest to not focus exclusively on
 official institutional repositories, but to
 explore alternatives: distributed filestores like
 Google's BigTable, Bittorrent or others might be
 just as suitable - check out
 http://en.wikipedia.org/wiki/Distributed_data_store.
 I guess that any crystallographic lab could easily
 sacrifice/donate a TB of storage for the purposes
 of this project in 2011 (and maybe 2 TB in 2012, 3
 in 2013, ...), but clearly the level of work to
 set this up should be kept as low as possible (a
 bittorrent daemon seems simple enough).

 Just my 2 cents,

 Kay

   P please don't print this e-mail unless you really
   need to
   Anastassis (Tassos) Perrakis, Principal Investigator
   / Staff Member
   Department of Biochemistry (B8)
   Netherlands Cancer Institute,
   Dept. B8, 1066 CX Amsterdam, The Netherlands
   Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile /
   SMS: +31 6 28 597791
To the extent this electronic communication or any of its attachments 
contain information that is not in the public domain, such information 
is considered by MedImmune to be confidential and proprietary. This 
communication is expected to be read and/or used only by the 
individual(s) for whom it is intended. If you have received this 
electronic communication in error, please reply to the sender advising 
of the error in transmission and delete the original message and any 
accompanying documents from your system immediately, without copying, 
reviewing or otherwise using them for any purpose. Thank you for your 
cooperation.


Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Gerard Bricogne
Dear Martin,

 Thank you for this very clear message about your views on this topic.
There is nothing like well articulated dissenting views to force a real
assessment of the initial arguments, and you have certainly provided that.

 As your presentation is modular, I will interleave my comments with
your text, if you don't mind.

--
 Still, after hundreds (?) of emails to this topic, I haven't seen any 
 convincing argument in favor of archiving data. The only convincing 
 arguments are against, and are from Gerard K and Tassos.

 Why?
 The question is not what to archive, but still why should we archive all 
 the data.

 Because software developers need more data? Should we raise all these 
 efforts and costs because 10 developers worldwide need the data to ALL 
 protein structures? Do they really need so much data, wouldn't it be enough 
 to build a repository of maybe 1000 datasets for developments?

 A first impression is that your remark rather looks down on those 10
developers worldwide, a view not out of keeping with that of structural
biologists who have moved away from ground-level crystallography and view
the latter as a mature technique - a euphemism for saying that no further
improvements are likely nor even necessary. As Clemens Vonrhein has just
written, it may be the very success of those developers that has given the
benefit of what software can do to users who don't have the faintest idea of
what it does, nor of how it does it, nor of what its limitations are and how
to overcome those limitations - and therefore take it for granted.

 Another side of the mature technique kiss of death is the underlying
assumption that the demands placed on crystallographic methods are
themselves static, and nothing could be more misleading. We get caught time
and again by rushed shifts in technology without proper precautions in case
the first adaptations of the old methods do not perform as well as they
might later. Let me quote an example: 3x3 CCD detectors. It was too quickly
and hurriedly assumed that, after correcting the images recorded on these
instruments for geometric distortions and flat-field response, one would get
images that could be processed as if they came from image plates (or film).
This turned out to be a mistake: corner effects were later diagnosed, that
were partially correctible by a position-dependent modulation factor,
applied for instance by XDS in response to the problem. That correction is
not just detector-dependent and applicable to all datasets recorded on a
given detector, unfortunately, as it is related to a spatial variation in
the point-spread function. - so you really need to reprocess each set of
images to determine the necessary corrections. The tragic thing is that for
a typical resolution limit and detector distance, these corners cut into the
quick of your strongest secondary-structure defining data. If you have kept
your images, you can try and recover from that; otherwise, you are stuck
with what can be seriously sub-optimal data. Imagine what this can do to SAD
anomalous difference when Bijvoet pairs fall on detector positions where
these corner effects are vastly different ... . 

 Another example it that of the recent use of numerous microcrystals,
each giving a very small amount of data, to assemble datasets for solving
GPCR structures. The methods for doing this, for getting the indexing and
integration of such thin slices of data and getting the overall scaling to
behave, are still very rough. It would be pure insanity to throw these
images away and not to count on better algorithms to come along and improve
the final data extractible from them. 

--
 Does really someone believe that our view on the actual problem, the 
 function of the proteins, changes with the analysis of whatsoever 
 scattering is still in the images but not used by today's software? Crystal 
 structures are static, snapshots, and obtained under artificial conditions. 
 In solution (still the physiologic state) they might look different, not 
 much, but at least far more dynamic. Does it therefore matter whether we 
 know some sidechain positions better (in the crystal structure) when 
 re-analysing the data? In turn, are our current software programs such bad 
 that we would expect strong difference when re-analysing the data? No. And 
 if the structures change upon reanalysis (more or less) who does 
 re-interpret the structures, re-writes the papers?

 I think that, rather than asking rhetorical questions about people's
beliefs regarding such a general question, one needs testimonies about real
life situations. We have helped a great many academic groups in the last 15
years: in every case, they ended up feeling really overjoyed that they had
kept their images when they had, and immensely regretful when they hadn't. 
I noticed, for example, that your last PDB entry, 1LKX (2002) does not have
structure factor data associated with it. It is therefore impossible 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Kelly Daughtry
 in the PDB, the images would just
 be the ones giving the final best data - hence the ones least
 likely to have been problematic. I'd be more interested in
 SFs/maps for looking at ligand-binding etc...

 2) Unless this were done before paper acceptance they would be
 of little use to referees seeking to review important
 structural papers. I'd like to see PDB validation reports
 (which could include automated data processing, perhaps culled
 from synchrotron sites, SFs and/or maps) made available to
 referees in advance of publication. This would be enabled by
 deposition, but could be achieved in other ways.

 3) The datasets of interest to methods developers are unlikely
 to be the ones deposited. They should be in contact with
 synchrotron archives directly. Processing multiple lattices is
 a case in point here.

 4) Remember the average consumer of a PDB file is not a
 crystallographer. More likely to be a graduate student in a
 clinical lab. For him/her things like occupancies and B-
 factors are far more serious concerns... I'm not trivializing
 the issue, but importance is always relative. Are there
 outsiders on the panel to keep perspective?

 Robert


 --

 Dr. Robert Esnouf,
 University Research Lecturer, ex-crystallographer
 and Head of Research Computing,
 Wellcome Trust Centre for Human Genetics,
 Roosevelt Drive, Oxford OX3 7BN, UK

 Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
 and rob...@esnouf.comFax: (+44) - 1865 - 287547


  Original message 
 Date: Mon, 31 Oct 2011 11:37:47 +0100
 From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK CCP4BB@JISCMAIL.AC.UK(on 
 behalf
 of Anastassis Perrakis a.perra...@nki.nl a.perra...@nki.nl)
 Subject: Re: [ccp4bb] To archive or not to archive, that's
 the question!
 To: CCP4BB@JISCMAIL.AC.UK
 
Dear all,
The discussion about keeping primary data, and what
level of data can be considered 'primary', has -
rather unsurprisingly - come up also in areas other
than structural biology.
An example is next generation sequencing. A
full-dataset is a few tera bytes, but
post-processing reduces it to sub-Gb size. However,
the post-processed data, as in our case,
have suffered the inadequacy of computational
reduction ... At least out institute has decided
to create double back-up of the primary data in
triplicate. For that reason our facility bought
three -80 freezers, one on site at the basement, on
at the top floor, and one off-site, and they keep
the DNA to be sequenced. A sequencing run is already
sub-1k$ and it will not become
more expensive. So, if its important, do it again.
Its cheaper and its better.
At first sight, that does not apply to MX. Or does
it?
So, maybe the question is not To archive or not to
archive but What to archive.
(similarly, it never crossed my mind if I should be
or not be - I always wondered what to be)
A.
On Oct 30, 2011, at 11:59, Kay Diederichs wrote:
 
  Am 20:59, schrieb Jrh:
  ...
 
So:-  Universities are now establishing their
own institutional
 
repositories, driven largely by Open Access
demands of funders. For
 
these to host raw datasets that underpin
publications is a reasonable
 
role in my view and indeed they already have
this category in the
 
University of Manchester eScholar system, for
example.  I am set to
 
explore locally here whether they would
accommodate all our Lab's raw
 
Xray images datasets per annum that underpin our
published crystal
 
structures.
 
It would be helpful if readers of this CCP4bb
could kindly also
 
explore with their own universities if they have
such an
 
institutional repository and if raw data sets
could be accommodated.
 
Please do email me off list with this
information if you prefer but
 
within the CCP4bb is also good.
 
  Dear John,
 
  I'm pretty sure that there exists no consistent
  policy to provide an institutional repository
  for deposition of scientific data at German
  universities or Max-Planck institutes or Helmholtz
  institutions, at least I never heard of something
  like this. More specifically, our University of
  Konstanz certainly does not have the
  infrastructure to provide this.
 
  I don't think that Germany is the only country
  which is the exception to any rule of availability
  of institutional repository . Rather, I'm almost
  amazed that British and American institutions seem
  to support this.
 
  Thus I suggest to not focus exclusively on
  official institutional repositories, but to
  explore alternatives: distributed filestores like
  Google's BigTable, Bittorrent or others might be
  just as suitable - check out
  http://en.wikipedia.org/wiki

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-31 Thread Gerard Bricogne
/maps for looking at ligand-binding etc...
 2) Unless this were done before paper acceptance they would be
 of little use to referees seeking to review important
 structural papers. I'd like to see PDB validation reports
 (which could include automated data processing, perhaps culled
 from synchrotron sites, SFs and/or maps) made available to
 referees in advance of publication. This would be enabled by
 deposition, but could be achieved in other ways.
 3) The datasets of interest to methods developers are unlikely
 to be the ones deposited. They should be in contact with
 synchrotron archives directly. Processing multiple lattices is
 a case in point here.
 4) Remember the average consumer of a PDB file is not a
 crystallographer. More likely to be a graduate student in a
 clinical lab. For him/her things like occupancies and B-
 factors are far more serious concerns... I'm not trivializing
 the issue, but importance is always relative. Are there
 outsiders on the panel to keep perspective?
 Robert
 --
 Dr. Robert Esnouf,
 University Research Lecturer, ex-crystallographer
 and Head of Research Computing,
 Wellcome Trust Centre for Human Genetics,
 Roosevelt Drive, Oxford OX3 7BN, UK
 Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
 and rob...@esnouf.comFax: (+44) - 1865 - 287547
  Original message 
 Date: Mon, 31 Oct 2011 11:37:47 +0100
 From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf
 of Anastassis Perrakis a.perra...@nki.nl)
 Subject: Re: [ccp4bb] To archive or not to archive, that's
 the question!
 To: CCP4BB@JISCMAIL.AC.UK
 
Dear all,
The discussion about keeping primary data, and what
level of data can be considered 'primary', has -
rather unsurprisingly - come up also in areas other
than structural biology.
An example is next generation sequencing. A
full-dataset is a few tera bytes, but
post-processing reduces it to sub-Gb size. However,
the post-processed data, as in our case,
have suffered the inadequacy of computational
reduction ... At least out institute has decided
to create double back-up of the primary data in
triplicate. For that reason our facility bought
three -80 freezers, one on site at the basement, on
at the top floor, and one off-site, and they keep
the DNA to be sequenced. A sequencing run is already
sub-1k$ and it will not become
more expensive. So, if its important, do it again.
Its cheaper and its better.
At first sight, that does not apply to MX. Or does
it?
So, maybe the question is not To archive or not to
archive but What to archive.
(similarly, it never crossed my mind if I should be
or not be - I always wondered what to be)
A.
On Oct 30, 2011, at 11:59, Kay Diederichs wrote:
 
  Am 20:59, schrieb Jrh:
  ...
 
So:-  Universities are now establishing their
own institutional
 
repositories, driven largely by Open Access
demands of funders. For
 
these to host raw datasets that underpin
publications is a reasonable
 
role in my view and indeed they already have
this category in the
 
University of Manchester eScholar system, for
example.  I am set to
 
explore locally here whether they would
accommodate all our Lab's raw
 
Xray images datasets per annum that underpin our
published crystal
 
structures.
 
It would be helpful if readers of this CCP4bb
could kindly also
 
explore with their own universities if they have
such an
 
institutional repository and if raw data sets
could be accommodated.
 
Please do email me off list with this
information if you prefer but
 
within the CCP4bb is also good.
 
  Dear John,
 
  I'm pretty sure that there exists no consistent
  policy to provide an institutional repository
  for deposition of scientific data at German
  universities or Max-Planck institutes or Helmholtz
  institutions, at least I never heard of something
  like this. More specifically, our University of
  Konstanz certainly does not have the
  infrastructure to provide this.
 
  I don't think that Germany is the only country
  which is the exception to any rule of availability
  of institutional repository . Rather, I'm almost
  amazed that British and American institutions seem
  to support this.
 
  Thus I suggest to not focus exclusively on
  official institutional repositories, but to
  explore alternatives: distributed filestores like
  Google's BigTable, Bittorrent or others might be
  just as suitable - check out
  http://en.wikipedia.org/wiki/Distributed_data_store.
  I guess that any crystallographic lab could easily
  sacrifice/donate a TB of storage for the purposes
  of this project in 2011 (and maybe 2 TB in 2012, 3
  in 2013

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-29 Thread Jrh
Dear Gerard K,
Many thanks indeed for this.
Like Gerard Bricogne you also indicate that the location option being the 
decentralised one is 'quite simple and very cheap in terms of centralised 
cost'. The SR Facilities worldwide I hope can surely follow the lead taken by 
Diamond Light Source and PaN, the European Consortium of SR and Neutron 
Facilities, and keep their data archives and also assist authors with the doi 
registration process for those datasets that result in publication. Linking to 
these dois from the PDB for example is as you confirm straightforward. 

Gerard B's pressing of the above approach via the 'Pilot project'  within the 
IUCr DDD WG various discussions, with a nicely detailed plan, brought home to 
me the merit of the above approach for the even greater challenge for raw data 
archiving for chemical crystallography, both in terms of number of datasets and 
also the SR Facilities role being much smaller. IUCr Journals also note the 
challenge of moving large quantities of data around ie if the Journals were to 
try and host everything for chemical crystallography, and them thus becoming 
'the centre' for these datasets.

So:-  Universities are now establishing their own institutional repositories, 
driven largely by Open Access demands of funders. For these to host raw 
datasets that underpin publications is a reasonable role in my view and indeed 
they already have this category in the University of Manchester eScholar 
system, for example.  I am set to explore locally here whether they would 
accommodate all our Lab's raw Xray images datasets per annum that underpin our 
published crystal structures. 

It would be helpful if readers of this CCP4bb could kindly also explore with 
their own universities if they have such an institutional repository and if raw 
data sets could be accommodated. Please do email me off list with this 
information if you prefer but within the CCP4bb is also good. 

Such an approach involving institutional repositories would also work of course 
for the 25% of MX structures that are for non SR datasets.

All the best for a splendid PDB40 Event.

Greetings,
John
Prof John R Helliwell DSc 
 
 

On 28 Oct 2011, at 22:02, Gerard DVD Kleywegt ger...@xray.bmc.uu.se wrote:

 Hi all,
 
 It appears that during my time here at Cold Spring Harbor, I have missed a 
 small debate on CCP4BB (in which my name has been used in vain to boot).
 
 I have not yet had time to read all the contributions, but would like to make 
 a few points that hopefully contribute to the discussion and keep it with two 
 feet on Earth (as opposed to La La Land where the people live who think that 
 image archiving can be done on a shoestring budget... more about this in a 
 bit).
 
 Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, 
 and sorry for the new subject line, but this way I can track the replies more 
 easily.
 
 It seems to me that there are a number of issues that need to be separated:
 
 (1) the case for/against storing raw data
 (2) implementation and resources
 (3) funding
 (4) location
 
 I will say a few things about each of these issues in turn:
 
 ---
 
 (1) Arguments in favour and against the concept of storing raw image data, as 
 well as possible alternative solutions that could address some of the issues 
 at lower cost or complexity.
 
 I realise that my views carry a weight=1.0 just like everybody else's, and 
 many of the arguments and counter-arguments have already been made, so I will 
 not add to these at this stage.
 
 ---
 
 (2) Implementation details and required resources.
 
 If the community should decide that archiving raw data would be 
 scientifically useful, then it has to decide how best to do it. This will 
 determine the level of resources required to do it. Questions include:
 
 - what should be archived? (See Jim H's list from (a) to (z) or so.) An 
 initial plan would perhaps aim for the images associated with the data used 
 in the final refinement of deposited structures.
 
 - how much data are we talking about per dataset/structure/year?
 
 - should it be stored close to the source (i.e., responsibility and costs for 
 depositors or synchrotrons) or centrally (i.e., costs for some central 
 resource)? If it is going to be stored centrally, the cost will be 
 substantial. For example, at the EBI -the European Bioinformatics Institute- 
 we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage 
 (not the kind you buy at Dixons or Radio Shack, obviously). For stored data, 
 we have a data-duplication factor of ~8, i.e. every file is stored 8 times 
 (at three data centres, plus back-ups, plus a data-duplication centre, plus 
 unreleased versus public versions of the archive). (Note - this is only for 
 the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, 
 disks have to be housed in a building (not free!), with cooling, security 
 measures, security staff, 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-29 Thread Herbert J. Bernstein

One important issue to address is how deal with the perceived
reliability issues of the federated model and how to start to
approach the higher reliability of the centralized model described bu
Gerard K, but without incurring what seems to be at present
unacceptable costs.  One answer comes from the approach followed in
communications systems.  If the probability of data loss in each
communication subsystem is, say, 1/1000, then the probability of data
loss in two independent copies of the same lossy system is only
1/1,000,000.  We could apply that lessonto the
federated data image archive model by asking each institution
to partner with a second independent, and hopefully geographically
distant, institution, with an agreement for each to host copies
of the other's images.  If we restrict that duplication protocol, at least at
first, to those images strongly related to an actual publication/PDB
deposition, the incremental cost of greatly improved reliability
would be very low, with no disruption of the basic federated
approach being suggested.

Please note that I am not suggesting that institutional repositories
will have 1/1000 data loss rates, but they will certainly have some
data loss rate, and this modest change in the proposal would help to
greatly lower the impact of that data loss rate and allow us to go
forward with greater confidence.

Regards,
  Herbert


At 7:53 AM +0100 10/29/11, Jrh wrote:

Dear Gerard K,
Many thanks indeed for this.
Like Gerard Bricogne you also indicate that the location option 
being the decentralised one is 'quite simple and very cheap in terms 
of centralised cost'. The SR Facilities worldwide I hope can surely 
follow the lead taken by Diamond Light Source and PaN, the European 
Consortium of SR and Neutron Facilities, and keep their data 
archives and also assist authors with the doi registration process 
for those datasets that result in publication. Linking to these dois 
from the PDB for example is as you confirm straightforward.


Gerard B's pressing of the above approach via the 'Pilot project' 
within the IUCr DDD WG various discussions, with a nicely detailed 
plan, brought home to me the merit of the above approach for the 
even greater challenge for raw data archiving for chemical 
crystallography, both in terms of number of datasets and also the SR 
Facilities role being much smaller. IUCr Journals also note the 
challenge of moving large quantities of data around ie if the 
Journals were to try and host everything for chemical 
crystallography, and them thus becoming 'the centre' for these 
datasets.


So:-  Universities are now establishing their own institutional 
repositories, driven largely by Open Access demands of funders. For 
these to host raw datasets that underpin publications is a 
reasonable role in my view and indeed they already have this 
category in the University of Manchester eScholar system, for 
example.  I am set to explore locally here whether they would 
accommodate all our Lab's raw Xray images datasets per annum that 
underpin our published crystal structures.


It would be helpful if readers of this CCP4bb could kindly also 
explore with their own universities if they have such an 
institutional repository and if raw data sets could be accommodated. 
Please do email me off list with this information if you prefer but 
within the CCP4bb is also good.


Such an approach involving institutional repositories would also 
work of course for the 25% of MX structures that are for non SR 
datasets.


All the best for a splendid PDB40 Event.

Greetings,
John
Prof John R Helliwell DSc



On 28 Oct 2011, at 22:02, Gerard DVD Kleywegt ger...@xray.bmc.uu.se wrote:


 Hi all,

 It appears that during my time here at Cold Spring Harbor, I have 
missed a small debate on CCP4BB (in which my name has been used in 
vain to boot).

 
 I have not yet had time to read all the contributions, but would 
like to make a few points that hopefully contribute to the 
discussion and keep it with two feet on Earth (as opposed to La La 
Land where the people live who think that image archiving can be 
done on a shoestring budget... more about this in a bit).


 Note: all of this is on personal title, i.e. not official wwPDB 
gospel. Oh, and sorry for the new subject line, but this way I can 
track the replies more easily.


 It seems to me that there are a number of issues that need to be separated:

 (1) the case for/against storing raw data
 (2) implementation and resources
 (3) funding
 (4) location

 

 I will say a few things about each of these issues in turn:

 ---

 (1) Arguments in favour and against the concept of storing raw 
image data, as well as possible alternative solutions that could 
address some of the issues at lower cost or complexity.


 I realise that my views carry a weight=1.0 just like everybody 
else's, and many of the arguments and counter-arguments have 
already been made, so I will not add to these at this stage.


 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-29 Thread Jrh
Dear Herbert,
I imagine it likely that eg The Univ Manchester eScholar system will have in 
place duplicate storage for the reasons you outline below. However for it to be 
geographically distant is, to my reckoning, less likely, but still possible. I 
will add that further query to my first query to my eScholar user support re 
dataset sizes and doi registration.
Greetings,
John
Prof John R Helliwell DSc 
 
 

On 29 Oct 2011, at 15:49, Herbert J. Bernstein y...@bernstein-plus-sons.com 
wrote:

 One important issue to address is how deal with the perceived
 reliability issues of the federated model and how to start to
 approach the higher reliability of the centralized model described bu
 Gerard K, but without incurring what seems to be at present
 unacceptable costs.  One answer comes from the approach followed in
 communications systems.  If the probability of data loss in each
 communication subsystem is, say, 1/1000, then the probability of data
 loss in two independent copies of the same lossy system is only
 1/1,000,000.  We could apply that lessonto the
 federated data image archive model by asking each institution
 to partner with a second independent, and hopefully geographically
 distant, institution, with an agreement for each to host copies
 of the other's images.  If we restrict that duplication protocol, at least at
 first, to those images strongly related to an actual publication/PDB
 deposition, the incremental cost of greatly improved reliability
 would be very low, with no disruption of the basic federated
 approach being suggested.
 
 Please note that I am not suggesting that institutional repositories
 will have 1/1000 data loss rates, but they will certainly have some
 data loss rate, and this modest change in the proposal would help to
 greatly lower the impact of that data loss rate and allow us to go
 forward with greater confidence.
 
 Regards,
  Herbert
 
 
 At 7:53 AM +0100 10/29/11, Jrh wrote:
 Dear Gerard K,
 Many thanks indeed for this.
 Like Gerard Bricogne you also indicate that the location option being the 
 decentralised one is 'quite simple and very cheap in terms of centralised 
 cost'. The SR Facilities worldwide I hope can surely follow the lead taken 
 by Diamond Light Source and PaN, the European Consortium of SR and Neutron 
 Facilities, and keep their data archives and also assist authors with the 
 doi registration process for those datasets that result in publication. 
 Linking to these dois from the PDB for example is as you confirm 
 straightforward.
 
 Gerard B's pressing of the above approach via the 'Pilot project' within the 
 IUCr DDD WG various discussions, with a nicely detailed plan, brought home 
 to me the merit of the above approach for the even greater challenge for raw 
 data archiving for chemical crystallography, both in terms of number of 
 datasets and also the SR Facilities role being much smaller. IUCr Journals 
 also note the challenge of moving large quantities of data around ie if the 
 Journals were to try and host everything for chemical crystallography, and 
 them thus becoming 'the centre' for these datasets.
 
 So:-  Universities are now establishing their own institutional 
 repositories, driven largely by Open Access demands of funders. For these to 
 host raw datasets that underpin publications is a reasonable role in my view 
 and indeed they already have this category in the University of Manchester 
 eScholar system, for example.  I am set to explore locally here whether they 
 would accommodate all our Lab's raw Xray images datasets per annum that 
 underpin our published crystal structures.
 
 It would be helpful if readers of this CCP4bb could kindly also explore with 
 their own universities if they have such an institutional repository and if 
 raw data sets could be accommodated. Please do email me off list with this 
 information if you prefer but within the CCP4bb is also good.
 
 Such an approach involving institutional repositories would also work of 
 course for the 25% of MX structures that are for non SR datasets.
 
 All the best for a splendid PDB40 Event.
 
 Greetings,
 John
 Prof John R Helliwell DSc
 
 
 
 On 28 Oct 2011, at 22:02, Gerard DVD Kleywegt ger...@xray.bmc.uu.se wrote:
 
 Hi all,
 
 It appears that during my time here at Cold Spring Harbor, I have missed a 
 small debate on CCP4BB (in which my name has been used in vain to boot).
 
 I have not yet had time to read all the contributions, but would like to 
 make a few points that hopefully contribute to the discussion and keep it 
 with two feet on Earth (as opposed to La La Land where the people live who 
 think that image archiving can be done on a shoestring budget... more about 
 this in a bit).
 
 Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, 
 and sorry for the new subject line, but this way I can track the replies 
 more easily.
 
 It seems to me that there are a number of issues that need to be separated:
 
 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-29 Thread Herbert J. Bernstein

Dear John,

  Most sound institutional data repositories use some form of
off-site backup.  However, not all of them do, and the
standards of reliabilty vary.  The advantages of an explicit
partnering system are both practical and psychological.  The
practical part is the major improvement in reliability --
even if we start at 6 nines, 12 nines is better.  The
psychological part is that members of the community can
feel reassured that reliability has in been improved to
levels at which they can focus on other, more scientific
issues, instead ot the question of reliability.

  Regards,
Herbert

=
 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769

 +1-631-244-3035
 y...@dowling.edu
=

On Sat, 29 Oct 2011, Jrh wrote:


Dear Herbert,
I imagine it likely that eg The Univ Manchester eScholar system will have in 
place duplicate storage for the reasons you outline below. However for it to be 
geographically distant is, to my reckoning, less likely, but still possible. I 
will add that further query to my first query to my eScholar user support re 
dataset sizes and doi registration.
Greetings,
John
Prof John R Helliwell DSc



On 29 Oct 2011, at 15:49, Herbert J. Bernstein y...@bernstein-plus-sons.com 
wrote:


One important issue to address is how deal with the perceived
reliability issues of the federated model and how to start to
approach the higher reliability of the centralized model described bu
Gerard K, but without incurring what seems to be at present
unacceptable costs.  One answer comes from the approach followed in
communications systems.  If the probability of data loss in each
communication subsystem is, say, 1/1000, then the probability of data
loss in two independent copies of the same lossy system is only
1/1,000,000.  We could apply that lessonto the
federated data image archive model by asking each institution
to partner with a second independent, and hopefully geographically
distant, institution, with an agreement for each to host copies
of the other's images.  If we restrict that duplication protocol, at least at
first, to those images strongly related to an actual publication/PDB
deposition, the incremental cost of greatly improved reliability
would be very low, with no disruption of the basic federated
approach being suggested.

Please note that I am not suggesting that institutional repositories
will have 1/1000 data loss rates, but they will certainly have some
data loss rate, and this modest change in the proposal would help to
greatly lower the impact of that data loss rate and allow us to go
forward with greater confidence.

Regards,
 Herbert


At 7:53 AM +0100 10/29/11, Jrh wrote:

Dear Gerard K,
Many thanks indeed for this.
Like Gerard Bricogne you also indicate that the location option being the 
decentralised one is 'quite simple and very cheap in terms of centralised 
cost'. The SR Facilities worldwide I hope can surely follow the lead taken by 
Diamond Light Source and PaN, the European Consortium of SR and Neutron 
Facilities, and keep their data archives and also assist authors with the doi 
registration process for those datasets that result in publication. Linking to 
these dois from the PDB for example is as you confirm straightforward.

Gerard B's pressing of the above approach via the 'Pilot project' within the 
IUCr DDD WG various discussions, with a nicely detailed plan, brought home to 
me the merit of the above approach for the even greater challenge for raw data 
archiving for chemical crystallography, both in terms of number of datasets and 
also the SR Facilities role being much smaller. IUCr Journals also note the 
challenge of moving large quantities of data around ie if the Journals were to 
try and host everything for chemical crystallography, and them thus becoming 
'the centre' for these datasets.

So:-  Universities are now establishing their own institutional repositories, 
driven largely by Open Access demands of funders. For these to host raw 
datasets that underpin publications is a reasonable role in my view and indeed 
they already have this category in the University of Manchester eScholar 
system, for example.  I am set to explore locally here whether they would 
accommodate all our Lab's raw Xray images datasets per annum that underpin our 
published crystal structures.

It would be helpful if readers of this CCP4bb could kindly also explore with 
their own universities if they have such an institutional repository and if raw 
data sets could be accommodated. Please do email me off list with this 
information if you prefer but within the CCP4bb is also good.

Such an approach involving institutional repositories would also work of course 
for the 25% of MX structures that are for non SR datasets.

All the best 

[ccp4bb] To archive or not to archive, that's the question!

2011-10-28 Thread Gerard DVD Kleywegt

Hi all,

It appears that during my time here at Cold Spring Harbor, I have missed a 
small debate on CCP4BB (in which my name has been used in vain to boot).


I have not yet had time to read all the contributions, but would like to make 
a few points that hopefully contribute to the discussion and keep it with two 
feet on Earth (as opposed to La La Land where the people live who think that 
image archiving can be done on a shoestring budget... more about this in a 
bit).


Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, 
and sorry for the new subject line, but this way I can track the replies more 
easily.


It seems to me that there are a number of issues that need to be separated:

(1) the case for/against storing raw data
(2) implementation and resources
(3) funding
(4) location

I will say a few things about each of these issues in turn:

---

(1) Arguments in favour and against the concept of storing raw image data, as 
well as possible alternative solutions that could address some of the issues 
at lower cost or complexity.


I realise that my views carry a weight=1.0 just like everybody else's, and 
many of the arguments and counter-arguments have already been made, so I will 
not add to these at this stage.


---

(2) Implementation details and required resources.

If the community should decide that archiving raw data would be scientifically 
useful, then it has to decide how best to do it. This will determine the level 
of resources required to do it. Questions include:


- what should be archived? (See Jim H's list from (a) to (z) or so.) An 
initial plan would perhaps aim for the images associated with the data used in 
the final refinement of deposited structures.


- how much data are we talking about per dataset/structure/year?

- should it be stored close to the source (i.e., responsibility and costs for 
depositors or synchrotrons) or centrally (i.e., costs for some central 
resource)? If it is going to be stored centrally, the cost will be 
substantial. For example, at the EBI -the European Bioinformatics Institute- 
we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage 
(not the kind you buy at Dixons or Radio Shack, obviously). For stored data, 
we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at 
three data centres, plus back-ups, plus a data-duplication centre, plus 
unreleased versus public versions of the archive). (Note - this is only for 
the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, 
disks have to be housed in a building (not free!), with cooling, security 
measures, security staff, maintenance staff, electricity (substantial cost!), 
rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of three 
years (barring failures) and then needs to be replaced (at lower cost, but 
still not free).


- if the data is going to be stored centrally, how will it get there? Using 
ftp will probably not be feasible.


- if it is not stored centrally, how will long-term data availability be 
enforced? (Otherwise I could have my data on a public server until my paper 
comes out in print, and then remove it.)


- what level of annotation will be required? There is no point in having 
zillions of files lying around if you don't know which 
structure/crystal/sample they belong to, at what wavelength they were 
recorded, if they were used in refinement or not, etc.


- an issue that has not been raised yet, I think: who is going to validate 
that the images actually correspond to the structure factor amplitudes or 
intensities that were used in the refinement? This means that the data will 
have to be indexed, integrated, scaled, merged, etc. and finally compared to 
the deposited Fobs or Iobs. This will have to be done for *10,000 data sets a 
year*... And I can already imagine the arguments that will follow between 
depositors and re-processors about what software to use, what resolution 
cut-off, what outlier-rejection criteria, etc. How will conflicts and 
discrepancies be resolved? This could well end up taking a day of working time 
per data set, i.e. with 200 working days per year, one would need 50 *new* 
staff for this task alone. For comparison: worldwide, there is currently a 
*total* of ~25 annotators working for the wwPDB partners...


Not many of you know that (about 10 years ago) I spent probably an entire year 
of my life sorting out the mess that was the PDB structure factor files 
pre-EDS... We were apparently the first people to ever look at the tens of 
thousands of structure factor files and try to use all of them to calculate 
maps for the EDS server. (If there were others who attempted this before us, 
they had probably run away screaming.) This went well for many files, but 
there were many, many files that had problems. There were dozens of different 
kinds of issues: non-CIF files, CIF files with wrong headers, Is instead of 
Fs, 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-28 Thread Colin Nave
Gerard
I said in INCREASING order of influence/power i.e. you are in first place. 

The joke comes from
 I used to think if there was reincarnation, I wanted to come back as the 
President or the Pope or a .400 baseball hitter. But now I want to come back as 
the bond market. You can intimidate everyone.
--James Carville, Clinton campaign strategist

Thanks for the comprehensive reply
 Regards
   Colin

-Original Message-
From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Gerard 
DVD Kleywegt
Sent: 28 October 2011 22:03
To: ccp4bb
Subject: [ccp4bb] To archive or not to archive, that's the question!

Hi all,

It appears that during my time here at Cold Spring Harbor, I have missed a 
small debate on CCP4BB (in which my name has been used in vain to boot).

I have not yet had time to read all the contributions, but would like to make 
a few points that hopefully contribute to the discussion and keep it with two 
feet on Earth (as opposed to La La Land where the people live who think that 
image archiving can be done on a shoestring budget... more about this in a 
bit).

Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, 
and sorry for the new subject line, but this way I can track the replies more 
easily.

It seems to me that there are a number of issues that need to be separated:

(1) the case for/against storing raw data
(2) implementation and resources
(3) funding
(4) location

I will say a few things about each of these issues in turn:

---

(1) Arguments in favour and against the concept of storing raw image data, as 
well as possible alternative solutions that could address some of the issues 
at lower cost or complexity.

I realise that my views carry a weight=1.0 just like everybody else's, and 
many of the arguments and counter-arguments have already been made, so I will 
not add to these at this stage.

---

(2) Implementation details and required resources.

If the community should decide that archiving raw data would be scientifically 
useful, then it has to decide how best to do it. This will determine the level 
of resources required to do it. Questions include:

- what should be archived? (See Jim H's list from (a) to (z) or so.) An 
initial plan would perhaps aim for the images associated with the data used in 
the final refinement of deposited structures.

- how much data are we talking about per dataset/structure/year?

- should it be stored close to the source (i.e., responsibility and costs for 
depositors or synchrotrons) or centrally (i.e., costs for some central 
resource)? If it is going to be stored centrally, the cost will be 
substantial. For example, at the EBI -the European Bioinformatics Institute- 
we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage 
(not the kind you buy at Dixons or Radio Shack, obviously). For stored data, 
we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at 
three data centres, plus back-ups, plus a data-duplication centre, plus 
unreleased versus public versions of the archive). (Note - this is only for 
the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, 
disks have to be housed in a building (not free!), with cooling, security 
measures, security staff, maintenance staff, electricity (substantial cost!), 
rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of three 
years (barring failures) and then needs to be replaced (at lower cost, but 
still not free).

- if the data is going to be stored centrally, how will it get there? Using 
ftp will probably not be feasible.

- if it is not stored centrally, how will long-term data availability be 
enforced? (Otherwise I could have my data on a public server until my paper 
comes out in print, and then remove it.)

- what level of annotation will be required? There is no point in having 
zillions of files lying around if you don't know which 
structure/crystal/sample they belong to, at what wavelength they were 
recorded, if they were used in refinement or not, etc.

- an issue that has not been raised yet, I think: who is going to validate 
that the images actually correspond to the structure factor amplitudes or 
intensities that were used in the refinement? This means that the data will 
have to be indexed, integrated, scaled, merged, etc. and finally compared to 
the deposited Fobs or Iobs. This will have to be done for *10,000 data sets a 
year*... And I can already imagine the arguments that will follow between 
depositors and re-processors about what software to use, what resolution 
cut-off, what outlier-rejection criteria, etc. How will conflicts and 
discrepancies be resolved? This could well end up taking a day of working time 
per data set, i.e. with 200 working days per year, one would need 50 *new* 
staff for this task alone. For comparison: worldwide, there is currently a 
*total* of ~25 annotators working for the wwPDB partners...


Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-28 Thread Gerard DVD Kleywegt

Gerard
I said in INCREASING order of influence/power i.e. you are in first place.


Ooo! *Now* it makes sense! :-)

--Gerard



The joke comes from
 I used to think if there was reincarnation, I wanted to come back as the 
President or the Pope or a .400 baseball hitter. But now I want to come back 
as the bond market. You can intimidate everyone.

--James Carville, Clinton campaign strategist

Thanks for the comprehensive reply
Regards
  Colin

-Original Message-
From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Gerard 
DVD Kleywegt
Sent: 28 October 2011 22:03
To: ccp4bb
Subject: [ccp4bb] To archive or not to archive, that's the question!

Hi all,

It appears that during my time here at Cold Spring Harbor, I have missed a
small debate on CCP4BB (in which my name has been used in vain to boot).

I have not yet had time to read all the contributions, but would like to make
a few points that hopefully contribute to the discussion and keep it with two
feet on Earth (as opposed to La La Land where the people live who think that
image archiving can be done on a shoestring budget... more about this in a
bit).

Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh,
and sorry for the new subject line, but this way I can track the replies more
easily.

It seems to me that there are a number of issues that need to be separated:

(1) the case for/against storing raw data
(2) implementation and resources
(3) funding
(4) location

I will say a few things about each of these issues in turn:

---

(1) Arguments in favour and against the concept of storing raw image data, as
well as possible alternative solutions that could address some of the issues
at lower cost or complexity.

I realise that my views carry a weight=1.0 just like everybody else's, and
many of the arguments and counter-arguments have already been made, so I will
not add to these at this stage.

---

(2) Implementation details and required resources.

If the community should decide that archiving raw data would be scientifically
useful, then it has to decide how best to do it. This will determine the level
of resources required to do it. Questions include:

- what should be archived? (See Jim H's list from (a) to (z) or so.) An
initial plan would perhaps aim for the images associated with the data used in
the final refinement of deposited structures.

- how much data are we talking about per dataset/structure/year?

- should it be stored close to the source (i.e., responsibility and costs for
depositors or synchrotrons) or centrally (i.e., costs for some central
resource)? If it is going to be stored centrally, the cost will be
substantial. For example, at the EBI -the European Bioinformatics Institute-
we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage
(not the kind you buy at Dixons or Radio Shack, obviously). For stored data,
we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at
three data centres, plus back-ups, plus a data-duplication centre, plus
unreleased versus public versions of the archive). (Note - this is only for
the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover,
disks have to be housed in a building (not free!), with cooling, security
measures, security staff, maintenance staff, electricity (substantial cost!),
rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of three
years (barring failures) and then needs to be replaced (at lower cost, but
still not free).

- if the data is going to be stored centrally, how will it get there? Using
ftp will probably not be feasible.

- if it is not stored centrally, how will long-term data availability be
enforced? (Otherwise I could have my data on a public server until my paper
comes out in print, and then remove it.)

- what level of annotation will be required? There is no point in having
zillions of files lying around if you don't know which
structure/crystal/sample they belong to, at what wavelength they were
recorded, if they were used in refinement or not, etc.

- an issue that has not been raised yet, I think: who is going to validate
that the images actually correspond to the structure factor amplitudes or
intensities that were used in the refinement? This means that the data will
have to be indexed, integrated, scaled, merged, etc. and finally compared to
the deposited Fobs or Iobs. This will have to be done for *10,000 data sets a
year*... And I can already imagine the arguments that will follow between
depositors and re-processors about what software to use, what resolution
cut-off, what outlier-rejection criteria, etc. How will conflicts and
discrepancies be resolved? This could well end up taking a day of working time
per data set, i.e. with 200 working days per year, one would need 50 *new*
staff for this task alone. For comparison: worldwide, there is currently a
*total* of ~25 annotators working for the wwPDB partners...


Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-28 Thread Ethan Merritt
On Friday, October 28, 2011 02:02:46 pm Gerard DVD Kleywegt wrote:
  I'm a tad disappointed to be only in fourth place, Colin! 
  What has the Pope ever done for crystallography?

   http://covers.openlibrary.org/b/id/5923051-L.jpg

-- 
Ethan A Merritt
Biomolecular Structure Center,  K-428 Health Sciences Bldg
University of Washington, Seattle 98195-7742


Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-28 Thread Gerard DVD Kleywegt

On Friday, October 28, 2011 02:02:46 pm Gerard DVD Kleywegt wrote:

 I'm a tad disappointed to be only in fourth place, Colin!
 What has the Pope ever done for crystallography?


  http://covers.openlibrary.org/b/id/5923051-L.jpg


Fock'n'Pope! Great find, Ethan! So maybe he deserves fourth place after all.

--Gerard

**
   Gerard J. Kleywegt

  http://xray.bmc.uu.se/gerard   mailto:ger...@xray.bmc.uu.se
**
   The opinions in this message are fictional.  Any similarity
   to actual opinions, living or dead, is purely coincidental.
**
   Little known gastromathematical curiosity: let z be the
   radius and a the thickness of a pizza. Then the volume
of that pizza is equal to pi*z*z*a !
**


Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-28 Thread Gerard Bricogne
Dear Gerard,

 I think that a major achievement of this online debate will have been
to actually get you to carry out a constructive analysis (an impressive one,
I will be the first to say) of this question, instead of dismissing it right
away. It is almost as great an achievement as getting the Pope to undergo
psychoanalysis! (I am thinking here of the movie Habemus Papam.)

 It is very useful to have the facts and figures you mention for the
costs of full PDB officialdom for the storage of raw data. I think one could
describe the first stage towards that, in the form I have been mentioning as
the IUCr DDDWG pilot project, as first trying to see how to stop those raw
images from disappearing, pending the mobilisation of more resources towards
eventually putting them up in five-star accommodation (if they are thought
to be earning their keep). I am again hopeful that anticipated difficulties
at the five-star stage (with today's cost estimates) will not stop us from
trying to do what is possible today in this pilot project, and I also hope
that enough synchrotrons and depositors will volunteer to take part in it.

 The extra logistical load on checking that submitted raw images sets do
correspond to the deposited structure should be something that can be pushed
down towards the synchrotron sources, as was mentioned for the proper
book-keeping of metadata, as part of keeping tidy records linking user
project databases to datasets, and towards enhancements in data processing
and structure determination pipelines to keep track of all stages of the
derivation of the deposited results from the raw data. Not trivial, but not
insuperable, and fully in the direction of more automation and more
associated record keeping. This is just to say that it needs not all land on
the PDB's shoulders in an initially amorphous state.


 In any case, thank you for devoting so much time and attention to this
nuts-and-bolts discussion when there are so many tempting forms of high
octane entertainment around!


 With best wishes,
 
Gerard (B.)

--
On Fri, Oct 28, 2011 at 11:02:46PM +0200, Gerard DVD Kleywegt wrote:
 Hi all,

 It appears that during my time here at Cold Spring Harbor, I have missed a 
 small debate on CCP4BB (in which my name has been used in vain to boot).

 I have not yet had time to read all the contributions, but would like to 
 make a few points that hopefully contribute to the discussion and keep it 
 with two feet on Earth (as opposed to La La Land where the people live who 
 think that image archiving can be done on a shoestring budget... more about 
 this in a bit).

 Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, 
 and sorry for the new subject line, but this way I can track the replies 
 more easily.

 It seems to me that there are a number of issues that need to be separated:

 (1) the case for/against storing raw data
 (2) implementation and resources
 (3) funding
 (4) location

 I will say a few things about each of these issues in turn:

 ---

 (1) Arguments in favour and against the concept of storing raw image data, 
 as well as possible alternative solutions that could address some of the 
 issues at lower cost or complexity.

 I realise that my views carry a weight=1.0 just like everybody else's, and 
 many of the arguments and counter-arguments have already been made, so I 
 will not add to these at this stage.

 ---

 (2) Implementation details and required resources.

 If the community should decide that archiving raw data would be 
 scientifically useful, then it has to decide how best to do it. This will 
 determine the level of resources required to do it. Questions include:

 - what should be archived? (See Jim H's list from (a) to (z) or so.) An 
 initial plan would perhaps aim for the images associated with the data used 
 in the final refinement of deposited structures.

 - how much data are we talking about per dataset/structure/year?

 - should it be stored close to the source (i.e., responsibility and costs 
 for depositors or synchrotrons) or centrally (i.e., costs for some central 
 resource)? If it is going to be stored centrally, the cost will be 
 substantial. For example, at the EBI -the European Bioinformatics 
 Institute- we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per 
 TB of storage (not the kind you buy at Dixons or Radio Shack, obviously). 
 For stored data, we have a data-duplication factor of ~8, i.e. every file 
 is stored 8 times (at three data centres, plus back-ups, plus a 
 data-duplication centre, plus unreleased versus public versions of the 
 archive). (Note - this is only for the EBI/PDBe! RCSB and PDBj will have to 
 acquire storage as well.) Moreover, disks have to be housed in a building 
 (not free!), with cooling, security measures, security staff, maintenance 
 staff, electricity (substantial cost!), rental of a 1-10 Gb/s connection, 
 etc. All hardware has a 

Re: [ccp4bb] To archive or not to archive, that's the question!

2011-10-28 Thread Herbert J. Bernstein

As the poster who mentioned the $1000 - $3000 per terabyte per year
figure, I should point out that the figure originated not from La La
land but from an NSF RDLM workshop in Princeton last summer.  Certainly
the actual costs may be higher or lower depending on 
economies/diseconomies of scale and required ancilary task to be

performed.  The base figure itself seems consistent with the GBP 1500
figure cited for EBI.

That aside, the list presented seems very useful to the discussion.
I would suggest adding to it the need to try to resolve the
complex intellectual property issues involved.  This might be
a good time to try to get a consensus of the scientific community
of what approach to IP law would best serve our interests going
forward.  The current situation seems a bit messy.

Regards,
  Herbert

=
 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769

 +1-631-244-3035
 y...@dowling.edu
=

On Fri, 28 Oct 2011, Gerard DVD Kleywegt wrote:


Gerard
I said in INCREASING order of influence/power i.e. you are in first place.


Ooo! *Now* it makes sense! :-)

--Gerard



The joke comes from
 I used to think if there was reincarnation, I wanted to come back as the 
President or the Pope or a .400 baseball hitter. But now I want to come 
back as the bond market. You can intimidate everyone.

--James Carville, Clinton campaign strategist

Thanks for the comprehensive reply
Regards
  Colin

-Original Message-
From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of 
Gerard DVD Kleywegt

Sent: 28 October 2011 22:03
To: ccp4bb
Subject: [ccp4bb] To archive or not to archive, that's the question!

Hi all,

It appears that during my time here at Cold Spring Harbor, I have missed a
small debate on CCP4BB (in which my name has been used in vain to boot).

I have not yet had time to read all the contributions, but would like to 
make
a few points that hopefully contribute to the discussion and keep it with 
two
feet on Earth (as opposed to La La Land where the people live who think 
that

image archiving can be done on a shoestring budget... more about this in a
bit).

Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh,
and sorry for the new subject line, but this way I can track the replies 
more

easily.

It seems to me that there are a number of issues that need to be separated:

(1) the case for/against storing raw data
(2) implementation and resources
(3) funding
(4) location

I will say a few things about each of these issues in turn:

---

(1) Arguments in favour and against the concept of storing raw image data, 
as
well as possible alternative solutions that could address some of the 
issues

at lower cost or complexity.

I realise that my views carry a weight=1.0 just like everybody else's, and
many of the arguments and counter-arguments have already been made, so I 
will

not add to these at this stage.

---

(2) Implementation details and required resources.

If the community should decide that archiving raw data would be 
scientifically
useful, then it has to decide how best to do it. This will determine the 
level

of resources required to do it. Questions include:

- what should be archived? (See Jim H's list from (a) to (z) or so.) An
initial plan would perhaps aim for the images associated with the data used 
in

the final refinement of deposited structures.

- how much data are we talking about per dataset/structure/year?

- should it be stored close to the source (i.e., responsibility and costs 
for

depositors or synchrotrons) or centrally (i.e., costs for some central
resource)? If it is going to be stored centrally, the cost will be
substantial. For example, at the EBI -the European Bioinformatics 
Institute-
we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of 
storage
(not the kind you buy at Dixons or Radio Shack, obviously). For stored 
data,
we have a data-duplication factor of ~8, i.e. every file is stored 8 times 
(at

three data centres, plus back-ups, plus a data-duplication centre, plus
unreleased versus public versions of the archive). (Note - this is only for
the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) 
Moreover,

disks have to be housed in a building (not free!), with cooling, security
measures, security staff, maintenance staff, electricity (substantial 
cost!),
rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of 
three

years (barring failures) and then needs to be replaced (at lower cost, but
still not free).

- if the data is going to be stored centrally, how will it get there? Using
ftp will probably not be feasible.

- if it is not stored centrally, how will long-term data availability be
enforced? (Otherwise I could have my data