Re: [ccp4bb] To archive or not to archive, that's the question!
Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become more expensive. So, if its important, do it again. Its cheaper and its better. At first sight, that does not apply to MX. Or does it? So, maybe the question is not To archive or not to archive but What to archive. (similarly, it never crossed my mind if I should be or not be - I always wondered what to be) A. On Oct 30, 2011, at 11:59, Kay Diederichs wrote: Am 20:59, schrieb Jrh: ... So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Dear John, I'm pretty sure that there exists no consistent policy to provide an institutional repository for deposition of scientific data at German universities or Max-Planck institutes or Helmholtz institutions, at least I never heard of something like this. More specifically, our University of Konstanz certainly does not have the infrastructure to provide this. I don't think that Germany is the only country which is the exception to any rule of availability of institutional repository . Rather, I'm almost amazed that British and American institutions seem to support this. Thus I suggest to not focus exclusively on official institutional repositories, but to explore alternatives: distributed filestores like Google's BigTable, Bittorrent or others might be just as suitable - check out http://en.wikipedia.org/wiki/Distributed_data_store . I guess that any crystallographic lab could easily sacrifice/ donate a TB of storage for the purposes of this project in 2011 (and maybe 2 TB in 2012, 3 in 2013, ...), but clearly the level of work to set this up should be kept as low as possible (a bittorrent daemon seems simple enough). Just my 2 cents, Kay P please don't print this e-mail unless you really need to Anastassis (Tassos) Perrakis, Principal Investigator / Staff Member Department of Biochemistry (B8) Netherlands Cancer Institute, Dept. B8, 1066 CX Amsterdam, The Netherlands Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile / SMS: +31 6 28 597791
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear Tassos, It is unclear whether this thread will be able to resolve your deep existential concerns about what to be, but you do introduce a couple of interesting points: (1) raw data archiving in areas (of biology) other than structural biology, and (2) archiving the samples rather than the verbose data that may have been extracted from them. Concerning (1), I am grateful to Peter Keller here in my group for pointing me, mid-August when we were for the n-th time reviewing the issue of raw data deposition under discussion in this thread, and its advantages over only keeping derived data extracted from them, towards the Trace Archive of DNA sequences. He found an example, at http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieveval=12345dopt=tracesize=1retrieve=Submit You can check the Quality Score box below the trace, and this will refresh the display to give a visual estimate of the reliability of the sequence. There is clearly a problem around position 210, that would not have been adequately dealt with by just retaining the most probable sequence. In this context, it has been found worthwhile to preserve the raw data, to make it possible to audit derived data against them. This is at least a very simple example of what you were referring to whan you wrote about the inadequacy of computational reduction. In the MX context, this is rather similar to the contamination of integrated intensities by spots from parasitic lattices (which would still affect unmerged intensities, by the way - so upgrading the pdb structure factor file to unmerged data would take care of over-merging, but not of that contamination). Concerning (2) I greatly doubt there would be an equivalent for MX: few people would have spare crystals to put to one side for a future repeat of a diffraction experiment (except in the case of lysozyme/insulin/thaumatin!). I can remember an esteemed colleague arguing 4-5 years ago that if you want to improve a deposited structure, you could simply repeat the work from scratch - a sensible position from the philosophical point of view (science being the art of the repeatable), but far less sensible in conditions of limited resources, and given also the difficulties of reproducing crystals. The real-life situation is more a Carpe diem one: archive what you have, as you may never see it again! Otherwise one would easily get drawn into the same kind of unrealistic expectations as people who get themselves frozen in liquid N2, with their blood replaced by DMSO, hoping to be brought back to life some day in the future ;-) . With best wishes, Gerard. -- On Mon, Oct 31, 2011 at 11:37:47AM +0100, Anastassis Perrakis wrote: Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become more expensive. So, if its important, do it again. Its cheaper and its better. At first sight, that does not apply to MX. Or does it? So, maybe the question is not To archive or not to archive but What to archive. (similarly, it never crossed my mind if I should be or not be - I always wondered what to be) A. On Oct 30, 2011, at 11:59, Kay Diederichs wrote: Am 20:59, schrieb Jrh: ... So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Dear John, I'm pretty sure that there exists no consistent policy to provide an institutional repository for deposition of scientific data at German universities or Max-Planck institutes or Helmholtz institutions, at least I never heard of something like this. More specifically, our University of
Re: [ccp4bb] To archive or not to archive, that's the question!
Still, after hundreds (?) of emails to this topic, I haven't seen any convincing argument in favor of archiving data. The only convincing arguments are against, and are from Gerard K and Tassos. Why? The question is not what to archive, but still why should we archive all the data. Because software developers need more data? Should we raise all these efforts and costs because 10 developers worldwide need the data to ALL protein structures? Do they really need so much data, wouldn't it be enough to build a repository of maybe 1000 datasets for developments? Does really someone believe that our view on the actual problem, the function of the proteins, changes with the analysis of whatsoever scattering is still in the images but not used by today's software? Crystal structures are static, snapshots, and obtained under artificial conditions. In solution (still the physiologic state) they might look different, not much, but at least far more dynamic. Does it therefore matter whether we know some sidechain positions better (in the crystal structure) when re-analysing the data? In turn, are our current software programs such bad that we would expect strong difference when re-analysing the data? No. And if the structures change upon reanalysis (more or less) who does re-interpret the structures, re-writes the papers? There are many many cases where researchers re-did structures (or did closely related structures to already available structures like mutants, structures of closely related species, etc.), also after 10 years. I guess they used the latest software in the different cases, thus they incorporated all the software development of the 10 years. And are the structures really different (beyond the introduced changes, mutations, etc.)? Different because of the software used? The comparison with next-generation sequencing data is useful here, but only in the sense Tassos explained. Well, of course not every position in the genomic sequence is fixed. Therefore it is sometimes useful to look at the original data (the traces, as Gerard B pointed out). But we already know, that every single organism is different (especially eukaryotes) and therefore it is absolutely enough to store the computationally reduced and merged data. If one needs better, position-specific data, sequencing and comparing single species becomes necessary, like in the ENCODE project, the sequencing of about 100 Saccharomyces strains, the sequencing of 1000 Arabidopsis strains, etc. Discussion about single positions are useless if they are not statistically relevant. They need to be analysed in the context of populations, large cohorts of patients, etc. If we need personalized medicine adapted to personal genomes, we would also need personal sets of protein structures which we cannot provide yet. Therefore, storing the DNA in the freezer is better and cheaper than storing all the sequencing raw data. Do you think a reviewer re-sequences, or re-assembles, or re-annotates a genome, even if access to the raw reads would be available? If you trust these data why don't we trust our structure factors? Do you trust electron microscopy images, movies of GFP-tagged proteins? Do you think what is presented for a single or a few visible cells is also found in all cells? And now, who many of you (if not everybody) uses structures from yeast, Drosophila, mouse etc. as MODEL for human proteins? If we stick to this thinking, who would care about potential minor changes in the structures upon re-analysis (and in the light of this discussion, arguing about specific genomic sequence positions becomes unimportant as well)? Is any of the archived data useful without manual evaluation upon archiving? This is especially relevant for structures not solved yet. Do the images belong to the structure factors, if only images are available, where is the corresponding protein sequence, has it been sequenced, what has been in the buffer/crystallization condition, what has been used during protein purification, what was the intention for crystallization - e.g. a certain functional state, that the protein was forced to by artificial conditions, etc. etc. Who want's to evaluate that, and how? The question is not that we could do it. We could do it, but wouldn't it advance science far more if we would spend the time and money in new projects rather than evaluation, administration, etc? Be honest: How many of you have really, and completely, reanalysed your own data, that you have deposited 10 years ago, with the latest software? What changes did you find? Did you have to re-write your former discussions in the publications? Do you think that the changes justify the efforts and costs of worldwide archiving of all data? Well, for all cases there are always (and have been mentioned in earlier emails) single cases where these things matter or mattered. But does this really justify all the future
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear All, As someone who recently left crystallography for sequencing, I should modify Tassos's point... A full data-set is a few terabytes, but post-processing reduces it to sub-Gb size. My experience from HiSeqs is that this full here means the base calls - equivalent to the unmerged HKLs - hardly raw data. NGS (short-read) sequencing is an imaging technique and the images are more like 100TB for a 15-day run on a single flow cell. The raw base calls are about 5TB. The compressed, mapped data (BAM file, for a human genome, 30x coverage) is about 120GB. It is only a variant call file (VCF, difference from a stated human reference genome) that is sub-Gb and these files are - unsurprisingly - unsuited to detailed statistical analysis. Also $1k is a not yet an economic cost... The DNA information capacity in a single human body dwarfs the entire world disk capacity, so storing DNA is a no brainer here. Sequencing groups are making very hard-nosed economic decisions about what to store - indeed it is a source of research in itself - but the scale of the problem is very much bigger. My tuppence ha'penny is that depositing raw images along with everything else in the PDB is a nice idea but would have little impact on science (human/animal/plant health or understanding of biology). 1) If confined to structures in the PDB, the images would just be the ones giving the final best data - hence the ones least likely to have been problematic. I'd be more interested in SFs/maps for looking at ligand-binding etc... 2) Unless this were done before paper acceptance they would be of little use to referees seeking to review important structural papers. I'd like to see PDB validation reports (which could include automated data processing, perhaps culled from synchrotron sites, SFs and/or maps) made available to referees in advance of publication. This would be enabled by deposition, but could be achieved in other ways. 3) The datasets of interest to methods developers are unlikely to be the ones deposited. They should be in contact with synchrotron archives directly. Processing multiple lattices is a case in point here. 4) Remember the average consumer of a PDB file is not a crystallographer. More likely to be a graduate student in a clinical lab. For him/her things like occupancies and B- factors are far more serious concerns... I'm not trivializing the issue, but importance is always relative. Are there outsiders on the panel to keep perspective? Robert -- Dr. Robert Esnouf, University Research Lecturer, ex-crystallographer and Head of Research Computing, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK Emails: rob...@strubi.ox.ac.uk Tel: (+44) - 1865 - 287783 and rob...@esnouf.comFax: (+44) - 1865 - 287547 Original message Date: Mon, 31 Oct 2011 11:37:47 +0100 From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf of Anastassis Perrakis a.perra...@nki.nl) Subject: Re: [ccp4bb] To archive or not to archive, that's the question! To: CCP4BB@JISCMAIL.AC.UK Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become more expensive. So, if its important, do it again. Its cheaper and its better. At first sight, that does not apply to MX. Or does it? So, maybe the question is not To archive or not to archive but What to archive. (similarly, it never crossed my mind if I should be or not be - I always wondered what to be) A. On Oct 30, 2011, at 11:59, Kay Diederichs wrote: Am 20:59, schrieb Jrh: ... So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly
Re: [ccp4bb] To archive or not to archive, that's the question!
I was hesitant to add my opinion so far because I'm used more to listen this forum rather than tell others what I think. Why and what to deposit are absolutely interconnected. Once you decide why you want to do it, then you will probably know what will be the best format and vice versa. Whether this deposition of raw images will or will not help in future understanding the biology better I'm not sure. But to store those difficult datasets to help the future software development sounds really farfetched. This assumes that in the future crystallographers will never grow crystals that will deliver difficult datasets. If that is the case and in 10-20-30 years next generation will be growing much better crystals then they don't need such a software development. If that is not the case, and once in a while (or more often) they will be getting something out of ordinary then software developers will take them and develop whatever they need to develop to consider such cases. Am I missing a point of discussion here? Regards, Vaheh -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Robert Esnouf Sent: Monday, October 31, 2011 10:31 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] To archive or not to archive, that's the question! Dear All, As someone who recently left crystallography for sequencing, I should modify Tassos's point... A full data-set is a few terabytes, but post-processing reduces it to sub-Gb size. My experience from HiSeqs is that this full here means the base calls - equivalent to the unmerged HKLs - hardly raw data. NGS (short-read) sequencing is an imaging technique and the images are more like 100TB for a 15-day run on a single flow cell. The raw base calls are about 5TB. The compressed, mapped data (BAM file, for a human genome, 30x coverage) is about 120GB. It is only a variant call file (VCF, difference from a stated human reference genome) that is sub-Gb and these files are - unsurprisingly - unsuited to detailed statistical analysis. Also $1k is a not yet an economic cost... The DNA information capacity in a single human body dwarfs the entire world disk capacity, so storing DNA is a no brainer here. Sequencing groups are making very hard-nosed economic decisions about what to store - indeed it is a source of research in itself - but the scale of the problem is very much bigger. My tuppence ha'penny is that depositing raw images along with everything else in the PDB is a nice idea but would have little impact on science (human/animal/plant health or understanding of biology). 1) If confined to structures in the PDB, the images would just be the ones giving the final best data - hence the ones least likely to have been problematic. I'd be more interested in SFs/maps for looking at ligand-binding etc... 2) Unless this were done before paper acceptance they would be of little use to referees seeking to review important structural papers. I'd like to see PDB validation reports (which could include automated data processing, perhaps culled from synchrotron sites, SFs and/or maps) made available to referees in advance of publication. This would be enabled by deposition, but could be achieved in other ways. 3) The datasets of interest to methods developers are unlikely to be the ones deposited. They should be in contact with synchrotron archives directly. Processing multiple lattices is a case in point here. 4) Remember the average consumer of a PDB file is not a crystallographer. More likely to be a graduate student in a clinical lab. For him/her things like occupancies and B- factors are far more serious concerns... I'm not trivializing the issue, but importance is always relative. Are there outsiders on the panel to keep perspective? Robert -- Dr. Robert Esnouf, University Research Lecturer, ex-crystallographer and Head of Research Computing, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK Emails: rob...@strubi.ox.ac.uk Tel: (+44) - 1865 - 287783 and rob...@esnouf.comFax: (+44) - 1865 - 287547 Original message Date: Mon, 31 Oct 2011 11:37:47 +0100 From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf of Anastassis Perrakis a.perra...@nki.nl) Subject: Re: [ccp4bb] To archive or not to archive, that's the question! To: CCP4BB@JISCMAIL.AC.UK Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one
Re: [ccp4bb] To archive or not to archive, that's the question!
@JISCMAIL.AC.UK (on behalf of Anastassis Perrakis a.perra...@nki.nl) Subject: Re: [ccp4bb] To archive or not to archive, that's the question! To: CCP4BB@JISCMAIL.AC.UK Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become more expensive. So, if its important, do it again. Its cheaper and its better. At first sight, that does not apply to MX. Or does it? So, maybe the question is not To archive or not to archive but What to archive. (similarly, it never crossed my mind if I should be or not be - I always wondered what to be) A. On Oct 30, 2011, at 11:59, Kay Diederichs wrote: Am 20:59, schrieb Jrh: ... So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Dear John, I'm pretty sure that there exists no consistent policy to provide an institutional repository for deposition of scientific data at German universities or Max-Planck institutes or Helmholtz institutions, at least I never heard of something like this. More specifically, our University of Konstanz certainly does not have the infrastructure to provide this. I don't think that Germany is the only country which is the exception to any rule of availability of institutional repository . Rather, I'm almost amazed that British and American institutions seem to support this. Thus I suggest to not focus exclusively on official institutional repositories, but to explore alternatives: distributed filestores like Google's BigTable, Bittorrent or others might be just as suitable - check out http://en.wikipedia.org/wiki/Distributed_data_store. I guess that any crystallographic lab could easily sacrifice/donate a TB of storage for the purposes of this project in 2011 (and maybe 2 TB in 2012, 3 in 2013, ...), but clearly the level of work to set this up should be kept as low as possible (a bittorrent daemon seems simple enough). Just my 2 cents, Kay P please don't print this e-mail unless you really need to Anastassis (Tassos) Perrakis, Principal Investigator / Staff Member Department of Biochemistry (B8) Netherlands Cancer Institute, Dept. B8, 1066 CX Amsterdam, The Netherlands Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile / SMS: +31 6 28 597791 To the extent this electronic communication or any of its attachments contain information that is not in the public domain, such information is considered by MedImmune to be confidential and proprietary. This communication is expected to be read and/or used only by the individual(s) for whom it is intended. If you have received this electronic communication in error, please reply to the sender advising of the error in transmission and delete the original message and any accompanying documents from your system immediately, without copying, reviewing or otherwise using them for any purpose. Thank you for your cooperation.
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear Vaheh, On Mon, Oct 31, 2011 at 03:18:07PM +, Oganesyan, Vaheh wrote: But to store those difficult datasets to help the future software development sounds really farfetched. As far as I see the general plan, that would be a second stage (deposit all datasets) - the first one would be the datasets related directly to a given PDB entry. This assumes that in the future crystallographers will never grow crystals that will deliver difficult datasets. Oh sure they will. And lots of those datasets will be available to developers ... being thrown a difficult problem under pressure is a very good thing to get ideas, think out of the box etc. However, developing solid algorithms is better done in a less hectic environment with a large collection of similar problems (changing only one parameter at a time) to test a new method. If that is the case and in 10-20-30 years next generation will be growing much better crystals then they don't need such a software development. They'll grow better crystals for the type of project we're currently struggling with, sure. But we'll still get poor crystals for projects we don't even attempt or tackle right now. Software development is a slow process, often working on a different timescale than the typical structure solution project (obvious there are exceptions). So planing ahead for that time will prepare us. And yes, it will have an impact on the biology then. It's not just the here and now (and next grant, next high-profile paper) we should be thinking about. Am I missing a point of discussion here? One small point maybe: there are very few developers out there - but a very large number of users that benefit from what they have done. Often the work is not very visible (It's just pressing a button or two ... so it must be trivial!) - which is a good thing: it has to be simple, robust, automtic and useable. I think if a large enough number of developers consider depositing images a very useful resource for their future development (and therefore future benefit to a large number of users), it should be seriously considered, even if some of the advertised benefits have to be taken on trust. Past developments in data processing have had a big impact on a lot of projects - high-profile or just the standard PhD-student nightmare - with often small return for the developers in terms of publications, grants or even citations (main paper or supplementary material). So maybe in the sprit of the festive season it is time to consider giving a little bit back? What is there to loose? Another 20 minutes additional deposition work for the user in return for maybe/hopefully saving a whole project 5 years down the line? Not a bad investment it seems to me ... Cheers Clemens -- *** * Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com * * Global Phasing Ltd. * Sheraton House, Castle Park * Cambridge CB3 0AX, UK *-- * BUSTER Development Group (http://www.globalphasing.com) ***
Re: [ccp4bb] To archive or not to archive, that's the question!
Original message Date: Mon, 31 Oct 2011 11:37:47 +0100 From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf of Anastassis Perrakis a.perra...@nki.nl) Subject: Re: [ccp4bb] To archive or not to archive, that's the question! To: CCP4BB@JISCMAIL.AC.UK Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become more expensive. So, if its important, do it again. Its cheaper and its better. At first sight, that does not apply to MX. Or does it? So, maybe the question is not To archive or not to archive but What to archive. (similarly, it never crossed my mind if I should be or not be - I always wondered what to be) A. On Oct 30, 2011, at 11:59, Kay Diederichs wrote: Am 20:59, schrieb Jrh: ... So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Dear John, I'm pretty sure that there exists no consistent policy to provide an institutional repository for deposition of scientific data at German universities or Max-Planck institutes or Helmholtz institutions, at least I never heard of something like this. More specifically, our University of Konstanz certainly does not have the infrastructure to provide this. I don't think that Germany is the only country which is the exception to any rule of availability of institutional repository . Rather, I'm almost amazed that British and American institutions seem to support this. Thus I suggest to not focus exclusively on official institutional repositories, but to explore alternatives: distributed filestores like Google's BigTable, Bittorrent or others might be just as suitable - check out http://en.wikipedia.org/wiki/Distributed_data_store. I guess that any crystallographic lab could easily sacrifice/donate a TB of storage for the purposes of this project in 2011 (and maybe 2 TB in 2012, 3 in 2013, ...), but clearly the level of work to set this up should be kept as low as possible (a bittorrent daemon seems simple enough). Just my 2 cents, Kay P please don't print this e-mail unless you really need to Anastassis (Tassos) Perrakis, Principal Investigator / Staff Member Department of Biochemistry (B8) Netherlands Cancer Institute, Dept. B8, 1066 CX Amsterdam, The Netherlands Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile / SMS: +31 6 28 597791 To the extent this electronic communication or any of its attachments contain information that is not in the public domain, such information is considered by MedImmune to be confidential and proprietary. This communication is expected to be read and/or used only by the individual(s) for whom it is intended. If you have received this electronic communication in error, please reply to the sender advising of the error in transmission and delete the original message and any accompanying documents from your system immediately, without copying, reviewing or otherwise using them for any purpose. Thank you for your cooperation.
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear Martin, Thank you for this very clear message about your views on this topic. There is nothing like well articulated dissenting views to force a real assessment of the initial arguments, and you have certainly provided that. As your presentation is modular, I will interleave my comments with your text, if you don't mind. -- Still, after hundreds (?) of emails to this topic, I haven't seen any convincing argument in favor of archiving data. The only convincing arguments are against, and are from Gerard K and Tassos. Why? The question is not what to archive, but still why should we archive all the data. Because software developers need more data? Should we raise all these efforts and costs because 10 developers worldwide need the data to ALL protein structures? Do they really need so much data, wouldn't it be enough to build a repository of maybe 1000 datasets for developments? A first impression is that your remark rather looks down on those 10 developers worldwide, a view not out of keeping with that of structural biologists who have moved away from ground-level crystallography and view the latter as a mature technique - a euphemism for saying that no further improvements are likely nor even necessary. As Clemens Vonrhein has just written, it may be the very success of those developers that has given the benefit of what software can do to users who don't have the faintest idea of what it does, nor of how it does it, nor of what its limitations are and how to overcome those limitations - and therefore take it for granted. Another side of the mature technique kiss of death is the underlying assumption that the demands placed on crystallographic methods are themselves static, and nothing could be more misleading. We get caught time and again by rushed shifts in technology without proper precautions in case the first adaptations of the old methods do not perform as well as they might later. Let me quote an example: 3x3 CCD detectors. It was too quickly and hurriedly assumed that, after correcting the images recorded on these instruments for geometric distortions and flat-field response, one would get images that could be processed as if they came from image plates (or film). This turned out to be a mistake: corner effects were later diagnosed, that were partially correctible by a position-dependent modulation factor, applied for instance by XDS in response to the problem. That correction is not just detector-dependent and applicable to all datasets recorded on a given detector, unfortunately, as it is related to a spatial variation in the point-spread function. - so you really need to reprocess each set of images to determine the necessary corrections. The tragic thing is that for a typical resolution limit and detector distance, these corners cut into the quick of your strongest secondary-structure defining data. If you have kept your images, you can try and recover from that; otherwise, you are stuck with what can be seriously sub-optimal data. Imagine what this can do to SAD anomalous difference when Bijvoet pairs fall on detector positions where these corner effects are vastly different ... . Another example it that of the recent use of numerous microcrystals, each giving a very small amount of data, to assemble datasets for solving GPCR structures. The methods for doing this, for getting the indexing and integration of such thin slices of data and getting the overall scaling to behave, are still very rough. It would be pure insanity to throw these images away and not to count on better algorithms to come along and improve the final data extractible from them. -- Does really someone believe that our view on the actual problem, the function of the proteins, changes with the analysis of whatsoever scattering is still in the images but not used by today's software? Crystal structures are static, snapshots, and obtained under artificial conditions. In solution (still the physiologic state) they might look different, not much, but at least far more dynamic. Does it therefore matter whether we know some sidechain positions better (in the crystal structure) when re-analysing the data? In turn, are our current software programs such bad that we would expect strong difference when re-analysing the data? No. And if the structures change upon reanalysis (more or less) who does re-interpret the structures, re-writes the papers? I think that, rather than asking rhetorical questions about people's beliefs regarding such a general question, one needs testimonies about real life situations. We have helped a great many academic groups in the last 15 years: in every case, they ended up feeling really overjoyed that they had kept their images when they had, and immensely regretful when they hadn't. I noticed, for example, that your last PDB entry, 1LKX (2002) does not have structure factor data associated with it. It is therefore impossible
Re: [ccp4bb] To archive or not to archive, that's the question!
in the PDB, the images would just be the ones giving the final best data - hence the ones least likely to have been problematic. I'd be more interested in SFs/maps for looking at ligand-binding etc... 2) Unless this were done before paper acceptance they would be of little use to referees seeking to review important structural papers. I'd like to see PDB validation reports (which could include automated data processing, perhaps culled from synchrotron sites, SFs and/or maps) made available to referees in advance of publication. This would be enabled by deposition, but could be achieved in other ways. 3) The datasets of interest to methods developers are unlikely to be the ones deposited. They should be in contact with synchrotron archives directly. Processing multiple lattices is a case in point here. 4) Remember the average consumer of a PDB file is not a crystallographer. More likely to be a graduate student in a clinical lab. For him/her things like occupancies and B- factors are far more serious concerns... I'm not trivializing the issue, but importance is always relative. Are there outsiders on the panel to keep perspective? Robert -- Dr. Robert Esnouf, University Research Lecturer, ex-crystallographer and Head of Research Computing, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK Emails: rob...@strubi.ox.ac.uk Tel: (+44) - 1865 - 287783 and rob...@esnouf.comFax: (+44) - 1865 - 287547 Original message Date: Mon, 31 Oct 2011 11:37:47 +0100 From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK CCP4BB@JISCMAIL.AC.UK(on behalf of Anastassis Perrakis a.perra...@nki.nl a.perra...@nki.nl) Subject: Re: [ccp4bb] To archive or not to archive, that's the question! To: CCP4BB@JISCMAIL.AC.UK Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become more expensive. So, if its important, do it again. Its cheaper and its better. At first sight, that does not apply to MX. Or does it? So, maybe the question is not To archive or not to archive but What to archive. (similarly, it never crossed my mind if I should be or not be - I always wondered what to be) A. On Oct 30, 2011, at 11:59, Kay Diederichs wrote: Am 20:59, schrieb Jrh: ... So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Dear John, I'm pretty sure that there exists no consistent policy to provide an institutional repository for deposition of scientific data at German universities or Max-Planck institutes or Helmholtz institutions, at least I never heard of something like this. More specifically, our University of Konstanz certainly does not have the infrastructure to provide this. I don't think that Germany is the only country which is the exception to any rule of availability of institutional repository . Rather, I'm almost amazed that British and American institutions seem to support this. Thus I suggest to not focus exclusively on official institutional repositories, but to explore alternatives: distributed filestores like Google's BigTable, Bittorrent or others might be just as suitable - check out http://en.wikipedia.org/wiki
Re: [ccp4bb] To archive or not to archive, that's the question!
/maps for looking at ligand-binding etc... 2) Unless this were done before paper acceptance they would be of little use to referees seeking to review important structural papers. I'd like to see PDB validation reports (which could include automated data processing, perhaps culled from synchrotron sites, SFs and/or maps) made available to referees in advance of publication. This would be enabled by deposition, but could be achieved in other ways. 3) The datasets of interest to methods developers are unlikely to be the ones deposited. They should be in contact with synchrotron archives directly. Processing multiple lattices is a case in point here. 4) Remember the average consumer of a PDB file is not a crystallographer. More likely to be a graduate student in a clinical lab. For him/her things like occupancies and B- factors are far more serious concerns... I'm not trivializing the issue, but importance is always relative. Are there outsiders on the panel to keep perspective? Robert -- Dr. Robert Esnouf, University Research Lecturer, ex-crystallographer and Head of Research Computing, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK Emails: rob...@strubi.ox.ac.uk Tel: (+44) - 1865 - 287783 and rob...@esnouf.comFax: (+44) - 1865 - 287547 Original message Date: Mon, 31 Oct 2011 11:37:47 +0100 From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf of Anastassis Perrakis a.perra...@nki.nl) Subject: Re: [ccp4bb] To archive or not to archive, that's the question! To: CCP4BB@JISCMAIL.AC.UK Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become more expensive. So, if its important, do it again. Its cheaper and its better. At first sight, that does not apply to MX. Or does it? So, maybe the question is not To archive or not to archive but What to archive. (similarly, it never crossed my mind if I should be or not be - I always wondered what to be) A. On Oct 30, 2011, at 11:59, Kay Diederichs wrote: Am 20:59, schrieb Jrh: ... So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Dear John, I'm pretty sure that there exists no consistent policy to provide an institutional repository for deposition of scientific data at German universities or Max-Planck institutes or Helmholtz institutions, at least I never heard of something like this. More specifically, our University of Konstanz certainly does not have the infrastructure to provide this. I don't think that Germany is the only country which is the exception to any rule of availability of institutional repository . Rather, I'm almost amazed that British and American institutions seem to support this. Thus I suggest to not focus exclusively on official institutional repositories, but to explore alternatives: distributed filestores like Google's BigTable, Bittorrent or others might be just as suitable - check out http://en.wikipedia.org/wiki/Distributed_data_store. I guess that any crystallographic lab could easily sacrifice/donate a TB of storage for the purposes of this project in 2011 (and maybe 2 TB in 2012, 3 in 2013
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear Gerard K, Many thanks indeed for this. Like Gerard Bricogne you also indicate that the location option being the decentralised one is 'quite simple and very cheap in terms of centralised cost'. The SR Facilities worldwide I hope can surely follow the lead taken by Diamond Light Source and PaN, the European Consortium of SR and Neutron Facilities, and keep their data archives and also assist authors with the doi registration process for those datasets that result in publication. Linking to these dois from the PDB for example is as you confirm straightforward. Gerard B's pressing of the above approach via the 'Pilot project' within the IUCr DDD WG various discussions, with a nicely detailed plan, brought home to me the merit of the above approach for the even greater challenge for raw data archiving for chemical crystallography, both in terms of number of datasets and also the SR Facilities role being much smaller. IUCr Journals also note the challenge of moving large quantities of data around ie if the Journals were to try and host everything for chemical crystallography, and them thus becoming 'the centre' for these datasets. So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Such an approach involving institutional repositories would also work of course for the 25% of MX structures that are for non SR datasets. All the best for a splendid PDB40 Event. Greetings, John Prof John R Helliwell DSc On 28 Oct 2011, at 22:02, Gerard DVD Kleywegt ger...@xray.bmc.uu.se wrote: Hi all, It appears that during my time here at Cold Spring Harbor, I have missed a small debate on CCP4BB (in which my name has been used in vain to boot). I have not yet had time to read all the contributions, but would like to make a few points that hopefully contribute to the discussion and keep it with two feet on Earth (as opposed to La La Land where the people live who think that image archiving can be done on a shoestring budget... more about this in a bit). Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, and sorry for the new subject line, but this way I can track the replies more easily. It seems to me that there are a number of issues that need to be separated: (1) the case for/against storing raw data (2) implementation and resources (3) funding (4) location I will say a few things about each of these issues in turn: --- (1) Arguments in favour and against the concept of storing raw image data, as well as possible alternative solutions that could address some of the issues at lower cost or complexity. I realise that my views carry a weight=1.0 just like everybody else's, and many of the arguments and counter-arguments have already been made, so I will not add to these at this stage. --- (2) Implementation details and required resources. If the community should decide that archiving raw data would be scientifically useful, then it has to decide how best to do it. This will determine the level of resources required to do it. Questions include: - what should be archived? (See Jim H's list from (a) to (z) or so.) An initial plan would perhaps aim for the images associated with the data used in the final refinement of deposited structures. - how much data are we talking about per dataset/structure/year? - should it be stored close to the source (i.e., responsibility and costs for depositors or synchrotrons) or centrally (i.e., costs for some central resource)? If it is going to be stored centrally, the cost will be substantial. For example, at the EBI -the European Bioinformatics Institute- we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage (not the kind you buy at Dixons or Radio Shack, obviously). For stored data, we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at three data centres, plus back-ups, plus a data-duplication centre, plus unreleased versus public versions of the archive). (Note - this is only for the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, disks have to be housed in a building (not free!), with cooling, security measures, security staff,
Re: [ccp4bb] To archive or not to archive, that's the question!
One important issue to address is how deal with the perceived reliability issues of the federated model and how to start to approach the higher reliability of the centralized model described bu Gerard K, but without incurring what seems to be at present unacceptable costs. One answer comes from the approach followed in communications systems. If the probability of data loss in each communication subsystem is, say, 1/1000, then the probability of data loss in two independent copies of the same lossy system is only 1/1,000,000. We could apply that lessonto the federated data image archive model by asking each institution to partner with a second independent, and hopefully geographically distant, institution, with an agreement for each to host copies of the other's images. If we restrict that duplication protocol, at least at first, to those images strongly related to an actual publication/PDB deposition, the incremental cost of greatly improved reliability would be very low, with no disruption of the basic federated approach being suggested. Please note that I am not suggesting that institutional repositories will have 1/1000 data loss rates, but they will certainly have some data loss rate, and this modest change in the proposal would help to greatly lower the impact of that data loss rate and allow us to go forward with greater confidence. Regards, Herbert At 7:53 AM +0100 10/29/11, Jrh wrote: Dear Gerard K, Many thanks indeed for this. Like Gerard Bricogne you also indicate that the location option being the decentralised one is 'quite simple and very cheap in terms of centralised cost'. The SR Facilities worldwide I hope can surely follow the lead taken by Diamond Light Source and PaN, the European Consortium of SR and Neutron Facilities, and keep their data archives and also assist authors with the doi registration process for those datasets that result in publication. Linking to these dois from the PDB for example is as you confirm straightforward. Gerard B's pressing of the above approach via the 'Pilot project' within the IUCr DDD WG various discussions, with a nicely detailed plan, brought home to me the merit of the above approach for the even greater challenge for raw data archiving for chemical crystallography, both in terms of number of datasets and also the SR Facilities role being much smaller. IUCr Journals also note the challenge of moving large quantities of data around ie if the Journals were to try and host everything for chemical crystallography, and them thus becoming 'the centre' for these datasets. So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Such an approach involving institutional repositories would also work of course for the 25% of MX structures that are for non SR datasets. All the best for a splendid PDB40 Event. Greetings, John Prof John R Helliwell DSc On 28 Oct 2011, at 22:02, Gerard DVD Kleywegt ger...@xray.bmc.uu.se wrote: Hi all, It appears that during my time here at Cold Spring Harbor, I have missed a small debate on CCP4BB (in which my name has been used in vain to boot). I have not yet had time to read all the contributions, but would like to make a few points that hopefully contribute to the discussion and keep it with two feet on Earth (as opposed to La La Land where the people live who think that image archiving can be done on a shoestring budget... more about this in a bit). Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, and sorry for the new subject line, but this way I can track the replies more easily. It seems to me that there are a number of issues that need to be separated: (1) the case for/against storing raw data (2) implementation and resources (3) funding (4) location I will say a few things about each of these issues in turn: --- (1) Arguments in favour and against the concept of storing raw image data, as well as possible alternative solutions that could address some of the issues at lower cost or complexity. I realise that my views carry a weight=1.0 just like everybody else's, and many of the arguments and counter-arguments have already been made, so I will not add to these at this stage.
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear Herbert, I imagine it likely that eg The Univ Manchester eScholar system will have in place duplicate storage for the reasons you outline below. However for it to be geographically distant is, to my reckoning, less likely, but still possible. I will add that further query to my first query to my eScholar user support re dataset sizes and doi registration. Greetings, John Prof John R Helliwell DSc On 29 Oct 2011, at 15:49, Herbert J. Bernstein y...@bernstein-plus-sons.com wrote: One important issue to address is how deal with the perceived reliability issues of the federated model and how to start to approach the higher reliability of the centralized model described bu Gerard K, but without incurring what seems to be at present unacceptable costs. One answer comes from the approach followed in communications systems. If the probability of data loss in each communication subsystem is, say, 1/1000, then the probability of data loss in two independent copies of the same lossy system is only 1/1,000,000. We could apply that lessonto the federated data image archive model by asking each institution to partner with a second independent, and hopefully geographically distant, institution, with an agreement for each to host copies of the other's images. If we restrict that duplication protocol, at least at first, to those images strongly related to an actual publication/PDB deposition, the incremental cost of greatly improved reliability would be very low, with no disruption of the basic federated approach being suggested. Please note that I am not suggesting that institutional repositories will have 1/1000 data loss rates, but they will certainly have some data loss rate, and this modest change in the proposal would help to greatly lower the impact of that data loss rate and allow us to go forward with greater confidence. Regards, Herbert At 7:53 AM +0100 10/29/11, Jrh wrote: Dear Gerard K, Many thanks indeed for this. Like Gerard Bricogne you also indicate that the location option being the decentralised one is 'quite simple and very cheap in terms of centralised cost'. The SR Facilities worldwide I hope can surely follow the lead taken by Diamond Light Source and PaN, the European Consortium of SR and Neutron Facilities, and keep their data archives and also assist authors with the doi registration process for those datasets that result in publication. Linking to these dois from the PDB for example is as you confirm straightforward. Gerard B's pressing of the above approach via the 'Pilot project' within the IUCr DDD WG various discussions, with a nicely detailed plan, brought home to me the merit of the above approach for the even greater challenge for raw data archiving for chemical crystallography, both in terms of number of datasets and also the SR Facilities role being much smaller. IUCr Journals also note the challenge of moving large quantities of data around ie if the Journals were to try and host everything for chemical crystallography, and them thus becoming 'the centre' for these datasets. So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Such an approach involving institutional repositories would also work of course for the 25% of MX structures that are for non SR datasets. All the best for a splendid PDB40 Event. Greetings, John Prof John R Helliwell DSc On 28 Oct 2011, at 22:02, Gerard DVD Kleywegt ger...@xray.bmc.uu.se wrote: Hi all, It appears that during my time here at Cold Spring Harbor, I have missed a small debate on CCP4BB (in which my name has been used in vain to boot). I have not yet had time to read all the contributions, but would like to make a few points that hopefully contribute to the discussion and keep it with two feet on Earth (as opposed to La La Land where the people live who think that image archiving can be done on a shoestring budget... more about this in a bit). Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, and sorry for the new subject line, but this way I can track the replies more easily. It seems to me that there are a number of issues that need to be separated:
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear John, Most sound institutional data repositories use some form of off-site backup. However, not all of them do, and the standards of reliabilty vary. The advantages of an explicit partnering system are both practical and psychological. The practical part is the major improvement in reliability -- even if we start at 6 nines, 12 nines is better. The psychological part is that members of the community can feel reassured that reliability has in been improved to levels at which they can focus on other, more scientific issues, instead ot the question of reliability. Regards, Herbert = Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 y...@dowling.edu = On Sat, 29 Oct 2011, Jrh wrote: Dear Herbert, I imagine it likely that eg The Univ Manchester eScholar system will have in place duplicate storage for the reasons you outline below. However for it to be geographically distant is, to my reckoning, less likely, but still possible. I will add that further query to my first query to my eScholar user support re dataset sizes and doi registration. Greetings, John Prof John R Helliwell DSc On 29 Oct 2011, at 15:49, Herbert J. Bernstein y...@bernstein-plus-sons.com wrote: One important issue to address is how deal with the perceived reliability issues of the federated model and how to start to approach the higher reliability of the centralized model described bu Gerard K, but without incurring what seems to be at present unacceptable costs. One answer comes from the approach followed in communications systems. If the probability of data loss in each communication subsystem is, say, 1/1000, then the probability of data loss in two independent copies of the same lossy system is only 1/1,000,000. We could apply that lessonto the federated data image archive model by asking each institution to partner with a second independent, and hopefully geographically distant, institution, with an agreement for each to host copies of the other's images. If we restrict that duplication protocol, at least at first, to those images strongly related to an actual publication/PDB deposition, the incremental cost of greatly improved reliability would be very low, with no disruption of the basic federated approach being suggested. Please note that I am not suggesting that institutional repositories will have 1/1000 data loss rates, but they will certainly have some data loss rate, and this modest change in the proposal would help to greatly lower the impact of that data loss rate and allow us to go forward with greater confidence. Regards, Herbert At 7:53 AM +0100 10/29/11, Jrh wrote: Dear Gerard K, Many thanks indeed for this. Like Gerard Bricogne you also indicate that the location option being the decentralised one is 'quite simple and very cheap in terms of centralised cost'. The SR Facilities worldwide I hope can surely follow the lead taken by Diamond Light Source and PaN, the European Consortium of SR and Neutron Facilities, and keep their data archives and also assist authors with the doi registration process for those datasets that result in publication. Linking to these dois from the PDB for example is as you confirm straightforward. Gerard B's pressing of the above approach via the 'Pilot project' within the IUCr DDD WG various discussions, with a nicely detailed plan, brought home to me the merit of the above approach for the even greater challenge for raw data archiving for chemical crystallography, both in terms of number of datasets and also the SR Facilities role being much smaller. IUCr Journals also note the challenge of moving large quantities of data around ie if the Journals were to try and host everything for chemical crystallography, and them thus becoming 'the centre' for these datasets. So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Such an approach involving institutional repositories would also work of course for the 25% of MX structures that are for non SR datasets. All the best
[ccp4bb] To archive or not to archive, that's the question!
Hi all, It appears that during my time here at Cold Spring Harbor, I have missed a small debate on CCP4BB (in which my name has been used in vain to boot). I have not yet had time to read all the contributions, but would like to make a few points that hopefully contribute to the discussion and keep it with two feet on Earth (as opposed to La La Land where the people live who think that image archiving can be done on a shoestring budget... more about this in a bit). Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, and sorry for the new subject line, but this way I can track the replies more easily. It seems to me that there are a number of issues that need to be separated: (1) the case for/against storing raw data (2) implementation and resources (3) funding (4) location I will say a few things about each of these issues in turn: --- (1) Arguments in favour and against the concept of storing raw image data, as well as possible alternative solutions that could address some of the issues at lower cost or complexity. I realise that my views carry a weight=1.0 just like everybody else's, and many of the arguments and counter-arguments have already been made, so I will not add to these at this stage. --- (2) Implementation details and required resources. If the community should decide that archiving raw data would be scientifically useful, then it has to decide how best to do it. This will determine the level of resources required to do it. Questions include: - what should be archived? (See Jim H's list from (a) to (z) or so.) An initial plan would perhaps aim for the images associated with the data used in the final refinement of deposited structures. - how much data are we talking about per dataset/structure/year? - should it be stored close to the source (i.e., responsibility and costs for depositors or synchrotrons) or centrally (i.e., costs for some central resource)? If it is going to be stored centrally, the cost will be substantial. For example, at the EBI -the European Bioinformatics Institute- we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage (not the kind you buy at Dixons or Radio Shack, obviously). For stored data, we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at three data centres, plus back-ups, plus a data-duplication centre, plus unreleased versus public versions of the archive). (Note - this is only for the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, disks have to be housed in a building (not free!), with cooling, security measures, security staff, maintenance staff, electricity (substantial cost!), rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of three years (barring failures) and then needs to be replaced (at lower cost, but still not free). - if the data is going to be stored centrally, how will it get there? Using ftp will probably not be feasible. - if it is not stored centrally, how will long-term data availability be enforced? (Otherwise I could have my data on a public server until my paper comes out in print, and then remove it.) - what level of annotation will be required? There is no point in having zillions of files lying around if you don't know which structure/crystal/sample they belong to, at what wavelength they were recorded, if they were used in refinement or not, etc. - an issue that has not been raised yet, I think: who is going to validate that the images actually correspond to the structure factor amplitudes or intensities that were used in the refinement? This means that the data will have to be indexed, integrated, scaled, merged, etc. and finally compared to the deposited Fobs or Iobs. This will have to be done for *10,000 data sets a year*... And I can already imagine the arguments that will follow between depositors and re-processors about what software to use, what resolution cut-off, what outlier-rejection criteria, etc. How will conflicts and discrepancies be resolved? This could well end up taking a day of working time per data set, i.e. with 200 working days per year, one would need 50 *new* staff for this task alone. For comparison: worldwide, there is currently a *total* of ~25 annotators working for the wwPDB partners... Not many of you know that (about 10 years ago) I spent probably an entire year of my life sorting out the mess that was the PDB structure factor files pre-EDS... We were apparently the first people to ever look at the tens of thousands of structure factor files and try to use all of them to calculate maps for the EDS server. (If there were others who attempted this before us, they had probably run away screaming.) This went well for many files, but there were many, many files that had problems. There were dozens of different kinds of issues: non-CIF files, CIF files with wrong headers, Is instead of Fs,
Re: [ccp4bb] To archive or not to archive, that's the question!
Gerard I said in INCREASING order of influence/power i.e. you are in first place. The joke comes from I used to think if there was reincarnation, I wanted to come back as the President or the Pope or a .400 baseball hitter. But now I want to come back as the bond market. You can intimidate everyone. --James Carville, Clinton campaign strategist Thanks for the comprehensive reply Regards Colin -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Gerard DVD Kleywegt Sent: 28 October 2011 22:03 To: ccp4bb Subject: [ccp4bb] To archive or not to archive, that's the question! Hi all, It appears that during my time here at Cold Spring Harbor, I have missed a small debate on CCP4BB (in which my name has been used in vain to boot). I have not yet had time to read all the contributions, but would like to make a few points that hopefully contribute to the discussion and keep it with two feet on Earth (as opposed to La La Land where the people live who think that image archiving can be done on a shoestring budget... more about this in a bit). Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, and sorry for the new subject line, but this way I can track the replies more easily. It seems to me that there are a number of issues that need to be separated: (1) the case for/against storing raw data (2) implementation and resources (3) funding (4) location I will say a few things about each of these issues in turn: --- (1) Arguments in favour and against the concept of storing raw image data, as well as possible alternative solutions that could address some of the issues at lower cost or complexity. I realise that my views carry a weight=1.0 just like everybody else's, and many of the arguments and counter-arguments have already been made, so I will not add to these at this stage. --- (2) Implementation details and required resources. If the community should decide that archiving raw data would be scientifically useful, then it has to decide how best to do it. This will determine the level of resources required to do it. Questions include: - what should be archived? (See Jim H's list from (a) to (z) or so.) An initial plan would perhaps aim for the images associated with the data used in the final refinement of deposited structures. - how much data are we talking about per dataset/structure/year? - should it be stored close to the source (i.e., responsibility and costs for depositors or synchrotrons) or centrally (i.e., costs for some central resource)? If it is going to be stored centrally, the cost will be substantial. For example, at the EBI -the European Bioinformatics Institute- we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage (not the kind you buy at Dixons or Radio Shack, obviously). For stored data, we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at three data centres, plus back-ups, plus a data-duplication centre, plus unreleased versus public versions of the archive). (Note - this is only for the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, disks have to be housed in a building (not free!), with cooling, security measures, security staff, maintenance staff, electricity (substantial cost!), rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of three years (barring failures) and then needs to be replaced (at lower cost, but still not free). - if the data is going to be stored centrally, how will it get there? Using ftp will probably not be feasible. - if it is not stored centrally, how will long-term data availability be enforced? (Otherwise I could have my data on a public server until my paper comes out in print, and then remove it.) - what level of annotation will be required? There is no point in having zillions of files lying around if you don't know which structure/crystal/sample they belong to, at what wavelength they were recorded, if they were used in refinement or not, etc. - an issue that has not been raised yet, I think: who is going to validate that the images actually correspond to the structure factor amplitudes or intensities that were used in the refinement? This means that the data will have to be indexed, integrated, scaled, merged, etc. and finally compared to the deposited Fobs or Iobs. This will have to be done for *10,000 data sets a year*... And I can already imagine the arguments that will follow between depositors and re-processors about what software to use, what resolution cut-off, what outlier-rejection criteria, etc. How will conflicts and discrepancies be resolved? This could well end up taking a day of working time per data set, i.e. with 200 working days per year, one would need 50 *new* staff for this task alone. For comparison: worldwide, there is currently a *total* of ~25 annotators working for the wwPDB partners...
Re: [ccp4bb] To archive or not to archive, that's the question!
Gerard I said in INCREASING order of influence/power i.e. you are in first place. Ooo! *Now* it makes sense! :-) --Gerard The joke comes from I used to think if there was reincarnation, I wanted to come back as the President or the Pope or a .400 baseball hitter. But now I want to come back as the bond market. You can intimidate everyone. --James Carville, Clinton campaign strategist Thanks for the comprehensive reply Regards Colin -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Gerard DVD Kleywegt Sent: 28 October 2011 22:03 To: ccp4bb Subject: [ccp4bb] To archive or not to archive, that's the question! Hi all, It appears that during my time here at Cold Spring Harbor, I have missed a small debate on CCP4BB (in which my name has been used in vain to boot). I have not yet had time to read all the contributions, but would like to make a few points that hopefully contribute to the discussion and keep it with two feet on Earth (as opposed to La La Land where the people live who think that image archiving can be done on a shoestring budget... more about this in a bit). Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, and sorry for the new subject line, but this way I can track the replies more easily. It seems to me that there are a number of issues that need to be separated: (1) the case for/against storing raw data (2) implementation and resources (3) funding (4) location I will say a few things about each of these issues in turn: --- (1) Arguments in favour and against the concept of storing raw image data, as well as possible alternative solutions that could address some of the issues at lower cost or complexity. I realise that my views carry a weight=1.0 just like everybody else's, and many of the arguments and counter-arguments have already been made, so I will not add to these at this stage. --- (2) Implementation details and required resources. If the community should decide that archiving raw data would be scientifically useful, then it has to decide how best to do it. This will determine the level of resources required to do it. Questions include: - what should be archived? (See Jim H's list from (a) to (z) or so.) An initial plan would perhaps aim for the images associated with the data used in the final refinement of deposited structures. - how much data are we talking about per dataset/structure/year? - should it be stored close to the source (i.e., responsibility and costs for depositors or synchrotrons) or centrally (i.e., costs for some central resource)? If it is going to be stored centrally, the cost will be substantial. For example, at the EBI -the European Bioinformatics Institute- we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage (not the kind you buy at Dixons or Radio Shack, obviously). For stored data, we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at three data centres, plus back-ups, plus a data-duplication centre, plus unreleased versus public versions of the archive). (Note - this is only for the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, disks have to be housed in a building (not free!), with cooling, security measures, security staff, maintenance staff, electricity (substantial cost!), rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of three years (barring failures) and then needs to be replaced (at lower cost, but still not free). - if the data is going to be stored centrally, how will it get there? Using ftp will probably not be feasible. - if it is not stored centrally, how will long-term data availability be enforced? (Otherwise I could have my data on a public server until my paper comes out in print, and then remove it.) - what level of annotation will be required? There is no point in having zillions of files lying around if you don't know which structure/crystal/sample they belong to, at what wavelength they were recorded, if they were used in refinement or not, etc. - an issue that has not been raised yet, I think: who is going to validate that the images actually correspond to the structure factor amplitudes or intensities that were used in the refinement? This means that the data will have to be indexed, integrated, scaled, merged, etc. and finally compared to the deposited Fobs or Iobs. This will have to be done for *10,000 data sets a year*... And I can already imagine the arguments that will follow between depositors and re-processors about what software to use, what resolution cut-off, what outlier-rejection criteria, etc. How will conflicts and discrepancies be resolved? This could well end up taking a day of working time per data set, i.e. with 200 working days per year, one would need 50 *new* staff for this task alone. For comparison: worldwide, there is currently a *total* of ~25 annotators working for the wwPDB partners...
Re: [ccp4bb] To archive or not to archive, that's the question!
On Friday, October 28, 2011 02:02:46 pm Gerard DVD Kleywegt wrote: I'm a tad disappointed to be only in fourth place, Colin! What has the Pope ever done for crystallography? http://covers.openlibrary.org/b/id/5923051-L.jpg -- Ethan A Merritt Biomolecular Structure Center, K-428 Health Sciences Bldg University of Washington, Seattle 98195-7742
Re: [ccp4bb] To archive or not to archive, that's the question!
On Friday, October 28, 2011 02:02:46 pm Gerard DVD Kleywegt wrote: I'm a tad disappointed to be only in fourth place, Colin! What has the Pope ever done for crystallography? http://covers.openlibrary.org/b/id/5923051-L.jpg Fock'n'Pope! Great find, Ethan! So maybe he deserves fourth place after all. --Gerard ** Gerard J. Kleywegt http://xray.bmc.uu.se/gerard mailto:ger...@xray.bmc.uu.se ** The opinions in this message are fictional. Any similarity to actual opinions, living or dead, is purely coincidental. ** Little known gastromathematical curiosity: let z be the radius and a the thickness of a pizza. Then the volume of that pizza is equal to pi*z*z*a ! **
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear Gerard, I think that a major achievement of this online debate will have been to actually get you to carry out a constructive analysis (an impressive one, I will be the first to say) of this question, instead of dismissing it right away. It is almost as great an achievement as getting the Pope to undergo psychoanalysis! (I am thinking here of the movie Habemus Papam.) It is very useful to have the facts and figures you mention for the costs of full PDB officialdom for the storage of raw data. I think one could describe the first stage towards that, in the form I have been mentioning as the IUCr DDDWG pilot project, as first trying to see how to stop those raw images from disappearing, pending the mobilisation of more resources towards eventually putting them up in five-star accommodation (if they are thought to be earning their keep). I am again hopeful that anticipated difficulties at the five-star stage (with today's cost estimates) will not stop us from trying to do what is possible today in this pilot project, and I also hope that enough synchrotrons and depositors will volunteer to take part in it. The extra logistical load on checking that submitted raw images sets do correspond to the deposited structure should be something that can be pushed down towards the synchrotron sources, as was mentioned for the proper book-keeping of metadata, as part of keeping tidy records linking user project databases to datasets, and towards enhancements in data processing and structure determination pipelines to keep track of all stages of the derivation of the deposited results from the raw data. Not trivial, but not insuperable, and fully in the direction of more automation and more associated record keeping. This is just to say that it needs not all land on the PDB's shoulders in an initially amorphous state. In any case, thank you for devoting so much time and attention to this nuts-and-bolts discussion when there are so many tempting forms of high octane entertainment around! With best wishes, Gerard (B.) -- On Fri, Oct 28, 2011 at 11:02:46PM +0200, Gerard DVD Kleywegt wrote: Hi all, It appears that during my time here at Cold Spring Harbor, I have missed a small debate on CCP4BB (in which my name has been used in vain to boot). I have not yet had time to read all the contributions, but would like to make a few points that hopefully contribute to the discussion and keep it with two feet on Earth (as opposed to La La Land where the people live who think that image archiving can be done on a shoestring budget... more about this in a bit). Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, and sorry for the new subject line, but this way I can track the replies more easily. It seems to me that there are a number of issues that need to be separated: (1) the case for/against storing raw data (2) implementation and resources (3) funding (4) location I will say a few things about each of these issues in turn: --- (1) Arguments in favour and against the concept of storing raw image data, as well as possible alternative solutions that could address some of the issues at lower cost or complexity. I realise that my views carry a weight=1.0 just like everybody else's, and many of the arguments and counter-arguments have already been made, so I will not add to these at this stage. --- (2) Implementation details and required resources. If the community should decide that archiving raw data would be scientifically useful, then it has to decide how best to do it. This will determine the level of resources required to do it. Questions include: - what should be archived? (See Jim H's list from (a) to (z) or so.) An initial plan would perhaps aim for the images associated with the data used in the final refinement of deposited structures. - how much data are we talking about per dataset/structure/year? - should it be stored close to the source (i.e., responsibility and costs for depositors or synchrotrons) or centrally (i.e., costs for some central resource)? If it is going to be stored centrally, the cost will be substantial. For example, at the EBI -the European Bioinformatics Institute- we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage (not the kind you buy at Dixons or Radio Shack, obviously). For stored data, we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at three data centres, plus back-ups, plus a data-duplication centre, plus unreleased versus public versions of the archive). (Note - this is only for the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, disks have to be housed in a building (not free!), with cooling, security measures, security staff, maintenance staff, electricity (substantial cost!), rental of a 1-10 Gb/s connection, etc. All hardware has a
Re: [ccp4bb] To archive or not to archive, that's the question!
As the poster who mentioned the $1000 - $3000 per terabyte per year figure, I should point out that the figure originated not from La La land but from an NSF RDLM workshop in Princeton last summer. Certainly the actual costs may be higher or lower depending on economies/diseconomies of scale and required ancilary task to be performed. The base figure itself seems consistent with the GBP 1500 figure cited for EBI. That aside, the list presented seems very useful to the discussion. I would suggest adding to it the need to try to resolve the complex intellectual property issues involved. This might be a good time to try to get a consensus of the scientific community of what approach to IP law would best serve our interests going forward. The current situation seems a bit messy. Regards, Herbert = Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 y...@dowling.edu = On Fri, 28 Oct 2011, Gerard DVD Kleywegt wrote: Gerard I said in INCREASING order of influence/power i.e. you are in first place. Ooo! *Now* it makes sense! :-) --Gerard The joke comes from I used to think if there was reincarnation, I wanted to come back as the President or the Pope or a .400 baseball hitter. But now I want to come back as the bond market. You can intimidate everyone. --James Carville, Clinton campaign strategist Thanks for the comprehensive reply Regards Colin -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Gerard DVD Kleywegt Sent: 28 October 2011 22:03 To: ccp4bb Subject: [ccp4bb] To archive or not to archive, that's the question! Hi all, It appears that during my time here at Cold Spring Harbor, I have missed a small debate on CCP4BB (in which my name has been used in vain to boot). I have not yet had time to read all the contributions, but would like to make a few points that hopefully contribute to the discussion and keep it with two feet on Earth (as opposed to La La Land where the people live who think that image archiving can be done on a shoestring budget... more about this in a bit). Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, and sorry for the new subject line, but this way I can track the replies more easily. It seems to me that there are a number of issues that need to be separated: (1) the case for/against storing raw data (2) implementation and resources (3) funding (4) location I will say a few things about each of these issues in turn: --- (1) Arguments in favour and against the concept of storing raw image data, as well as possible alternative solutions that could address some of the issues at lower cost or complexity. I realise that my views carry a weight=1.0 just like everybody else's, and many of the arguments and counter-arguments have already been made, so I will not add to these at this stage. --- (2) Implementation details and required resources. If the community should decide that archiving raw data would be scientifically useful, then it has to decide how best to do it. This will determine the level of resources required to do it. Questions include: - what should be archived? (See Jim H's list from (a) to (z) or so.) An initial plan would perhaps aim for the images associated with the data used in the final refinement of deposited structures. - how much data are we talking about per dataset/structure/year? - should it be stored close to the source (i.e., responsibility and costs for depositors or synchrotrons) or centrally (i.e., costs for some central resource)? If it is going to be stored centrally, the cost will be substantial. For example, at the EBI -the European Bioinformatics Institute- we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage (not the kind you buy at Dixons or Radio Shack, obviously). For stored data, we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at three data centres, plus back-ups, plus a data-duplication centre, plus unreleased versus public versions of the archive). (Note - this is only for the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, disks have to be housed in a building (not free!), with cooling, security measures, security staff, maintenance staff, electricity (substantial cost!), rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of three years (barring failures) and then needs to be replaced (at lower cost, but still not free). - if the data is going to be stored centrally, how will it get there? Using ftp will probably not be feasible. - if it is not stored centrally, how will long-term data availability be enforced? (Otherwise I could have my data