Re: [ccp4bb] To archive or not to archive, that's the question!
Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become more expensive. So, if its important, do it again. Its cheaper and its better. At first sight, that does not apply to MX. Or does it? So, maybe the question is not To archive or not to archive but What to archive. (similarly, it never crossed my mind if I should be or not be - I always wondered what to be) A. On Oct 30, 2011, at 11:59, Kay Diederichs wrote: Am 20:59, schrieb Jrh: ... So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Dear John, I'm pretty sure that there exists no consistent policy to provide an institutional repository for deposition of scientific data at German universities or Max-Planck institutes or Helmholtz institutions, at least I never heard of something like this. More specifically, our University of Konstanz certainly does not have the infrastructure to provide this. I don't think that Germany is the only country which is the exception to any rule of availability of institutional repository . Rather, I'm almost amazed that British and American institutions seem to support this. Thus I suggest to not focus exclusively on official institutional repositories, but to explore alternatives: distributed filestores like Google's BigTable, Bittorrent or others might be just as suitable - check out http://en.wikipedia.org/wiki/Distributed_data_store . I guess that any crystallographic lab could easily sacrifice/ donate a TB of storage for the purposes of this project in 2011 (and maybe 2 TB in 2012, 3 in 2013, ...), but clearly the level of work to set this up should be kept as low as possible (a bittorrent daemon seems simple enough). Just my 2 cents, Kay P please don't print this e-mail unless you really need to Anastassis (Tassos) Perrakis, Principal Investigator / Staff Member Department of Biochemistry (B8) Netherlands Cancer Institute, Dept. B8, 1066 CX Amsterdam, The Netherlands Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile / SMS: +31 6 28 597791
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear Tassos, It is unclear whether this thread will be able to resolve your deep existential concerns about what to be, but you do introduce a couple of interesting points: (1) raw data archiving in areas (of biology) other than structural biology, and (2) archiving the samples rather than the verbose data that may have been extracted from them. Concerning (1), I am grateful to Peter Keller here in my group for pointing me, mid-August when we were for the n-th time reviewing the issue of raw data deposition under discussion in this thread, and its advantages over only keeping derived data extracted from them, towards the Trace Archive of DNA sequences. He found an example, at http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieveval=12345dopt=tracesize=1retrieve=Submit You can check the Quality Score box below the trace, and this will refresh the display to give a visual estimate of the reliability of the sequence. There is clearly a problem around position 210, that would not have been adequately dealt with by just retaining the most probable sequence. In this context, it has been found worthwhile to preserve the raw data, to make it possible to audit derived data against them. This is at least a very simple example of what you were referring to whan you wrote about the inadequacy of computational reduction. In the MX context, this is rather similar to the contamination of integrated intensities by spots from parasitic lattices (which would still affect unmerged intensities, by the way - so upgrading the pdb structure factor file to unmerged data would take care of over-merging, but not of that contamination). Concerning (2) I greatly doubt there would be an equivalent for MX: few people would have spare crystals to put to one side for a future repeat of a diffraction experiment (except in the case of lysozyme/insulin/thaumatin!). I can remember an esteemed colleague arguing 4-5 years ago that if you want to improve a deposited structure, you could simply repeat the work from scratch - a sensible position from the philosophical point of view (science being the art of the repeatable), but far less sensible in conditions of limited resources, and given also the difficulties of reproducing crystals. The real-life situation is more a Carpe diem one: archive what you have, as you may never see it again! Otherwise one would easily get drawn into the same kind of unrealistic expectations as people who get themselves frozen in liquid N2, with their blood replaced by DMSO, hoping to be brought back to life some day in the future ;-) . With best wishes, Gerard. -- On Mon, Oct 31, 2011 at 11:37:47AM +0100, Anastassis Perrakis wrote: Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become more expensive. So, if its important, do it again. Its cheaper and its better. At first sight, that does not apply to MX. Or does it? So, maybe the question is not To archive or not to archive but What to archive. (similarly, it never crossed my mind if I should be or not be - I always wondered what to be) A. On Oct 30, 2011, at 11:59, Kay Diederichs wrote: Am 20:59, schrieb Jrh: ... So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good. Dear John, I'm pretty sure that there exists no consistent policy to provide an institutional repository for deposition of scientific data at German universities or Max-Planck institutes or Helmholtz institutions, at least I never heard of something like this. More specifically, our University of
[ccp4bb] 2 senior positions at AstraZeneca
Two senior structural biology positions available at AstraZeneca UK. One is associate director of the crystallography/crystallization group (recently mentioned in ccp4bb and still open) and has reference RD237, the other is one position up and has just become available, the director of protein structure and biophysics with reference RD300. You can find the jobs directly with the links below, or search for the reference numbers on www.astrazeneca.com under careers. Both should also appear in Nature jobs at some point soon. http://jobs.astrazeneca.com/jobs/3370-associate-director-protein-structu re-crystallography http://jobs.astrazeneca.com/jobs/3443-director-of-structure-and-biophysi cs Many thanks Richard Pauptit Discovery Sciences, Alderley Park AstraZeneca UK -- AstraZeneca UK Limited is a company incorporated in England and Wales with registered number: 03674842 and a registered office at 2 Kingdom Street, London, W2 6BD. Confidentiality Notice: This message is private and may contain confidential, proprietary and legally privileged information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorised use or disclosure of the contents of this message is not permitted and may be unlawful. Disclaimer: Email messages may be subject to delays, interception, non-delivery and unauthorised alterations. Therefore, information expressed in this message is not given or endorsed by AstraZeneca UK Limited unless otherwise notified by an authorised representative independent of this message. No contractual relationship is created by this message by any person unless specifically indicated by agreement in writing other than email. Monitoring: AstraZeneca UK Limited may monitor email traffic data and content for the purposes of the prevention and detection of crime, ensuring the security of our computer systems and checking Compliance with our Code of Conduct and Policies.
Re: [ccp4bb] To archive or not to archive, that's the question!
Still, after hundreds (?) of emails to this topic, I haven't seen any convincing argument in favor of archiving data. The only convincing arguments are against, and are from Gerard K and Tassos. Why? The question is not what to archive, but still why should we archive all the data. Because software developers need more data? Should we raise all these efforts and costs because 10 developers worldwide need the data to ALL protein structures? Do they really need so much data, wouldn't it be enough to build a repository of maybe 1000 datasets for developments? Does really someone believe that our view on the actual problem, the function of the proteins, changes with the analysis of whatsoever scattering is still in the images but not used by today's software? Crystal structures are static, snapshots, and obtained under artificial conditions. In solution (still the physiologic state) they might look different, not much, but at least far more dynamic. Does it therefore matter whether we know some sidechain positions better (in the crystal structure) when re-analysing the data? In turn, are our current software programs such bad that we would expect strong difference when re-analysing the data? No. And if the structures change upon reanalysis (more or less) who does re-interpret the structures, re-writes the papers? There are many many cases where researchers re-did structures (or did closely related structures to already available structures like mutants, structures of closely related species, etc.), also after 10 years. I guess they used the latest software in the different cases, thus they incorporated all the software development of the 10 years. And are the structures really different (beyond the introduced changes, mutations, etc.)? Different because of the software used? The comparison with next-generation sequencing data is useful here, but only in the sense Tassos explained. Well, of course not every position in the genomic sequence is fixed. Therefore it is sometimes useful to look at the original data (the traces, as Gerard B pointed out). But we already know, that every single organism is different (especially eukaryotes) and therefore it is absolutely enough to store the computationally reduced and merged data. If one needs better, position-specific data, sequencing and comparing single species becomes necessary, like in the ENCODE project, the sequencing of about 100 Saccharomyces strains, the sequencing of 1000 Arabidopsis strains, etc. Discussion about single positions are useless if they are not statistically relevant. They need to be analysed in the context of populations, large cohorts of patients, etc. If we need personalized medicine adapted to personal genomes, we would also need personal sets of protein structures which we cannot provide yet. Therefore, storing the DNA in the freezer is better and cheaper than storing all the sequencing raw data. Do you think a reviewer re-sequences, or re-assembles, or re-annotates a genome, even if access to the raw reads would be available? If you trust these data why don't we trust our structure factors? Do you trust electron microscopy images, movies of GFP-tagged proteins? Do you think what is presented for a single or a few visible cells is also found in all cells? And now, who many of you (if not everybody) uses structures from yeast, Drosophila, mouse etc. as MODEL for human proteins? If we stick to this thinking, who would care about potential minor changes in the structures upon re-analysis (and in the light of this discussion, arguing about specific genomic sequence positions becomes unimportant as well)? Is any of the archived data useful without manual evaluation upon archiving? This is especially relevant for structures not solved yet. Do the images belong to the structure factors, if only images are available, where is the corresponding protein sequence, has it been sequenced, what has been in the buffer/crystallization condition, what has been used during protein purification, what was the intention for crystallization - e.g. a certain functional state, that the protein was forced to by artificial conditions, etc. etc. Who want's to evaluate that, and how? The question is not that we could do it. We could do it, but wouldn't it advance science far more if we would spend the time and money in new projects rather than evaluation, administration, etc? Be honest: How many of you have really, and completely, reanalysed your own data, that you have deposited 10 years ago, with the latest software? What changes did you find? Did you have to re-write your former discussions in the publications? Do you think that the changes justify the efforts and costs of worldwide archiving of all data? Well, for all cases there are always (and have been mentioned in earlier emails) single cases where these things matter or mattered. But does this really justify all the future
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear All, As someone who recently left crystallography for sequencing, I should modify Tassos's point... A full data-set is a few terabytes, but post-processing reduces it to sub-Gb size. My experience from HiSeqs is that this full here means the base calls - equivalent to the unmerged HKLs - hardly raw data. NGS (short-read) sequencing is an imaging technique and the images are more like 100TB for a 15-day run on a single flow cell. The raw base calls are about 5TB. The compressed, mapped data (BAM file, for a human genome, 30x coverage) is about 120GB. It is only a variant call file (VCF, difference from a stated human reference genome) that is sub-Gb and these files are - unsurprisingly - unsuited to detailed statistical analysis. Also $1k is a not yet an economic cost... The DNA information capacity in a single human body dwarfs the entire world disk capacity, so storing DNA is a no brainer here. Sequencing groups are making very hard-nosed economic decisions about what to store - indeed it is a source of research in itself - but the scale of the problem is very much bigger. My tuppence ha'penny is that depositing raw images along with everything else in the PDB is a nice idea but would have little impact on science (human/animal/plant health or understanding of biology). 1) If confined to structures in the PDB, the images would just be the ones giving the final best data - hence the ones least likely to have been problematic. I'd be more interested in SFs/maps for looking at ligand-binding etc... 2) Unless this were done before paper acceptance they would be of little use to referees seeking to review important structural papers. I'd like to see PDB validation reports (which could include automated data processing, perhaps culled from synchrotron sites, SFs and/or maps) made available to referees in advance of publication. This would be enabled by deposition, but could be achieved in other ways. 3) The datasets of interest to methods developers are unlikely to be the ones deposited. They should be in contact with synchrotron archives directly. Processing multiple lattices is a case in point here. 4) Remember the average consumer of a PDB file is not a crystallographer. More likely to be a graduate student in a clinical lab. For him/her things like occupancies and B- factors are far more serious concerns... I'm not trivializing the issue, but importance is always relative. Are there outsiders on the panel to keep perspective? Robert -- Dr. Robert Esnouf, University Research Lecturer, ex-crystallographer and Head of Research Computing, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK Emails: rob...@strubi.ox.ac.uk Tel: (+44) - 1865 - 287783 and rob...@esnouf.comFax: (+44) - 1865 - 287547 Original message Date: Mon, 31 Oct 2011 11:37:47 +0100 From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf of Anastassis Perrakis a.perra...@nki.nl) Subject: Re: [ccp4bb] To archive or not to archive, that's the question! To: CCP4BB@JISCMAIL.AC.UK Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become more expensive. So, if its important, do it again. Its cheaper and its better. At first sight, that does not apply to MX. Or does it? So, maybe the question is not To archive or not to archive but What to archive. (similarly, it never crossed my mind if I should be or not be - I always wondered what to be) A. On Oct 30, 2011, at 11:59, Kay Diederichs wrote: Am 20:59, schrieb Jrh: ... So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures. It would be helpful if readers of this CCP4bb could kindly
Re: [ccp4bb] To archive or not to archive, that's the question!
I was hesitant to add my opinion so far because I'm used more to listen this forum rather than tell others what I think. Why and what to deposit are absolutely interconnected. Once you decide why you want to do it, then you will probably know what will be the best format and vice versa. Whether this deposition of raw images will or will not help in future understanding the biology better I'm not sure. But to store those difficult datasets to help the future software development sounds really farfetched. This assumes that in the future crystallographers will never grow crystals that will deliver difficult datasets. If that is the case and in 10-20-30 years next generation will be growing much better crystals then they don't need such a software development. If that is not the case, and once in a while (or more often) they will be getting something out of ordinary then software developers will take them and develop whatever they need to develop to consider such cases. Am I missing a point of discussion here? Regards, Vaheh -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Robert Esnouf Sent: Monday, October 31, 2011 10:31 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] To archive or not to archive, that's the question! Dear All, As someone who recently left crystallography for sequencing, I should modify Tassos's point... A full data-set is a few terabytes, but post-processing reduces it to sub-Gb size. My experience from HiSeqs is that this full here means the base calls - equivalent to the unmerged HKLs - hardly raw data. NGS (short-read) sequencing is an imaging technique and the images are more like 100TB for a 15-day run on a single flow cell. The raw base calls are about 5TB. The compressed, mapped data (BAM file, for a human genome, 30x coverage) is about 120GB. It is only a variant call file (VCF, difference from a stated human reference genome) that is sub-Gb and these files are - unsurprisingly - unsuited to detailed statistical analysis. Also $1k is a not yet an economic cost... The DNA information capacity in a single human body dwarfs the entire world disk capacity, so storing DNA is a no brainer here. Sequencing groups are making very hard-nosed economic decisions about what to store - indeed it is a source of research in itself - but the scale of the problem is very much bigger. My tuppence ha'penny is that depositing raw images along with everything else in the PDB is a nice idea but would have little impact on science (human/animal/plant health or understanding of biology). 1) If confined to structures in the PDB, the images would just be the ones giving the final best data - hence the ones least likely to have been problematic. I'd be more interested in SFs/maps for looking at ligand-binding etc... 2) Unless this were done before paper acceptance they would be of little use to referees seeking to review important structural papers. I'd like to see PDB validation reports (which could include automated data processing, perhaps culled from synchrotron sites, SFs and/or maps) made available to referees in advance of publication. This would be enabled by deposition, but could be achieved in other ways. 3) The datasets of interest to methods developers are unlikely to be the ones deposited. They should be in contact with synchrotron archives directly. Processing multiple lattices is a case in point here. 4) Remember the average consumer of a PDB file is not a crystallographer. More likely to be a graduate student in a clinical lab. For him/her things like occupancies and B- factors are far more serious concerns... I'm not trivializing the issue, but importance is always relative. Are there outsiders on the panel to keep perspective? Robert -- Dr. Robert Esnouf, University Research Lecturer, ex-crystallographer and Head of Research Computing, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK Emails: rob...@strubi.ox.ac.uk Tel: (+44) - 1865 - 287783 and rob...@esnouf.comFax: (+44) - 1865 - 287547 Original message Date: Mon, 31 Oct 2011 11:37:47 +0100 From: CCP4 bulletin board CCP4BB@JISCMAIL.AC.UK (on behalf of Anastassis Perrakis a.perra...@nki.nl) Subject: Re: [ccp4bb] To archive or not to archive, that's the question! To: CCP4BB@JISCMAIL.AC.UK Dear all, The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology. An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case, have suffered the inadequacy of computational reduction ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought three -80 freezers, one
Re: [ccp4bb] To archive or not to archive, that's the question!
I have no doubt there are software developers out there who have spent years building up their own personal collections of 'interesting' datasets, file formats, and various oddities that they take with them wherever they go, and consider this collection to be precious. Despite the fact that many bad datasets are collected daily at beamlines the world over, it is amazing how difficult it can be to find what you want when there is no open, single point-of-access repository to search. Simply asking the crystallographers and beamline scientists doesn't work: they are too busy doing their own jobs. -- David On 31 October 2011 15:18, Oganesyan, Vaheh oganesy...@medimmune.com wrote: I was hesitant to add my opinion so far because I'm used more to listen this forum rather than tell others what I think. Why and what to deposit are absolutely interconnected. Once you decide why you want to do it, then you will probably know what will be the best format and *vice versa*. Whether this deposition of raw images will or will not help in future understanding the biology better I'm not sure. But to store those difficult datasets to help the future software development sounds really farfetched. This assumes that in the future crystallographers will never grow crystals that will deliver difficult datasets. If that is the case and in 10-20-30 years next generation will be growing much better crystals then they don't need such a software development. If that is not the case, and once in a while (or more often) they will be getting something out of ordinary then software developers will take them and develop whatever they need to develop to consider such cases. Am I missing a point of discussion here? Regards, Vaheh -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UKCCP4BB@JISCMAIL.AC.UK] On Behalf Of Robert Esnouf Sent: Monday, October 31, 2011 10:31 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] To archive or not to archive, that's the question! Dear All, As someone who recently left crystallography for sequencing, I should modify Tassos's point... A full data-set is a few terabytes, but post-processing reduces it to sub-Gb size. My experience from HiSeqs is that this full here means the base calls - equivalent to the unmerged HKLs - hardly raw data. NGS (short-read) sequencing is an imaging technique and the images are more like 100TB for a 15-day run on a single flow cell. The raw base calls are about 5TB. The compressed, mapped data (BAM file, for a human genome, 30x coverage) is about 120GB. It is only a variant call file (VCF, difference from a stated human reference genome) that is sub-Gb and these files are - unsurprisingly - unsuited to detailed statistical analysis. Also $1k is a not yet an economic cost... The DNA information capacity in a single human body dwarfs the entire world disk capacity, so storing DNA is a no brainer here. Sequencing groups are making very hard-nosed economic decisions about what to store - indeed it is a source of research in itself - but the scale of the problem is very much bigger. My tuppence ha'penny is that depositing raw images along with everything else in the PDB is a nice idea but would have little impact on science (human/animal/plant health or understanding of biology). 1) If confined to structures in the PDB, the images would just be the ones giving the final best data - hence the ones least likely to have been problematic. I'd be more interested in SFs/maps for looking at ligand-binding etc... 2) Unless this were done before paper acceptance they would be of little use to referees seeking to review important structural papers. I'd like to see PDB validation reports (which could include automated data processing, perhaps culled from synchrotron sites, SFs and/or maps) made available to referees in advance of publication. This would be enabled by deposition, but could be achieved in other ways. 3) The datasets of interest to methods developers are unlikely to be the ones deposited. They should be in contact with synchrotron archives directly. Processing multiple lattices is a case in point here. 4) Remember the average consumer of a PDB file is not a crystallographer. More likely to be a graduate student in a clinical lab. For him/her things like occupancies and B- factors are far more serious concerns... I'm not trivializing the issue, but importance is always relative. Are there outsiders on the panel to keep perspective? Robert -- Dr. Robert Esnouf, University Research Lecturer, ex-crystallographer and Head of Research Computing, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK Emails: rob...@strubi.ox.ac.uk Tel: (+44) - 1865 - 287783 and rob...@esnouf.comFax: (+44) - 1865 - 287547 Original message Date: Mon, 31 Oct 2011 11:37:47 +0100 From: CCP4 bulletin board
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear Vaheh, On Mon, Oct 31, 2011 at 03:18:07PM +, Oganesyan, Vaheh wrote: But to store those difficult datasets to help the future software development sounds really farfetched. As far as I see the general plan, that would be a second stage (deposit all datasets) - the first one would be the datasets related directly to a given PDB entry. This assumes that in the future crystallographers will never grow crystals that will deliver difficult datasets. Oh sure they will. And lots of those datasets will be available to developers ... being thrown a difficult problem under pressure is a very good thing to get ideas, think out of the box etc. However, developing solid algorithms is better done in a less hectic environment with a large collection of similar problems (changing only one parameter at a time) to test a new method. If that is the case and in 10-20-30 years next generation will be growing much better crystals then they don't need such a software development. They'll grow better crystals for the type of project we're currently struggling with, sure. But we'll still get poor crystals for projects we don't even attempt or tackle right now. Software development is a slow process, often working on a different timescale than the typical structure solution project (obvious there are exceptions). So planing ahead for that time will prepare us. And yes, it will have an impact on the biology then. It's not just the here and now (and next grant, next high-profile paper) we should be thinking about. Am I missing a point of discussion here? One small point maybe: there are very few developers out there - but a very large number of users that benefit from what they have done. Often the work is not very visible (It's just pressing a button or two ... so it must be trivial!) - which is a good thing: it has to be simple, robust, automtic and useable. I think if a large enough number of developers consider depositing images a very useful resource for their future development (and therefore future benefit to a large number of users), it should be seriously considered, even if some of the advertised benefits have to be taken on trust. Past developments in data processing have had a big impact on a lot of projects - high-profile or just the standard PhD-student nightmare - with often small return for the developers in terms of publications, grants or even citations (main paper or supplementary material). So maybe in the sprit of the festive season it is time to consider giving a little bit back? What is there to loose? Another 20 minutes additional deposition work for the user in return for maybe/hopefully saving a whole project 5 years down the line? Not a bad investment it seems to me ... Cheers Clemens -- *** * Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com * * Global Phasing Ltd. * Sheraton House, Castle Park * Cambridge CB3 0AX, UK *-- * BUSTER Development Group (http://www.globalphasing.com) ***
[ccp4bb] Archiving Images for PDB Depositions
Dear Crystallographers, I am sending this to try to start a thread which addresses only the specific issue of whether to archive, at least as a start, images corresponding to PDB-deposited structures. I believe there could be a real consensus about the low cost and usefulness of this degree of archiving, but the discussion keeps swinging around to all levels of archiving, obfuscating who's for what and for what reason. What about this level, alone? All of the accompanying info is already entered into the PDB, so there would be no additional costs on that score. There could just be a simple link, added to the download files pulldown, which could say go to image archive, or something along those lines. Images would be pre-zipped, maybe even tarred, and people could just download from there. What's so bad? The benefits are that sometimes there are structures in which resolution cutoffs might be unreasonable, or perhaps there is some potential radiation damage in the later frames that might be deleterious to interpretations, or perhaps there are ugly features in the images which are invisible or obscure in the statistics. In any case, it seems to me that this step would be pretty painless, as it is merely an extension of the current system--just add a link to the pulldown menu! Best Regards, Jacob Keller -- *** Jacob Pearson Keller Northwestern University Medical Scientist Training Program email: j-kell...@northwestern.edu ***
[ccp4bb] atomic scattering factors in REFMAC
Dear Refmac users, I noticed that if I refine a structure containing SeMet, then Se atoms usually have big negative (red) peeks of difference map and high B-factors. As I understand from the diffraction theory and from some discussions at CCP4bb, that may result because in REFMAC the atomic scattering factors are internally coded for copper radiation (CuKa). I tried to use keyword anomalous wavelength 0.9683 and found that with this keyword I had different values of coefficient c for Se, Mn, and P as shown in REFMAC log-file: loop_ _atom_type_symbol _atom_type_scat_Cromer_Mann_a1 _atom_type_scat_Cromer_Mann_b1 _atom_type_scat_Cromer_Mann_a2 _atom_type_scat_Cromer_Mann_b2 _atom_type_scat_Cromer_Mann_a3 _atom_type_scat_Cromer_Mann_b3 _atom_type_scat_Cromer_Mann_a4 _atom_type_scat_Cromer_Mann_b4 _atom_type_scat_Cromer_Mann_c N 12.2126 0.0057 3.1322 9.8933 2.0125 28.9975 1.1663 0.5826 -11.5290 C 2.3100 20.8439 1.0200 10.2075 1.5886 0.5687 0.8650 51.6512 0.2156 H 0.4930 10.5109 0.3229 26.1257 0.1402 3.1424 0.0408 57.7997 0.0030 O 3.0485 13.2771 2.2868 5.7011 1.5463 0.3239 0.8670 32.9089 0.2508 SE17.0006 2.4098 5.8196 0.2726 3.9731 15.2372 4.3543 43.8163 -1.0329 MN11.2819 5.3409 7.3573 0.3432 3.0193 17.8674 2.2441 83.7543 1.3834 P 6.4345 1.9067 4.1791 27.1570 1.7800 0.5260 1.4908 68.1645 1.2650 As a result, red peeks around Se are significantly lower, Se B-factors are a bit smaller (like 25.6 and 23.1), and Rf is lowered by a bit more than 0.1% with the same input files. That looks pretty good. Still, I want to ask your opinion on the following: 1) Is it proper way to specify atomic scattering factors? I found this keyword under REFMAC documentation topic Simultaneous SAD experimental phasing and refinement and Im not sure if I change something else when I specify the keyword. I dont have separate F+, F- and corresponding SIGF+, SIGF- in my mtz, so SAD experimental phasing should not go. 2) Do you think it is safe to specify this keyword for every structure under refinement? Can it have some drawbacks (except wrong wavelength)? As I understand, the theoretical Cromer_Mann curve can be different from experimental, but still it is better than not to change scattering factor at all. Thank you very much!! With best regards, Ivan Shabalin, Ph.D. Research Associate, University of Virginia 4-224 Jordan Hall, 1340 Jefferson Park Ave. Charlottesville, VA 22908
Re: [ccp4bb] Archiving Images for PDB Depositions
I have no problem with this idea as an opt-in. However I loathe being forced to do things - for my own good or anyone else's. But unless I read the tenor of this discussion completely wrongly, opt-in is precisely what is not being proposed. Adrian Goldman Sent from my iPhone On 31 Oct 2011, at 18:02, Jacob Keller j-kell...@fsm.northwestern.edu wrote: Dear Crystallographers, I am sending this to try to start a thread which addresses only the specific issue of whether to archive, at least as a start, images corresponding to PDB-deposited structures. I believe there could be a real consensus about the low cost and usefulness of this degree of archiving, but the discussion keeps swinging around to all levels of archiving, obfuscating who's for what and for what reason. What about this level, alone? All of the accompanying info is already entered into the PDB, so there would be no additional costs on that score. There could just be a simple link, added to the download files pulldown, which could say go to image archive, or something along those lines. Images would be pre-zipped, maybe even tarred, and people could just download from there. What's so bad? The benefits are that sometimes there are structures in which resolution cutoffs might be unreasonable, or perhaps there is some potential radiation damage in the later frames that might be deleterious to interpretations, or perhaps there are ugly features in the images which are invisible or obscure in the statistics. In any case, it seems to me that this step would be pretty painless, as it is merely an extension of the current system--just add a link to the pulldown menu! Best Regards, Jacob Keller -- *** Jacob Pearson Keller Northwestern University Medical Scientist Training Program email: j-kell...@northwestern.edu ***
Re: [ccp4bb] Archiving Images for PDB Depositions
Pilot phase, opt-in--eventually, mandatory? Like structure factors? Jacob On Mon, Oct 31, 2011 at 11:29 AM, Adrian Goldman adrian.gold...@helsinki.fi wrote: I have no problem with this idea as an opt-in. However I loathe being forced to do things - for my own good or anyone else's. But unless I read the tenor of this discussion completely wrongly, opt-in is precisely what is not being proposed. Adrian Goldman Sent from my iPhone On 31 Oct 2011, at 18:02, Jacob Keller j-kell...@fsm.northwestern.edu wrote: Dear Crystallographers, I am sending this to try to start a thread which addresses only the specific issue of whether to archive, at least as a start, images corresponding to PDB-deposited structures. I believe there could be a real consensus about the low cost and usefulness of this degree of archiving, but the discussion keeps swinging around to all levels of archiving, obfuscating who's for what and for what reason. What about this level, alone? All of the accompanying info is already entered into the PDB, so there would be no additional costs on that score. There could just be a simple link, added to the download files pulldown, which could say go to image archive, or something along those lines. Images would be pre-zipped, maybe even tarred, and people could just download from there. What's so bad? The benefits are that sometimes there are structures in which resolution cutoffs might be unreasonable, or perhaps there is some potential radiation damage in the later frames that might be deleterious to interpretations, or perhaps there are ugly features in the images which are invisible or obscure in the statistics. In any case, it seems to me that this step would be pretty painless, as it is merely an extension of the current system--just add a link to the pulldown menu! Best Regards, Jacob Keller -- *** Jacob Pearson Keller Northwestern University Medical Scientist Training Program email: j-kell...@northwestern.edu *** -- *** Jacob Pearson Keller Northwestern University Medical Scientist Training Program email: j-kell...@northwestern.edu ***
Re: [ccp4bb] Archiving Images for PDB Depositions
Loathe being forced to do things? You mean, like being forced to use programs developed by others at no cost to yourself? I'm in a bit of a time-warp here - how exactly do users think our current suite of software got to be as astonishingly good as it is? 10 years ago people (non-developers) were saying exactly the same things - yet almost every talk on phasing and auto-building that I've heard ends up acknowledging the JCSG datasets. Must have been a waste of time then, I suppose. phx. On 31/10/2011 16:29, Adrian Goldman wrote: I have no problem with this idea as an opt-in. However I loathe being forced to do things - for my own good or anyone else's. But unless I read the tenor of this discussion completely wrongly, opt-in is precisely what is not being proposed. Adrian Goldman Sent from my iPhone On 31 Oct 2011, at 18:02, Jacob Kellerj-kell...@fsm.northwestern.edu wrote: Dear Crystallographers, I am sending this to try to start a thread which addresses only the specific issue of whether to archive, at least as a start, images corresponding to PDB-deposited structures. I believe there could be a real consensus about the low cost and usefulness of this degree of archiving, but the discussion keeps swinging around to all levels of archiving, obfuscating who's for what and for what reason. What about this level, alone? All of the accompanying info is already entered into the PDB, so there would be no additional costs on that score. There could just be a simple link, added to the download files pulldown, which could say go to image archive, or something along those lines. Images would be pre-zipped, maybe even tarred, and people could just download from there. What's so bad? The benefits are that sometimes there are structures in which resolution cutoffs might be unreasonable, or perhaps there is some potential radiation damage in the later frames that might be deleterious to interpretations, or perhaps there are ugly features in the images which are invisible or obscure in the statistics. In any case, it seems to me that this step would be pretty painless, as it is merely an extension of the current system--just add a link to the pulldown menu! Best Regards, Jacob Keller -- *** Jacob Pearson Keller Northwestern University Medical Scientist Training Program email: j-kell...@northwestern.edu ***
Re: [ccp4bb] Archiving Images for PDB Depositions
Dear Adrian, On Mon, Oct 31, 2011 at 06:29:50PM +0200, Adrian Goldman wrote: I have no problem with this idea as an opt-in. However I loathe being forced to do things - for my own good or anyone else's. But unless I read the tenor of this discussion completely wrongly, opt-in is precisely what is not being proposed. I understood it slightly different - see Gerard Bricogne's points in https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1110L=CCP4BBF=S=P=363135 which sounds very much like an opt-in? Such a starting point sounds very similar to that we had with initial PDB submission (optional for publication) and then structure factor deposition. Cheers Clemens -- *** * Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com * * Global Phasing Ltd. * Sheraton House, Castle Park * Cambridge CB3 0AX, UK *-- * BUSTER Development Group (http://www.globalphasing.com) ***
Re: [ccp4bb] To archive or not to archive, that's the question!
The point is that science is not collecting stamps. Therefore the first question should always be Why. If you start with What the discussion immediately switches to technical issues like how many TB, PB etc. $/EUR, manpower. And all the intense discussion will blow out by one single Why. Nothing is for free. But if it would help science and mankind, nobody would hesitate to spend millions of $/EUR. Supporting software development / software developers is a different question. If this were the first question that someone would have asked the answer would have never been archiving all datasets worldwide / deposited structures, but how could we, the community, build up a resource with different kind of problems (e.g. space groups, twinning, overlapping lattices, etc.). I still didn't got an answer for Why. Best regards, Martin Am 31.10.2011 16:18, schrieb Oganesyan, Vaheh: I was hesitant to add my opinion so far because I'm used more to listen this forum rather than tell others what I think. Why and what to deposit are absolutely interconnected. Once you decide why you want to do it, then you will probably know what will be the best format and /vice versa/. Whether this deposition of raw images will or will not help in future understanding the biology better I'm not sure. But to store those difficult datasets to help the future software development sounds really farfetched. This assumes that in the future crystallographers will never grow crystals that will deliver difficult datasets. If that is the case and in 10-20-30 years next generation will be growing much better crystals then they don't need such a software development. If that is not the case, and once in a while (or more often) they will be getting something out of ordinary then software developers will take them and develop whatever they need to develop to consider such cases. Am I missing a point of discussion here? Regards, Vaheh -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Robert Esnouf Sent: Monday, October 31, 2011 10:31 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] To archive or not to archive, that's the question! Dear All, As someone who recently left crystallography for sequencing, I should modify Tassos's point... A full data-set is a few terabytes, but post-processing reduces it to sub-Gb size. My experience from HiSeqs is that this full here means the base calls - equivalent to the unmerged HKLs - hardly raw data. NGS (short-read) sequencing is an imaging technique and the images are more like 100TB for a 15-day run on a single flow cell. The raw base calls are about 5TB. The compressed, mapped data (BAM file, for a human genome, 30x coverage) is about 120GB. It is only a variant call file (VCF, difference from a stated human reference genome) that is sub-Gb and these files are - unsurprisingly - unsuited to detailed statistical analysis. Also $1k is a not yet an economic cost... The DNA information capacity in a single human body dwarfs the entire world disk capacity, so storing DNA is a no brainer here. Sequencing groups are making very hard-nosed economic decisions about what to store - indeed it is a source of research in itself - but the scale of the problem is very much bigger. My tuppence ha'penny is that depositing raw images along with everything else in the PDB is a nice idea but would have little impact on science (human/animal/plant health or understanding of biology). 1) If confined to structures in the PDB, the images would just be the ones giving the final best data - hence the ones least likely to have been problematic. I'd be more interested in SFs/maps for looking at ligand-binding etc... 2) Unless this were done before paper acceptance they would be of little use to referees seeking to review important structural papers. I'd like to see PDB validation reports (which could include automated data processing, perhaps culled from synchrotron sites, SFs and/or maps) made available to referees in advance of publication. This would be enabled by deposition, but could be achieved in other ways. 3) The datasets of interest to methods developers are unlikely to be the ones deposited. They should be in contact with synchrotron archives directly. Processing multiple lattices is a case in point here. 4) Remember the average consumer of a PDB file is not a crystallographer. More likely to be a graduate student in a clinical lab. For him/her things like occupancies and B- factors are far more serious concerns... I'm not trivializing the issue, but importance is always relative. Are there outsiders on the panel to keep perspective? Robert -- Dr. Robert Esnouf, University Research Lecturer, ex-crystallographer and Head of Research Computing, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK Emails: rob...@strubi.ox.ac.uk Tel: (+44) - 1865 - 287783 and rob...@esnouf.comFax: (+44) - 1865 - 287547
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear Martin, Thank you for this very clear message about your views on this topic. There is nothing like well articulated dissenting views to force a real assessment of the initial arguments, and you have certainly provided that. As your presentation is modular, I will interleave my comments with your text, if you don't mind. -- Still, after hundreds (?) of emails to this topic, I haven't seen any convincing argument in favor of archiving data. The only convincing arguments are against, and are from Gerard K and Tassos. Why? The question is not what to archive, but still why should we archive all the data. Because software developers need more data? Should we raise all these efforts and costs because 10 developers worldwide need the data to ALL protein structures? Do they really need so much data, wouldn't it be enough to build a repository of maybe 1000 datasets for developments? A first impression is that your remark rather looks down on those 10 developers worldwide, a view not out of keeping with that of structural biologists who have moved away from ground-level crystallography and view the latter as a mature technique - a euphemism for saying that no further improvements are likely nor even necessary. As Clemens Vonrhein has just written, it may be the very success of those developers that has given the benefit of what software can do to users who don't have the faintest idea of what it does, nor of how it does it, nor of what its limitations are and how to overcome those limitations - and therefore take it for granted. Another side of the mature technique kiss of death is the underlying assumption that the demands placed on crystallographic methods are themselves static, and nothing could be more misleading. We get caught time and again by rushed shifts in technology without proper precautions in case the first adaptations of the old methods do not perform as well as they might later. Let me quote an example: 3x3 CCD detectors. It was too quickly and hurriedly assumed that, after correcting the images recorded on these instruments for geometric distortions and flat-field response, one would get images that could be processed as if they came from image plates (or film). This turned out to be a mistake: corner effects were later diagnosed, that were partially correctible by a position-dependent modulation factor, applied for instance by XDS in response to the problem. That correction is not just detector-dependent and applicable to all datasets recorded on a given detector, unfortunately, as it is related to a spatial variation in the point-spread function. - so you really need to reprocess each set of images to determine the necessary corrections. The tragic thing is that for a typical resolution limit and detector distance, these corners cut into the quick of your strongest secondary-structure defining data. If you have kept your images, you can try and recover from that; otherwise, you are stuck with what can be seriously sub-optimal data. Imagine what this can do to SAD anomalous difference when Bijvoet pairs fall on detector positions where these corner effects are vastly different ... . Another example it that of the recent use of numerous microcrystals, each giving a very small amount of data, to assemble datasets for solving GPCR structures. The methods for doing this, for getting the indexing and integration of such thin slices of data and getting the overall scaling to behave, are still very rough. It would be pure insanity to throw these images away and not to count on better algorithms to come along and improve the final data extractible from them. -- Does really someone believe that our view on the actual problem, the function of the proteins, changes with the analysis of whatsoever scattering is still in the images but not used by today's software? Crystal structures are static, snapshots, and obtained under artificial conditions. In solution (still the physiologic state) they might look different, not much, but at least far more dynamic. Does it therefore matter whether we know some sidechain positions better (in the crystal structure) when re-analysing the data? In turn, are our current software programs such bad that we would expect strong difference when re-analysing the data? No. And if the structures change upon reanalysis (more or less) who does re-interpret the structures, re-writes the papers? I think that, rather than asking rhetorical questions about people's beliefs regarding such a general question, one needs testimonies about real life situations. We have helped a great many academic groups in the last 15 years: in every case, they ended up feeling really overjoyed that they had kept their images when they had, and immensely regretful when they hadn't. I noticed, for example, that your last PDB entry, 1LKX (2002) does not have structure factor data associated with it. It is therefore impossible
Re: [ccp4bb] But...what's on my tyrosine?
I would just model it with one water. Especially if resolution is worse than 1.8 (I dont think you have better based on the map) Only if resolution is high and R-factors are low I would worry about this peak. Regards, Ivan
Re: [ccp4bb] To archive or not to archive, that's the question!
I believe that archiving original images for published data sets could be very useful, if linked to the PDB. I have downloaded SFs from the PDB to use for re-refinement of the published model (if I think the electron density maps are misinterpreted) and personally had a different interpretation of the density (ion vs small ligand). With that in mind, re-processing from the original images could be useful for catching mistakes in processing (especially if a high R-factor or low I/sigma are reported), albeit it a small percentage of the time. As for difficult data sets, problematic cases, etc, I can see the importance of their availability by the preceding arguments. It seems to be most useful for software developers. In that case, I would suggest software developers to publicly request our difficult to process images, or create their own repository. Then they can store and use the data as they like. I would happily upload a few data sets. (Just a suggestion) Best Wishes, Kelly Daughtry *** Kelly Daughtry, Ph.D. Post-Doctoral Fellow, Raetz Lab Biochemistry Department Duke University Alex H. Sands, Jr. Building 303 Research Drive RM 250 Durham, NC 27710 P: 919-684-5178 *** On Mon, Oct 31, 2011 at 12:01 PM, Martin Kollmar m...@nmr.mpibpc.mpg.dewrote: The point is that science is not collecting stamps. Therefore the first question should always be Why. If you start with What the discussion immediately switches to technical issues like how many TB, PB etc. $/€, manpower. And all the intense discussion will blow out by one single Why. Nothing is for free. But if it would help science and mankind, nobody would hesitate to spend millions of $/€. Supporting software development / software developers is a different question. If this were the first question that someone would have asked the answer would have never been archiving all datasets worldwide / deposited structures, but how could we, the community, build up a resource with different kind of problems (e.g. space groups, twinning, overlapping lattices, etc.). I still didn't got an answer for Why. Best regards, Martin Am 31.10.2011 16:18, schrieb Oganesyan, Vaheh: I was hesitant to add my opinion so far because I'm used more to listen this forum rather than tell others what I think. Why and what to deposit are absolutely interconnected. Once you decide why you want to do it, then you will probably know what will be the best format and *vice versa*. Whether this deposition of raw images will or will not help in future understanding the biology better I'm not sure. But to store those difficult datasets to help the future software development sounds really farfetched. This assumes that in the future crystallographers will never grow crystals that will deliver difficult datasets. If that is the case and in 10-20-30 years next generation will be growing much better crystals then they don't need such a software development. If that is not the case, and once in a while (or more often) they will be getting something out of ordinary then software developers will take them and develop whatever they need to develop to consider such cases. Am I missing a point of discussion here? Regards, Vaheh -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UKCCP4BB@JISCMAIL.AC.UK] On Behalf Of Robert Esnouf Sent: Monday, October 31, 2011 10:31 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] To archive or not to archive, that's the question! Dear All, As someone who recently left crystallography for sequencing, I should modify Tassos's point... A full data-set is a few terabytes, but post-processing reduces it to sub-Gb size. My experience from HiSeqs is that this full here means the base calls - equivalent to the unmerged HKLs - hardly raw data. NGS (short-read) sequencing is an imaging technique and the images are more like 100TB for a 15-day run on a single flow cell. The raw base calls are about 5TB. The compressed, mapped data (BAM file, for a human genome, 30x coverage) is about 120GB. It is only a variant call file (VCF, difference from a stated human reference genome) that is sub-Gb and these files are - unsurprisingly - unsuited to detailed statistical analysis. Also $1k is a not yet an economic cost... The DNA information capacity in a single human body dwarfs the entire world disk capacity, so storing DNA is a no brainer here. Sequencing groups are making very hard-nosed economic decisions about what to store - indeed it is a source of research in itself - but the scale of the problem is very much bigger. My tuppence ha'penny is that depositing raw images along with everything else in the PDB is a nice idea but would have little impact on science (human/animal/plant health or understanding of biology). 1) If confined to structures in the
Re: [ccp4bb] To archive or not to archive, that's the question!
Dear Martin, First of all I would like to say that I regret having made my remark 500 and apologise if you read it as a personal one - I just saw it as an example of a dataset it might have been useful to revisit if data had been available in any form. I am sure that there are many skeletons in many cupboards, including my own :-) . Otherwise, as the discussion does seem to refocus on the very initial proposal in gestation within the IUCr's DDDWG, i.e. voluntary involvement of depositors and of synchrotrons, so that questions of logistics and cost could be answered in the light of empirical evidence, your Why question is the only one unanswered by this proposal, it seems. In this respect I wonder how you view the two examples I gave in my reply to your previous message, namely the corner effects problem and the re-development of methods for collating data from numerous small, poorly diffracting crystals as was done in the recent solution of GPCR structures. There remains the example I cited from the beginning, namely the integration of images displaying several overlapping lattices. With best wishes, Gerard. -- On Mon, Oct 31, 2011 at 05:01:38PM +0100, Martin Kollmar wrote: The point is that science is not collecting stamps. Therefore the first question should always be Why. If you start with What the discussion immediately switches to technical issues like how many TB, PB etc. $/EUR, manpower. And all the intense discussion will blow out by one single Why. Nothing is for free. But if it would help science and mankind, nobody would hesitate to spend millions of $/EUR. Supporting software development / software developers is a different question. If this were the first question that someone would have asked the answer would have never been archiving all datasets worldwide / deposited structures, but how could we, the community, build up a resource with different kind of problems (e.g. space groups, twinning, overlapping lattices, etc.). I still didn't got an answer for Why. Best regards, Martin Am 31.10.2011 16:18, schrieb Oganesyan, Vaheh: I was hesitant to add my opinion so far because I'm used more to listen this forum rather than tell others what I think. Why and what to deposit are absolutely interconnected. Once you decide why you want to do it, then you will probably know what will be the best format and /vice versa/. Whether this deposition of raw images will or will not help in future understanding the biology better I'm not sure. But to store those difficult datasets to help the future software development sounds really farfetched. This assumes that in the future crystallographers will never grow crystals that will deliver difficult datasets. If that is the case and in 10-20-30 years next generation will be growing much better crystals then they don't need such a software development. If that is not the case, and once in a while (or more often) they will be getting something out of ordinary then software developers will take them and develop whatever they need to develop to consider such cases. Am I missing a point of discussion here? Regards, Vaheh -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Robert Esnouf Sent: Monday, October 31, 2011 10:31 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] To archive or not to archive, that's the question! Dear All, As someone who recently left crystallography for sequencing, I should modify Tassos's point... A full data-set is a few terabytes, but post-processing reduces it to sub-Gb size. My experience from HiSeqs is that this full here means the base calls - equivalent to the unmerged HKLs - hardly raw data. NGS (short-read) sequencing is an imaging technique and the images are more like 100TB for a 15-day run on a single flow cell. The raw base calls are about 5TB. The compressed, mapped data (BAM file, for a human genome, 30x coverage) is about 120GB. It is only a variant call file (VCF, difference from a stated human reference genome) that is sub-Gb and these files are - unsurprisingly - unsuited to detailed statistical analysis. Also $1k is a not yet an economic cost... The DNA information capacity in a single human body dwarfs the entire world disk capacity, so storing DNA is a no brainer here. Sequencing groups are making very hard-nosed economic decisions about what to store - indeed it is a source of research in itself - but the scale of the problem is very much bigger. My tuppence ha'penny is that depositing raw images along with everything else in the PDB is a nice idea but would have little impact on science (human/animal/plant health or understanding of biology). 1) If confined to structures in the PDB, the images would just be the ones giving the final best data - hence the ones least likely to have been problematic. I'd be more interested in
[ccp4bb] Postdoctoral position, bacterial membrane protein
Postdoctoral Position Available Department of Biological Chemistry, University of Michigan Position open immediately for a highly motivated postdoctoral researcher to investigate the function and structure of an intrinsic membrane protein from pathogenic bacteria involved in capsule polysaccharide export. The researcher will develop an expression system to purify sufficient protein for crystallization and solid-state NMR experiments, as well as investigate function at the cellular level. See our recent paper (http://dx.doi.org/10.1021/bi101869h) that describes another protein from the same operon. A doctoral degree in biochemistry, microbiology, or related field is required, preferably with experience in membrane proteins. Please email CV, a short description of your research accomplishments and career plans, and names of three references to Prof. Mark A. Saper, sa...@umich.edu. _ Mark A. Saper, Ph.D. Associate Professor of Biological Chemistry University of Michigan Biophysics, 3040 Chemistry Building 930 N University Ave Ann Arbor MI 48109-1055 U.S.A. sa...@umich.edu phone (734) 764-3353 fax (734) 764-3323 http://www.biochem.med.umich.edu/?q=saper http://www.strucbio.umich.edu/