[Bioc-devel] Using integrated contains in Bioconductor packages
Hi package developers -- I found this article pretty intersting reading http://www.nature.com/nbt/journal/v31/n10/full/nbt.2721.html?WT.ec_id=NBT-201310 especially of course the comments of Robert Gentleman and the reasons for success of R (external packages written by domain experts) and Bioconductor (interoperability between different analysis capabilities enabled by using similar data structures). It's also very important to provide 'integrated' containers that couple, say, a matrix of expression count data with the annotations of the genes / gene regions (rows) and sample phenotypic data (columns). With these ideas in mind, I want to emphasize that new and existing Bioconductor packages should be re-using established data structures. With omics data it is very important to offer users a way to easily work with data across Bioconductor packages. While you might implement 'internal' functions that perform numerical calculations on an R `matrix`, say, the major input functions should really support GenomicRanges::SummarizedExperiment objects, rather than (in addition to?) plain old matrix objects. The rowData of summarized experiments can minimally contain names like the rownames() of a matrix, but can typically contain much more useful information, e.g., the genomic coordinates of regions of the regions of interst (as GRanges or GRangesList objects) and / or other attributes that are useful to your own analysis (GC content of each region?) or to the user (p-values from previous analysis?). Similarly the colData can be simple identifiers like colnames() of a matrix, but it's much more informative to tightly couple the phenotypic data about the samples. This makes it easy and error-free for the user to do things like subset both the phenotype and experssion data by some phenotype of interest, e.g., se[, colData(se)$Gender %in% Female]. Return values should respect the row and column indicies of the inputs as appropriate, so for instance it's easy for the user to add a matrix (assays(se)[[foo]] - foo(se, ...)), or vector or data.frame (preferablly, DataFrame) mcols(colData)$bar - bar(se, ...) of results to their summarized experiment. It may often be appropriate to do this work for the user, returning a SummarizedExperiment annotated with your additional results. There are similar data structures for other types of data, e.g., Biobase::ExpressionSet for microarrays and in the flow cell packages. Feel free to ask on this list if you're looking for guidance. Not all return values are as simple as a vector, matrix, or data.frame, and of course one should not try to fit this into an inappropriate data structure. Martin -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
The 'foreach' framework does this sort of analysis using codetools at least in part. You may be able to build on what they have. luke On Mon, 4 Nov 2013, Ryan wrote: On 11/4/13, 11:05 AM, Gabriel Becker wrote: As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? I think this is a different issue. We want to detect when a function depends on variables outside that function in the user's workspace, or variables defined in a pacakge that the user has loaded. I think we can assume that R child processes will be of the same version with the same set of installed packages, so package-defined variables will not have different values in child processes. For user variables, I think the goal should be to prevent (or at least highly discourage) dependencies on them entirely, so I don't think it matters what their value may be in the child. I realize this is somewhat counter to the question that started this thread, which was about exporting variables to the children, but I think it is the most straightforward approach. As I believe someone noted earlier in the thread, Henrik's original problem of a recursive function is properly solved by using the Recall function. -Ryan ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Luke Tierney Chair, Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Bioc-devel Digest, Vol 116, Issum
p Sent from a cell phone, please excuse bypos and terseness. On Nov 3, 2013 3:04 AM, bioc-devel-requ...@r-project.org wrote: Send Bioc-devel mailing list submissions to bioc-devel@r-project.org To subscribe or unsubscribe via the World Wide Web, visit https://stat.ethz.ch/mailman/listinfo/bioc-devel or, via email, send a message with subject or body 'help' to bioc-devel-requ...@r-project.org You can reach the person managing the list at bioc-devel-ow...@r-project.org When replying, please edit your Subject line so it is more specific than Re: Contents of Bioc-devel digest... Today's Topics: 1. Re: lumiT VST Normalisation Fails (Dario Strbenac) -- Message: 1 Date: Sun, 3 Nov 2013 04:00:49 + From: Dario Strbenac dstr7...@uni.sydney.edu.au To: Pan Du dupan.m...@gmail.com Cc: bioc-devel@r-project.org bioc-devel@r-project.org Subject: Re: [Bioc-devel] lumiT VST Normalisation Fails Message-ID: b1a39f01485eae1c874c83d71...@blupr01mb035.prod.exchangelabs.com Content-Type: text/plain I read the journal article about the variance stabilizing method. I understand how it uses the p-value in its algorithm now. Some better error checking and message to the user would be useful, if the user created the object with new() from IDAT files. I will ask for the data to be exported again, with the necessary columns. It's also missing the number of beads column. From: du4p...@gmail.com du4p...@gmail.com on behalf of Pan Du dupan.m...@gmail.com Sent: Saturday, 2 November 2013 3:31 AM To: Dario Strbenac Cc: bioc-devel@r-project.org Subject: Re: lumiT VST Normalisation Fails Hi Dario You are correct that the LumiBatch class only requires exprs and se.exprs. But for VST transform in lumiT, the detection p-values will help it better estimates the transformation. (For user flexibility, I will make it also accepts LumiBatch without detection matrix.) As for the SampleIDs, which are the prefix of column names to identify samples in the Illumina BeadStudio/GenomeStudio output file. The vignette lumi.pdf has some example plots. In order to match controlData file with expression data file, these column headers should match up. Pan On Thu, Oct 31, 2013 at 11:00 PM, Dario Strbenac dstr7...@uni.sydney.edu.aumailto:dstr7...@uni.sydney.edu.au wrote: Hello, I have a LumiBatch object, but the lumiT function produces an error. class(treatmentBatch) [1] LumiBatch attr(,package) [1] lumi treatmentBatch - lumiT(treatmentBatch) Perform vst transformation ... Error in !assayDataValidMembers(assayData(x.lumi), detection) : invalid argument type Since I've created a valid object of class LumiBatch, it should work on that object without errors. In fact, the documentation of the class states : The arguments to new should include exprs and se.exprs, others can be missing, in which case they are assigned default values. Nothing in the documentation of lumiT states that detection p-values are required, so it's a mystery to the lumi end-user what obscure format of parameters the package author expects them to provide. Another example of this type of problem is treatmentBatch - addControlData2lumi(controlTable, treatmentBatch) Error in addControlData2lumi(controlTable, treatmentBatch) : SampleID does not match up between controlData and x.lumi! controlData is described as a data.frame with first two columns as controlType and ProbeID. The rest columns are the expression amplitudes for individual samples. Searching the entire PDF manual of the help pages also does not show SampleID described anywhere. I am using lumi 2.14.0. -- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia [[alternative HTML version deleted]] -- ___ Bioc-devel mailing list Bioc-devel@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-devel End of Bioc-devel Digest, Vol 116, Issue 3 ** [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Submitting two strongly related packages (software + data)
Hi, I have never used a mailing list before so do not hesitate to let me know if I'm doing something wrong. I have been growingly increasingly fond of the Bioconductor project and along with a small intend on submitting a package. We have tried to follow package guidelines as much as possible and noticed a size limit on software packages (4MB). The example data sets will greatly exceed that quota (~150MB) so we figured we'd package the data separately (e.g. minfi and minfiData). In that case, what is the right order for submitting the packages? Of course, the data package will be completely documented first so should we 1. submit the data package, then the software package? (You will have to take our word that the software package is coming up.) 2. submit the software package, then the data package? (The software depends on the data though for the vignette!) 3. submit both at the same time. Also, is there a way to test for Bioconductor compliance like the --as-cran option of the R CMD check command? Thank you for your help. I'm looking forward to becoming an active developer in the Bioconductor community. --- *Nicolas De Jay * M.Sc. Student Department of Human Genetics Montreal Children's Hospital Research Institute, McGill University Health Centre 4060 Ste Catherine West, PT-239 Montreal, QC H3Z2Z3, Canada T: (514) 412-4440 | E: nicolas.de...@mail.mcgill.ca [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Submitting two strongly related packages (software + data)
Hi Nicolas, - Original Message - From: Nicolas De Jay nicolas.de...@mail.mcgill.ca To: bioc-devel@r-project.org Sent: Tuesday, November 5, 2013 4:02:20 PM Subject: [Bioc-devel] Submitting two strongly related packages (software + data) Hi, I have never used a mailing list before so do not hesitate to let me know if I'm doing something wrong. I have been growingly increasingly fond of the Bioconductor project and along with a small intend on submitting a package. We have tried to follow package guidelines as much as possible and noticed a size limit on software packages (4MB). The example data sets will greatly exceed that quota (~150MB) so we figured we'd package the data separately (e.g. minfi and minfiData). In that case, what is the right order for submitting the packages? Of course, the data package will be completely documented first so should we 1. submit the data package, then the software package? (You will have to take our word that the software package is coming up.) 2. submit the software package, then the data package? (The software depends on the data though for the vignette!) 3. submit both at the same time. Submit both at the same time, to the same issue in our tracker. This makes it easier for us to know that the packages go together. Also, is there a way to test for Bioconductor compliance like the --as-cran option of the R CMD check command? Not really. Part of Bicoonductor compliance is a subjective human looking at the package and that can't be automated. Reading our guidelines carefully should minimize these problems. However, once you submit a package it will automatically be built and checked on all platforms which may turn up some issues. Dan Thank you for your help. I'm looking forward to becoming an active developer in the Bioconductor community. --- *Nicolas De Jay * M.Sc. Student Department of Human Genetics Montreal Children's Hospital Research Institute, McGill University Health Centre 4060 Ste Catherine West, PT-239 Montreal, QC H3Z2Z3, Canada T: (514) 412-4440 | E: nicolas.de...@mail.mcgill.ca [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] cleaning up obsolete warnings
I'm getting a little tired of this one popping up: In x %in% other : Starting with BioC 2.12, the behavior of %in% on GenomicRanges objects has changed to use *equality* instead of *overlap* for comparing elements between GenomicRanges objects 'x' and 'table'. Now 'x[i]' and 'table[j]' are considered to match when they are equal (i.e. 'x[i] == table[j]'), instead of when they overlap. This new behavior is consistent with base::`%in%`(). If you need the old behavior, please use: query %over% subject Now that even release is Bioc 2.13, can we remove it? And there might be others. Thanks, Michael [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel