[Bioc-devel] Using integrated contains in Bioconductor packages

2013-11-05 Thread Martin Morgan

Hi package developers --

I found this article pretty intersting reading

  
http://www.nature.com/nbt/journal/v31/n10/full/nbt.2721.html?WT.ec_id=NBT-201310

especially of course the comments of Robert Gentleman and the reasons for 
success of R (external packages written by domain experts) and Bioconductor 
(interoperability between different analysis capabilities enabled by using 
similar data structures). It's also very important to provide 'integrated' 
containers that couple, say, a matrix of expression count data with the 
annotations of the genes / gene regions (rows) and sample phenotypic data (columns).


With these ideas in mind, I want to emphasize that new and existing Bioconductor 
packages should be re-using established data structures. With omics data it is 
very important to offer users a way to easily work with data across Bioconductor 
packages. While you might implement 'internal' functions that perform numerical 
calculations on an R `matrix`, say, the major input functions should really 
support GenomicRanges::SummarizedExperiment objects, rather than (in addition 
to?) plain old matrix objects.


The rowData of summarized experiments can minimally contain names like the 
rownames() of a matrix, but can typically contain much more useful information, 
e.g., the genomic coordinates of regions of the regions of interst (as GRanges 
or GRangesList objects) and / or other attributes that are useful to your own 
analysis (GC content of each region?) or to the user (p-values from previous 
analysis?). Similarly the colData can be simple identifiers like colnames() of a 
matrix, but it's much more informative to tightly couple the phenotypic data 
about the samples. This makes it easy and error-free for the user to do things 
like subset both the phenotype and experssion data by some phenotype of 
interest, e.g., se[, colData(se)$Gender %in% Female].


Return values should respect the row and column indicies of the inputs as 
appropriate, so for instance it's easy for the user to add a matrix 
(assays(se)[[foo]] - foo(se, ...)), or vector or data.frame (preferablly, 
DataFrame) mcols(colData)$bar - bar(se, ...) of results to their summarized 
experiment. It may often be appropriate to do this work for the user, returning 
a SummarizedExperiment annotated with your additional results.


There are similar data structures for other types of data, e.g., 
Biobase::ExpressionSet for microarrays and in the flow cell packages. Feel free 
to ask on this list if you're looking for guidance.


Not all return values are as simple as a vector, matrix, or data.frame, and of 
course one should not try to fit this into an inappropriate data structure.


Martin
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-05 Thread luke-tierney

The 'foreach' framework does this sort of analysis using codetools at
least in part. You may be able to build on what they have.

luke

On Mon, 4 Nov 2013, Ryan wrote:



On 11/4/13, 11:05 AM, Gabriel Becker wrote:

As a side note, I'm not sure that existence of a symbol is sufficient (it
certainly is necessary). What about situations where the symbol exists but
is stale compared to the value in the parent? Are we sure that can never
happen?
I think this is a different issue. We want to detect when a function depends 
on variables outside that function in the user's workspace, or variables 
defined in a pacakge that the user has loaded. I think we can assume that R 
child processes will be of the same version with the same set of installed 
packages, so package-defined variables will not have different values in 
child processes. For user variables, I think the goal should be to prevent 
(or at least highly discourage) dependencies on them entirely, so I don't 
think it matters what their value may be in the child. I realize this is 
somewhat counter to the question that started this thread, which was about 
exporting variables to the children, but I think it is the most 
straightforward approach. As I believe someone noted earlier in the thread, 
Henrik's original problem of a recursive function is properly solved by using 
the Recall function.


-Ryan

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Bioc-devel Digest, Vol 116, Issum

2013-11-05 Thread Eugene Bolotin, PhD
p
Sent from a cell phone, please excuse bypos and terseness.
On Nov 3, 2013 3:04 AM, bioc-devel-requ...@r-project.org wrote:

 Send Bioc-devel mailing list submissions to
 bioc-devel@r-project.org

 To subscribe or unsubscribe via the World Wide Web, visit
 https://stat.ethz.ch/mailman/listinfo/bioc-devel
 or, via email, send a message with subject or body 'help' to
 bioc-devel-requ...@r-project.org

 You can reach the person managing the list at
 bioc-devel-ow...@r-project.org

 When replying, please edit your Subject line so it is more specific
 than Re: Contents of Bioc-devel digest...


 Today's Topics:

1. Re: lumiT VST Normalisation Fails (Dario Strbenac)


 --

 Message: 1
 Date: Sun, 3 Nov 2013 04:00:49 +
 From: Dario Strbenac dstr7...@uni.sydney.edu.au
 To: Pan Du dupan.m...@gmail.com
 Cc: bioc-devel@r-project.org bioc-devel@r-project.org
 Subject: Re: [Bioc-devel] lumiT VST Normalisation Fails
 Message-ID:
 
 b1a39f01485eae1c874c83d71...@blupr01mb035.prod.exchangelabs.com
 Content-Type: text/plain

 I read the journal article about the variance stabilizing method. I
 understand how it uses the p-value in its algorithm now. Some better error
 checking and message to the user would be useful, if the user created the
 object with new() from IDAT files. I will ask for the data to be exported
 again, with the necessary columns. It's also missing the number of beads
 column.

 
 From: du4p...@gmail.com du4p...@gmail.com on behalf of Pan Du 
 dupan.m...@gmail.com
 Sent: Saturday, 2 November 2013 3:31 AM
 To: Dario Strbenac
 Cc: bioc-devel@r-project.org
 Subject: Re: lumiT VST Normalisation Fails

 Hi Dario

 You are correct that the LumiBatch class only requires exprs and se.exprs.
 But for VST transform in lumiT, the detection p-values will help it better
 estimates the transformation. (For user flexibility, I will make it also
 accepts LumiBatch without detection matrix.)

 As for the SampleIDs, which are the prefix of column names to identify
 samples in the Illumina BeadStudio/GenomeStudio output file.  The vignette
 lumi.pdf has some example plots. In order to match controlData file with
 expression data file, these column headers should match up.

 Pan


 On Thu, Oct 31, 2013 at 11:00 PM, Dario Strbenac 
 dstr7...@uni.sydney.edu.aumailto:dstr7...@uni.sydney.edu.au wrote:
 Hello,

 I have a LumiBatch object, but the lumiT function produces an error.

  class(treatmentBatch)
 [1] LumiBatch
 attr(,package)
 [1] lumi
  treatmentBatch - lumiT(treatmentBatch)
 Perform vst transformation ...
 Error in !assayDataValidMembers(assayData(x.lumi), detection) :
   invalid argument type

 Since I've created a valid object of class LumiBatch, it should work on
 that object without errors. In fact, the documentation of the class states
 : The arguments to new should include exprs and se.exprs, others can be
 missing, in which case they are assigned default values. Nothing in the
 documentation of lumiT states that detection p-values are required, so it's
 a mystery to the lumi end-user what obscure format of parameters the
 package author expects them to provide.

 Another example of this type of problem is

  treatmentBatch - addControlData2lumi(controlTable, treatmentBatch)
 Error in addControlData2lumi(controlTable, treatmentBatch) :
   SampleID does not match up between controlData and x.lumi!

 controlData is described as a data.frame with first two columns as
 controlType and ProbeID. The rest columns are the expression amplitudes
 for individual samples. Searching the entire PDF manual of the help pages
 also does not show SampleID described anywhere.

 I am using lumi 2.14.0.

 --
 Dario Strbenac
 PhD Student
 University of Sydney
 Camperdown NSW 2050
 Australia


 [[alternative HTML version deleted]]



 --

 ___
 Bioc-devel mailing list
 Bioc-devel@r-project.org
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


 End of Bioc-devel Digest, Vol 116, Issue 3
 **


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Submitting two strongly related packages (software + data)

2013-11-05 Thread Nicolas De Jay
Hi,

I have never used a mailing list before so do not hesitate to let me know
if I'm doing something wrong.

I have been growingly increasingly fond of the Bioconductor project and
along with a small intend on submitting a package.  We have tried to follow
package guidelines as much as possible and noticed a size limit on software
packages (4MB).  The example data sets will greatly exceed that quota
(~150MB) so we figured we'd package the data separately (e.g. minfi and
minfiData).  In that case, what is the right order for submitting the
packages?  Of course, the data package will be completely documented first
so should we

1. submit the data package, then the software package? (You will have to
take our word that the software package is coming up.)
2. submit the software package, then the data package? (The software
depends on the data though for the vignette!)
3. submit both at the same time.

Also, is there a way to test for Bioconductor compliance like the --as-cran
option of the R CMD check command?

Thank you for your help.  I'm looking forward to becoming an active
developer in the Bioconductor community.

---
*Nicolas De Jay *
M.Sc. Student
Department of Human Genetics
Montreal Children's Hospital Research Institute, McGill University Health
Centre
4060 Ste Catherine West, PT-239
Montreal, QC H3Z2Z3, Canada
T: (514) 412-4440 | E: nicolas.de...@mail.mcgill.ca

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Submitting two strongly related packages (software + data)

2013-11-05 Thread Dan Tenenbaum
Hi Nicolas,

- Original Message -
 From: Nicolas De Jay nicolas.de...@mail.mcgill.ca
 To: bioc-devel@r-project.org
 Sent: Tuesday, November 5, 2013 4:02:20 PM
 Subject: [Bioc-devel] Submitting two strongly related packages (software +
 data)
 
 Hi,
 
 I have never used a mailing list before so do not hesitate to let me
 know
 if I'm doing something wrong.
 
 I have been growingly increasingly fond of the Bioconductor project
 and
 along with a small intend on submitting a package.  We have tried to
 follow
 package guidelines as much as possible and noticed a size limit on
 software
 packages (4MB).  The example data sets will greatly exceed that quota
 (~150MB) so we figured we'd package the data separately (e.g. minfi
 and
 minfiData).  In that case, what is the right order for submitting the
 packages?  Of course, the data package will be completely documented
 first
 so should we
 
 1. submit the data package, then the software package? (You will have
 to
 take our word that the software package is coming up.)
 2. submit the software package, then the data package? (The software
 depends on the data though for the vignette!)
 3. submit both at the same time.

Submit both at the same time, to the same issue in our tracker. This makes it 
easier for us to know that the packages go together.


 
 Also, is there a way to test for Bioconductor compliance like the
 --as-cran
 option of the R CMD check command?

Not really. Part of Bicoonductor compliance is a subjective human looking at 
the package and that can't be automated.
Reading our guidelines carefully should minimize these problems.
However, once you submit a package it will automatically be built and checked 
on all platforms which may turn up some issues.

Dan


 
 Thank you for your help.  I'm looking forward to becoming an active
 developer in the Bioconductor community.
 
 ---
 *Nicolas De Jay *
 M.Sc. Student
 Department of Human Genetics
 Montreal Children's Hospital Research Institute, McGill University
 Health
 Centre
 4060 Ste Catherine West, PT-239
 Montreal, QC H3Z2Z3, Canada
 T: (514) 412-4440 | E: nicolas.de...@mail.mcgill.ca
 
   [[alternative HTML version deleted]]
 
 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] cleaning up obsolete warnings

2013-11-05 Thread Michael Lawrence
I'm getting a little tired of this one popping up:

In x %in% other :
   Starting with BioC 2.12, the behavior of %in% on GenomicRanges objects
  has changed to use *equality* instead of *overlap* for comparing
  elements between GenomicRanges objects 'x' and 'table'. Now 'x[i]' and
  'table[j]' are considered to match when they are equal (i.e. 'x[i] ==
  table[j]'), instead of when they overlap. This new behavior is consistent
  with base::`%in%`(). If you need the old behavior, please use:

query %over% subject

Now that even release is Bioc 2.13, can we remove it? And there might be
others.

Thanks,
Michael

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel