Re: [Bioc-devel] version increments for unchanged packages
On 06/11/2015 11:50 AM, Hervé Pagès wrote: Hi Stephanie, On 06/11/2015 11:33 AM, Gabe Becker wrote: Stephanie, As far as I know, it is so that package versions are unique to specific releases of bioconductor. This has the benefits of 1. providing assurances that that particular version of a package is tested and confirmed to work within that release, and 2. enforces that the source code/data for a particular version of a package appears only and exactly once within the Bioconductor SVN structure. Together, these prevent there being ambiguity when a package needs to be updated/fixed in the context of a particular release, both in terms of what version needs to be fixed and where that fix needs to be applied. Exactly. One more thing: because the exact version of R that is used to build binary packages is not necessarily the same between 2 versions of Bioconductor, we would end up with binary packages that are different (and possibly not interchangeable) but impossible to distinguish based on their name if we were not bumping versions (e.g. both would be GWASdata_1.4.0.zip). Could make troubleshooting nightmarish. I forgot that we don't build binaries for data experiment packages anymore. Because these packages don't need compilation, the source package can be installed on any platform without the need of any extra tool (e.g. Rtools or Xcode). That argument still applies for software packages though... H. Hope that makes sense, H. Best, ~G On Thu, Jun 11, 2015 at 10:40 AM, Stephanie M. Gogarten sdmor...@u.washington.edu wrote: Why is it that packages with no changes still get new version numbers at every release? For example, my experiment data package GWASdata has not changed since the last release, but the version was bumped from 1.4.0 to 1.6.0. I imagine most users expect that a change in version number indicates some change in content. Stephanie ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
I also think that we're heading towards something like that. So genome(gr) - hg19 would: (a) Add any missing information to the seqinfo. (b) Sort the seqlevels in canonical order. (c) Change the seqlevels style to UCSC style if they are not. The 3 tasks are orthogonal. I guess most of the times people want an easy way to perform them all at once. We could easily support (a) and (b). This assumes that the current seqlevels are already valid hg19 seqlevels: si1 - Seqinfo(c(chrX, chrUn_gl000249, chr2, chr6_cox_hap2)) gr1 - GRanges(seqinfo=si1) hg19_si - Seqinfo(genome=hg19) ## (a): seqinfo(gr1) - merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)] seqinfo(gr1) # Seqinfo object with 4 sequences (1 circular) from hg19 genome: # seqnames seqlengths isCircular genome # chrX155270560 FALSE hg19 # chrUn_gl000249 38502 FALSE hg19 # chr2243199373 FALSE hg19 # chr6_cox_hap2 4795371 FALSE hg19 ## (b): seqlevels(gr1) - intersect(seqlevels(hg19_si), seqlevels(gr1)) seqinfo(gr1) # Seqinfo object with 4 sequences (1 circular) from hg19 genome: # seqnames seqlengths isCircular genome # chr2243199373 FALSE hg19 # chrX155270560 FALSE hg19 # chr6_cox_hap2 4795371 FALSE hg19 # chrUn_gl000249 38502 FALSE hg19 (c) is harder because seqlevelsStyle() doesn't know how to rename scaffolds yet: si2 - Seqinfo(c(X, HSCHRUN_RANDOM_CTG42, 2, HSCHR6_MHC_COX_CTG1)) gr2 - GRanges(seqinfo=si2) seqlevelsStyle(gr2) # [1] NCBI seqlevelsStyle(gr2) - UCSC seqlevels(gr2) # [1] chrX HSCHRUN_RANDOM_CTG42 chr2 # [4] HSCHR6_MHC_COX_CTG1 So we need to work on this. I'm not sure about using genome(gr) - hg19 for this. Right now it sets the genome column of the seqinfo with the supplied string and nothing else. Aren't there valid use cases for this? What about using seqinfo(gr) - hg19 instead? It kind of suggests that the whole seqinfo component actually gets filled. H. On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote: that's kind of always been my goal... Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: Maybe this could eventually support setting the seqinfo with: genome(gr) - hg19 Or is that being too clever? On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi, FWIW I started to work on supporting quick generation of a standalone Seqinfo object via Seqinfo(genome=hg38) in GenomeInfoDb. It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2, bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10, mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3, gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3, and sacCer2. I'll add more. See ?Seqinfo for some examples. Right now it fetches the information from internet every time you call it but maybe we should just store that information in the GenomeInfoDb package as Tim suggested? H. On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote: That would be perfect actually. And it would radically reduce modularize maintenance. Maybe that's the best way to go after all. Quite sensible. --t On Jun 3, 2015, at 12:46 PM, Vincent Carey st...@channing.harvard.edu mailto:st...@channing.harvard.edu wrote: It really isn't hard to have multiple OrganismDb packages in place -- the process of making new ones is documented and was given as an exercise in the EdX course. I don't know if we want to institutionalize it and distribute such -- I think we might, so that there would be Hs19, Hs38, mm9, etc. packages. They have very little content, they just coordinate interactions with packages that you'll already have. On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. tim.tri...@gmail.com mailto:tim.tri...@gmail.com wrote: Right, I typically do that too, and if you're working on human data it isn't a big deal. What makes things a lot more of a drag is when you work on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where Mus.musculus is essentially a build ahead of Homo.sapiens. R seqinfo(Homo.sapiens) Seqinfo object with 93 sequences (1 circular) from hg19 genome R seqinfo(Mus.musculus) Seqinfo object with 66 sequences (1 circular) from mm10 genome: It's not as explicit as directly assigning the seqinfo from a genome
Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
On 06/05/2015 11:43 AM, Michael Lawrence wrote: On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès hpa...@fredhutch.org wrote: I also think that we're heading towards something like that. So genome(gr) - hg19 would: (a) Add any missing information to the seqinfo. (b) Sort the seqlevels in canonical order. (c) Change the seqlevels style to UCSC style if they are not. The 3 tasks are orthogonal. I guess most of the times people want an easy way to perform them all at once. We could easily support (a) and (b). This assumes that the current seqlevels are already valid hg19 seqlevels: si1 - Seqinfo(c(chrX, chrUn_gl000249, chr2, chr6_cox_hap2)) gr1 - GRanges(seqinfo=si1) hg19_si - Seqinfo(genome=hg19) ## (a): seqinfo(gr1) - merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)] seqinfo(gr1) # Seqinfo object with 4 sequences (1 circular) from hg19 genome: # seqnames seqlengths isCircular genome # chrX155270560 FALSE hg19 # chrUn_gl000249 38502 FALSE hg19 # chr2243199373 FALSE hg19 # chr6_cox_hap2 4795371 FALSE hg19 ## (b): seqlevels(gr1) - intersect(seqlevels(hg19_si), seqlevels(gr1)) seqinfo(gr1) # Seqinfo object with 4 sequences (1 circular) from hg19 genome: # seqnames seqlengths isCircular genome # chr2243199373 FALSE hg19 # chrX155270560 FALSE hg19 # chr6_cox_hap2 4795371 FALSE hg19 # chrUn_gl000249 38502 FALSE hg19 (c) is harder because seqlevelsStyle() doesn't know how to rename scaffolds yet: si2 - Seqinfo(c(X, HSCHRUN_RANDOM_CTG42, 2, HSCHR6_MHC_COX_CTG1)) gr2 - GRanges(seqinfo=si2) seqlevelsStyle(gr2) # [1] NCBI seqlevelsStyle(gr2) - UCSC seqlevels(gr2) # [1] chrX HSCHRUN_RANDOM_CTG42 chr2 # [4] HSCHR6_MHC_COX_CTG1 So we need to work on this. I'm not sure about using genome(gr) - hg19 for this. Right now it sets the genome column of the seqinfo with the supplied string and nothing else. Aren't there valid use cases for this? Not sure. People would almost always want the seqname style and order to be consistent with the given genome. Agreed but my worry is that when they don't, then they would be left with no way to just set the genome column. H. What about using seqinfo(gr) - hg19 instead? It kind of suggests that the whole seqinfo component actually gets filled. Yea, but genome is so intuitive compared to seqinfo. H. On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote: that's kind of always been my goal... Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: Maybe this could eventually support setting the seqinfo with: genome(gr) - hg19 Or is that being too clever? On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi, FWIW I started to work on supporting quick generation of a standalone Seqinfo object via Seqinfo(genome=hg38) in GenomeInfoDb. It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2, bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10, mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3, gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3, and sacCer2. I'll add more. See ?Seqinfo for some examples. Right now it fetches the information from internet every time you call it but maybe we should just store that information in the GenomeInfoDb package as Tim suggested? H. On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote: That would be perfect actually. And it would radically reduce modularize maintenance. Maybe that's the best way to go after all. Quite sensible. --t On Jun 3, 2015, at 12:46 PM, Vincent Carey st...@channing.harvard.edu mailto:st...@channing.harvard.edu wrote: It really isn't hard to have multiple OrganismDb packages in place -- the process of making new ones is documented and was given as an exercise in the EdX course. I don't know if we want to institutionalize it and distribute such -- I think we might, so that there would be Hs19, Hs38, mm9, etc. packages. They have very little content, they just coordinate interactions with packages that you'll already have. On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. tim.tri...@gmail.com mailto:tim.tri...@gmail.com wrote: Right, I typically do that too, and if you're working on human data it isn't a big deal. What makes
Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
On 06/05/2015 01:19 PM, Michael Lawrence wrote: To support the multi-genome case, one could set the genome as a vector, one value for each seqname, and it would fix the style/seqlength per seqname. It could sort by the combination of seqname and species. Presumably it would do nothing for unknown genomes. But I agree that a standardizeSeqinfo() that amounts to genome(x) - genome(x) would make sense. I don't think people sort too often by seqnames (except to the natural ordering), That's what sort(), order(), rank() do by default: they sort by seqnames first, then by start, then by end, and finally by strand. but I could be wrong. I do sympathize though with the need for a low-level accessor. At least one would want a parameter for disabling the standardization. Ok. So the candidates are: (a) standardizeSeqinfo(gr) - hg19 (b) gr - standardizeSeqinfo(gr, hg19) (c) standardizeGenome(gr) - hg19 (d) gr - standardizeGenome(gr, hg19) (e) seqinfo(gr) - hg19 Is there a risk of confusion with keepStandardChromosomes where standard means a very different thing? I'll add 2 more: (f) normalizeSeqinfo(gr) - hg19 (g) gr - normalizeSeqinfo(gr, hg19) Anyway, we're not here yet. As pointed in an earlier post, there are still some missing pieces to complete the puzzle. Thanks, H. On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen kasperdanielhan...@gmail.com wrote: In WGBS we frequently sequence a human with spikein from the lambda genome. In this case, most of the chromosomes of the Granges are from human, except one. This is a usecase where genome(GR) is not constant. I suggest, partly for compatibility, to keep genome, but perhaps do something like standardizeGenome() or something like this. I would indeed love, love, love a function which just cleans it up. Kasper On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker becker.g...@gene.com wrote: Herve, This is probably a naive question, but what usecases are there for creating an object with the wrong seqinfo for its genome? ~G On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence lawrence.mich...@gene.com wrote: On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès hpa...@fredhutch.org wrote: I also think that we're heading towards something like that. So genome(gr) - hg19 would: (a) Add any missing information to the seqinfo. (b) Sort the seqlevels in canonical order. (c) Change the seqlevels style to UCSC style if they are not. The 3 tasks are orthogonal. I guess most of the times people want an easy way to perform them all at once. We could easily support (a) and (b). This assumes that the current seqlevels are already valid hg19 seqlevels: si1 - Seqinfo(c(chrX, chrUn_gl000249, chr2, chr6_cox_hap2)) gr1 - GRanges(seqinfo=si1) hg19_si - Seqinfo(genome=hg19) ## (a): seqinfo(gr1) - merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)] seqinfo(gr1) # Seqinfo object with 4 sequences (1 circular) from hg19 genome: # seqnames seqlengths isCircular genome # chrX155270560 FALSE hg19 # chrUn_gl000249 38502 FALSE hg19 # chr2243199373 FALSE hg19 # chr6_cox_hap2 4795371 FALSE hg19 ## (b): seqlevels(gr1) - intersect(seqlevels(hg19_si), seqlevels(gr1)) seqinfo(gr1) # Seqinfo object with 4 sequences (1 circular) from hg19 genome: # seqnames seqlengths isCircular genome # chr2243199373 FALSE hg19 # chrX155270560 FALSE hg19 # chr6_cox_hap2 4795371 FALSE hg19 # chrUn_gl000249 38502 FALSE hg19 (c) is harder because seqlevelsStyle() doesn't know how to rename scaffolds yet: si2 - Seqinfo(c(X, HSCHRUN_RANDOM_CTG42, 2, HSCHR6_MHC_COX_CTG1)) gr2 - GRanges(seqinfo=si2) seqlevelsStyle(gr2) # [1] NCBI seqlevelsStyle(gr2) - UCSC seqlevels(gr2) # [1] chrX HSCHRUN_RANDOM_CTG42 chr2 # [4] HSCHR6_MHC_COX_CTG1 So we need to work on this. I'm not sure about using genome(gr) - hg19 for this. Right now it sets the genome column of the seqinfo with the supplied string and nothing else. Aren't there valid use cases for this? Not sure. People would almost always want the seqname style and order to be consistent with the given genome. What about using seqinfo(gr) - hg19 instead? It kind of suggests that the whole seqinfo component actually gets filled. Yea, but genome is so intuitive compared to seqinfo. H. On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote: that's kind of always been my goal... Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: Maybe this could eventually support setting the seqinfo with: genome(gr) - hg19 Or is that being too clever? On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès hpa
Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
On 06/05/2015 01:39 PM, Tim Triche, Jr. wrote: how about just gr - addSeqinfo(gr, hg19) mmh, we don't really add a seqinfo. We transform the existing one. This transformation can be called standardization, or normalization, or... but I wouldn't call it addition. H. Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science On Fri, Jun 5, 2015 at 1:36 PM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: On 06/05/2015 01:19 PM, Michael Lawrence wrote: To support the multi-genome case, one could set the genome as a vector, one value for each seqname, and it would fix the style/seqlength per seqname. It could sort by the combination of seqname and species. Presumably it would do nothing for unknown genomes. But I agree that a standardizeSeqinfo() that amounts to genome(x) - genome(x) would make sense. I don't think people sort too often by seqnames (except to the natural ordering), That's what sort(), order(), rank() do by default: they sort by seqnames first, then by start, then by end, and finally by strand. but I could be wrong. I do sympathize though with the need for a low-level accessor. At least one would want a parameter for disabling the standardization. Ok. So the candidates are: (a) standardizeSeqinfo(gr) - hg19 (b) gr - standardizeSeqinfo(gr, hg19) (c) standardizeGenome(gr) - hg19 (d) gr - standardizeGenome(gr, hg19) (e) seqinfo(gr) - hg19 Is there a risk of confusion with keepStandardChromosomes where standard means a very different thing? I'll add 2 more: (f) normalizeSeqinfo(gr) - hg19 (g) gr - normalizeSeqinfo(gr, hg19) Anyway, we're not here yet. As pointed in an earlier post, there are still some missing pieces to complete the puzzle. Thanks, H. On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen kasperdanielhan...@gmail.com mailto:kasperdanielhan...@gmail.com wrote: In WGBS we frequently sequence a human with spikein from the lambda genome. In this case, most of the chromosomes of the Granges are from human, except one. This is a usecase where genome(GR) is not constant. I suggest, partly for compatibility, to keep genome, but perhaps do something like standardizeGenome() or something like this. I would indeed love, love, love a function which just cleans it up. Kasper On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker becker.g...@gene.com mailto:becker.g...@gene.com wrote: Herve, This is probably a naive question, but what usecases are there for creating an object with the wrong seqinfo for its genome? ~G On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: I also think that we're heading towards something like that. So genome(gr) - hg19 would: (a) Add any missing information to the seqinfo. (b) Sort the seqlevels in canonical order. (c) Change the seqlevels style to UCSC style if they are not. The 3 tasks are orthogonal. I guess most of the times people want an easy way to perform them all at once. We could easily support (a) and (b). This assumes that the current seqlevels are already valid hg19 seqlevels: si1 - Seqinfo(c(chrX, chrUn_gl000249, chr2, chr6_cox_hap2)) gr1 - GRanges(seqinfo=si1) hg19_si - Seqinfo(genome=hg19) ## (a): seqinfo(gr1) - merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)] seqinfo(gr1) # Seqinfo object with 4 sequences (1 circular) from hg19 genome: # seqnames seqlengths isCircular genome # chrX155270560 FALSE hg19 # chrUn_gl000249 38502 FALSE hg19 # chr2
Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
On 06/05/2015 01:48 PM, Gabe Becker wrote: I dunno, standardizeSeqInfo just seems really long for a function name users are going to have to call. At the risk of annoying Herve further, what about gr - castSeqInfo(gr, gh19) gr! ? ~G On Fri, Jun 5, 2015 at 1:46 PM, Tim Triche, Jr. tim.tri...@gmail.com mailto:tim.tri...@gmail.com wrote: maybe standardizeSeqinfo or fixSeqinfo is clearer after all Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science On Fri, Jun 5, 2015 at 1:41 PM, Gabe Becker becker.g...@gene.com mailto:becker.g...@gene.com wrote: On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr. tim.tri...@gmail.com mailto:tim.tri...@gmail.com wrote: how about just gr - addSeqinfo(gr, hg19) Add sounds like it's, well, adding rather than replacing (Which it sometimes would do. gr - fixSeqInfo(gr, hg19) instead? ~G -- Gabriel Becker, Ph.D Computational Biologist Genentech Research -- Gabriel Becker, Ph.D Computational Biologist Genentech Research -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
] BSgenome.Dmelanogaster.UCSC.dm6 BSgenome.Drerio.UCSC.danRer5 [15] BSgenome.Drerio.UCSC.danRer6 BSgenome.Drerio.UCSC.danRer7 [17] BSgenome.Ecoli.NCBI.20080805 BSgenome.Gaculeatus.UCSC.gasAcu1 [19] BSgenome.Ggallus.UCSC.galGal3 BSgenome.Ggallus.UCSC.galGal4 [21] BSgenome.Hsapiens.NCBI.GRCh38 BSgenome.Hsapiens.UCSC.hg17 [23] BSgenome.Hsapiens.UCSC.hg18BSgenome.Hsapiens.UCSC.hg19 [25] BSgenome.Hsapiens.UCSC.hg38 BSgenome.Mfascicularis.NCBI.5.0 [27] BSgenome.Mfuro.UCSC.musFur1 BSgenome.Mmulatta.UCSC.rheMac2 [29] BSgenome.Mmulatta.UCSC.rheMac3 BSgenome.Mmusculus.UCSC.mm10 [31] BSgenome.Mmusculus.UCSC.mm8BSgenome.Mmusculus.UCSC.mm9 [33] BSgenome.Ptroglodytes.UCSC.panTro2 BSgenome.Ptroglodytes.UCSC.panTro3 [35] BSgenome.Rnorvegicus.UCSC.rn4 BSgenome.Rnorvegicus.UCSC.rn5 [37] BSgenome.Rnorvegicus.UCSC.rn6 BSgenome.Scerevisiae.UCSC.sacCer1 [39] BSgenome.Scerevisiae.UCSC.sacCer2 BSgenome.Scerevisiae.UCSC.sacCer3 [41] BSgenome.Sscrofa.UCSC.susScr3 BSgenome.Tguttata.UCSC.taeGut1 Am I insane for suggesting this? It would make things a little easier for rtracklayer, most SummarizedExperiment and SE-derived objects, blah, blah, blah... Best, --t Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Rd] \alias{} -- rather \concept{} for conceptual links to help pages
Hi Martin, On 05/18/2015 05:14 AM, Martin Maechler wrote: From R-help, subject Variable number of loops I've opened a new thread, moving from R-help to R-devel .. Jim Lemon drjimle...@gmail.com on Sun, 17 May 2015 09:19:06 +1000 writes: Hi all, Given the number of help requests that involve permutations/combinations, and the less than obvious naming of the expand.grid function, perhaps adding an alias such as permute.elements or combine.elements might ease the tasks of both searchers and those offering help. Neither of the above names appear to be used at present. Jim Using \alias{} is not a very good thing here, since as you know they are *key*s that must remain unique if possible and they can be linked to -- which I think would not be helpful for 'expand.grid'. It seems to me that Jim was maybe suggesting to define an alias for the expand.grid function i.e. something like: permute.elements - expand.grid or combine.elements - expand.grid as a way to address the less than obvious naming of the expand.grid function. But maybe I misunderstood... Cheers, H. Rather, for quite a few years now, we have had \concept{} for adding search keywords, i.e., things that help.search() and hence ??topic will find. The other advantage of \concept{} is that you can use short phrases, i.e., \concept{all variable combinations} would be possible here. (Better wording proposals for this specific case are welcome! -- maybe privately). Martin Maechler, ETH Zurich On Sun, May 17, 2015 at 5:54 AM, Bert Gunter gunter.ber...@gene.com wrote: 1. Please always reply to the list unless there is a compelling reason to keep the discussion private. You will have a better chance of getting something useful that way. 2. I don't know what you mean by I don't have a fixed number of variables. You have to specify at least the number of variables and how many levels each has in order to work out what you requested, which is **NOT** the number of permutations but the number of combinations AFAICS, which is exactly what expand.grid will give you. 3. Maybe what you're looking for is the ... arguments in function calls, which would be used along the lines of: myfun - function( x,y,...) { ## some code combs - expand.grid(...) ## some more code } Any good R tutorial will tell you about this if this is unfamiliar. 4. Another possibility might be to deliver a list of named variables as an argument and then use do.call, e.g. myfun - (x,y, alist) { ## some code combs - do.call(expand.grid, alist) ## some more code } ?do.call and/or a tutorial for details. 5. Otherwise, maybe someone else can figure out what you're looking for. Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 Data is not information. Information is not knowledge. And knowledge is certainly not wisdom. Clifford Stoll On Sat, May 16, 2015 at 11:16 AM, WRAY NICHOLAS nicholas.w...@ntlworld.com wrote: I might be but doesn't expand.grid need a defined and listed number of inputs? The problem I'm having is that the number of variables is not fixed, so I'm not sure whether I can reference the variable number of variables by using a vector -- haven't had time to try yet But thanks anyway Nick Wray On 16 May 2015 at 14:28, Bert Gunter gunter.ber...@gene.com wrote: Are you trying to reinvent ?expand.grid ? -- Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 Data is not information. Information is not knowledge. And knowledge is certainly not wisdom. Clifford Stoll [...] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Unexpected failure when calling new() with unnamed arg and
Thanks Martin for looking into this. H. On 05/13/2015 03:57 AM, Martin Maechler wrote: Hervé Pagès hpa...@fredhutch.org on Tue, 12 May 2015 15:18:42 -0700 writes: Hi, The man page for new() suggests that if 'a' is an object with slots slot1 and slot2 and C is a class that extends the class of 'a', then the 2 following calls should be equivalent: new(C, a, ...) new(C, slot1=a@slot1, slot2=a@slot2, ...) This is generally the case but I just ran into a situation where it's not. In the following example the former fails while the latter works: setClass(A, representation(slot1=numeric, slot2=logical)) setClass(B, contains=A, representation(design=formula)) setClass(C, contains=B) a - new(A, slot1=77, slot2=TRUE) new(C, a, design=x ~ y) # fails new(C, slot1=a@slot1, slot2=a@slot2, design=x ~ y) # works Note that new(B, a, design=x ~ y) works so the 3-level class hierarchy is really needed in order to reproduce. Probably related to this, I also noted that new(B) and/or new(C) return invalid objects: c - new(C) validObject(c) # Error in validObject(c) : # invalid class “C” object: invalid object for slot design # in class C: got class S4, should be or extend class formula is(c@design, formula) # [1] FALSE class(c@design) # [1] S4 Note that 'c' can be fixed: c@design - formula(NULL) validObject(c) # [1] TRUE Maybe something that the default initialize method should take care of? Another singularity that is maybe at the root of all of this is that the formula S4 class is virtual: showClass(formula) # Virtual Class formula [package methods] # # Slots: # # Name: .S3Class # Class: character # # Extends: oldClass so a bare call to new(formula) fails: new(formula) # Error in new(formula) : # trying to generate an object from a virtual class (formula) Shouldn't new(formula) just return an empty S3 formula (like formula(NULL) does), in the same way that new(integer) returns an empty ordinary integer vector? Interesting .. and at least worth following. One problem and historical reason for the current setup seems that the formula S3 class is not from 'base' but 'stats' : R's source, src/library/methods/R/BasicClasses.R, lines 524 ff has the following comment block | .OldClassesPrototypes is a list of S3 classes for which prototype | objects are known reasonable. The classes will reappear in | .OldClassesList, but will have been initialized first in | .InitBasicClasses. NB: the methods package will NOT set up | prototypes for S3 classes except those in package base and for ts | (and would rather not do those either). The package that owns the | S3 class should have code to call setOldClass in its | initialization. So, when John Chambers wrote this, he envisioned that the 'stats' package would do the correct thing for its own classes. OTOH, as history went, the stats package was never allowed to depend on methods. There are many other S3 classes from 'stats' which also end up similarly, being defined via setOldClass() and that itself produces a VIRTUAL class. Also, another part of the (R source) comment above is no longer quite accurate, e.g., data.frame is in .OldClassesPrototypes but not in .OldClassesList ... As I do agree that formula is much more basic than these other classes, I'm currently looking at tweaks to the methods (and stats) code, to get this to work as indeed - you mentioned above - we already allow empty S3 formula objects anyway. ... half an hour later : Indeed, I've been able to use the above information such that new(formula) and new(formula, y ~ x) work. However, your code above now --- with my changes --- would fail : setClass(A, representation(slot1=numeric, slot2=logical)) setClass(B, contains=A, representation(design=formula)) setClass(C, contains=B) Error in reconcilePropertiesAndPrototype(name, slots, prototype, superClasses, : B is not eligible to be the data part of another class (must be a basic class or a virtual class with no slots) So, I am not yet committing my changes to R-devel. Hopefully more on this, later today. Martin Maechler, ETH Zurich Thanks, H. sessionInfo() R version 3.2.0 Patched (2015-04-17 r68202) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS -- Hervé Pagès Fred Hutchinson Cancer Research Center [..] -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Unexpected failure when calling new() with unnamed arg and
Hi, The man page for new() suggests that if 'a' is an object with slots slot1 and slot2 and C is a class that extends the class of 'a', then the 2 following calls should be equivalent: new(C, a, ...) new(C, slot1=a@slot1, slot2=a@slot2, ...) This is generally the case but I just ran into a situation where it's not. In the following example the former fails while the latter works: setClass(A, representation(slot1=numeric, slot2=logical)) setClass(B, contains=A, representation(design=formula)) setClass(C, contains=B) a - new(A, slot1=77, slot2=TRUE) new(C, a, design=x ~ y) # fails new(C, slot1=a@slot1, slot2=a@slot2, design=x ~ y) # works Note that new(B, a, design=x ~ y) works so the 3-level class hierarchy is really needed in order to reproduce. Probably related to this, I also noted that new(B) and/or new(C) return invalid objects: c - new(C) validObject(c) # Error in validObject(c) : # invalid class “C” object: invalid object for slot design # in class C: got class S4, should be or extend class formula is(c@design, formula) # [1] FALSE class(c@design) # [1] S4 Note that 'c' can be fixed: c@design - formula(NULL) validObject(c) # [1] TRUE Maybe something that the default initialize method should take care of? Another singularity that is maybe at the root of all of this is that the formula S4 class is virtual: showClass(formula) # Virtual Class formula [package methods] # # Slots: # # Name: .S3Class # Class: character # # Extends: oldClass so a bare call to new(formula) fails: new(formula) # Error in new(formula) : # trying to generate an object from a virtual class (formula) Shouldn't new(formula) just return an empty S3 formula (like formula(NULL) does), in the same way that new(integer) returns an empty ordinary integer vector? Thanks, H. sessionInfo() R version 3.2.0 Patched (2015-04-17 r68202) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] exptData(SummarizedExperiment)
SummarizedExperiment was just an example. I agree it can be a little challenging for end users to know where to find a particular functionality but I'm not sure about using meta packages to address that. At least I feel we should probably avoid creating new meta packages out of the blue, with arbitrary limits and possibly endless discussions about what exactly goes in them. Also I don't think there is a single core but rather several domain-specific cores. What about using the existing workflow packages instead? A workflow package (like the variants package here http://bioconductor.org/help/workflows/variants/) covers a specific domain and loading it should load the core for that domain. Plus the user gets a great vignette as a bonus to get started so it's not just an empty shell. There are probably some shortcomings with workflow packages that would need to be addressed before they can serve as convenient meta packages though e.g. they're treated too differently from other BioC packages (e.g. they're not available via biocLite() and don't show up under the biocViews tree here http://bioconductor.org/packages/release/BiocViews.html). Nothing that seems impossible to address though... H. On 05/12/2015 03:22 PM, Michael Lawrence wrote: It's more general than SummarizedExperiment. I think people would appreciate a simple way to load the core, without having to remember, for example, that VCF reading is in VariantAnnotation. On Mon, May 11, 2015 at 9:51 PM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi Michael, On 05/11/2015 05:35 PM, Michael Lawrence wrote: Splitting stuff into different packages is good for modularity, but tough on the mind of the user. What about having some sort of meta package that simply loads the core infrastructure packages? Named something simple like Genomics or GenomicsCore. Don't know if we need this. For example, for all the SummarizedExperiment use cases I ran into, the end-user generally only needs to load the corresponding high-level package (DESeq2, VariantAnnotation, minfi, GenomicAlignments, etc...) and that takes care of loading all the low-level infrastructure packages. H. On Mon, May 11, 2015 at 5:10 PM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi Tim, The SummarizedExperiment class is being replaced with the RangedSummarizedExperiment class from the new SummarizedExperiment package. This is a work-in-progress and the name and internal representation of the RangedSummarizedExperiment class are not finalized yet. The main goal for now is to move all the SummarizedExperiment stuff from GenomicRanges to its own package. Anyway, metadata() is the replacement for exptData() on RangedSummarizedExperiment objects. It's on my list to add an exptData method for backward compatibility. Cheers, H. On 05/11/2015 04:37 PM, Tim Triche, Jr. wrote: who determined that breaking this would be a good idea?!? R ?SummarizedExperiment Help on topic 'SummarizedExperiment' was found in the following packages: Package Library GenomicRanges /home/tim/R/x86_64-pc-linux-gnu-library/3.2 SummarizedExperiment /home/tim/R/x86_64-pc-linux-gnu-library/3.2 R nrows - 200; ncols - 6 Rcounts - matrix(runif(nrows * ncols, 1, 1e4), nrows) RrowRanges - GRanges(rep(c(chr1, chr2), c(50, 150)), + IRanges(floor(runif(200, 1e5, 1e6)), width=100), + strand=sample(c(+, -), 200, TRUE)) RcolData - DataFrame(Treatment=rep(c(ChIP, Input), 3), + row.names=LETTERS[1:6]) Rsset - SummarizedExperiment(assays=SimpleList(counts=counts), + rowRanges=rowRanges, colData=colData) Rsset class: RangedSummarizedExperiment dim: 200 6 metadata(0): assays(1): counts rownames: NULL rowRanges metadata column names(0): colnames(6): A B ... E F colData names(1): Treatment RassayNames(sset) [1] counts Rassays(sset) - endoapply(assays(sset), asinh) Rhead
Re: [Bioc-devel] 'memory not mapped' in trimLRpatterns
Hi Michael, I finally got to fix this in Biostrings 2.36.1 (release) and 2.37.2 (devel). Both should become available via biocLite() on Friday around 11am (Seattle time). Thanks for your patience. Cheers, H. On 04/30/2015 10:50 AM, Hervé Pagès wrote: Hi Michael, Thanks for the reminder. I must confess this slipped out of my radar. I'll look into it today and will let you know. H. On 04/30/2015 08:42 AM, Michael Stadler wrote: Hi Herve, I stumbled again over the 'memory not mapped' issue in trimLRpatterns using updated versions of R and BioC-devel. I guess it does not hit people very often, but I would highly appreciate if it could be fixed. Many thanks, Michael PS: I can reproduce the issue using the code below under: R version 3.2.0 (2015-04-16) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago) locale: [1] C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] ShortRead_1.27.1GenomicAlignments_1.5.1 Rsamtools_1.21.3 [4] GenomicRanges_1.21.5GenomeInfoDb_1.5.1 BiocParallel_1.3.4 [7] Biostrings_2.37.0 XVector_0.9.0 IRanges_2.3.6 [10] S4Vectors_0.7.0 BiocGenerics_0.15.0 RColorBrewer_1.1-2 loaded via a namespace (and not attached): [1] lattice_0.20-31 bitops_1.0-6 grid_3.2.0 [4] futile.options_1.0.0 zlibbioc_1.15.0 hwriter_1.3.2 [7] latticeExtra_0.6-26 futile.logger_1.4.1 lambda.r_1.1.7 [10] Biobase_2.29.0 On 25.04.2014 13:11, Hervé Pagès wrote: Hi Michael, Thanks for the report. I'll look into this. H. On 04/22/2014 08:29 AM, Michael Stadler wrote: Dear Herve, We are hitting a 'memory not mapped' problem when using trimLRpatterns as detailed below. I did not manage to reproduce it with few sequences, so I have to refer to a publicly available sequence file with many reads. Even then, it occasionally runs through without problems. Also, our use-case may not be typical and be part of the problem - maybe the solution will be to change our use of trimLRpatterns. Here is some code to illustrate/reproduce the problem: library(Biostrings) library(ShortRead) Rpat - TGGAATTCTCGGGTGCCAAGG maxRmm - rep(0:2, c(6,3,nchar(Rpat)-9)) fq1 - DNAStringSet(c(TGGAATTCTCGGGTGCCAAGG, TGGAATTCTCGGGTGCCAAGG)) # the second read is not trimmed because it runs through the adaptor trimLRPatterns(subject=fq1, Rpattern=Rpat, max.Rmismatch=maxRmm) # A DNAStringSet instance of length 2 #width seq #[1] 4 #[2]28 TGGAATTCTCGGGTGCCAAGGTTT # as a workaround, we pad the adaptor with Ns and # increase the mismatch tolerance numNs - 90 maxRmm - c(maxRmm, 1:numNs+max(maxRmm)) Rpat - paste(c(Rpat, rep(N, numNs)), collapse=) # now, also the second read gets trimmed trimLRPatterns(subject=fq1, Rpattern=Rpat, max.Rmismatch=maxRmm) # A DNAStringSet instance of length 2 #width seq #[1] 4 #[2] 4 # to trigger the segmentation fault, many reads are needed download.file(ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000916/ERR000916_1.fastq.gz;, ERR000916_1.fastq.gz) fq2 - readFastq(ERR000916_1.fastq.gz) fq3 - trimLRPatterns(subject=fq2, Rpattern=Rpat, max.Rmismatch=maxRmm) # *** caught segfault *** #address 0x7f5109be4fed, cause 'memory not mapped' # #Traceback: # 1: .Call(.NAME, ..., PACKAGE = PACKAGE) # 2: .Call2(XStringSet_vmatch_pattern_at, pattern, subject, at, at.type, max.mismatch, min.mismatch, with.indels, fixed, ans.type, auto.reduce.pattern, PACKAGE = Biostrings) # 3: .matchPatternAt(pattern, subject, ending.at, 1L, max.mismatch, min.mismatch, with.indels, fixed, .to.ans.type(follow.index), auto.reduce.pattern) # 4: which.isMatchingEndingAt(pattern = Rpattern, subject = subject, ending.at = subject_width, max.mismatch = max.Rmismatch, with.indels = with.Rindels, fixed = Rfixed, auto.reduce.pattern = TRUE) # 5: which.isMatchingEndingAt(pattern = Rpattern, subject = subject, ending.at = subject_width, max.mismatch = max.Rmismatch, with.indels = with.Rindels, fixed = Rfixed, auto.reduce.pattern = TRUE) # 6: .computeTrimEnd(Rpattern, subject, max.Rmismatch, with.Rindels, Rfixed) # 7: .XStringSet.trimLRPatterns(Lpattern, Rpattern, subject, max.Lmismatch, max.Rmismatch, with.Lindels, with.Rindels, Lfixed, Rfixed, ranges) # 8: trimLRPatterns(Lpattern, Rpattern, sread(subject), max.Lmismatch, max.Rmismatch, with.Lindels, with.Rindels, Lfixed, Rfixed, ranges = TRUE) # 9: trimLRPatterns(Lpattern, Rpattern, sread(subject), max.Lmismatch, max.Rmismatch, with.Lindels, with.Rindels, Lfixed, Rfixed, ranges = TRUE) #10: eval(expr, envir, enclos) #11: eval(call, sys.frame(sys.parent())) #12: callGeneric(Lpattern, Rpattern, sread(subject), max.Lmismatch, max.Rmismatch, with.Lindels, with.Rindels, Lfixed, Rfixed, ranges = TRUE) #13: trimLRPatterns(subject = fq2, Rpattern = Rpat
Re: [Rd] Shouldn't vector indexing with negative out-of-range index give an error?
much on purpose. Making it a *warning* instead of the original error may have been both more cautious and more helpful for detecting programming errors. OTOH, John Chambers, the father of S and hence grandfather of R, may have had good reasons why it seemed more logical to silently ignore such out of bound negative indices: One could argue that x[-5] means leave away the 5-th element of x and if there is no 5-th element of x, leaving it away should be a no-op. After all this musing and history detection, my gut decision would be to only change the documentation which Ross forgot to change. But of course, it may be interesting to hear other programmeR's feedback on this. Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] as.character method for GenomicRanges?
On 04/27/2015 02:15 PM, Michael Lawrence wrote: It would be nice to have a single function call that would hide these details. It could probably be made more efficient also by avoiding multiple matching, unnecessary revmap lists, etc. tableAsGRanges() is not a good name but it conveys what I mean (does that make it actually good?). There is nothing specific to GRanges here. We're just reporting the frequency of unique elements in a metadata column so this belongs to the extended Vector API in the same way that findMatches/countMatches do. H. On Mon, Apr 27, 2015 at 12:23 PM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: On 04/24/2015 11:41 AM, Michael Lawrence wrote: Taking this a bit off topic but it would be nice if we could get the GRanges equivalent of as.data.frame(table(x)), i.e., unique(x) with a count mcol. Should be easy to support but what should the API be like? This was actually the motivating use case for introducing findMatches/countMatches a couple of years ago: ux - unique(x) mcols(ux)$Freq - countMatches(ux, x) Don't know what a good API would be to make this even more straightforward though. Maybe via some extra argument to unique() e.g. 'with.freq'? This is kind of similar to the 'with.revmap' argument of reduce(). Note that unique() could also support the 'with.revmap' arg. Once it does, the 'with.freq' arg can also be implemented by just calling elementLengths() on the revmap metadata column. H. On Fri, Apr 24, 2015 at 10:54 AM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: On 04/24/2015 10:18 AM, Michael Lawrence wrote: It is a great idea, but I'm not sure I would use it to implement table(). Allocating those strings will be costly. Don't we already have the 4-way int hash? Of course, my intuition might be completely off here. It does use the 4-way int hash internally. as.character() is only used at the very-end to stick the names on the returned table object. H. On Fri, Apr 24, 2015 at 9:59 AM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi Pete, Excellent idea. That will make things like table() work out-of-the-box on GenomicRanges objects. I'll add that. Thanks, H. On 04/24/2015 09:43 AM, Peter Haverty wrote: Would people be interested in having this: setMethod(as.character, GenomicRanges, function(x) { paste0(seqnames(x), :, start(x), -, end(x)) }) ? I find myself doing that a lot to make unique names or for output that goes to collaborators. I suppose we might want to tack on the strand if it isn't *. I have some code for going the other direction too, if there is interest. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com mailto:phave...@gene.com mailto:phave...@gene.com mailto:phave...@gene.com mailto:phave...@gene.com mailto:phave...@gene.com mailto:phave...@gene.com mailto:phave...@gene.com [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences
Re: [Bioc-devel] Biobase: Imports repeats Depends
Hi Henrik, I know it's not necessary and I know that the WRE manual and 'R CMD check' don't like this but I like to list in Imports what I import and to list in Depends what I want to see attached to the search() path. For the same reason that I like to explicitly export the generic functions that I define in my packages even if exporting the methods defined on these generics is enough (because it has the side effect to automatically export the generic). More generally it's about expressing intentions directly versus expressing them in an indirect manner (e.g. by relying on side effects). I think the latter is wrong. Hope that makes sense. Cheers, H. On 04/25/2015 01:12 PM, Henrik Bengtsson wrote: It seems unnecessary that BiocGenerics have the same package under Imports as under Depends. The former can be dropped. packageDescription(BiocGenerics) Package: BiocGenerics Title: S4 generic functions for Bioconductor Description: S4 generic functions needed by many Bioconductor packages. Version: 0.14.0 Author: The Bioconductor Dev Team Maintainer: Bioconductor Package Maintainer maintai...@bioconductor.org biocViews: Infrastructure Depends: methods, utils, graphics, stats, parallel Imports: methods, utils, graphics, stats, parallel Suggests: Biobase, S4Vectors, IRanges, GenomicRanges, AnnotationDbi, oligoClasses, oligo, affyPLM, flowClust, affy, DESeq2, MSnbase, annotate, RUnit License: Artistic-2.0 Collate: S3-classes-as-S4-classes.R normarg-utils.R update-utils.R . NeedsCompilation: no Packaged: 2015-04-17 03:42:27 UTC; biocbuild Built: R 3.3.0; ; 2015-04-25 20:08:25 UTC; windows /Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] as.character method for GenomicRanges?
On 04/24/2015 11:41 AM, Michael Lawrence wrote: Taking this a bit off topic but it would be nice if we could get the GRanges equivalent of as.data.frame(table(x)), i.e., unique(x) with a count mcol. Should be easy to support but what should the API be like? This was actually the motivating use case for introducing findMatches/countMatches a couple of years ago: ux - unique(x) mcols(ux)$Freq - countMatches(ux, x) Don't know what a good API would be to make this even more straightforward though. Maybe via some extra argument to unique() e.g. 'with.freq'? This is kind of similar to the 'with.revmap' argument of reduce(). Note that unique() could also support the 'with.revmap' arg. Once it does, the 'with.freq' arg can also be implemented by just calling elementLengths() on the revmap metadata column. H. On Fri, Apr 24, 2015 at 10:54 AM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: On 04/24/2015 10:18 AM, Michael Lawrence wrote: It is a great idea, but I'm not sure I would use it to implement table(). Allocating those strings will be costly. Don't we already have the 4-way int hash? Of course, my intuition might be completely off here. It does use the 4-way int hash internally. as.character() is only used at the very-end to stick the names on the returned table object. H. On Fri, Apr 24, 2015 at 9:59 AM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi Pete, Excellent idea. That will make things like table() work out-of-the-box on GenomicRanges objects. I'll add that. Thanks, H. On 04/24/2015 09:43 AM, Peter Haverty wrote: Would people be interested in having this: setMethod(as.character, GenomicRanges, function(x) { paste0(seqnames(x), :, start(x), -, end(x)) }) ? I find myself doing that a lot to make unique names or for output that goes to collaborators. I suppose we might want to tack on the strand if it isn't *. I have some code for going the other direction too, if there is interest. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com mailto:phave...@gene.com mailto:phave...@gene.com mailto:phave...@gene.com [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org Phone: (206) 667-5791 tel:%28206%29%20667-5791 tel:%28206%29%20667-5791 Fax: (206) 667-1319 tel:%28206%29%20667-1319 tel:%28206%29%20667-1319 ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org mailto:hpa...@fredhutch.org Phone: (206) 667-5791 tel:%28206%29%20667-5791 Fax: (206) 667-1319 tel:%28206%29%20667-1319 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] as.character method for GenomicRanges?
On 04/24/2015 11:08 AM, Peter Haverty wrote: Good catch. We'll want the strand in case we need to go back to a GRanges. I would make the strand addition optional with the default of FALSE. It's nice to have a column of strings you can paste right into a genome browser (sorry Michael :-) ). I often pass my bench collaborators a spreadsheet with such a column. as.character(unstrand(gr)) ? 3 reasons I'm not too keen about 'ignore.strand=TRUE' being the default: (1) Many functions and methods in GenomicRanges/GenomicAlignments have an 'ignore.strand' argument. For consistency, the default value has been set to FALSE everywhere. Note that this was done even if this default doesn't reflect the most common use case (e.g. summarizeOverlaps). (2) I think it's good to have the default behavior of as.character() allow going back and forth between GRanges and character vector without losing the strand information. (3) The table method for Vector would break if as.character was ignoring the strand by default. Can be worked-around by implementing a method for GenomicRanges objects but... Hope that makes sense. H. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com mailto:phave...@gene.com On Fri, Apr 24, 2015 at 10:50 AM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: On 04/24/2015 10:21 AM, Michael Lawrence wrote: Sorry, one more concern, if you're thinking of using as a range key, you will need the strand, but many use cases might not want the strand on there. Like for pasting into a genome browser. What about appending the strand only for GRanges objects that have at least one range that is not on *? setMethod(as.character, GenomicRanges, function(x) { if (length(x) == 0L) return(character(0)) ans - paste0(seqnames(x), :, start(x), -, end(x)) if (any(strand(x) != *)) ans - paste0(ans, :, strand(x)) ans } ) as.character(gr) [1] chr1:1-10 chr2:2-10 chr2:3-10 chr2:4-10 chr1:5-10 [6] chr1:6-10 chr3:7-10 chr3:8-10 chr3:9-10 chr3:10-10 strand(gr)[2:3] - c(-, +) as.character(gr) [1] chr1:1-10:* chr2:2-10:- chr2:3-10:+ chr2:4-10:* chr1:5-10:* [6] chr1:6-10:* chr3:7-10:* chr3:8-10:* chr3:9-10:* chr3:10-10:* H. On Fri, Apr 24, 2015 at 10:18 AM, Michael Lawrence micha...@gene.com mailto:micha...@gene.com mailto:micha...@gene.com mailto:micha...@gene.com wrote: It is a great idea, but I'm not sure I would use it to implement table(). Allocating those strings will be costly. Don't we already have the 4-way int hash? Of course, my intuition might be completely off here. On Fri, Apr 24, 2015 at 9:59 AM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi Pete, Excellent idea. That will make things like table() work out-of-the-box on GenomicRanges objects. I'll add that. Thanks, H. On 04/24/2015 09:43 AM, Peter Haverty wrote: Would people be interested in having this: setMethod(as.character, GenomicRanges, function(x) { paste0(seqnames(x), :, start(x), -, end(x)) }) ? I find myself doing that a lot to make unique names or for output that goes to collaborators. I suppose we might want to tack on the strand if it isn't *. I have some code for going the other direction too, if there is interest. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com mailto:phave...@gene.com mailto:phave...@gene.com mailto:phave...@gene.com [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer
Re: [Bioc-devel] as.character method for GenomicRanges?
On 04/24/2015 10:18 AM, Michael Lawrence wrote: It is a great idea, but I'm not sure I would use it to implement table(). Allocating those strings will be costly. Don't we already have the 4-way int hash? Of course, my intuition might be completely off here. It does use the 4-way int hash internally. as.character() is only used at the very-end to stick the names on the returned table object. H. On Fri, Apr 24, 2015 at 9:59 AM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi Pete, Excellent idea. That will make things like table() work out-of-the-box on GenomicRanges objects. I'll add that. Thanks, H. On 04/24/2015 09:43 AM, Peter Haverty wrote: Would people be interested in having this: setMethod(as.character, GenomicRanges, function(x) { paste0(seqnames(x), :, start(x), -, end(x)) }) ? I find myself doing that a lot to make unique names or for output that goes to collaborators. I suppose we might want to tack on the strand if it isn't *. I have some code for going the other direction too, if there is interest. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com mailto:phave...@gene.com [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org mailto:hpa...@fredhutch.org Phone: (206) 667-5791 tel:%28206%29%20667-5791 Fax: (206) 667-1319 tel:%28206%29%20667-1319 ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Error on Windows when selecting a single chromosome from a BSgenome
Hi Bernat, Sorry for the delay in getting back to you. Like Dan, I cannot reproduce this either. Without more input from you (e.g. sessionInfo()), there is not much we can do. Note that an easy way to get the coordinates of the masked regions in a data.frame is with something like this: path - system.file(extdata, chr1.masks.rda, package=BSgenome.Hsapiens.UCSC.hg19.masked) load(path) as.data.frame(chr1.masks[[AGAPS]]) This is for the assembly gaps on chr1. You can do the same for any of the 3 other types of masks (intra-contig ambiguities, RepeatMasker, Tandem Repeats Finder) on any other chromosome/scaffold. Cheers, H. On 03/20/2015 02:50 PM, Dan Tenenbaum wrote: - Original Message - From: Bernat Gel b...@imppc.org To: bioc-devel@r-project.org Sent: Friday, March 20, 2015 8:06:59 AM Subject: [Bioc-devel] Error on Windows when selecting a single chromosome from a BSgenome Hi all, We are trying to get our first package (regioneR) included into Bioconductor. A couple of months ago it passed all the automatic checks but now, without code changes from our part fails on windows. I've been digging and it seems the problem arises when selecting a single chromosome from a BSgenome object. library(BSgenome.Hsapiens.UCSC.hg19.masked) BSgenome.Hsapiens.UCSC.hg19.masked[[chr1]] This works in linux and mac but in windows it kills R without any error message. Any idea of what's going on? Our final aim when doing that is to have a data.frame with the genomic coordinates of the masked regions, so if there's an alternative approach we could use it as a workaround. I can't reproduce this. Can you send the output of sessionInfo() (run after loading BSgenome.Hsapiens.UCSC.hg19.masked but before the line that causes the crash)? Dan Thanks a lot! Bernat ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] coverage(gr, weight='score') does not work when score(gr) is an Rle
Hi, On 04/17/2015 09:42 AM, Hervé Pagès wrote: Hi, I think we should just expand the Rle internally. That will produce a numeric vector of the length of the GRanges i.e. it will be the same size as the start and end components of the GRanges object itself. No big deal at all. I'll make that change. Done in GenomicRanges 1.20.2 (release) and 1.21.1 (devel). Cheers, H. H. On 04/17/2015 09:00 AM, Michael Lawrence wrote: Ideally it should be supported, but it would take some work as the coverage stuff is all in C. Could you give more details on your use case? For example, if you already have a range for every position on the chromosome, you could just extract the score column. I'm guessing it's more complicated than that. If the zeros are the problem, you could just subset the GRanges to remove the ranges with zero score, and then coerce the score to numeric before calling coverage. Michael 2015-04-17 8:00 GMT-07:00 Philip Lijnzaad p.lijnz...@umcutrecht.nl: Dear all, I'm puzzled by the following behaviour: Given n - 10 gr - GRanges(seqnames=Rle('A', n), ranges=IRanges(1:n, width=1), score=Rle(5,n)) If I do coverage(gr,weight='score') I get Error in .normarg_shift_or_weight(weight, weight, x) : 'weight' must be a numeric vector, a single string, or a list-like object Surely 'score' should be allowed to be an Rle? Especially given the fact that the return value of coverage(x,weight=score) when score is plain numeric vector is always an Rle ! Is this the expected behaviour? If so, I would argue that violates the principle of least suprise :-) The background to this is that I do numerical analysis on derived numerical data along my chromosomes. It contains many contiguous zeroes so it would be wasteful to cast everything down using as.numeric(). This is R version 3.01 on x86_64 Linux, Bioconductor version 2.13, package.version(IRanges) [1] 1.20.7 package.version(GenomicRanges) [1] 1.14.4 Regards, Philip -- Philip Lijnzaad, PhD Molecular Cancer Research University Medical Center (UMC), Utrecht Stratenum room 2.211 IM: plijnz...@jabber.org , philip.lijnz...@gmail.com P.O. Box 85060, 3508 AB Utrecht (Universiteitsweg 100, 3584 CG Utrecht) The Netherlands tel: +31 (0)8875 68464 -- De informatie opgenomen in dit bericht kan vertrouwelijk zijn en is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onterecht ontvangt, wordt u verzocht de inhoud niet te gebruiken en de afzender direct te informeren door het bericht te retourneren. Het Universitair Medisch Centrum Utrecht is een publiekrechtelijke rechtspersoon in de zin van de W.H.W. (Wet Hoger Onderwijs en Wetenschappelijk Onderzoek) en staat geregistreerd bij de Kamer van Koophandel voor Midden-Nederland onder nr. 30244197. Denk s.v.p aan het milieu voor u deze e-mail afdrukt. -- This message may contain confidential information and ...{{dropped:10}} ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] coverage(gr, weight='score') does not work when score(gr) is an Rle
On 04/17/2015 10:00 AM, Michael Lawrence wrote: Is that the case here? He has an Rle as an mcol in the GRanges, so in general expanding it will not align with the other components. Not sure what you mean. Can you give an example? H. On Fri, Apr 17, 2015 at 9:42 AM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi, I think we should just expand the Rle internally. That will produce a numeric vector of the length of the GRanges i.e. it will be the same size as the start and end components of the GRanges object itself. No big deal at all. I'll make that change. H. On 04/17/2015 09:00 AM, Michael Lawrence wrote: Ideally it should be supported, but it would take some work as the coverage stuff is all in C. Could you give more details on your use case? For example, if you already have a range for every position on the chromosome, you could just extract the score column. I'm guessing it's more complicated than that. If the zeros are the problem, you could just subset the GRanges to remove the ranges with zero score, and then coerce the score to numeric before calling coverage. Michael 2015-04-17 8:00 GMT-07:00 Philip Lijnzaad p.lijnz...@umcutrecht.nl mailto:p.lijnz...@umcutrecht.nl: Dear all, I'm puzzled by the following behaviour: Given n - 10 gr - GRanges(seqnames=Rle('A', n), ranges=IRanges(1:n, width=1), score=Rle(5,n)) If I do coverage(gr,weight='score') I get Error in .normarg_shift_or_weight(weight, weight, x) : 'weight' must be a numeric vector, a single string, or a list-like object Surely 'score' should be allowed to be an Rle? Especially given the fact that the return value of coverage(x,weight=score) when score is plain numeric vector is always an Rle ! Is this the expected behaviour? If so, I would argue that violates the principle of least suprise :-) The background to this is that I do numerical analysis on derived numerical data along my chromosomes. It contains many contiguous zeroes so it would be wasteful to cast everything down using as.numeric(). This is R version 3.01 on x86_64 Linux, Bioconductor version 2.13, package.version(IRanges) [1] 1.20.7 package.version(GenomicRanges) [1] 1.14.4 Regards, Philip -- Philip Lijnzaad, PhD Molecular Cancer Research University Medical Center (UMC), Utrecht Stratenum room 2.211 IM: plijnz...@jabber.org mailto:plijnz...@jabber.org , philip.lijnz...@gmail.com mailto:philip.lijnz...@gmail.com P.O. Box 85060, 3508 AB Utrecht (Universiteitsweg 100, 3584 CG Utrecht) The Netherlands tel: +31 (0)8875 68464 tel:%2B31%20%280%298875%2068464 -- De informatie opgenomen in dit bericht kan vertrouwelijk zijn en is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onterecht ontvangt, wordt u verzocht de inhoud niet te gebruiken en de afzender direct te informeren door het bericht te retourneren. Het Universitair Medisch Centrum Utrecht is een publiekrechtelijke rechtspersoon in de zin van de W.H.W. (Wet Hoger Onderwijs en Wetenschappelijk Onderzoek) en staat geregistreerd bij de Kamer van Koophandel voor Midden-Nederland onder nr. 30244197. Denk s.v.p aan het milieu voor u deze e-mail afdrukt. -- This message may contain confidential information and ...{{dropped:10}} ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org mailto:hpa...@fredhutch.org Phone: (206) 667-5791 tel:%28206%29%20667-5791 Fax: (206) 667-1319 tel
Re: [Rd] behavior of as.integer(5000000000)
On 04/17/2015 08:24 AM, Martin Maechler wrote: Martin Maechler maech...@lynne.stat.math.ethz.ch on Fri, 17 Apr 2015 15:49:35 +0200 writes: Hervé Pagès hpa...@fredhutch.org on Mon, 13 Apr 2015 23:36:14 -0700 writes: On 04/13/2015 11:32 PM, Martin Maechler wrote: Hi, as.integer(50) [1] 2147483647 Warning message: inaccurate integer conversion in coercion as.integer(-50) [1] NA Warning message: inaccurate integer conversion in coercion Is this a bug or a feature? The man page suggests it's the latter: I think you mean the former, a bug. and I agree entirely, see the following 2 x 2 comparison : N - 5 * 8^-(0:7) as.integer(N) [1] NA NA NA NA 1220703125 152587890 190734862384185 Warning message: NAs introduced by coercion as.integer(-N) [1] NA NA NA NA -1220703125 -152587890 -19073486 [8]-2384185 Warning message: NAs introduced by coercion as.integer(as.character(N)) [1] 2147483647 2147483647 2147483647 2147483647 1220703125 152587890 190734862384185 Warning message: inaccurate integer conversion in coercion as.integer(as.character(-N)) [1] NA NA NA NA -1220703125 -152587890 -19073486 [8]-2384185 Warning message: inaccurate integer conversion in coercion ‘as.integer’ attempts to coerce its argument to be of integer type. The answer will be ‘NA’ unless the coercion succeeds. even though someone could always argue that coercion of 50 succeeded (for some definition of succeed). Also is there any reason why the warning message is different than with: as.integer(-50) [1] NA Warning message: NAs introduced by coercion In the case of as.integer(-50), it's not really that the conversion was inaccurate, it's a little bit worse than that. And knowing that NAs where introduced by coercion is important. Yes. The message is less a problem than the bug, but I agree we should try to improve it. Sounds good. Thanks Martin, I've committed a change to R-devel now, such that also this case returns NA with a warning, actually for the moment with both the old warning and the 'NAs introduced by coercion' warning. The nice thing about the old warning is that it explicitly mentions integer coercion. I currently think we should keep that property, and I'd propose to completely drop the inaccurate integer conversion in coercion warning (it is not used anywhere else currently) and replace it in this and other as.integer(.) cases with 'NAs introduced by integer coercion' (or something similar. ... improvements / proposals are welcome). Replying to myself: I've found 'NAs introduced by coercion to integer range' I like that we see coercion *to* integer instead of just integer coercion because the former indicates the direction of the coercion. I'm not that convinced with the range thing though. I think as.integer(c(78, a34, -50)) should emit only one warning and not try to categorize the reasons for getting an NA. Thanks, H. to be even more on spot, and so will commit it for today. Of course, amendment proposals are still welcome. Martin BTW, the fact that as.integer(-50) did produce an NA instead of -2147483647 so it would have been compatible with as.integer(50) was just another coincidence, namely that we currently code NA_integer_ by INT_MIN (for 32 bit integers, INT_MIN = 2147483648 = 2^31) [[but your C code must not rely on that, it is an implementation detail!]] Martin -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] coverage(gr, weight='score') does not work when score(gr) is an Rle
Hi, I think we should just expand the Rle internally. That will produce a numeric vector of the length of the GRanges i.e. it will be the same size as the start and end components of the GRanges object itself. No big deal at all. I'll make that change. H. On 04/17/2015 09:00 AM, Michael Lawrence wrote: Ideally it should be supported, but it would take some work as the coverage stuff is all in C. Could you give more details on your use case? For example, if you already have a range for every position on the chromosome, you could just extract the score column. I'm guessing it's more complicated than that. If the zeros are the problem, you could just subset the GRanges to remove the ranges with zero score, and then coerce the score to numeric before calling coverage. Michael 2015-04-17 8:00 GMT-07:00 Philip Lijnzaad p.lijnz...@umcutrecht.nl: Dear all, I'm puzzled by the following behaviour: Given n - 10 gr - GRanges(seqnames=Rle('A', n), ranges=IRanges(1:n, width=1), score=Rle(5,n)) If I do coverage(gr,weight='score') I get Error in .normarg_shift_or_weight(weight, weight, x) : 'weight' must be a numeric vector, a single string, or a list-like object Surely 'score' should be allowed to be an Rle? Especially given the fact that the return value of coverage(x,weight=score) when score is plain numeric vector is always an Rle ! Is this the expected behaviour? If so, I would argue that violates the principle of least suprise :-) The background to this is that I do numerical analysis on derived numerical data along my chromosomes. It contains many contiguous zeroes so it would be wasteful to cast everything down using as.numeric(). This is R version 3.01 on x86_64 Linux, Bioconductor version 2.13, package.version(IRanges) [1] 1.20.7 package.version(GenomicRanges) [1] 1.14.4 Regards, Philip -- Philip Lijnzaad, PhD Molecular Cancer Research University Medical Center (UMC), Utrecht Stratenum room 2.211 IM: plijnz...@jabber.org , philip.lijnz...@gmail.com P.O. Box 85060, 3508 AB Utrecht (Universiteitsweg 100, 3584 CG Utrecht) The Netherlands tel: +31 (0)8875 68464 -- De informatie opgenomen in dit bericht kan vertrouwelijk zijn en is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onterecht ontvangt, wordt u verzocht de inhoud niet te gebruiken en de afzender direct te informeren door het bericht te retourneren. Het Universitair Medisch Centrum Utrecht is een publiekrechtelijke rechtspersoon in de zin van de W.H.W. (Wet Hoger Onderwijs en Wetenschappelijk Onderzoek) en staat geregistreerd bij de Kamer van Koophandel voor Midden-Nederland onder nr. 30244197. Denk s.v.p aan het milieu voor u deze e-mail afdrukt. -- This message may contain confidential information and ...{{dropped:10}} ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Rd] behavior of as.integer(5000000000)
Hi Martin, On 04/17/2015 06:49 AM, Martin Maechler wrote: Hervé Pagès hpa...@fredhutch.org on Mon, 13 Apr 2015 23:36:14 -0700 writes: On 04/13/2015 11:32 PM, Martin Maechler wrote: Hi, as.integer(50) [1] 2147483647 Warning message: inaccurate integer conversion in coercion as.integer(-50) [1] NA Warning message: inaccurate integer conversion in coercion Is this a bug or a feature? The man page suggests it's the latter: I think you mean the former, a bug. and I agree entirely, see the following 2 x 2 comparison : N - 5 * 8^-(0:7) as.integer(N) [1] NA NA NA NA 1220703125 152587890 190734862384185 Warning message: NAs introduced by coercion as.integer(-N) [1] NA NA NA NA -1220703125 -152587890 -19073486 [8]-2384185 Warning message: NAs introduced by coercion as.integer(as.character(N)) [1] 2147483647 2147483647 2147483647 2147483647 1220703125 152587890 190734862384185 Warning message: inaccurate integer conversion in coercion as.integer(as.character(-N)) [1] NA NA NA NA -1220703125 -152587890 -19073486 [8]-2384185 Warning message: inaccurate integer conversion in coercion ‘as.integer’ attempts to coerce its argument to be of integer type. The answer will be ‘NA’ unless the coercion succeeds. even though someone could always argue that coercion of 50 succeeded (for some definition of succeed). Also is there any reason why the warning message is different than with: as.integer(-50) [1] NA Warning message: NAs introduced by coercion In the case of as.integer(-50), it's not really that the conversion was inaccurate, it's a little bit worse than that. And knowing that NAs where introduced by coercion is important. Yes. The message is less a problem than the bug, but I agree we should try to improve it. Sounds good. Thanks Martin, I've committed a change to R-devel now, such that also this case returns NA with a warning, actually for the moment with both the old warning and the 'NAs introduced by coercion' warning. The nice thing about the old warning is that it explicitly mentions integer coercion. I currently think we should keep that property, and I'd propose to completely drop the inaccurate integer conversion in coercion warning (it is not used anywhere else currently) and replace it in this and other as.integer(.) cases with 'NAs introduced by integer coercion' (or something similar. ... improvements / proposals are welcome). Thanks. That's a much better warning message. BTW, the fact that as.integer(-50) did produce an NA instead of -2147483647 so it would have been compatible with as.integer(50) was just another coincidence, namely that we currently code NA_integer_ by INT_MIN (for 32 bit integers, INT_MIN = 2147483648 = 2^31) [[but your C code must not rely on that, it is an implementation detail!]] Yeah, I suspected that. Thanks for the fix. H. Martin -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] truncated warning messages
Hi, I was installing hundreds of packages on a machine with a single call to install.packages() and after a long time the call to install.packages() finally returned with the following warnings and errors: Warning messages: 1: packages ‘hgu133aprobe’, ‘hgu95av2.db’, ‘BSgenome.Celegans.UCSC.ce2’, ‘BSgenome.Mmusculus.UCSC.mm10’, ‘BSgenome.Dmelanogaster.UCSC.dm3.masked’, ‘BSgenome.Hsapiens.UCSC.hg18.masked’, ‘BSgenome.Hsapiens.UCSC.hg19.masked’, ‘BSgenome.Mmusculus.UCSC.mm9.masked’, ‘BSgenome.Scerevisiae.UCSC.sacCer2’, ‘BSgenome.Hsapiens.NCBI.GRCh38’, ‘BSgenome.Rnorvegicus.UCSC.rn5’, ‘drosophila2probe’, ‘hgu95av2probe’, ‘hgu95av2cdf’, ‘SNPlocs.Hsapiens.dbSNP.20100427’, ‘SNPlocs.Hsapiens.dbSNP.20101109’, ‘SNPlocs.Hsapiens.dbSNP.20110815’, ‘SNPlocs.Hsapiens.dbSNP141.GRCh38’, ‘XtraSNPlocs.Hsapiens.dbSNP141.GRCh38’, ‘org.Sc.sgd.db’, ‘org.Mm.eg.db’, ‘org.At.tair.db’, ‘hom.Hs.inp.db’, ‘GO.db’, ‘TxDb.Hsapiens.UCSC.hg18.knownGene’, ‘TxDb.Hsapiens.UCSC.hg19.knownGene’, ‘TxDb.Hsapiens.UCSC.hg38.knownGene’, ‘TxDb.Mmusculus.UCSC.mm9.knownGene’, ‘TxDb.Mmusculus.UCSC.mm10.knownGene’, ‘TxDb.Dmelanogaster.UCSC. [... truncated] 2: In install.packages(pkgs = doing, lib = lib, ...) : installation of package ‘humanStemCell’ had non-zero exit status 3: In install.packages(pkgs = doing, lib = lib, ...) : installation of package ‘AnnotationForge’ had non-zero exit status As you can see the warning message is truncated so I can't see what happened to these packages. What about having a message like package(s) not available: pkg1, pkg2, pkg3, etc... so when it's truncated we don't loose the important part of the story? I guess the same kind of change could be made to the other warning messages, and, more generally, to any warning/error message obtained by injecting an arbitrary (and potentially long) list of strings. Thanks, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] behavior of as.integer(5000000000)
On 04/13/2015 11:32 PM, Martin Maechler wrote: Hi, as.integer(50) [1] 2147483647 Warning message: inaccurate integer conversion in coercion as.integer(-50) [1] NA Warning message: inaccurate integer conversion in coercion Is this a bug or a feature? The man page suggests it's the latter: I think you mean the former, a bug. and I agree entirely, see the following 2 x 2 comparison : N - 5 * 8^-(0:7) as.integer(N) [1] NA NA NA NA 1220703125 152587890 190734862384185 Warning message: NAs introduced by coercion as.integer(-N) [1] NA NA NA NA -1220703125 -152587890 -19073486 [8]-2384185 Warning message: NAs introduced by coercion as.integer(as.character(N)) [1] 2147483647 2147483647 2147483647 2147483647 1220703125 152587890 190734862384185 Warning message: inaccurate integer conversion in coercion as.integer(as.character(-N)) [1] NA NA NA NA -1220703125 -152587890 -19073486 [8]-2384185 Warning message: inaccurate integer conversion in coercion ‘as.integer’ attempts to coerce its argument to be of integer type. The answer will be ‘NA’ unless the coercion succeeds. even though someone could always argue that coercion of 50 succeeded (for some definition of succeed). Also is there any reason why the warning message is different than with: as.integer(-50) [1] NA Warning message: NAs introduced by coercion In the case of as.integer(-50), it's not really that the conversion was inaccurate, it's a little bit worse than that. And knowing that NAs where introduced by coercion is important. Yes. The message is less a problem than the bug, but I agree we should try to improve it. Sounds good. Thanks Martin, H. Martin -- Hervé Pagès ... -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] behavior of as.integer(5000000000)
Hi, as.integer(50) [1] 2147483647 Warning message: inaccurate integer conversion in coercion as.integer(-50) [1] NA Warning message: inaccurate integer conversion in coercion Is this a bug or a feature? The man page suggests it's the latter: ‘as.integer’ attempts to coerce its argument to be of integer type. The answer will be ‘NA’ unless the coercion succeeds. even though someone could always argue that coercion of 50 succeeded (for some definition of succeed). Also is there any reason why the warning message is different than with: as.integer(-50) [1] NA Warning message: NAs introduced by coercion In the case of as.integer(-50), it's not really that the conversion was inaccurate, it's a little bit worse than that. And knowing that NAs where introduced by coercion is important. Thanks, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] Suggestion: export function fetchChromLengthsFromEnsembl in GenomicFeatures
Hi jo, getChromInfoFromBiomart() is a higher level function that is exported, documented, and intended to be used by the user for that, rather than internal helpers fetchChromLengthsFromEnsembl() and fetchChromLengthsFromEnsemblPlants(). However, after some discussions here with Marc and Sonali, my understanding is that the GRanges objects made from the Ensembl GTF files are going to have their seqinfo updated soon so you won't need that. H. On 04/11/2015 12:37 PM, Rainer Johannes wrote: I wanted to ask whether it would be possible to export the functions fetchChromLengthsFromEnsembl and fetchChromLengthsFromEnsemblPlants in GenomicFeatures, as I find these methods quite useful to retrieve chromosome lengths. best, jo ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] New package: coMET
Hi Tiphaine, As you're probably already aware, a change in the latest release of Ensembl breaks its vignette: http://bioconductor.org/checkResults/3.1/bioc-LATEST/coMET/zin2-buildsrc.html On D-day, only packages that pass build/check will be made available to BioC 3.1 users via biocLite(). So please fix coMET asap so it can make it into BioC 3.1. Thanks, H. On 03/23/2015 12:23 PM, Martin, Tiphaine wrote: Hi all, I'm pleased to announce the availability of the coMET package (http://bioconductor.org/packages/devel/bioc/html/coMET.html , https://github.com/TiphaineCMartin/coMET), which visualises regional epigenome-wide association scan (EWAS) results and DNA co-methylation patterns. The R package is designed for human epigenetic data, but can also be applied to genomic and functional genomic datasets in any species such as gene expression. coMET exists also in a Shiny-based web application (http://www.epigen.kcl.ac.uk/comet), which allows users to run a pre-formated version (currently only for human genome). It is possible also to extract the correlations between omic features with their pvalues. Please let me know of any issues you encounter. Thanks, Tiphaine Martin Tiphaine Martin I PhD Student I TwinsUK Department of Twin Research, Kings College London, St Thomas' Hospital Campus, 3rd Floor South Wing Block D Westminster Bridge Road, London SE1 7EH, UK Email : tiphaine.mar...@kcl.ac.ukmailto:tiphaine.mar...@kcl.ac.uk Direct: +44 (0) 207 188 1508 Fax: +44 (0) 207 188 6761 London Microbiome Meeting, 28th of May 2015: http://londonmicrobiome.org/ Mathematical and Quantitative Genomics conference, 29th of May: http://quantgen.soc.srcf.net/qg15/ [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] issue about S4 slot has a dist object.
On 04/01/2015 05:05 PM, Michael Lawrence wrote: So this explains why I wasn't able to figure out how that package was importing graph, and, yes, I also thought it was strange that graph did not export it. The methods package explicitly conditions on the warn level, so it is apparently intentional. It just looks in the global class table for duplicates, so it does not pay attention to the namespace. It's not clear to what extent the methods package assumes that there are no duplicates in the class table; probably too much work to fix. As for sharing class definitions, perhaps the methods package should define it. It already defines classes from the stats package, like aov. We could start moving more stuff from BiocGenerics to methods. Sounds good. The more upstream these class definitions are the better. Just moved setOldClass(dist) from graph to BiocGenerics 0.13.11 and exported the dist class. H. On Wed, Apr 1, 2015 at 4:29 PM, Martin Morgan mtmor...@fredhutch.org mailto:mtmor...@fredhutch.org wrote: On 04/01/2015 03:52 PM, Hervé Pagès wrote: Hi, In the same way that we avoid having 2 packages define the same S4 generic function by moving the shared generic definitions to BiocGenerics, it seems that we should also avoid having 2 packages call setOldClass on the same S3 class. Like with S4 generic functions, we've already started to do this by putting some setOldClass statements in BiocGenerics (e.g. we've done it for the 'connection' classes 'file', 'url', 'gzfile', 'bzfile', etc..., see class?gzfile). So if nobody objects we'll do this for the 'dist' class too. Then you won't need to use setOldClass in your cogena package Zhilong. You'll just need to make sure that you import BiocGenerics. This sounds like an ok work-around to me. For the hard-core... One thing is that this is not seen when the package is loaded by itself, e.g., biocLite(zhilongjia/cogena) library(cogena) only when loaded by BiocCheck (on the source directory) BiocCheck::BiocCheck(cogena) * This is BiocCheck, version 1.3.13. * BiocCheck is a work in progress. Output and severity of issues may change. * Installing package... Note: the specification for S3 class dist in package 'cogena' seems equivalent to one from package 'graph': not turning on duplicate class definitions for this class. ^C This is because BiocCheck (indirectly?) Imports: graph. But the old class definition seems to 'leak' (even though graph is not on the search path, and the dist old class is not exported from graph, and BiocCheck doesn't import the non-exported dist class, and cogena doesn't Depend or Import graph!) Also of interest perhaps is that the Note is only printed when warn=1 (which BiocCheck also uses) (new R session) requireNamespace(graph) Loading required namespace: graph requireNamespace(cogena) Loading required namespace: cogena q() (new R session:) options(warn=1) requireNamespace(graph) Loading required namespace: graph requireNamespace(cogena) Loading required namespace: cogena Note: the specification for S3 class dist in package 'cogena' seems equivalent to one from package 'graph': not turning on duplicate class definitions for this class. Cheers, H. On 04/01/2015 03:28 PM, Michael Lawrence wrote: Using setOldClass is generally fine. In this case, the graph package is already defining the dist class, so you could just import that. The graph package might have to export it though. On Wed, Apr 1, 2015 at 3:15 PM, Zhilong Jia zhilong...@gmail.com mailto:zhilong...@gmail.com wrote: Hi, Here is the package. (https://tracker.bioconductor.__org/issue1204 https://tracker.bioconductor.org/issue1204 or https://github.com/zhilongjia/__cogena https://github.com/zhilongjia/cogena; ). When I biocCheck it, there is a note. Note: the specification for S3 class “dist” in package ‘cogena’ seems equivalent to one from package ‘graph’: not turning on duplicate class definitions for this class. In the source code, there are two R files are related with this issue, cogena_class.R and https://github.com/__zhilongjia/cogena/blob/master/__R/cogena_class.R https://github.com/zhilongjia/cogena/blob/master/R/cogena_class.R dist_class.R https://github.com/__zhilongjia/cogena/blob/master/__R
Re: [Bioc-devel] request to create BSgenome Bos_taurus_UMD3.1.1
So I went ahead and did BSgenome.Btaurus.UCSC.bosTau8. It should become available via biocLite() in the next couple of hours (in BioC devel only). H. On 03/26/2015 12:54 AM, Hervé Pagès wrote: Hi Byungkuk, On 03/25/2015 10:17 PM, Byungkuk Min wrote: Dear Dr.Pages, I've been trying to build a BSgenome for Bos_taurus_UMD3.1.1 from NCBI http://www.ncbi.nlm.nih.gov/assembly/GCF_03055.6/. But I'm getting the same error message... library(BSgenome) forgeBSgenomeDataPkg(/PATH/TO/genome.fa) Error: Line starting '1 ...' is malformed! I don't have much knowledge of R, so creating a custom BSgenome seems out of my league. I would like to kindly request the bovine BSgenome package. Do you really need *that* particular assembly (GCF_03055.6). Otherwise, there are already some bovine BSgenome packages available: library(BSgenome) grep(Btaurus, available.genomes(), value=TRUE) [1] BSgenome.Btaurus.UCSC.bosTau3 [2] BSgenome.Btaurus.UCSC.bosTau3.masked [3] BSgenome.Btaurus.UCSC.bosTau4 [4] BSgenome.Btaurus.UCSC.bosTau4.masked [5] BSgenome.Btaurus.UCSC.bosTau6 [6] BSgenome.Btaurus.UCSC.bosTau6.masked We don't have bosTau8 yet, which is the latest bovine assembly available at UCSC (they added it in June 2014) but I could add it. Note that despite its name (also Bos_taurus_UMD_3.1.1), bosTau8 is not the same assembly as the one you picked up on NCBI. Yours is: http://www.ncbi.nlm.nih.gov/assembly/GCF_03055.6 (the latest, from 2014/11/25), but bosTau8 is: http://www.ncbi.nlm.nih.gov/assembly/GCF_03055.5 (much older, from 2009/12/01) Anyway, if we ignore chrM and the thousands of scaffolds that are included in these assemblies, the sequences of the main chromosomes (i.e. chr1 to chr29 + chrX) are exactly the same in the 2 assemblies. So maybe the BSgenome package for bosTau8 will do? Note that the main advantage of making a BSgenome package for a UCSC assembly (instead of using an NCBI assembly) is that the BSgenome object then interoperates nicely with the TxDb object that one can easily make from one of the UCSC tracks available for that assembly (using the makeTxDbFromUCSC() function from the GenomicFeatures package). H. Thank you! *Byungkuk Min* Ph.D Candidate University of Science Technology, http://ust.ac.kr Korea Research Institute of Bioscience and Biotechnology http://www.kribb.re.kr 125 Gwahangno, Yuseong-Gu, Daejeon, Korea (305-806) Laboratory : 042-860-4423 FAX : 042-860-4608 Cell phone : 010-5209-7377 [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] zero-width ranges representing insertions
On 03/16/2015 07:36 PM, Valerie Obenchain wrote: On 03/16/2015 05:31 PM, Hervé Pagès wrote: On 03/16/2015 04:06 PM, Michael Lawrence wrote: On Mon, Mar 16, 2015 at 3:12 PM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote: +1 IMO BioC could adopt the zero-width ranges representation for insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to deal with each corresponding beast, be VCF, dbSNP or the like. Who knows, VCF could also change their representation in the future and it'll be a headache to update the affected packages if we decide to keep using its insertion representation internally to store variant ranges in BioC. That would break just about every tool, so let's hope not. There's a bunch of code on top of Bioc that currently depends on the current representation. For example, zero width ranges do not overlap anything, so they need special treatment to e.g. detect whether an insertion falls within a gene. There are real benefits to keeping the representation of indels consistent with the rest of the field (VCF). There was much thought put into this. Note that findOverlaps() now handles zero-width ranges. predictCoding and locateVariants had to work around the fact that findOverlaps didn't work on zero-length ranges. Now that we can handle zero-length overlaps this code should be updated (yes, my fault for not updating). How predictCoding and locateVariants currently handle insertions can be modified. The current behavior (bug) should not have no bearing on how we want to represent indels in the VCF class in general. I agree with Michael that much thought went into making the VCF class consistent with the VCF specs. While information has been added to the file format over the years I am not aware of any changes related to the positional representation of the variant (ie, it has been stable) and think it unlikely this would change in the future. Just to clarify, my point was that the ranges of the insertions as currently reported in VCF and VRanges objects cannot be used as-is in findOverlaps() to detect whether an insertion falls within a gene or where is falls exactly within a gene. So these ranges need special treatment like zero-width ranges do. H. Val Straight use of findOverlaps() on the ranges of a VCF object leads to some subtle problems on insertions. For example predictCoding() (which I guess uses findOverlaps() internally) reports strange things for these 2 insertions (1 right before and 1 right after the stop codon): library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb - TxDb.Hsapiens.UCSC.hg19.knownGene cdsBy(txdb, use.names=TRUE)$uc002wcw.3 GRanges object with 2 ranges and 3 metadata columns: seqnames ranges strand |cds_idcds_name exon_rank Rle IRanges Rle | integer character integer [1]chr20 [68351, 68408] + |206100NA 1 [2]chr20 [76646, 77058] + |206101NA 2 --- seqinfo: 93 sequences (1 circular) from hg19 genome library(VariantAnnotation) rowRanges(vcf) # hand-made VCF GRanges object with 2 ranges and 5 metadata columns: seqnames ranges strand | paramRangeID Rle IRanges Rle | factor ins before stop codonchr20 [77055, 77055] * | NA ins after stop codonchr20 [77058, 77058] * | NA REFALT QUAL FILTER DNAStringSet DNAStringSetList numeric character ins before stop codon T TG70 PASS ins after stop codon A AG70 PASS --- seqinfo: 1 sequence from hg19 genome Calling predictCoding(): library(BSgenome.Hsapiens.UCSC.hg19) predictCoding(vcf, txdb, Hsapiens) GRanges object with 2 ranges and 17 metadata columns: seqnames ranges strand | paramRangeID Rle IRanges Rle | factor ins before stop codonchr20 [77055, 77055] + | NA ins after stop codonchr20 [77058, 77058] + | NA REFALT QUAL FILTER DNAStringSet DNAStringSetList numeric character ins before stop codon T TG70 PASS ins after stop codon A AG70 PASS varAllele CDSLOCPROTEINLOC QUERYID DNAStringSet IRanges IntegerList integer ins before stop codon TG [468, 468] 156 1 ins after stop codon AG [471, 471] 157 2 TXID CDSID GENEID CONSEQUENCE character IntegerList character factor ins before stop codon 70477
Re: [Bioc-devel] zero-width ranges representing insertions
On 03/16/2015 04:06 PM, Michael Lawrence wrote: On Mon, Mar 16, 2015 at 3:12 PM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote: +1 IMO BioC could adopt the zero-width ranges representation for insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to deal with each corresponding beast, be VCF, dbSNP or the like. Who knows, VCF could also change their representation in the future and it'll be a headache to update the affected packages if we decide to keep using its insertion representation internally to store variant ranges in BioC. That would break just about every tool, so let's hope not. There's a bunch of code on top of Bioc that currently depends on the current representation. For example, zero width ranges do not overlap anything, so they need special treatment to e.g. detect whether an insertion falls within a gene. There are real benefits to keeping the representation of indels consistent with the rest of the field (VCF). There was much thought put into this. Note that findOverlaps() now handles zero-width ranges. Straight use of findOverlaps() on the ranges of a VCF object leads to some subtle problems on insertions. For example predictCoding() (which I guess uses findOverlaps() internally) reports strange things for these 2 insertions (1 right before and 1 right after the stop codon): library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb - TxDb.Hsapiens.UCSC.hg19.knownGene cdsBy(txdb, use.names=TRUE)$uc002wcw.3 GRanges object with 2 ranges and 3 metadata columns: seqnames ranges strand |cds_idcds_name exon_rank Rle IRanges Rle | integer character integer [1]chr20 [68351, 68408] + |206100NA 1 [2]chr20 [76646, 77058] + |206101NA 2 --- seqinfo: 93 sequences (1 circular) from hg19 genome library(VariantAnnotation) rowRanges(vcf) # hand-made VCF GRanges object with 2 ranges and 5 metadata columns: seqnames ranges strand | paramRangeID Rle IRanges Rle | factor ins before stop codonchr20 [77055, 77055] * | NA ins after stop codonchr20 [77058, 77058] * | NA REFALT QUAL FILTER DNAStringSet DNAStringSetList numeric character ins before stop codon T TG70 PASS ins after stop codon A AG70 PASS --- seqinfo: 1 sequence from hg19 genome Calling predictCoding(): library(BSgenome.Hsapiens.UCSC.hg19) predictCoding(vcf, txdb, Hsapiens) GRanges object with 2 ranges and 17 metadata columns: seqnames ranges strand | paramRangeID Rle IRanges Rle | factor ins before stop codonchr20 [77055, 77055] + | NA ins after stop codonchr20 [77058, 77058] + | NA REFALT QUAL FILTER DNAStringSet DNAStringSetList numeric character ins before stop codon T TG70 PASS ins after stop codon A AG70 PASS varAllele CDSLOCPROTEINLOC QUERYID DNAStringSet IRanges IntegerList integer ins before stop codon TG [468, 468] 156 1 ins after stop codon AG [471, 471] 157 2 TXID CDSID GENEID CONSEQUENCE character IntegerList characterfactor ins before stop codon 70477245938 frameshift ins after stop codon 70477245938 frameshift REFCODON VARCODON REFAA DNAStringSet DNAStringSet AAStringSet ins before stop codonAATAAT ins after stop codonTAATAA VARAA AAStringSet ins before stop codon ins after stop codon --- seqinfo: 1 sequence from hg19 genome PROTEINLOC, REFCODON, VARCODON, and CONSEQUENCE don't seem quite right to me. Could be that my hand-made vcf is missing some important data needeed by predictCoding() though... H. robert. On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote: There would be a LOT of value in having core packages use exactly the same representation; I don't have any opinion about which one is best. Kasper On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: On 03/16/2015 09:22
Re: [Bioc-devel] zero-width ranges representing insertions
Using a GAlignments object to show 3 different representations of the same insertion (three nucleotides (TTT) inserted between positions 10 and 11): library(GenomicAlignments) gal - GAlignments(Rle(chr1, 3), pos=c(10L, 10L, 11L), cigar=c(1M3I1M, 1M3I, 3I), strand=Rle(strand(+), 3), ref=c(-, A, ), alt=c(TTT, ATTT, TTT), who=c(dbSNP/HGVS, VariantAnnotation/VCF, XtraSNPlocs/UCSC)) gal GAlignments object with 3 alignments and 3 metadata columns: seqnames strand cigarqwidth start end width Rle Rle character integer integer integer integer [1] chr1 + 1M3I1M 51011 2 [2] chr1 +1M3I 41010 1 [3] chr1 + 3I 31110 0 njunc | ref alt who integer | character character character [1] 0 | - TTTdbSNP/HGVS [2] 0 | AATTT VariantAnnotation/VCF [3] 0 | TTT XtraSNPlocs/UCSC --- seqinfo: 1 sequence from an unspecified genome; no seqlengths In the context of a GAlignments object the alt allele can be seen as the sequence of the query (SEQ field in BAM). So we expect the number of nucleotides in the alt allele to match the qwidth: nchar(mcols(gal)$alt) == qwidth(gal) [1] FALSE TRUE TRUE Not true for the dbSNP/HGVS representation. The range of the insertion on the reference is: as(gal, GRanges) GRanges object with 3 ranges and 3 metadata columns: seqnamesranges strand | ref alt who Rle IRanges Rle | character character character [1] chr1 [10, 11] + | - TTT dbSNP/HGVS [2] chr1 [10, 10] + | AATTT VariantAnnotation/VCF [3] chr1 [11, 10] + | TTT XtraSNPlocs/UCSC --- seqinfo: 1 sequence from an unspecified genome; no seqlengths We expect its width to be the same as the nb of nucleotides in the ref alleles: width(gal) == nchar(mcols(gal)$ref) [1] FALSE TRUE TRUE Again, not true for the dbSNP/HGVS representation. H. On 03/16/2015 03:12 PM, Robert Castelo wrote: +1 IMO BioC could adopt the zero-width ranges representation for insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to deal with each corresponding beast, be VCF, dbSNP or the like. Who knows, VCF could also change their representation in the future and it'll be a headache to update the affected packages if we decide to keep using its insertion representation internally to store variant ranges in BioC. robert. On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote: There would be a LOT of value in having core packages use exactly the same representation; I don't have any opinion about which one is best. Kasper On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: On 03/16/2015 09:22 AM, Michael Lawrence wrote: Yes, I think it would make sense for the Xtra package to follow the established convention in VariantAnnotation/VCF. I don't know. I agree it would be nice to use a more consistent representation of insertions across the software but I'm not convinced we should necessarily follow the VCF way, which is to include the base before the event in the ref and alt alleles as well as in the reported range. Note that there doesn't seem to be any consensus in the broader Bioinformatics community either. For example dbSNP and HGVS report the range that corresponds to the 2 flanking nucleotides but they don't include these nucleotides in the ref or alt alleles. VCF does the same as GFF3 which says start equals end and the implied site is to the right of the indicated base except that VCF wants to treat events that occur at position 1 in a special way. In that case VCF says the base *after* the event should be included (seems like the VCF authors want to avoid both: empty ranges and also ranges that start at POS 0). BTW it doesn't seem that VariantAnnotation::isInsertion() is aware of that special treatment. UCSC uses a zero-width range, and so does the XtraSNPlocs.* packages. The advantage of this representation is its simplicity. There is no special cases. It also reflects the notion that an insertion is a replacement of an empty string with a non-empty string. No need to include flanking nucleotides in the representation (which is artificial and distorts the real alt allele). H. On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote: dear devel people, specially Val and Michael
Re: [Bioc-devel] zero-width ranges representing insertions
]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Submitting a new package
Hi Dan, On 03/13/2015 12:04 PM, Dan Tenenbaum wrote: - Original Message - From: Charles Determan Jr deter...@umn.edu To: Dan Tenenbaum dtene...@fredhutch.org Cc: Bioc-devel@r-project.org Sent: Friday, March 13, 2015 12:04:12 PM Subject: Re: [Bioc-devel] Submitting a new package Thanks Dan, could you confirm that the email statement ( packages NEAR bioconductor POINT org) means packa...@bioconductor.org ? I don't recognize the NEAR notation. Yes. Spammers are not supposed to recognize it either, which is why we didn't just say packa...@bioconductor.org. I think Charles means why NEAR when we normally use AT for this? Thanks, H. Dan Charles On Fri, Mar 13, 2015 at 1:59 PM, Dan Tenenbaum dtene...@fredhutch.org wrote: - Original Message - From: Charles Determan Jr deter...@umn.edu To: Bioc-devel@r-project.org Sent: Friday, March 13, 2015 11:57:24 AM Subject: [Bioc-devel] Submitting a new package I have finished a new package recently that I would like to submit to bioconductor. I have read through the package guidelines and I believe I understand the package submission process but embarrassingly I cannot figure out how to actually submit the package? Could someone kindly explain where I actually submit the package for review? See the very bottom of http://bioconductor.org/developers/package-submission/ Dan Regards, -- Dr. Charles Determan, PhD Integrated Biosciences [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Dr. Charles Determan, PhD Integrated Biosciences ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Package build error for windows system
Hi Liu, On 03/10/2015 10:08 AM, Liu, Yuanhang wrote: Hello, everyone, I recently submitted my first R package. Everything checks fine on my local machine. However, when I upload the tarball to the single package builder at bioconductor.org, it gave me the following error on windows system: * creating vignettes ...Warning: running command 'E:/packagebuilder/R/bin/x64/Rscript --vanilla --default-packages= -e tools::buildVignettes(dir = '.', tangle = TRUE)' had status 5 ERROR Error in withCallingHandlers(tryCatch(evalq((function (i) : object '.rcpp_warning_recorder' not found Don't worry about this one. The problem is probably on the build machine side (in the R it uses). Please ignore. The name of the windows build system is : moscato2 Windows Server 2008 R2 Enterprise SP1 (64-bit)/x64 What does this mean? I would really appreciate it if someone can help me with this. I also got a warning on one linux system (morelia Mac OS X Mavericks (10.9.5)/x86_64): * checking loading without being on the library search path ... WARNING Loading required package: IRanges Error: package BiocGenerics 0.13.4 is loaded, but = 0.13.6 is required by IRanges In addition: Warning message: version 0.13.6 of BiocGenerics masked by 0.13.4 in /private/tmp/Rtmp8nLYMq/RLIBS_9d027c2fae3 Execution halted mmh... again something doesn't look quite right on this build machine. Please ignore. Then I checked my version of BiocGenerics, which is 0.12.1, the latest version. This is the current release version. However remember that for testing before package submission you should be using BioC devel (BioC 3.1, which requires R 3.2), not BioC release (BioC 3.0). That's what the automatic single package builder uses. The latest version of BiocGenerics in devel is 0.13.6. Cheers, H. How come the system require 0.13.6? I also checked IRanges manual, which it says it depends on BiocGenerics(= 0.11.3). Would anybody tell me what does this mean? Thanks in advance! Regards Yuanhang Liu PhD Candidate Bioinformatics and Computational Biology laboratory Cellular Structural Biology, Department of Epidemiology Biostatistics UT Health Science Center at San Antonio email: liu...@uthscsa.educel: 210-784-0828 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] CRAN package with Bioconductor dependencies
Hi Gordon, On 03/04/2015 02:12 PM, Gordon Brown wrote: Hi, Hi, Bioc-devel folks, Is there an accepted way to migrate a package from CRAN to BioC? CRAN's policy says, The package�s license must give the right for CRAN to distribute the package in perpetuity Is it enough to request that CRAN archive the package, then submit it as a new package to BioC? I own a CRAN package (msarc) that probably should have been in BioC from the start, but isn't. Suggestions? I can't speak to the CRAN policy side of things, but several packages have migrated from CRAN to Bioconductor. At a minimum, you need to make sure that the version in Bioconductor is higher than the version in CRAN, but you should also definitely ask CRAN to archive the package. The way Bioconductor works, your package will first be available only in the devel version, but will then become available in the release version after our next release. Currently we're scheduled to release 3.1 on April 17 and the deadline for package submissions for this release is March 27. Dan Thanks, Dan. I'll check with the CRAN folks to see what they say, and pass along anything significant regarding policy etc. It's not a crisis if there's a lag between getting it archived at CRAN and into the release version of Bioconductor. The CRAN folks will ask you to proceed the other way around. First submit your package to Bioconductor. Only after we release (in April), ask them to remove and archive your package. It makes sense since this ensures continuous availability of the package. Cheers, H. Cheers, - Gord ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Rd] [R] Why does R replace all row values with NAs
where you indexed using -which(x$c=6) is a bad idea: if none of the entries were 6 or more, this would be indexing with an empty vector, and you'd get nothing, not everything. Duncan Murdoch On Fri, Feb 27, 2015 at 9:13 AM, Duncan Murdoch murdoch.dun...@gmail.com wrote: On 27/02/2015 9:04 AM, Dimitri Liakhovitski wrote: I know how to get the output I need, but I would benefit from an explanation why R behaves the way it does. # I have a data frame x: x = data.frame(a=1:10,b=2:11,c=c(1,NA,3,NA,5,NA,7,NA,NA,10)) x # I want to toss rows in x that contain values =6. But I don't want to toss my NAs there. subset(x,c6) # Works correctly, but removes NAs in c, understand why x[which(x$c6),] # Works correctly, but removes NAs in c, understand why x[-which(x$c=6),] # output I need # Here is my question: why does the following line replace the values of all rows that contain an NA # in x$c with NAs? x[x$c6,] # Leaves rows with c=NA, but makes the whole row an NA. Why??? x[(x$c6) | is.na(x$c),] # output I need - I have to be super-explicit Thank you very much! Most of your examples (except the ones using which()) are doing logical indexing. In logical indexing, TRUE keeps a line, FALSE drops the line, and NA returns NA. Since x$c 6 is NA if x$c is NA, you get the third kind of indexing. Your last example works because in the cases where x$c is NA, it evaluates NA | TRUE, and that evaluates to TRUE. In the cases where x$c is not NA, you get x$c 6 | FALSE, and that's the same as x$c 6, which will be either TRUE or FALSE. Duncan Murdoch -- Dimitri Liakhovitski __ r-h...@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ r-h...@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] Why does R replace all row values with NAs
On 03/03/2015 02:17 PM, Gabriel Becker wrote: Stephanie, Actually, it's as.logical that isn't preserving matrix dimensions, because it coerces to a logical vector: x - matrix(sample(c(NA_integer_, 1:100), 500, replace=TRUE), nrow=50) dim(as.logical(x)) It's true, as.logical() doesn't help here but Stephanie is right, %in% does not preserve the dimensions either: dim(x %in% 1:5) NULL That's because match() itself doesn't preserve the dimensions: dim(match(x, 1:5)) NULL So maybe my fast is.true() should be: is.true - function(x) { ans - as.logical(x) %in% TRUE if (is.null(dim(x))) { names(ans) - names(x) } else { dim(ans) - dim(x) dimnames(ans) - dimnames(x) } ans } or something like that... H. NULL ~G On Tue, Mar 3, 2015 at 2:09 PM, Stephanie M. Gogarten sdmor...@u.washington.edu mailto:sdmor...@u.washington.edu wrote: On 3/3/15 1:26 PM, Hervé Pagès wrote: On 03/03/2015 02:28 AM, Martin Maechler wrote: Diverted from R-help : as it gets into musing about new R language primitives William Dunlap wdun...@tibco.com mailto:wdun...@tibco.com on Fri, 27 Feb 2015 08:04:36 -0800 writes: You could define functions like is.true - function(x) !is.na http://is.na(x) x is.false - function(x) !is.na http://is.na(x) !x and use them in your selections. E.g., x - data.frame(a=1:10,b=2:11,c=c(__1,NA,3,NA,5,NA,7,NA,NA,10)) x[is.true(x$c = 6), ] a b c 7 7 8 7 10 10 11 10 Bill Dunlap TIBCO Software wdunlap tibco.com http://tibco.com Yes; the Matrix package has had these is0 - function(x) !is.na http://is.na(x) x == 0 isN0 - function(x) is.na http://is.na(x) | x != 0 is1 - function(x) !is.na http://is.na(x) x # also == isTRUE componentwise Note that using %in% to block propagation of NAs is about 2x faster: x - sample(c(NA_integer_, 1:1), 50, replace=TRUE) microbenchmark(as.logical(x) %in% TRUE, !is.na http://is.na(x) x) Unit: milliseconds expr minlq mean medianuq as.logical(x) %in% TRUE 6.034744 6.264382 6.999083 6.29488 6.346028 !is.na http://is.na(x) x 11.202808 11.402437 11.469101 11.44848 11.517576 max neval 40.36472 100 tel:40.36472%20%20%20100 11.90916 100 Unfortunately %in% does not preserve matrix dimensions: x - matrix(sample(c(NA_integer_, 1:100), 500, replace=TRUE), nrow=50) dim(x) [1] 50 10 dim(!is.na http://is.na(x) x) [1] 50 10 dim(as.logical(x) %in% TRUE) NULL Stephanie namespace hidden for a while [note the comment of the last one!] and using them for readibility in its own code. Maybe we should (again) consider providing some versions of these with R ? The Matrix package also has had fast allFalse - all0 - function(x) .Call(R_all0, x) anyFalse - any0 - function(x) .Call(R_any0, x) ## ## anyFalse - function(x) isTRUE(any(!x)) ## ~= any0 ## any0 - function(x) isTRUE(any(x == 0)) ## ~= anyFalse namespace hidden as well, already, which probably could also be brought to base R. One big reason to *not* go there (to internal C code) at all with R is that S3 and S4 dispatch for '==' ('!=', etc, the 'Compare' group generics) and 'is.na http://is.na() have been known and package writers have programmed methods for these. To ensure that S3 and S4 dispatch works correctly also inside such new internals is much less easily achieved, and so such a C-based internal function is0() would no longer be equivalent with!is.na http://is.na(x) x == 0 as soon as 'x' is an object with a '==', 'Compare' and/or an is.na http://is.na() method. Excellent point. Thank you! It really makes a big difference for developers who maintain a complex hierarchy of S4 classes and methods, when functions like is.true, anyFalse, etc..., which can be expressed in terms of more basic operations like ==, !=, !, is.na http://is.na, etc..., just work out-of-the-box on objects for which these basic operations
Re: [Bioc-devel] Changes to the SummarizedExperiment Class
On 03/03/2015 03:06 PM, Peter Haverty wrote: I'd like to see a basic class that takes a DataFrame and a sub-class that takes a GRanges. Yes. I still think GRanges should be a subclass of DataFrame, which would make this easy, but I don't seem to be winning that argument. Just impossible. As Michael mentioned back in November, they have conflicting APIs. While the hood is up, can we try some different names? SummarizedExperiment never seemed like a great fit to me because it doesn't necessarily contain experiments or summaries thereof. It's a collection of like-sized rectangular things with metadata on the two dimensions. Maybe the name could reflect what it holds rather than a common use case? AnnotatedMatrixList? We actually need 2 names: 1 for the parent class, 1 for the child. I'm starting to think that introducing 2 new names would maybe make the migration a little bit easier, especially since the plan is to move the refactored SummarizedExperiment to its own package. With 2 new names we can start the new package, implement the 2 new classes in it, and have the old SummarizedExperiment (in GenomicRanges) and the 2 new classes peacefully cohabit during the time of the migration. Cheers, H. Anyway, I'm excited to see a version on the way that takes a DataFrame as rowData. I'm glad you guys are working on that. Regards, Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Tue, Mar 3, 2015 at 2:57 PM, Michael Lawrence lawrence.mich...@gene.com wrote: Seems like rowData could be made to work universallly through coercion. rowRanges would not, however, and one would like a convenient mechanism to condition on whether range information is available. One way is to introduce a new class and rely on dispatch. But that adds complexity. On Tue, Mar 3, 2015 at 2:44 PM, Gabe Becker becker.g...@gene.com wrote: Jim et al., Why have two accessors (rowRanges, rowData), each of which are less flexible than the underlying structure and thus will fail (return NULL? or GRanges()/DataFrame() ?) in some proportion of valid objects? ~G On Tue, Mar 3, 2015 at 2:37 PM, Jim Hester james.f.hes...@gmail.com wrote: Motivated by the discussion thread from November ( https://stat.ethz.ch/ pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core team is planning on making changes to the SummarizedExperiment class. Our end goal is to allow the @rowData slot to become more flexible and hold either a DataFrame or GRanges type object. To this end we have currently deprecated the current rowData accessor in favor of a rowRanges accessor. This change has resulted in a few broken builds in devel, which we are in the process of fixing now. We will contact any package authors directly if needed for this migration. The rowData accessor will be deprecated in this release, however eventually the plan is to re-purpose this function to serve as an accessor for DataFrame data on the rows. Please let us know if you have any questions with the above and if you need any assistance with the transition. [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Gabriel Becker, Ph.D Computational Biologist Genentech Research [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Rd] R-devel does not update the C++ returned variables
On 03/02/2015 01:00 PM, Hervé Pagès wrote: Hi, On 03/02/2015 12:18 PM, Dénes Tóth wrote: On 03/02/2015 04:37 PM, Martin Maechler wrote: On 2 March 2015 at 09:09, Duncan Murdoch wrote: | I generally recommend that people use Rcpp, which hides a lot of the | details. It will generate your .Call calls for you, and generate the | C++ code that receives them; you just need to think about the real | problem, not the interface. It has its own learning curve, but I think | it is easier than using the low-level code that you need to work with .Call. Thanks for that vote, and I second that. And these days the learning is a lot flatter than it was a decade ago: R Rcpp::cppFunction(NumericVector doubleThis(NumericVector x) { return(2*x); }) R doubleThis(c(1,2,3,21,-4)) [1] 2 4 6 42 -8 R That defined, compiled, loaded and run/illustrated a simple function. Dirk Indeed impressive, ... and it also works with integer vectors something also not 100% trivial when working with compiled code. When testing that, I've went a step further: ## now test: require(microbenchmark) i - 1:10 Note that the relative speed of the algorithms also depends on the size of the input vector. i + i becomes the winner for longer vectors (e.g. i - 1:1e6), but a proper Rcpp version is still approximately twice as fast. The difference in speed is probably due to the fact that R does safe arithmetic. C or C++ do not: doubleThisInt(i) [1] 2147483642 2147483644 2147483646 NA -2147483646 -2147483644 2L * i [1] 2147483642 2147483644 2147483646 NA NA NA Warning message: In 2L * i : NAs produced by integer overflow That was with i - as.integer(2^30-4) + 1:6 Cheers, H. H. Rcpp::cppFunction(NumericVector doubleThisNum(NumericVector x) { return(2*x); }) Rcpp::cppFunction(IntegerVector doubleThisInt(IntegerVector x) { return(2*x); }) i - 1:1e6 mb - microbenchmark::microbenchmark(doubleThisNum(i), doubleThisInt(i), i*2, 2*i, i*2L, 2L*i, i+i, times=100) plot(mb, log=y, notch=TRUE) (mb - microbenchmark(doubleThis(i), i*2, 2*i, i*2L, 2L*i, i+i, times=2^12)) ## Lynne (i7; FC 20), R Under development ... (2015-03-02 r67924): ## Unit: nanoseconds ## expr min lq mean median uq max neval cld ## doubleThis(i) 762 985 1319.5974 1124 1338 17831 4096 b ## i * 2 124 151 258.4419164 221 4 4096 a ## 2 * i 127 154 266.4707169 216 20213 4096 a ## i * 2L 143 164 250.6057181 234 16863 4096 a ## 2L * i 144 177 269.5015193 237 16119 4096 a ## i + i 152 183 272.6179199 243 10434 4096 a plot(mb, log=y, notch=TRUE) ## hmm, looks like even the simple arithm. differ slightly ... ## ## == zoom in: plot(mb, log=y, notch=TRUE, ylim = c(150,300)) dev.copy(png, file=mbenchm-doubling.png) dev.off() # [ - why do I need this here for png ??? ] ##-- see the appended *png graphic Those who've learnt EDA or otherwise about boxplot notches, will know that they provide somewhat informal but robust pairwise tests on approximate 5% level. From these, one *could* - possibly wrongly - conclude that 'i * 2' is significantly faster than both 'i * 2L' and also 'i + i' which I find astonishing, given that i is integer here... Probably no reason for deep thoughts here, but if someone is enticed, this maybe slightly interesting to read. Martin Maechler, ETH Zurich __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R-devel does not update the C++ returned variables
Hi, On 03/02/2015 12:18 PM, Dénes Tóth wrote: On 03/02/2015 04:37 PM, Martin Maechler wrote: On 2 March 2015 at 09:09, Duncan Murdoch wrote: | I generally recommend that people use Rcpp, which hides a lot of the | details. It will generate your .Call calls for you, and generate the | C++ code that receives them; you just need to think about the real | problem, not the interface. It has its own learning curve, but I think | it is easier than using the low-level code that you need to work with .Call. Thanks for that vote, and I second that. And these days the learning is a lot flatter than it was a decade ago: R Rcpp::cppFunction(NumericVector doubleThis(NumericVector x) { return(2*x); }) R doubleThis(c(1,2,3,21,-4)) [1] 2 4 6 42 -8 R That defined, compiled, loaded and run/illustrated a simple function. Dirk Indeed impressive, ... and it also works with integer vectors something also not 100% trivial when working with compiled code. When testing that, I've went a step further: ## now test: require(microbenchmark) i - 1:10 Note that the relative speed of the algorithms also depends on the size of the input vector. i + i becomes the winner for longer vectors (e.g. i - 1:1e6), but a proper Rcpp version is still approximately twice as fast. The difference in speed is probably due to the fact that R does safe arithmetic. C or C++ do not: doubleThisInt(i) [1] 2147483642 2147483644 2147483646 NA -2147483646 -2147483644 2L * i [1] 2147483642 2147483644 2147483646 NA NA NA Warning message: In 2L * i : NAs produced by integer overflow H. Rcpp::cppFunction(NumericVector doubleThisNum(NumericVector x) { return(2*x); }) Rcpp::cppFunction(IntegerVector doubleThisInt(IntegerVector x) { return(2*x); }) i - 1:1e6 mb - microbenchmark::microbenchmark(doubleThisNum(i), doubleThisInt(i), i*2, 2*i, i*2L, 2L*i, i+i, times=100) plot(mb, log=y, notch=TRUE) (mb - microbenchmark(doubleThis(i), i*2, 2*i, i*2L, 2L*i, i+i, times=2^12)) ## Lynne (i7; FC 20), R Under development ... (2015-03-02 r67924): ## Unit: nanoseconds ## expr min lq mean median uq max neval cld ## doubleThis(i) 762 985 1319.5974 1124 1338 17831 4096 b ## i * 2 124 151 258.4419164 221 4 4096 a ## 2 * i 127 154 266.4707169 216 20213 4096 a ## i * 2L 143 164 250.6057181 234 16863 4096 a ## 2L * i 144 177 269.5015193 237 16119 4096 a ## i + i 152 183 272.6179199 243 10434 4096 a plot(mb, log=y, notch=TRUE) ## hmm, looks like even the simple arithm. differ slightly ... ## ## == zoom in: plot(mb, log=y, notch=TRUE, ylim = c(150,300)) dev.copy(png, file=mbenchm-doubling.png) dev.off() # [ - why do I need this here for png ??? ] ##-- see the appended *png graphic Those who've learnt EDA or otherwise about boxplot notches, will know that they provide somewhat informal but robust pairwise tests on approximate 5% level. From these, one *could* - possibly wrongly - conclude that 'i * 2' is significantly faster than both 'i * 2L' and also 'i + i' which I find astonishing, given that i is integer here... Probably no reason for deep thoughts here, but if someone is enticed, this maybe slightly interesting to read. Martin Maechler, ETH Zurich __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] 100k SVN commits...
On 02/27/2015 03:28 PM, Martin Morgan wrote: I noticed that we're at r99955 in svn; who will be the lucky person making the 100,000th commit? depends how many zeroes after the one you're gonna put on the check... 5? ;-) H. Martin -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BamTallyParam argument 'which'
FWIW, and after checking how pileup() handles this, I thought I should report: library(GenomicAlignments) bamfile - system.file(extdata, ex1.bam, package=Rsamtools) Then: stackStringsFromBam(bamfile, param=seq1:1-6) A DNAStringSet instance of length 4 width seq [1] 6 CACTAG [2] 6 ++CTAG [3] 6 AG [4] 6 +G which - GRanges(seq1, IRanges(c(1, 5, 1), c(2, 6, 3))) pileup(bamfile, scanBamParam=ScanBamParam(which=which)) seqnames pos strand nucleotide count which_label 1 seq1 1 + C 1seq1:1-2 2 seq1 2 + A 1seq1:1-2 3 seq1 5 + A 3seq1:5-6 4 seq1 6 + G 4seq1:5-6 5 seq1 1 + C 1seq1:1-3 6 seq1 2 + A 1seq1:1-3 7 seq1 3 + C 2seq1:1-3 This output looks clean to me: (a) Nucleotide positions that belong to more than 1 range in 'which' are treated as distinct positions, and (b) reads that overlap with 2 non-overlapping ranges (e.g. read #1 and ranges #1 and #2) only contribute at most once at each position covered by these ranges. H. On 02/25/2015 01:21 PM, Hervé Pagès wrote: Hi, On 02/25/2015 06:37 AM, Gabe Becker wrote: I think we need to be a little careful of trying to know the users intentions better than they do here. Reduce is a (very) easy operation to do on a GRanges, so if the user didn't, are we really safe assuming they meant to when passing the GRanges as a which? I would argue for the samtools way, not because samtools does it (though consistency is good) but because it allows the user to do more things, while not making it that painful to do the thing that they might want most often. I agree with Michael that an additional argument might be a good middle ground. Maybe another good middle ground is to issue a warning when 'which' contains overlapping ranges. The warning could suggest to the user to reduce() the ranges first. Maybe the warning should also point out that reducing doesn't completely eliminate the risk of selecting the same record more than once (at least that's the case for the readGAlignment* functions, but I don't know if it holds for BamTallyParam). The risk is actually higher with paired-end reads where the same pair is selected once for each range in 'which' that overlaps with at least one of the 2 mates in the pair. But as already discussed here (or maybe it was on the old bioconductor list, don't remember), better solutions to the duplicated selection problem could be achieved via one of the following: (1) Expose some sort of unique id for the records in a BAM file. I agree with Michael that duplicated selection is incompatible with summarization tools like BamTallyParam or pileup(). Having access to a unique id would completely solve the problem. (2) Introduce an argument similar to 'which' but with a slightly different semantic e.g. select records that *start* in one of the specified ranges (for single-end reads), or select pairs of records for which the *first* mate starts in one of the specified ranges. Advantages of this semantic: (a) If the ranges don't overlap, then no duplicates. (b) It can be used in the context of tiling the genome and processing the reads by tile. This plays well with parallelization (the semantic of 'which' does not). Inconvenient: the arbitrary nature of the selection criteria and its lack of symmetry is incompatible with summarization tools like BamTallyParam or pileup(). Note that (1) and (2) are not exclusive. Cheers, H. ~G On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres lcoll...@jhu.edu mailto:lcoll...@jhu.edu wrote: Related to my post on a separate thread (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html), I think that if 'which' is not being reduced by default, a simple example showing the effects of this could be included in the functions that have such an argument. Also note that 'reducing' could lead to unintended results. For example, in the help page for GenomicAlignments::readGAlignments, after the 'gal4' example it would be nice to add something like this: ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups - RangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups - ScanBamParam(which=which_dups) param_reduced - ScanBamParam(which=reduce(which_dups)) gal4_dups - readGAlignments(bamfile, param=param_dups) gal4_reduced
Re: [Bioc-devel] BamTallyParam argument 'which'
@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Gabriel Becker, Ph.D Computational Biologist Genentech Research -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BamTallyParam argument 'which'
On 02/23/2015 11:05 AM, Leonard Goldstein wrote: Hi Michael and Thomas, I ran into the same problem in the past (i.e. when I started working with functions like scanBam I expected them not to return the same alignment multiple times) One thing to consider might be that returning alignments multiple times is consistent with the behavior of the samtools view command. Quoting from the samtools manual: “Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.” Thanks Leonard for pointing this out. This is indeed the reason why all the functions in Rsamtools and GenomicAlignments that take a 'which' argument (via a ScanBamParam object) treat it the samtools way, that is, as a vector of meaningful loci. Most of them track the index of the individual loci via a which_label metadata column. See for example Rsamtools::pileup() and all the readGAlignment*() functions in the GenomicAlignments package. FWIW the man page for the readGAlignment*() functions contains the following note: Note that a given record is loaded one time for each region it belongs to (this is a scanBam() feature, readGAlignments() is based on scanBam()). but maybe this should be emphasized a little bit more. Cheers, H. Maybe there is an argument for keeping things consistent with samtools? As you said, if documented properly, the user can decide whether to reduce regions specified in which or not. Leonard On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence lawrence.mich...@gene.com wrote: We should at leaast try to avoid surprising the user. Seems like most people expect which to be a simple restriction, so I think for now I will just reduce the which, and if someone has a use case for separate queries, we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann sandmann.tho...@gene.com wrote: Personally, I don't have a use case with meaningful loci worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter in a consistent way across different methods ? I just saw this recent post on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam: https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html --- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.tho...@gene.com If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties. -- Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence lawrence.mich...@gene.com wrote: We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii, which will be analyzed individually? I would actually favor the first one (the same as yours), just because it's simpler. To keep track of the query ranges, we would need to add a new column to the returned object, which will more often than not just be clutter. I guess we could introduce a new parameter, reduceWhich which defaults to TRUE and reduces the which. If FALSE, it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann sandmann.tho...@gene.com wrote: Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your first option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Fwd: readGAlignmentsFromBam function
Hi Karolis, Please use the bioc-devel mailing list (to which you should be subscribed) to ask this type of question. If you manually run the example in your package (using BioC devel, which requires R-3.2), you'll get: prebs_values - calc_prebs(bam_files, manufacturer_cdf_mapping) Inferred name for CDF package: HGU133Plus2_mapping.txt - hgu133plus2cdf [1] Finished: /home/hpages/R/R-3.2.r67773/library/prebsdata/sample_bam_files/input1.bam [1] Finished: /home/hpages/R/R-3.2.r67773/library/prebsdata/sample_bam_files/input2.bam Note: Some probe IDs contain duplicates. Estimated values for Bayesian prior: Alpha=1e-04 Beta=0.0136640101662362 Note: 41457 probe sequences are missing in _mapping.txt file. Normalizing Calculating Expression Warning messages: 1: 'readGAlignmentsFromBam' is deprecated. Use 'readGAlignments' instead. See help(Deprecated) 2: 'readGAlignmentsFromBam' is deprecated. Use 'readGAlignments' instead. See help(Deprecated) Hope this helps. Too bad 'R CMD check' truncates the deprecation message :-/ Cheers, H. On 02/19/2015 01:15 PM, Michael Lawrence wrote: -- Forwarded message -- From: Karolis Uziela karolis.uzi...@scilifelab.se Date: Feb 19, 2015 9:22 AM Subject: readGAlignmentsFromBam function To: Michael Lawrence lawrence.mich...@gene.com Cc: bioconduc...@r-project.org bioconduc...@r-project.org Hi, I have looked at R check results for the devel version of Bioconductor and I noticed that the package I developed (prebs) gives this warning: Warning: 'readGAlignmentsFromBam' is deprecated. What is the new function that should be used instead of readGAlignmentsFromBam (GenomicAlignments package)? Do the changes affect readGAlignmentPairs function, too? Regards, Karolis -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Masked version of BSgenome.Hsapiens.NCBI.GRCh38?
Hi Ulrich, I was not sure about how much demand there is for the masked BSgenome packages in general so I was just waiting for someone to ask. Note that the masks are typically generated from data available at UCSC so it sounds that it's time to make BSgenome.Hsapiens.NCBI.hg38 and BSgenome.Hsapiens.NCBI.hg38.masked available. I'll prepare the 2 packages in the next couple of weeks and post back here when they are ready for download. Cheers, H. On 02/06/2015 06:13 AM, Ulrich Bodenhofer wrote: Hi, The latest human genome build GRCh38 has been around in Bioconductor for some while (package BSgenome.Hsapiens.NCBI.GRCh38). As far as I can tell, however, there is currently no package that provides easy access to masked/unmasked regions in the genome (like there is a .masked version that wraps BSgenome objects into MaskedBSgenome objects for many other genomes, e.g. there is BSgenome.Hsapiens.UCSC.hg19.masked for hg19). Here is my question: does anybody have plans to include a package BSgenome.Hsapiens.NCBI.GRCh38.masked (or under a different name) into Bioconductor 3.1? At least I could not find anything in the current development branch. Thanks and best regards, Ulrich *Dr. Ulrich Bodenhofer* Associate Professor Institute of Bioinformatics *Johannes Kepler University* Altenberger Str. 69 4040 Linz, Austria Tel. +43 732 2468 4526 Fax +43 732 2468 4539 bodenho...@bioinf.jku.at mailto:bodenho...@bioinf.jku.at http://www.bioinf.jku.at/ http://www.bioinf.jku.at ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Rd] :: and ::: as .Primitives?
Hi, On 01/23/2015 07:01 AM, luke-tier...@uiowa.edu wrote: On Thu, 22 Jan 2015, Michael Lawrence wrote: On Thu, Jan 22, 2015 at 11:44 AM, luke-tier...@uiowa.edu wrote: For default methods there ought to be a way to create those so the default method is computed at creation or load time and stored in an environment. We had considered that, but we thought the definition of the function would be easier to interpret if it explicitly specified the namespace, instead of using tricks with environments. The same applies for memoizing the lookup in front of a loop. interpret in what sense (human reader or R interpreter)? In either case I'm not convinced. From a developer perspective, especially when debugging, when we do selectMethod(match, ...) and it turns out that this returns the default method, it's good to see: Method Definition (Class derivedDefaultMethod): function (x, table, nomatch = NA_integer_, incomparables = NULL, ...) base::match(x, table, nomatch = nomatch, incomparables = incomparables, ...) environment: namespace:BiocGenerics Signatures: x table target DataFrame ANY defined ANY ANY rather than some obscure/uninformative body. I hope we can keep that. The implementation of these functions is almost simpler in C than it is in R, so there is relatively little risk to this change. But I agree the benefits are also somewhat minor. I don't disagree, but it remains that even calling the C version has costs that should not need to be paid. But maybe we can leave that to the compiler/byte code engine. Optimizing references to symbols resolved statically to name spaces and imports is on the to do list, and with a little care that mechanism should work for foo::bar uses as well. That would be great. Thanks! H. Best, luke For other cases if I want to use foo::bar many times, say in a loop, I would do foo_bar - foo::bar and use foo_bar, or something along those lines. When :: and ::: were introduce they were intended primarily for reflection and debugging, so speed was not an issue. ::: is still really only reliably usable that way, and making it faster may just encourage bad practice. :: is different and there are good arguments for using it in code, but I'm not yet seeing good arguments for use in ways that would be performance-critical, but I'm happy to be convinced otherwise. If there is a need for a faster :: then going to a SPECIALSXP is fine; it would also be good to make the byte code compiler aware of it, and possibly to work on ways to improve the performance further e.g. through cacheing. Best, luke On Thu, 22 Jan 2015, Peter Haverty wrote: Hi all, When S4 methods are defined on base function (say, match), the function becomes a method with the body base::match(x,y). A call to such a function often spends more time doing :: than in the function itself. I always assumed that :: was a very low-level thing, but it turns out to be a plain old function defined in base/R/namespace.R. What would you all think about making :: and ::: .Primitives? I have submitted some examples, timings, and a patch to the R bug tracker (https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=16134). I'd be very interested to hear your thoughts on the matter. Regards, Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] IRanges findOverlaps Result Different for Recent Update
Hi guys, Indeed, the Hits object returned by findOverlaps() is not fully sorted anymore. Now it's sorted by query hit *only* and not by query hit *and* subject hit. Fully sorting a big Hits object has a high cost, both in terms of time and memory footprint. The partial sorting is *much* cheaper: it's done using a tabulated sorting algo implemented in C that works in linear time. The partial sorting is important: it allows a very common transformation like as(hits, List) to be super fast. But the full sorting was overkill and generally not needed. Also note that the full sorting was never enforced via the validity method for Hits objects (and t(hits) was breaking that order in BioC 3.1). Now the validity method for Hits enforces the partial sorting and t(hits) preserves it. There were only 3 or 4 packages that broke in devel because of that change (typically the change broke their unit tests). I fixed them (except Repitools, but it's still on my list). The fix is easy: if having the hits fully sorted matters, just use sort() on the Hits object. The man page for ?findOverlaps will soon be updated to reflect these changes. Cheers, H. On 01/15/2015 06:42 AM, Kasper Daniel Hansen wrote: Has it ever been documented that the return object is sorted in a specific way? I just want to make sure we think about whether that is something we want to enforce giving the possibility of using a different algorithm in the future. We could also address this by implementing (perhaps it already exists) a sort() method for the return object. That would still break existing code though. Best, Kasper On Wed, Jan 14, 2015 at 11:13 PM, Michael Lawrence lawrence.mich...@gene.com wrote: I bet there is a lot of code that depends on having the hits (conveniently) ordered by query,subject index, so we should try to restore the previous behavior. On Wed, Jan 14, 2015 at 8:00 PM, Dario Strbenac dstr7...@uni.sydney.edu.au wrote: Hello, For an identical query, the matrix results are in a different order. Consider the subject hits of the last two rows : mapping# R Under development (unstable) (2015-01-13 r67453) and IRanges 2.1.35 queryHits subjectHits [1,] 1 1 [2,] 1 4 [3,] 2 2 [4,] 4 1 [5,] 4 4 [6,] 6 7 [7,] 6 6 mapping# R Under development (unstable) (2015-01-13 r67453) and IRanges 2.0.1 queryHits subjectHits [1,] 1 1 [2,] 1 4 [3,] 2 2 [4,] 4 1 [5,] 4 4 [6,] 6 6 [7,] 6 7 This causes some values to be extracted in a different order by our annotationLookup function, and causes an error for the development version of Repitools on a test case which uses all.equal to compare a list to a correct list, but not for the release version which uses the release version of IRanges. Should I update the test case to have a new expected result, or is this new characteristic of findOverlaps likely to revert to the previous output soon ? The two sets of intervals to produce this result are anno and probesGR, defined in the tests.R file in the Repitools package. -- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] IRanges findOverlaps Result Different for Recent Update
Hi Michael, On 01/15/2015 11:59 AM, Michael Lawrence wrote: My concern is mostly in user code not seen in Bioc svn. I understand but the fate of that code is to get out of sync sooner or later. And sooner rather than later if it relies on undocumented behavior. But perhaps the partial sorting (by query) is sufficient for many of those. It seems to be sufficient for more than 99.5% of the packages in BioC svn :-) Note that keeping Hits objects partially sorted instead of fully sorted not only speeds up findOverlaps() but also basic operations on Hits objects like union(), t(), etc... Since we are on it, I should also mention that new in BioC 3.1 is a Hits() constructor function which takes care of partially sorting the hits, selectHits() for selecting hits in the same way the 'select' arg of findOverlaps() does, and all the comparison operations (==, =, order, sort, rank, etc..., see ?`Hits-comparison` in S4Vectors). Cheers, H. On Thu, Jan 15, 2015 at 11:34 AM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi guys, Indeed, the Hits object returned by findOverlaps() is not fully sorted anymore. Now it's sorted by query hit *only* and not by query hit *and* subject hit. Fully sorting a big Hits object has a high cost, both in terms of time and memory footprint. The partial sorting is *much* cheaper: it's done using a tabulated sorting algo implemented in C that works in linear time. The partial sorting is important: it allows a very common transformation like as(hits, List) to be super fast. But the full sorting was overkill and generally not needed. Also note that the full sorting was never enforced via the validity method for Hits objects (and t(hits) was breaking that order in BioC 3.1). Now the validity method for Hits enforces the partial sorting and t(hits) preserves it. There were only 3 or 4 packages that broke in devel because of that change (typically the change broke their unit tests). I fixed them (except Repitools, but it's still on my list). The fix is easy: if having the hits fully sorted matters, just use sort() on the Hits object. The man page for ?findOverlaps will soon be updated to reflect these changes. Cheers, H. On 01/15/2015 06:42 AM, Kasper Daniel Hansen wrote: Has it ever been documented that the return object is sorted in a specific way? I just want to make sure we think about whether that is something we want to enforce giving the possibility of using a different algorithm in the future. We could also address this by implementing (perhaps it already exists) a sort() method for the return object. That would still break existing code though. Best, Kasper On Wed, Jan 14, 2015 at 11:13 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: I bet there is a lot of code that depends on having the hits (conveniently) ordered by query,subject index, so we should try to restore the previous behavior. On Wed, Jan 14, 2015 at 8:00 PM, Dario Strbenac dstr7...@uni.sydney.edu.au mailto:dstr7...@uni.sydney.edu.au wrote: Hello, For an identical query, the matrix results are in a different order. Consider the subject hits of the last two rows : mapping# R Under development (unstable) (2015-01-13 r67453) and IRanges 2.1.35 queryHits subjectHits [1,] 1 1 [2,] 1 4 [3,] 2 2 [4,] 4 1 [5,] 4 4 [6,] 6 7 [7,] 6 6 mapping# R Under development (unstable) (2015-01-13 r67453) and IRanges 2.0.1 queryHits subjectHits [1,] 1 1 [2,] 1 4 [3,] 2 2 [4,] 4 1 [5,] 4 4 [6,] 6 6 [7,] 6 7 This causes some values to be extracted in a different order by our annotationLookup function, and causes an error for the development version of Repitools on a test case which uses all.equal to compare a list to a correct list, but not for the release version which uses the release
Re: [Rd] RFC: getifexists() {was [Bug 16065] exists ...}
Hi, On 01/08/2015 07:02 AM, Martin Maechler wrote: Adding an optional argument to get (and mget) like val - get(name, where, ..., value.if.not.found=NULL ) (*) would be useful for many. HOWEVER, it is possible that there could be some confusion here: (*) can give a NULL because either x exists and has value NULL, or because x doesn't exist. If that matters, the user would need to be careful about specifying a value.if.not.found that cannot be confused with a valid value of x. Exactly -- well, of course: That problem { NULL can be the legit value of what you want to get() } was the only reason to have a 'value.if.not' argument at all. Note that this is not about a universal replacement of the if(exists(..)) { .. get(..) } idiom, FWIW, if(exists(..)) { x - get(..) } is not safe because it's not atomic. I've seen situations where exists() returns TRUE but then get() fails to find the symbol (even if called immediately after exists()). After scratching my head for a while I found out that the symbol was removed by some finalizer function defined somewhere (not on the environment exists() and gets() were looking at, of course). And since garbage collection is triggered between the moment exists() sees the symbol and get() tries to get it, the finalizer was executed and the symbol removed. After that, I started to systematically use x - try(get(...)) instead (which is atomic). Cheers, H. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup
On 01/08/2015 01:30 PM, peter dalgaard wrote: If you look at the definition of %in%, you'll find that it is implemented using match, so if we did as you suggest, I give it about three days before someone suggests to inline the function call... But you wouldn't bet money on that right? Because you know you would loose. Readability of source code is not usually our prime concern. Don't sacrifice readability if you do not have a good reason for it. What's your reason here? Are you seriously suggesting that inlining makes a significant difference? As Michael pointed out, the expensive operation here is the hashing. But sadly some people like inlining and want to use it everywhere: it's easy and they feel good about it, even if it hurts readability and maintainability (if you use x %in% y instead of the inlined version, the day someone changes the implementation of x %in% y for something faster, or fixes a bug in it, your code will automatically benefit, right now it won't). More simply put: good readability generally leads to better code. The idea does have some merit, though. Apropos, why is there no setcontains()? Wait... shouldn't everybody use all(match(x, y, nomatch = 0L) 0L) ? H. -pd On 06 Jan 2015, at 22:02 , Hervé Pagès hpa...@fredhutch.org wrote: Hi, Current implementation: setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(match(x, y, 0L) 0L, match(y, x, 0L) 0L)) } First what about replacing 'match(x, y, 0L) 0L' and 'match(y, x, 0L) 0L' with 'x %in% y' and 'y %in% x', respectively. They're strictly equivalent but the latter form is a lot more readable than the former (isn't this the raison d'être of %in%?): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(x %in% y, y %in% x)) } Furthermore, replacing 'all(c(x %in% y, y %in x))' with 'all(x %in% y) all(y %in% x)' improves readability even more and, more importantly, reduces memory footprint significantly on big vectors (e.g. by 15% on integer vectors with 15M elements): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(x %in% y) all(y %in% x) } It also seems to speed up things a little bit (not in a significant way though). Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
Hi, Current implementation: setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(match(x, y, 0L) 0L, match(y, x, 0L) 0L)) } First what about replacing 'match(x, y, 0L) 0L' and 'match(y, x, 0L) 0L' with 'x %in% y' and 'y %in% x', respectively. They're strictly equivalent but the latter form is a lot more readable than the former (isn't this the raison d'être of %in%?): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(x %in% y, y %in% x)) } Furthermore, replacing 'all(c(x %in% y, y %in x))' with 'all(x %in% y) all(y %in% x)' improves readability even more and, more importantly, reduces memory footprint significantly on big vectors (e.g. by 15% on integer vectors with 15M elements): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(x %in% y) all(y %in% x) } It also seems to speed up things a little bit (not in a significant way though). Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] Use Imports instead of Depends in the DESCRIPTION files of bioconductor packages
Hi Gordon, My guess is that it has to do with how many symbols get exported. For example on my machine, doing library(limma) in a fresh session takes 0.261s and triggers export of 292 symbols (as reported by ls(..., all.names=TRUE)). Doing library(GenomicRanges) in a fresh session takes 2.724s and triggers export of 1581 symbols (counting the symbols exported by all the packages that get loaded). Michael it's great to hear that somebody is working on speeding up the code in charge of this. Happy New Year everybody! H. On 12/31/2014 06:07 PM, Gordon K Smyth wrote: Hi Michael, What aspect of the methods package causes the slowness? There are many packages (limma for one) that depend on methods but load quickly. Regards Gordon Date: Wed, 31 Dec 2014 09:17:01 -0800 From: Michael Lawrence lawrence.mich...@gene.com To: Peng Yu pengyu...@gmail.com Cc: Bioconductor Package Maintainer maintai...@bioconductor.org, bioc-devel@r-project.org bioc-devel@r-project.org Subject: Re: [Bioc-devel] [devteam-bioc] Use Imports instead of Depends in the DESCRIPTION files of bioconductor packages. The slowness is due to the methods package. We're working on it. Michael On Wed, Dec 31, 2014 at 8:47 AM, Peng Yu pengyu...@gmail.com wrote: On Wed, Dec 31, 2014 at 9:41 AM, Martin Morgan mtmor...@fredhutch.org wrote: On 12/24/2014 07:31 PM, Maintainer wrote: Hi, Many bioconductor packages Depends on other packages but not Imports other packages. (e.g., IRanges Depends on BiocGenerics.) Imports is usually preferred to Depends. http://stackoverflow.com/questions/8637993/better-explanation-of-when-to-use-imports-depends http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/ Could the unnecessary Depends be forced to be replaced by Imports? This should improve the package load time significantly. R package symbols and other objects are collated at build time into a 'name space'. When used, - Import: loads the name space from disk. - Depends: loads the name space from disk, and attaches it to the search() path. Attaching is very inexpensive compared to loading, so there is no speed improvement gained by Import'ing instead of Depend'ing. Yes. For example, changing Depends to Imports does not improve the package load time much. But loading a package in 4 sec seems to be too long. system.time(suppressPackageStartupMessages(library(MBASED))) user system elapsed 4.404 0.100 4.553 For example, it only takes 10% of the time to load ggplot2. It seems that many bioconductor packages have similar problems. system.time(suppressPackageStartupMessages(library(ggplot2))) user system elapsed 0.394 0.036 0.460 The main reason to Depend: on a package is because the symbols defined by the package are needed by the end-user. Import'ing a package is appropriate when the package provides functionality only relevant to the package author. What causes the load time to be too long? Is it because exporting too many functions from all dependent packages to the global namespace? There are likely to be specific packages that mis-use Depends; packages such as IRanges, GenomicRanges, etc use Depends: as intended, to provide functions that are useful to the end user. Maintainers are certainly encouraged to think carefully about adding packages providing functionality irrelevant to the end-user to the Depends: field. The codetoolsBioC package (available from svn, see http://bioconductor.org/developers/how-to/source-control/) provides some mostly reliable hints to package authors about correctly formulating a NAMESPACE file to facilitate using Imports: instead of Depends:. General questions about Bioconductor packages should be addressed to the support forum https://support.bioconductor.org. Questions about Bioconductor development (such as this) should be addressed to the bioc-devel mailing list (subscription required) https://stat.ethz.ch/mailman/listinfo/bioc-devel. I have cc'd the bioc-devel mailing list; I hope that is ok. -- Regards, Peng __ The information in this email is confidential and inte...{{dropped:20}} ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] bamFlagAsBitMatrix error (?bug)
Hi, On 12/14/2014 01:29 PM, Valerie Obenchain wrote: This should be fixed now (thanks Martin). We recently deprecated the flag 'isNotPrimaryRead' in favor of 'isNotPrimaryAlignment' (Rsamtools devel). The new name is actually isSecondaryAlignment. H. This affected GenomicAlignments and others ... ~/b/Rpacks$ grep -lr isNotPrimaryRead * CoverageView/R/cov.matrix.R CoverageView/R/cov.interval.R gage/vignettes/RNA-seqWorkflow.Rnw hiReadsProcessor/R/hiReadsProcessor.R QDNAseq/man/binReadCounts.Rd QDNAseq/R/binReadCounts.R Rsamtools/NEWS Rsamtools/man/ScanBamParam-class.Rd Rsamtools/R/methods-ScanBamParam.R SplicingGraphs/inst/unitTests/test_countReads-methods.R SplicingGraphs/inst/scripts/TSPC-utils.R SplicingGraphs/vignettes/SplicingGraphs.Rnw SplicingGraphs/man/assignReads.Rd SplicingGraphs/man/txpath-methods.Rd systemPipeR/R/utilities.R These packages should be clear (of this error) on tomorrow's builds. Thanks. Val On 12/14/2014 09:15 AM, Valerie Obenchain wrote: Hi Sean, Yes, we are aware of the problem, thanks. Hopefully this will be resolved for tomorrows builds ... we'll post back when it's fixed. Val On 12/14/2014 08:33 AM, Sean Davis wrote: Hi, Martin, Val, and Herve. This looks like a little problem with the bitnames in Rsamtools/GenomicAlignments. Perhaps this is related to some bitnames being deprecated? Thanks, Sean library(GenomicAlignments) sbp = ScanBamParam(which=cds[1:100],flag=scanBamFlag(isDuplicate = FALSE)) x = readGAlignmentPairs(LOCALBAMS[180],param=sbp) Error in bamFlagAsBitMatrix(flag1, bitnames = isNotPrimaryRead) : invalid bitname(s): isNotPrimaryRead traceback() 8: stop(invalid bitname(s): , in1string) 7: bamFlagAsBitMatrix(flag1, bitnames = isNotPrimaryRead) 6: .make_GAlignmentPairs_from_GAlignments(gal, use.mcols = use.mcols) 5: readGAlignmentPairsFromBam(bam, character(), use.names = use.names, param = param, with.which_label = with.which_label) 4: readGAlignmentPairsFromBam(bam, character(), use.names = use.names, param = param, with.which_label = with.which_label) 3: readGAlignmentPairsFromBam(file = file, use.names = use.names, ...) 2: readGAlignmentPairsFromBam(file = file, use.names = use.names, ...) 1: readGAlignmentPairs(LOCALBAMS[180], param = sbp) sessionInfo() R Under development (unstable) (2014-11-18 r66997) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats4parallel stats graphics grDevices datasets utils methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0 GenomicFeatures_1.19.7 AnnotationDbi_1.29.11 [4] GenomicAlignments_1.3.14stringr_0.6.2 plotrix_3.5-10 [7] pd.huex.1.0.st.v2_3.10.0RSQLite_1.0.0 DBI_0.3.1 [10] limma_3.23.2oligo_1.31.0 Biobase_2.27.0 [13] oligoClasses_1.29.3 knitr_1.8 VariantAnnotation_1.13.19 [16] Rsamtools_1.19.15 Biostrings_2.35.7 XVector_0.7.3 [19] GenomicRanges_1.19.21 GenomeInfoDb_1.3.7 IRanges_2.1.33 [22] S4Vectors_0.5.14BiocGenerics_0.13.3 BiocInstaller_1.17.1 loaded via a namespace (and not attached): [1] BBmisc_1.8BSgenome_1.35.8 BatchJobs_1.5 BiocParallel_1.1.9RCurl_1.95-4.5XML_3.98-1.1 [7] affxparser_1.39.3-1 affyio_1.35.0 base64enc_0.1-2 biomaRt_2.23.5bit_1.1-12bitops_1.0-6 [13] brew_1.0-6checkmate_1.5.1 codetools_0.2-9 digest_0.6.6 evaluate_0.5.5fail_1.2 [19] ff_2.2-13 foreach_1.4.2 formatR_1.0 htmltools_0.2.6 iterators_1.0.7 knitrBootstrap_1.0.0 [25] markdown_0.7.4preprocessCore_1.29.0 rmarkdown_0.3.3 rtracklayer_1.27.6sendmailR_1.2-1 splines_3.2.0 [31] tools_3.2.0 yaml_2.1.13 zlibbioc_1.13.0 [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Error when reading BAM files with readGAlignmentPairs
Hi Leonard, You can also see this error on the build report: http://bioconductor.org/checkResults/3.1/bioc-LATEST/GenomicAlignments/zin2-buildsrc.html This is a known issue that is the consequence of a recent change in Rsamtools. We'll sort it out. Thanks for your patience, H. On 12/12/2014 12:51 PM, Leonard Goldstein wrote: Dear Bioc developers, There seems to be an issue with GenomicAlignments or dependent packages as I can no longer read BAM files with readGAlignmentPairs that could be read without problems previously (see example below). I assume this is a general problem but can provide test files if necessary. Leonard readGAlignmentPairs(file) Error in .Call2(Integer_explode_bits, x, bitpos, PACKAGE = S4Vectors) : 'bitpos' must contain values = 1 sessionInfo() R Under development (unstable) (2014-11-04 r66932) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] GenomicAlignments_1.3.14 Rsamtools_1.19.14Biostrings_2.35.6 [4] XVector_0.7.3GenomicRanges_1.19.21GenomeInfoDb_1.3.7 [7] IRanges_2.1.32 S4Vectors_0.5.14 BiocGenerics_0.13.3 loaded via a namespace (and not attached): [1] base64enc_0.1-2BatchJobs_1.5 BBmisc_1.8 BiocParallel_1.1.9 [5] bitops_1.0-6 brew_1.0-6 checkmate_1.5.0codetools_0.2-9 [9] DBI_0.3.1 digest_0.6.6 fail_1.2 foreach_1.4.2 [13] iterators_1.0.7RSQLite_1.0.0 sendmailR_1.2-1stringr_0.6.2 [17] tools_3.2.0zlibbioc_1.13.0 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] SummarizedExperiment vs ExpressionSet
/bioc-devel -- Laurent Gatto http://cpu.sysbiol.cam.ac.uk/ ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Can't seem to use useDevel(); biocLite() for BioC 3.1
Hi, On 11/02/2014 10:58 PM, Dan Tenenbaum wrote: - Original Message - From: Leonardo Collado Torres lcoll...@jhu.edu To: Kasper Daniel Hansen kasperdanielhan...@gmail.com Cc: bioc-devel@r-project.org Sent: Sunday, November 2, 2014 9:26:25 PM Subject: Re: [Bioc-devel] Can't seem to use useDevel(); biocLite() for BioC 3.1 Thanks, I incorrectly thought that I was using the same R version as the Bioc-build machines since the r66923 part matched. Now I realize that this tag is used in all of the newest R builds at http://r.research.att.com/. It's admittedly confusing; Agreed. But the error message issued by useDevel() in the current release adds to the confusion: library(BiocInstaller) Bioconductor version 3.0 (BiocInstaller 1.16.0), ?biocLite for help useDevel() Error: 'devel' version already in use What about something like: Error: you need R 3.2 to run BioC 'devel' version Thanks, H. the question of whether to use R-devel with Bioc-devel has a different answer every six months, but the answer can always be found here: http://www.bioconductor.org/developers/how-to/useDevel/ Also, when looking at the devel build report, you should probably focus more on the version portion of the R version column than on the SVN revision number. Dan On Sun, Nov 2, 2014 at 9:57 PM, Kasper Daniel Hansen kasperdanielhan...@gmail.com wrote: You should not be using R-3.1.2 patched with the current devel version of Bioconductor; use R-devel. Best, Kasper On Sun, Nov 2, 2014 at 9:32 PM, Leonardo Collado Torres lcoll...@jhu.edu wrote: Hi, I can't seem to install devel packages via biocLite() and I wonder if something is broken or if I'm missing something. For example, take GenomeInfoDb which is at 1.3.6 and is passing all checks. The usual code using a fresh R 3.1.2-patched install isn't working as it downloads the latest release version (1.2.2): source(http://bioconductor.org/biocLite.R;) Bioconductor version 3.0 (BiocInstaller 1.16.0), ?biocLite for help ## Was expecting 3.1 here useDevel() Error: 'devel' version already in use biocLite('GenomeInfoDb') BioC_mirror: http://bioconductor.org Using Bioconductor version 3.0 (BiocInstaller 1.16.0), R version 3.1.2. Installing package(s) 'GenomeInfoDb' trying URL 'http://bioconductor.org/packages/3.0/bioc/bin/macosx/contrib/3.1/GenomeInfoDb_1.2.2.tgz' Content type 'application/x-gzip' length 404120 bytes (394 KB) opened URL == downloaded 394 KB The downloaded binary packages are in /var/folders/cx/n9s558kx6fb7jf5z_pgszgb8gn/T//Rtmpxp29xo/downloaded_packages sessionInfo() R version 3.1.2 Patched (2014-11-01 r66923) ## Note that it matches the current R version used by the Bioc-devel build machines Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.16.0 colorout_1.0-2 loaded via a namespace (and not attached): [1] tools_3.1.2 For my computer it's not problem because I can download the pkg via svn and install locally. But it breaks my tests in TravisCI which relies on biocLite(). Cheers, Leo ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Rd] ScalarLogical and setAttrib
Hi, The problem is better illustrated with: library(inline) test2 - cfunction(body = ' SEXP success = PROTECT(ScalarLogical(0)); setAttrib(success, install(foo), mkString(bar)); UNPROTECT(1); return success; ') test3 - cfunction(body = ' SEXP success = PROTECT(ScalarLogical(0)); LOGICAL(success)[0] = 1; UNPROTECT(1); return success; ') Then: test2() [1] FALSE attr(,foo) [1] bar test3() [1] TRUE attr(,foo) [1] bar Looks like someone assumed that the SEXP returned by ScalarLogical() would never be touched before being returned to R (which is probably true 99% of the time but not 100%). Cheers, H. On 10/31/2014 03:58 PM, Jeroen Ooms wrote: Is it expected that attributes set on a LGLSXP created by ScalarLogical will apply to all future objects created by ScalarLogical as well? For example: the 'test1' function below returns FALSE and 'test2' returns FALSE with an attribute: library(inline) test1 - cfunction(body = 'return ScalarLogical(0);') test2 - cfunction(body = ' SEXP success = PROTECT(ScalarLogical(0)); setAttrib(success, install(foo), mkString(bar)); UNPROTECT(1); return success; ') However after running test2(), then test1() will also return the attribute: test1() [1] FALSE test2() [1] FALSE attr(,foo) [1] bar test1() [1] FALSE attr(,foo) [1] bar It seems like ScalarLogical returns a singleton object, which is not the case for ScalarInteger or ScalarReal. I am currently working around this using duplicate(ScalarLogical(0)), but was quite surprised by this behavior of ScalarLogical. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] ScalarLogical and setAttrib
Hi, The problem is better illustrated with: library(inline) test2 - cfunction(body = ' SEXP success = PROTECT(ScalarLogical(0)); setAttrib(success, install(foo), mkString(bar)); UNPROTECT(1); return success; ') test3 - cfunction(body = ' SEXP success = PROTECT(ScalarLogical(0)); LOGICAL(success)[0] = 1; UNPROTECT(1); return success; ') Then: test2() [1] FALSE attr(,foo) [1] bar test3() [1] TRUE attr(,foo) [1] bar Looks like someone assumed that the SEXP returned by ScalarLogical() would never be touched before being returned to R (which is probably true 99% of the time but not 100%). Cheers, H. On 10/31/2014 03:58 PM, Jeroen Ooms wrote: Is it expected that attributes set on a LGLSXP created by ScalarLogical will apply to all future objects created by ScalarLogical as well? For example: the 'test1' function below returns FALSE and 'test2' returns FALSE with an attribute: library(inline) test1 - cfunction(body = 'return ScalarLogical(0);') test2 - cfunction(body = ' SEXP success = PROTECT(ScalarLogical(0)); setAttrib(success, install(foo), mkString(bar)); UNPROTECT(1); return success; ') However after running test2(), then test1() will also return the attribute: test1() [1] FALSE test2() [1] FALSE attr(,foo) [1] bar test1() [1] FALSE attr(,foo) [1] bar It seems like ScalarLogical returns a singleton object, which is not the case for ScalarInteger or ScalarReal. I am currently working around this using duplicate(ScalarLogical(0)), but was quite surprised by this behavior of ScalarLogical. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] getting a listing of all src packages
Hi Darin, We're grateful that you're willing to do this! Please be aware that a particular version of BioC is aimed to be used with a particular version of R only. E.g. the current release (BioC 3.0) is aimed to be used with R-3.1.1. Also some BioC packages depend on particular versions of CRAN packages. Only the CRAN packages used by our build system have been tested to be compatible with BioC 3.0: http://bioconductor.org/checkResults/3.0/bioc-LATEST/ You can see the list of CRAN packages used by the latest builds by clicking on the link under 'Installed pkgs' in the top table. Note that, because these builds are currently live (they run every night and will stop in April 2015), the CRAN packages they use are always the latest packages available on CRAN. Cheers, H. On 10/30/2014 06:22 AM, Sean Davis wrote: Hi, Darin. You might find these instructions useful: http://www.bioconductor.org/about/mirrors/mirror-how-to/ Basically, you can use rsync to get the repository and keep it up-to-date. Sean On Thu, Oct 30, 2014 at 9:17 AM, Darin Perusich da...@darins.net wrote: Hello All, Would it be possible to get a listing of all package sources from /packages/release/bioc/src/contrib or make the location browseable? I'm trying to download all the bioconductor tarball's so I can create rpm packages of them for inclusion in the openSUSE Build Service. I've already done this for all the CRAN packages and would like to include bioconductor packages as well. https://build.opensuse.org/project/show/devel:languages:R:released https://build.opensuse.org/project/show/devel:languages:R:CRAN-packages -- Later, Darin ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] depends on packages providing classes
typically be of great utility. Given your list above, I would think that depending on GenomicRanges would often be sufficient, and IRanges/S4Vectors would not require dependency assertion. I would think that GenomeInfoDb should be a voluntary attachment for a specific session. These are just my guesses -- I doubt there will be complete consensus, but I have started to think very critically about using Depends, and I think it is better when its use is minimized. That of course leads to the R CMD check NOTE on depending on too many packages I guess I should ignore that one. Best, Kasper [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] depends on packages providing classes
On 10/28/2014 12:42 PM, Vincent Carey wrote: On Tue, Oct 28, 2014 at 2:29 PM, Hervé Pagès hpa...@fredhutch.org mailto:hpa...@fredhutch.org wrote: Hi, On 10/28/2014 08:48 AM, Vincent Carey wrote: On Tue, Oct 28, 2014 at 11:23 AM, Kasper Daniel Hansen kasperdanielhan...@gmail.com mailto:kasperdanielhan...@gmail.com wrote: Well, first I want to make sure that there is not something special regarding S4 methods and classes. I have a feeling that they are a special case. Second, while I agree with Jim's general opinion, it is a little bit different when I have return objects which are defined in other packages. If I don't depend on this other package, the user is hosed wrt. the return object, unless I manually export all classes from this other In what sense? If you return an instance of GRanges, certain things can be done even if GenomicRanges is not attached. Yes certain things maybe, but it's hard to predict which ones. You can get values of slots, for example. With the following little package %vjcair cat foo/NAMESPACE importFrom(IRanges, IRanges) importClassesFrom(__GenomicRanges, GRanges) importFrom(GenomicRanges, GRanges) export(myfun) %vjcair cat foo/DESCRIPTION Package: foo Title: foo Version: 0.0.0 Author: VJ Carey st...@channing.harvard.edu mailto:st...@channing.harvard.edu Description: Suggests: Depends: Imports: GenomicRanges Maintainer: VJ Carey st...@channing.harvard.edu mailto:st...@channing.harvard.edu License: Private LazyLoad: yes %vjcair cat foo/R/* myfun = function(seqnames=1, ranges=IRanges(1,2), ...) GRanges(seqnames=seqnames, ranges=ranges, ...) The following works: library(foo) x = myfun() x GRanges object with 1 range and 0 metadata columns: seqnamesranges strand Rle IRanges Rle [1]1[1, 2] * --- seqinfo: 1 sequence from an unspecified genome; no seqlengths So the show method works, even though I have not touched it. (I did not expect it to work, in fact.) Exactly. Let's call it luck ;-) Additionally, I can get access to slots. The end user should never try to access slots directly but use getters and setters instead. And most getters and setters for GRanges objects are defined and documented in the GenomicRanges package. Those that are not are defined in packages that GenomicRanges depends on. But ranges() fails. If I, the user, want to use it, I need to arrange for that. IMO if your package returns a GRanges object to the user, then the user should be able to access the man page for GRanges objects with ?GRanges. Oddly enough, that seems to be incorrect. I added a man page to foo that has a \link[GenomicRanges]{GRanges-class}. I ran help.start and the cross reference from my man page succeeds. Furthermore with the sessionInfo below, ?GRanges succeeds at the CLI. Did you try to run example(GRanges)? I'm not sure that will work. For example after I do library(rtracklayer), I can indeed do ?DNAStringSet at the command line (I'm surprised this works), but then example(DNAStringSet) fails: example(DNAStringSet) Warning message: In example(DNAStringSet) : no help found for ‘DNAStringSet’ I'm also surprised this is just a warning but that's another story... H. I am not trying to defend the NOTE but the principle of minimizing Depends declarations needs to be considered critically, and I am just exploring the space. ?GRanges # it worked as usual in the tty sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin13.1.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices datasets utils tools methods [8] base other attached packages: [1] foo_0.0.0rmarkdown_0.3.8 knitr_1.6 [4] weaver_1.31.0codetools_0.2-9 digest_0.6.4 [7] BiocInstaller_1.16.0 loaded via a namespace (and not attached): [1] BiocGenerics_0.11.5 evaluate_0.5.5formatR_1.0 [4] GenomeInfoDb_1.1.26 GenomicRanges_1.17.48 htmltools_0.2.6 [7] IRanges_1.99.32 parallel_3.1.1S4Vectors_0.2.8 [10] stats4_3.1.1 stringr_0.6.2 XVector_0.5.8 And that works only if the GenomicRanges package is attached. Attaching GenomicRanges will also attach other packages that GenomicRanges depends
Re: [Rd] No error when assigning values to an empty vector/matrix/array
Hi Henrik, On 10/23/2014 08:10 PM, Henrik Bengtsson wrote: Assigning one or more values to a vector/matrix/array x for which length(x) == 0 gives no error, e.g. x - integer(0) x[] - 1:2 x - matrix(nrow=0, ncol=1) x[] - 1:2 x[,1] - 1:2 x - array(dim=c(0,1,1)) x[] - 1:2 x[,1,1] - 1:2 whereas x - integer(1) x[] - 1:2 Warning message: In x[] - 1:2 : number of items to replace is not a multiple of replacement length x - matrix(nrow=1, ncol=1) x[] - 1:2 Warning message: In x[] - 1:2 : number of items to replace is not a multiple of replacement length x[,1] - 1:2 Error in x[, 1] - 1:2 : number of items to replace is not a multiple of replacement length x - array(dim=c(1,1,1)) x[] - 1:2 Warning message: In x[] - 1:2 : number of items to replace is not a multiple of replacement length x[,1,1] - 1:2 Error in x[, 1, 1] - 1:2 : number of items to replace is not a multiple of replacement length Is this intended by design or is it a bug that should be reported? Since [- supports truncating of the right value, why an exception should be made when the left vector has length 0? Also note that these warnings or errors are complaining that the number of items to replace (left length) is not a multiple of replacement length (right length). This suggests that when the left length is a multiple of the right length, everything is fine. And this is actually the case when the left length is 0. Because 0 is a multiple of anything. So in that case, the right value is truncated to length 0 and no warning is issued. Makes sense to me. Cheers, H. /Henrik __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] No error when assigning values to an empty vector/matrix/array
Hi, On 10/24/2014 06:58 AM, S Ellison wrote: Also note that these warnings or errors are complaining that the number of items to replace (left length) is not a multiple of replacement length (right length). This suggests that when the left length is a multiple of the right length, everything is fine. And this is actually the case when the left length is 0. Because 0 is a multiple of anything. So in that case, the right value is truncated to length 0 and no warning is issued. Makes sense to me. Thanks Hervé, you gave the perfect explanation/rationale for this being consistent. This explains why a check for exact multiple of replacement length does not trigger a warning, but surely that is not sensible in the length 0 case. In all other cases, this check warns when there will be truncation of the replacement, and that seems to me the sensible intent of the check. A silent truncation to nothing is surely not the intended behaviour. Yes truncation should not be silent. But truncation to length zero should not be seen as a special case either. What would be more sensible is that [- recognizes the 2 distinct situations that deserve a warning: 1. Truncation (i.e. when left length is right length). 2. Left length is right length AND left length is not a multiple of right length. Then the warning we get should be clear about which situation was detected. So we would get a sensible warning all the time, even when left length is 0. H. I can't help feeling that the 'check for multiple of length' was a neat portmanteau check for several possible problems when recycling is allowed, but that the possibility of assigning to a length 0 object was not considered. I'd suggest logging it as an issue to for R-core to at least look at and either to fix or to at least warn of in documentation. S Ellison *** This email and any attachments are confidential. Any u...{{dropped:26}} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] Documentation on how to update a BioC experimental package?
Hi Henrik, FWIW there is some additional information here: https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005156.html In particular the section starting at One important difference with the svn software repository provides some useful hints on how to commit changes to the data (see step 4.). H. On 10/23/2014 07:46 PM, Dan Tenenbaum wrote: - Original Message - From: Vincent Carey st...@channing.harvard.edu To: Henrik Bengtsson h...@biostat.ucsf.edu Cc: bioC-devel bioc-de...@stat.math.ethz.ch Sent: Thursday, October 23, 2014 7:45:13 PM Subject: Re: [Bioc-devel] Documentation on how to update a BioC experimental package? There is a README.txt in the pkgs folder. I will attach it. I think this is accurate, but there may be something else on the site. See the Experiment Data Packages section of http://www.bioconductor.org/developers/how-to/source-control/ Dan On Thu, Oct 23, 2014 at 10:23 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: It's been a while since I worked with experimental packages. Where can I find documentation on how to (Subversion) update our AffymetrixDataTestFiles package with additional data files? All I know is that the SVN repository only contains a stub of the package and http://www.bioconductor.org/developers/package-guidelines/#package-types provides little information and basically only point to the devel mailing list. Thanks, Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Why be default 'D' is not dropped from coverage?
Hi Leo, On 10/18/2014 10:50 AM, Leonardo Collado Torres wrote: Hi, A collaborator of mine is working on a new software and we while we were doing a sanity check to compare the base-level coverage from BAM files and bigWig files generated from his software we realized that by default bases corresponding to a 'D' on the CIGAR string get counted when reading coverage from BAM files. That is: getMethod('coverage', 'GAlignments') Method Definition: function (x, shift = 0L, width = NULL, weight = 1L, ...) { .local - function (x, shift = 0L, width = NULL, weight = 1L, method = c(auto, sort, hash), drop.D.ranges = FALSE) coverage(grglist(x, drop.D.ranges = drop.D.ranges), shift = shift, width = width, weight = weight, method = method) .local(x, shift, width, weight, ...) } environment: namespace:GenomicAlignments Signatures: x target GAlignments defined GAlignments packageVersion('GenomicAlignments') [1] ‘1.2.0’ You can see that this is default elsewhere, for example: extractAlignmentRangesOnReference function (cigar, pos = 1L, drop.D.ranges = FALSE, f = NULL) { if (!isTRUEorFALSE(drop.D.ranges)) stop('drop.D.ranges' must be TRUE or FALSE) if (drop.D.ranges) { ops - c(M, =, X, I) } else { ops - c(M, =, X, I, D) } cigarRangesAlongReferenceSpace(cigar, flag = NULL, pos = pos, f = f, ops = ops, drop.empty.ranges = FALSE, reduce.ranges = TRUE) } environment: namespace:GenomicAlignments What is the rationale for setting `drop.D.ranges` by default to FALSE? Because last time I checked (but that was 3 or 4 years ago), that's what Samtools pileup was doing. Cheers, H. Thanks, Leo -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] runnable examples for internal function ?
Hi Tiphaine, On 10/15/2014 06:30 AM, Martin, Tiphaine wrote: Hi Herve, Thank you for your email. very useful. But I updated my NAMESPACE in order to list only functions that I want to export. Even if I did that and if in addition, I keep different man files for functions that I do not want to export, I have this error message. But if I removed them, I don't have error messages. I kept different man files for functions that I do not want to export to help my colleagues. It seems that currently, it is not recommended in Bioconductor. By colleagues I guess you mean co-developers, not users of your package right? Yes it seems reasonable to expect 'R CMD BiocCheck' to only check \value and \example for man pages that document exported things. We should fix that. Note that it's questionable whether it's a good idea or not to keep around man pages for non exported things. IMO it's better to document internal helpers using in-source comments. And since you already have those in-source comments in place anyway (because you're using roxygen2), all you need to do is find a way to block roxygen2 from generating the man pages for these internal helpers. Co-developers of your package will still be able to see useful information and that information will be placed where developers expect it to be: in the source code itself. My 2 cents. Cheers, H. thank you for your help, Tiphaine From: Hervé Pagès hpa...@fhcrc.org Sent: 14 October 2014 03:02 To: Martin, Tiphaine; Michael Lawrence; Dan Tenenbaum Cc: bioc-devel@r-project.org Subject: Re: [Bioc-devel] runnable examples for internal function ? Hi Tiphaine, On 10/13/2014 04:31 PM, Martin, Tiphaine wrote: Hi, I wrote a list of functions used in my package called coMET. I decided with my colleagues to try to publish it in Bioconductor. Currently, it has not been yet submitted to bioconductor because I would like to be sure that it satisfies all your guidelines. A list of functions is useful only internally. I saw that there are two methods to remove the function from the package index and to disable some of their automated tests: the first method is to put a dot as first letter, the second method is to put the internal keyword (http://cran.r-project.org/web/packages/roxygen2/vignettes/rdkeywords.html). That link is to the documentation of the roxygen2 package. Are you using roxygen2 to develop your package? You didn't say so. You say that you saw that prefixing the function name with a dot or putting the 'internal' keyword in the man page will 'remove the function from the package index and disable some of its automated tests'. I guess that's something you saw in the roxygen2 documentation right? I don't use roxygene2 myself so I don't know what's the roxygen2 way to control what's exposed or not to the user. However I know many BioC developers use it for their package so maybe they'll be able to provide some good advice here. Keep in mind that using a tool like roxygen2 adds an extra layer of convenience when developing your package but, unfortunately, it doesn't completely exempt you from learning and understanding some basic concepts like NAMESPACE, man pages aliases, keywords, etc... The ultimate reference for these things is still the Writing R Extensions manual: http://cran.r-project.org/doc/manuals/r-release/R-exts.html So, and FWIW, below I'll describe quickly how you need to proceed when you don't use a fancy tool like roxygen2 to automatically generate the NAMESPACE and man pages in your package: 1) The real true way to not expose a function to the user is to not export it. What one exports from a package is controlled via the NAMESPACE file. So first you need to learn about how to use the NAMESPACE file to control exactly what you want to expose to the user. See Writing R Extensions manual for the details. 2) Every symbol that is exported needs to be documented, that is, there must be a man page with an alias for this symbol. And of course opening the man page at the R prompt with ?symbol should display useful information about that symbol. 'R CMD check' is the tool that will check that every exported symbol is documented. It will also perform many checks on the man pages to make sure that they are formatted properly. 3) As Dan said previously, any function that is exported also needs to have runnable examples and a \value section in its man page. Note that this is a Bioconductor requirement, 'R CMD check' doesn't check that. 'R CMD BiocCheck' is the tool that will perform this check and any other Bioconductor specific check. I decided to use the second method to reduce to rewrite now all my package. Note that, for plain package development (i.e. without using a a fancy tool like roxygen2), using the internal keyword in a man page has no effect on whether the function is exported or not. When I run the new
Re: [Bioc-devel] warning when merging disjoint sets of seqlevels
Hi Michael, On 10/12/2014 02:02 PM, Michael Lawrence wrote: This recently became a warning, and I am not sure why. Yes, in the overlap case, that might be something to worry about. But a perfectly reasonable use case of merge,Seqinfo is to merge two disjoint sets of seqlevels. Now we're forced to use suppressWarnings() for that. For that use case, nothing has changed, we've always had a warning and we still have it. With BioC 2.14: merge(Seqinfo(chr1), Seqinfo(chr2)) Seqinfo of length 2 seqnames seqlengths isCircular genome chr1 NA NA NA chr2 NA NA NA Warning message: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other: - in 'x': chr1 - in 'y': chr2 Make sure to always combine/compare objects based on the same reference genome (use suppressWarnings() to suppress this warning). With BioC 3.0: merge(Seqinfo(chr1), Seqinfo(chr2)) Seqinfo object with 2 sequences from an unspecified genome; no seqlengths: seqnames seqlengths isCircular genome chr1 NA NA NA chr2 NA NA NA Warning message: In .Seqinfo.mergexy(x, y) : The 2 combined objects have no sequence levels in common. (Use suppressWarnings() to suppress this warning.) It's just that the warning is different. The intention was to make the new warning more to the point. H. Michael [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] warning when merging disjoint sets of seqlevels
On 10/13/2014 07:45 AM, Michael Lawrence wrote: But the behavior has changed for this case: merge(Seqinfo(chr1), Seqinfo()) Before it was each has seqlevels not in the other but now it is these two sets are disjoint. Subtle difference, but I think it's important? What has changed is that now we get the warning for the edge-case where one of the 2 Seqinfo objects is empty. In that particular case the warning is arguably not useful so I just removed it. H. On Sun, Oct 12, 2014 at 11:46 PM, Hervé Pagès hpa...@fhcrc.org mailto:hpa...@fhcrc.org wrote: Hi Michael, On 10/12/2014 02:02 PM, Michael Lawrence wrote: This recently became a warning, and I am not sure why. Yes, in the overlap case, that might be something to worry about. But a perfectly reasonable use case of merge,Seqinfo is to merge two disjoint sets of seqlevels. Now we're forced to use suppressWarnings() for that. For that use case, nothing has changed, we've always had a warning and we still have it. With BioC 2.14: merge(Seqinfo(chr1), Seqinfo(chr2)) Seqinfo of length 2 seqnames seqlengths isCircular genome chr1 NA NA NA chr2 NA NA NA Warning message: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other: - in 'x': chr1 - in 'y': chr2 Make sure to always combine/compare objects based on the same reference genome (use suppressWarnings() to suppress this warning). With BioC 3.0: merge(Seqinfo(chr1), Seqinfo(chr2)) Seqinfo object with 2 sequences from an unspecified genome; no seqlengths: seqnames seqlengths isCircular genome chr1 NA NA NA chr2 NA NA NA Warning message: In .Seqinfo.mergexy(x, y) : The 2 combined objects have no sequence levels in common. (Use suppressWarnings() to suppress this warning.) It's just that the warning is different. The intention was to make the new warning more to the point. H. Michael [[alternative HTML version deleted]] _ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org mailto:hpa...@fhcrc.org Phone: (206) 667-5791 tel:%28206%29%20667-5791 Fax: (206) 667-1319 tel:%28206%29%20667-1319 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] runnable examples for internal function ?
it to avoid listing functions in the index (RGtk2 has thousands of functions). But in general, it's not a good idea to be exporting things that are truly internal. Perhaps internal methods on exported generics, but even then, it's probably best to keep the aliases with the generic, or something. On Mon, Oct 13, 2014 at 1:33 PM, Dan Tenenbaum dtene...@fhcrc.orgmailto:dtene...@fhcrc.org wrote: - Original Message - From: Tiphaine Martin tiphaine.mar...@kcl.ac.ukmailto:tiphaine.mar...@kcl.ac.uk To: bioc-devel@r-project.orgmailto:bioc-devel@r-project.org Sent: Monday, October 13, 2014 1:29:08 PM Subject: [Bioc-devel] runnable examples for internal function ? Hi, I would like to know if I need to add a runnable example for each function that has keyword either internal or not. I don't know what you mean by this, but maybe I should. I ask that because with BiocCheck, version 1.0.2, I had a message for function with keyword internal such as Checking exported objects have runnable examples... * CONSIDER: Adding runnable examples to the following man pages which document exported objects: If the function is exported, it must have a runnable example. and now, I have an error message with BiocCheck, version 1.1.18 * Checking exported objects have runnable examples... Error in if (line == ## No test: || insideDontTest || line == ## End(No test)) { : missing value where TRUE/FALSE needed Calls: Anonymous ... checkExportsAreDocumented - doesManPageHaveRunnableExample - removeDontTest Execution halted Can you file an issue at https://github.com/Bioconductor/BiocCheck/issues and include the name of the package? I will get to it as soon as I can. Dan Regards, Tiphaine Tiphaine Martin PhD Research Student | King's College The Department of Twin Research Genetic Epidemiology | Genetics Molecular Medicine Division St Thomas' Hospital 4th Floor, Block D, South Wing SE1 7EH, London United Kingdom email : tiphaine.mar...@kcl.ac.ukmailto:tiphaine.mar...@kcl.ac.uk Fax: +44 (0) 207 188 6761tel:%2B44%20%280%29%20207%20188%206761 [[alternative HTML version deleted]] ___ Bioc-devel@r-project.orgmailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.orgmailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] NAMESPACE question
. Is there a reproducible example? I see in your code there are several places where you require() or library() various packages. I think one of these Depends: on GenomicRanges, and the messages you see are the effect of moving GenomicRanges from 'loaded' to 'attached'. You can see the effect with library(qpgraph) sessionInfo() ## GenomicRanges loaded but not attached library(GenomicRanges) ## information about the package being attached Probably in your code you do not actually want to require() ad hoc packages and influence the user search path (and implicitly rely on search path order for correct functionality), but rather to requireNamespace(foo); foo::fun(...) (or possibly loadNamespace()). Complicated! Martin robert. On 10/07/2014 05:05 PM, Michael Lawrence wrote: Does that happen with the other methods or just [? As a last resort, you could just drop the import (because [ is a primitive, it should just work). On Tue, Oct 7, 2014 at 3:08 AM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote: hi Martin, On 10/06/2014 07:24 PM, Martin Morgan wrote: [...] There are two 'as.vector' generics, one defined in Matrix and one in BiocGenerics (and made available via IRanges). These generics have different methods showMethods(Matrix::as.vector) Function: as.vector (package base) x=abIndex, mode=ANY x=abIndex, mode=character x=ANY, mode=ANY x=dgCMatrix, mode=missing x=dgeMatrix, mode=missing x=diagonalMatrix, mode=missing x=dsCMatrix, mode=missing x=ldenseMatrix, mode=missing x=Matrix, mode=missing x=ndenseMatrix, mode=missing x=sparseVector, mode=character x=sparseVector, mode=missing showMethods(BiocGenerics::as.__vector) Function: as.vector (package BiocGenerics) x=ANY x=AtomicList x=Rle x=XDouble x=XInteger x=XRaw x=XString x=XStringSet so it's important that your code clearly distinguish between generics. One possibility is to remove importMethodsFrom(IRanges, as.vector) from the NAMESPACE, and explicitly use IRanges::as.vector(...) in your code. ok, i've done this as it is the easiest at the moment to meet the release schedule. i guess that in the future i should try to avoid using the '::' operator by importing exclusively what is needed from each package. codetoolsBioC::__writeNamespaceImports(__qpgraph) might provide you with some guidance (it's not 100% reliable; available via svn at https://hedgehog.fhcrc.org/__bioconductor/trunk/madman/__Rpacks/codetoolsBioC https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/codetoolsBioC) about what functionality is being imported. thanks for the heads up about codetoolsBioC, i've tried it out and seen that some of the suggested imports are not necessary but some others i was really missing them (which makes me wonder how was it possible that he package did not break at those points). one further question related to NAMESPACE. i subset GRanges objects in the package via the '[' operator, i've included this into the NAMESPACE file as: importMethodsFrom(__GenomicRanges, c, cbind, rbind, mcols-, start, end, strand, sort, [, [-, [[, [[-, $, $-) however, when the package reaches a subset operation x[i] with x being a GRanges object, an entire package loading sequence starts: Loading required package: GenomicRanges Loading required package: BiocGenerics Loading required package: parallel Attaching package: ‘BiocGenerics’ [... etc ...] which may look a bit odd to the user. for every other imported method the package uses them silently without loading the corresponding package, am i importing '[' for GRanges objects from the wrong package? is there a way to import '[' so that my package can use it without triggering that package loading sequence? thanks again! robert. _ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] a warning we may not need any more?
Hi Vince, Good idea. I made that change in GenomeInfoDb 1.1.26: library(GenomeInfoDb) x - Seqinfo(seqnames=c(chr1, chr2, chr3, chrM), seqlengths=c(100, 200, NA, 15), isCircular=c(NA, FALSE, FALSE, TRUE), genome=toy) y - Seqinfo(seqnames=c(chr3, chr4, chrM), seqlengths=c(300, NA, 15)) merge(x, y) # warning genome(y) - genome(x)[[1]] merge(x, y) # no warning Cheers, H. On 10/09/2014 12:04 PM, Vincent Carey wrote: *Warning message:* *In .Seqinfo.mergexy(x, y) :* * Each of the 2 combined objects has sequence levels not in the other:* * - in 'x': chr4, chr16, chr10, chr11, chr18, chr20, chr22* * - in 'y': chrY* * Make sure to always combine/compare objects based on the same reference* * genome (use suppressWarnings() to suppress this warning).* *It seems to me that if we can check equality of the assigned genomes, we should* *not issue this warning.* [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] S4Vectors unit tests
On 10/07/2014 10:35 AM, Michael Lawrence wrote: Looks like we still have to move over the DataFrame, etc tests from IRanges? Yes this and other leftovers. H. Michael [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] minor suggestion for BiocStyle?
://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BSgenome forge file input file restrictions
Hi Florian, True. These restrictions don't make much sense these days anymore! Some of them are gone in the devel version of BSgenome. The BSgenomeForge vignette in devel now says: The sequence data must be in a single twoBit file (e.g. musFur1.2bit) or in a collection of FASTA files (possibly gzip-compressed). I guess I should also support a single FASTA file. H. On 09/29/2014 01:36 AM, Hahne, Florian wrote: Hi all, I was wondering whether some of the rather arbitrary restrictions on input files for the process of forging as new Bsgenome package could be liftet. In particular: Why do we need all chromosomes in individual files? Couldn�t the function be smart enough to just extract the relevant bits from a single file containing all chromosomes? Or even from several such files? Why are gzipped files not allowed? Pretty much all tools in Biostrings seem to be able to deal with gzipped fasta files these days. Thanks, Florian [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Importing summary from AnnotationDbi, Category, GOstats
Hi Jim, On 09/23/2014 09:42 AM, James W. MacDonald wrote: There is an issue on the support site having to do with the inability to import the summary function from the namespaces from the packages listed in the subject line. I see the same problem/errors with my affycoretools package. When you load ReportingTools, you get the following warnings: Warning messages: 1: In namespaceImportMethods(ns, loadNamespace(j - imp[[1L]], c(lib.loc, : No generic function found corresponding to requested imported methods for summary from package AnnotationDbi (malformed exports?) 2: In namespaceImportMethods(ns, loadNamespace(j - imp[[1L]], c(lib.loc, : No generic function found corresponding to requested imported methods for summary from package Category (malformed exports?) 3: In namespaceImportMethods(ns, loadNamespace(j - imp[[1L]], c(lib.loc, : No generic function found corresponding to requested imported methods for summary from package GOstats (malformed exports?) Is this because of the following changes? r94092 | hpa...@fhcrc.org | 2014-09-12 18:02:34 -0700 (Fri, 12 Sep 2014) | 7 lines Changed paths: M /trunk/madman/Rpacks/AnnotationDbi/DESCRIPTION M /trunk/madman/Rpacks/AnnotationDbi/NAMESPACE M /trunk/madman/Rpacks/AnnotationDbi/R/createAnnObjs-utils.R M /trunk/madman/Rpacks/AnnotationDbi/R/createAnnObjs.NCBIORG_DBs.R M /trunk/madman/Rpacks/AnnotationDbi/R/methods-Inparanoid.R M /trunk/madman/Rpacks/AnnotationDbi/R/methods-Inparanoid8.R M /trunk/madman/Rpacks/AnnotationDbi/R/methods-geneCentricDbs.R - Drop dependency on IRanges (stuff from IRanges needed by AnnotationDbi is now in S4Vectors). - Add dependency on stats4 and import summary() from it. This is an S4 generic and AnnotationDbi defines and exports S4 methods for it. - Address the A package almost never needs to use ::: for its own objects NOTE from 'R CMD check'. which included Index: NAMESPACE === --- NAMESPACE (revision 94000) +++ NAMESPACE (working copy) @@ -1,11 +1,11 @@ import(methods) import(utils) +importFrom(stats4, summary) import(Biobase) import(DBI) import(RSQLite) import(BiocGenerics) -importFrom(S4Vectors, isTRUEorFALSE, isSingleString, metadata) -importFrom(IRanges, elementLengths) +importFrom(S4Vectors, isTRUEorFALSE, isSingleString, metadata, elementLengths) importFrom(GenomeInfoDb, genomeStyles, extractSeqlevels, mapSeqlevels) exportClasses( @@ -169,9 +169,6 @@ intraIDMapper, idConverter, -#Needs to be exported from RSQLite -summary, - ## AnnotationDb metadata, so it appears summary is no longer exported? Yes, no need to export summary from AnnotationDbi because summary() is a generic defined in the stats4 package. Before I made that change, AnnotationDbi was implicitly promoting base::summary() to an S4 generic which was then a different generic from stats4::summary(). Having these 2 distinct summary() generics was causing the usual troubles when a user had AnnotationDbi and stats4 in their search path. Also it was breaking the attract package in some obscure way. The warnings you get when you load ReportingTools will hopefully go away if you reinstall the package. Let me know if it doesn't. Hope this helps, H. Best, Jim -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] need for consistent coordinate mapping API
Hi, It's an interesting problem. Right now mapCoords() has some limitations. For example I can use it to map from reference sequence to read cycle but not the other way around. Or from reference genome to transcriptome but not the other way around (this reverse mapping is actually what low level util transcriptLocs2RefLocs() does). Ideally we should be able to easily go back and forth between 2 coordinate spaces. Also we need to think about how more complex use cases (like Laurent's one) could be handled by mapCoords(). Not clear. We might need to change mapCoords's design a little bit, or maybe we'll need something else. H. On 09/19/2014 10:40 AM, Laurent Gatto wrote: On 19 September 2014 18:07, Michael Lawrence wrote: Hi guys, This is the problem of mapping back and forth between coordinate spaces, such as between genomic and transcript space. I think there was some progress this release cycle (introduction of mapCoords generic, etc), but I think there is yet more to do. For example, transcriptLocs2RefLocs could be given a ranges-based wrapper that conforms to the mapCoords API somehow. Could we please put this on the TODO list of someone (in Seattle) for the next release cycle? And I would be very interested in (and slowly working towards) generalising this to proteomics data. For now, there is a rather long description in [1], but eventually, it should be standardised. I was not aware of mapCoords and will read about it. Laurent [1] http://bioconductor.org/packages/devel/bioc/vignettes/Pbase/inst/doc/mapping.html Michael [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] some turbulences due to IRanges/S4Vectors split
Hi developers, I recently moved SimpleList and DataFrame from IRanges to S4Vectors. That breaks a number of packages as you can see on today's build/check report: http://bioconductor.org/checkResults/3.0/bioc-LATEST/ Will fix today. Sorry for the inconvenience. Cheers, H. ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method
OK so let's go for a 1 line summarization of the seqinfo. Vince is that OK if we keep this at the bottom of the object? That way it will always be visible, even when the object requires more than 1 screen to display (e.g. when it has a lot of metadata cols). Will look something like: gr GRanges with 3 ranges and 0 metadata columns: seqnames ranges strand RleIRanges Rle [1]chr14 [19069583, 19069654] + [2]chr14 [19363738, 19363809] + [3]chr14 [19363755, 19363826] - [4]chr14 [19369799, 19369870] + seqinfo: 60 seqlevels (2 circular) on 2 genomes (hg19, mm10); no seqlengths Thanks, H. On 09/09/2014 06:38 AM, Michael Lawrence wrote: Agreed, that looks a lot nicer. On Tue, Sep 9, 2014 at 4:42 AM, Martin Morgan mtmor...@fhcrc.org wrote: On 09/09/2014 04:02 AM, Michael Lawrence wrote: I'm in favor of this display. The seqinfo output at the bottom has always been annoying (over-emphasized). the fact that the lengths are 'NA' can be a helpful prompt to do something about it, e.g., add seqinfo when inputting the data. Also they are helpful when one is told that seqlengths are incompatible during, e.g., findOverlaps. But I like the idea of less but more informative display of seqinfo, along the lines suggested by Vince. seqinfo: 60 seqlevels (2 circular) on 2 genomes (hg19, mm10); 60 'NA' seqlengths Martin On Mon, Sep 8, 2014 at 10:08 PM, Vincent Carey st...@channing.harvard.edu wrote: On Tue, Sep 9, 2014 at 12:30 AM, Hervé Pagès hpa...@fhcrc.org wrote: On 09/08/2014 06:42 PM, Michael Lawrence wrote: Instead of printing out multiple lines of a table that is rarely of interest, could we develop Peter's idea toward something like: hg19:chr1 hg19:chr2 ... [lengths ...] Not sure what condensed notation would be useful for circularity. I don't know either. I'm worried that this would make the seqinfo stuff look like a named vector and that the user would expect hg19:chr1, hg19:chr2, etc... to be valid names. With the table-like layout, some screen real estate can always be saved by printing less lines: What I had in mind was gr GRanges with 3 ranges and 0 metadata columns: genome: hg19 seqnames ranges strand RleIRanges Rle [1]chr14 [19069583, 19069654] + [2]chr14 [19363738, 19363809] + [3]chr14 [19363755, 19363826] - [4]chr14 [19369799, 19369870] + you could then probably dispense with the seqlengths. i have never found them too useful except as a key to the genome. if there are multiple genomes, we have something like genomes: hg19, mm9 the point is to make it prominent, particularly at a time of transition. --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) --- seqlevelsseqlengths isCircular genome chr1 249250621 NA hg19 chr10 135534747 NA hg19 ... ...... ... chrX 155270560 NA hg19 chrY 59373566 NA hg19 I agree that the exact content of the seqinfo table itself is rarely of interest so printing only 3 or 4 lines is OK. IMO it's important to make the user aware of the existence of this hidden table and to display it like what it really is (i.e. a table). Also displaying the column names is a well established tradition and serves the purpose of providing a quick summary of the accessors that are available to access those fields. H. On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey hic...@wehi.edu.au mailto:hic...@wehi.edu.au wrote: Perhaps it might be useful to have some way of highlighting if any of the chromosomes are circular or highlighting if there are multiple genomes present? Otherwise this information might be hidden in the … Cheers, Pete On 09/09/2014, at 9:44 AM, Hervé Pagès hpa...@fhcrc.org mailto:hpa...@fhcrc.org wrote: On 09/08/2014 02:28 PM, Peter Hickey wrote: Just a vote for still allowing for multiple genomes in a Seqinfo object (in a GRanges object). My use case is in bisulfite-sequencing experiments where there is often a spike-in of a lambda phage genome along with the genome of interest (human or mouse). It's often useful to keep all data from a single library together in the same objet but process according to genome(x) for each seqlevel. Note taken. Thanks Pete! It's always great to know about concrete use cases. FWIW, I like Vincent's proposal of selectSome(unique(genome(x))) in the show method. Or what about displaying
Re: [Bioc-devel] writeVcf performance
mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project. mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.__org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project mailto:Bioc-devel@r-project. mailto:Bioc-devel@r-project mailto:Bioc-devel@r-project.__org mailto:Bioc-devel@r-project. mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.__org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Computational Biologist Genentech Research [[alternative HTML version deleted]] _ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.__org mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project. mailto:Bioc-devel@r-project.org mailto:Bioc-devel@r-project.__org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 tel:%28206%29%20667-2793 tel:%28206%29%20667-2793 tel:%28206%29%20667-2793 -- Computational Biologist Genentech Research -- Computational Biologist Genentech Research -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 tel:%28206%29%20667-2793 tel:%28206%29%20667-2793 -- Computational Biologist Genentech Research _ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Computational Biologist Genentech Research ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health
Re: [Bioc-devel] writeVcf performance
Ah, ok. I should have 'svn up' and re-tried 'R CMD check' before reporting this. Thanks and sorry for the noise. H. On 09/09/2014 03:09 PM, Valerie Obenchain wrote: Hi Herve, This unit test passes in VA 1.11.30 (the current version in svn). It was related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours. Val On 09/09/2014 02:47 PM, Hervé Pagès wrote: Hi Val, On 09/09/2014 02:12 PM, Valerie Obenchain wrote: Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the process of moving SimpleList and DataFrame from IRanges to S4Vectors; finished up today I think. I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check' failed for me because of an unit test error in test_VRanges_vcf(). Here is how to quickly reproduce: library(VariantAnnotation) library(RUnit) source(path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R) dest - tempfile() vr - make_TARGET_VRanges_vcf() writeVcf(vr, dest) vcf - readVcf(dest, genome = hg19) perm - c(1, 7, 8, 4, 2, 10) vcf.vr - as(vcf, VRanges)[perm] genome(vr) - hg19 checkIdenticalVCF(vr, vcf.vr) # Error in checkIdentical(orig, vcf) : FALSE Hard for me to tell whether this is related to DataFrame moving from IRanges to S4Vectors or to a regression in writeVcf(). Do you think you can have a look? Thanks for the help and sorry for the trouble. H. Anyhow, if you get VariantAnnotation from svn you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe rtracklayer). I'm working on the 'chunking' approach next. It looks like we can still gain from adding that. Valerie On 09/08/2014 12:00 PM, Valerie Obenchain wrote: fyi Martin found a bug with the treatment of list data (ie, Number = '.') in the header. Working on a fix ... Val On 09/08/2014 08:43 AM, Gabe Becker wrote: Val, That is great. I'll check this out and test it on our end. ~G On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: The new writeVcf code is in 1.11.28. Using the illumina file you suggested, geno fields only, writing now takes about 17 minutes. hdr class: VCFHeader samples(1): NA12877 meta(6): fileformat ApplyRecalibration ... reference source fixed(1): FILTER info(22): AC AF ... culprit set geno(8): GT GQX ... PL VF param = ScanVcfParam(info=NA) vcf = readVcf(fl, , param=param) dim(vcf) [1] 516127621 system.time(writeVcf(vcf, out.vcf)) user system elapsed 971.0326.568 1004.593 In 1.11.28, parsing of geno data was moved to C. If this didn't speed things up enough we were planning to implement 'chunking' through the VCF and/or move the parsing of info to C, however, it looks like geno was the bottleneck. I've tested a number of samples/fields combinations in files with = .5 million rows and the improvement over writeVcf() in release is ~ 90%. Valerie On 09/04/14 15:28, Valerie Obenchain wrote: Thanks Gabe. I should have something for you on Monday. Val On 09/04/2014 01:56 PM, Gabe Becker wrote: Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write out, and is about 1.5x the size of the files we are actually dealing with (~50M ranges vs our ~30M). Looking forward a new vastly improved writeVcf :). ~G On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com mailto:lawrence.michael@gene.__com mailto:lawrence.mich...@gene.com wrote: Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields
Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method
Hi Vince, Yes it would make sense to have the show method report the genome when genome(x) contains a unique non-NA value. I think the main use case for having the genome defined at the sequence level instead of the whole object level is metagenomics. Maybe Michael has some other good use cases to share since IIRC he requested the addition of the genome field a couple of years ago and made the case for having it defined at the sequence level. Cheers, H. On 09/08/2014 07:21 AM, Vincent Carey wrote: For GRanges x, my naive expectation is that genome(x) returns a length- one tag identifying the genome to which chromosomal coordinates correspond. The genome() method seems to have sequence-specific semantics, which makes sense, but when we identify sequence with chromosome, it seems too complicated. Is there a use case for a GRanges with sequences from several different genomes? One reason I am inquiring is that I feel it would be nice to have the GRanges show() method report, prominently, the genome in use (or NA if unspecified). This could be accomplished by reporting unique(genome(x)), and perhaps that would be satisfactory. after example(genome) : seqinfo(txdb) Seqinfo of length 15 seqnames seqlengths isCircular genome CH2L 23011544 FALSEdm3 CH2R 21146708 FALSEdm3 CH3L 24543557 FALSEdm3 CH3R 27905053 FALSEdm3 CH4 1351857 FALSEdm3 ... ......... CH3LHet 2555491 FALSEdm3 CH3RHet 2517507 FALSEdm3 CHXHet 204112 FALSEdm3 CHYHet 347038 FALSEdm3 CHUextra 29004656 FALSEdm3 genome(seqinfo(txdb)) CH2L CH2R CH3L CH3R CH4 CHX CHUM dm3dm3dm3dm3dm3dm3dm3dm3 CH2LHet CH2RHet CH3LHet CH3RHet CHXHet CHYHet CHUextra dm3dm3dm3dm3dm3dm3dm3 [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method
On 09/08/2014 11:41 AM, Michael Lawrence wrote: I might have requested the genome annotation, but I'm pretty sure it wasn't me who decide on tracking it on a per-sequence basis. OK, maybe. Don't trust my memory too much on this. No regrets though. I think it was the right thing to do ;-) Just because the SAM/BAM format does it is a good enough reason for us to do it too. According to the SAM Spec, the header can look like: @HD VN:1.3 SO:coordinate @SQ SN:chr1_hg19LN:45 AS:hg19 @SQ SN:chr1_mm10LN:42 AS:mm10 The only problem is that seqinfo(BamFile(test.bam)) seems to ignore the AS tag (genome assembly identifier) at the moment: seqinfo(BamFile(test.bam)) Seqinfo of length 2 seqnames seqlengths isCircular genome chr1_hg1945 NA NA chr1_mm1042 NA NA Hopefully that can be addressed. But that's a separate issue... Cheers, H. I could imagine use cases for that though, e.g., when diagnosing sequencing contamination (like human vs. mouse). But most other tools and file formats expect a single genome per track, so, for example, rtracklayer has an internal function singleGenome() to take care of this. On Mon, Sep 8, 2014 at 10:50 AM, Hervé Pagès hpa...@fhcrc.org mailto:hpa...@fhcrc.org wrote: Hi Vince, Yes it would make sense to have the show method report the genome when genome(x) contains a unique non-NA value. I think the main use case for having the genome defined at the sequence level instead of the whole object level is metagenomics. Maybe Michael has some other good use cases to share since IIRC he requested the addition of the genome field a couple of years ago and made the case for having it defined at the sequence level. Cheers, H. On 09/08/2014 07:21 AM, Vincent Carey wrote: For GRanges x, my naive expectation is that genome(x) returns a length- one tag identifying the genome to which chromosomal coordinates correspond. The genome() method seems to have sequence-specific semantics, which makes sense, but when we identify sequence with chromosome, it seems too complicated. Is there a use case for a GRanges with sequences from several different genomes? One reason I am inquiring is that I feel it would be nice to have the GRanges show() method report, prominently, the genome in use (or NA if unspecified). This could be accomplished by reporting unique(genome(x)), and perhaps that would be satisfactory. after example(genome) : seqinfo(txdb) Seqinfo of length 15 seqnames seqlengths isCircular genome CH2L 23011544 FALSEdm3 CH2R 21146708 FALSEdm3 CH3L 24543557 FALSEdm3 CH3R 27905053 FALSEdm3 CH4 1351857 FALSEdm3 ... ......... CH3LHet 2555491 FALSEdm3 CH3RHet 2517507 FALSEdm3 CHXHet 204112 FALSEdm3 CHYHet 347038 FALSEdm3 CHUextra 29004656 FALSEdm3 genome(seqinfo(txdb)) CH2L CH2R CH3L CH3R CH4 CHX CHUM dm3dm3dm3dm3dm3dm3 dm3dm3 CH2LHet CH2RHet CH3LHet CH3RHet CHXHet CHYHet CHUextra dm3dm3dm3dm3dm3dm3dm3 [[alternative HTML version deleted]] _ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org mailto:hpa...@fhcrc.org Phone: (206) 667-5791 tel:%28206%29%20667-5791 Fax: (206) 667-1319 tel:%28206%29%20667-1319 _ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method
On 09/08/2014 02:28 PM, Peter Hickey wrote: Just a vote for still allowing for multiple genomes in a Seqinfo object (in a GRanges object). My use case is in bisulfite-sequencing experiments where there is often a spike-in of a lambda phage genome along with the genome of interest (human or mouse). It's often useful to keep all data from a single library together in the same objet but process according to genome(x) for each seqlevel. Note taken. Thanks Pete! It's always great to know about concrete use cases. FWIW, I like Vincent's proposal of selectSome(unique(genome(x))) in the show method. Or what about displaying the genome next to the seqlevel it's associated with? Like e.g.: gr GRanges with 3 ranges and 0 metadata columns: seqnames ranges strand RleIRanges Rle [1]chr14 [19069583, 19069654] + [2]chr14 [19363738, 19363809] + [3]chr14 [19363755, 19363826] - [4]chr14 [19369799, 19369870] + --- seqinfo: seqlevels seqlengths isCircular genome chr1 249250621 NA hg19 chr10 135534747 NA hg19 chr11 135006516 NA hg19 ... ......... chrUn_gl000249 38502 NA hg19 chrX 155270560 NA hg19 chrY59373566 NA hg19 That way, we also raise awareness about the isCircular field. The current choice to only display the seqlengths pre-dates the existence of the seqinfo slot but might be a little bit misleading those days since it only exposes some arbitrary seqinfo fields. H. Cheers, Pete I might have requested the genome annotation, but I'm pretty sure it wasn't me who decide on tracking it on a per-sequence basis. I could imagine use cases for that though, e.g., when diagnosing sequencing contamination (like human vs. mouse). But most other tools and file formats expect a single genome per track, so, for example, rtracklayer has an internal function singleGenome() to take care of this. On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s hpa...@fhcrc.org wrote: Hi Vince, Yes it would make sense to have the show method report the genome when genome(x) contains a unique non-NA value. I think the main use case for having the genome defined at the sequence level instead of the whole object level is metagenomics. Maybe Michael has some other good use cases to share since IIRC he requested the addition of the genome field a couple of years ago and made the case for having it defined at the sequence level. Cheers, H. On 09/08/2014 07:21 AM, Vincent Carey wrote: For GRanges x, my naive expectation is that genome(x) returns a length- one tag identifying the genome to which chromosomal coordinates correspond. The genome() method seems to have sequence-specific semantics, which makes sense, but when we identify sequence with chromosome, it seems too complicated. Is there a use case for a GRanges with sequences from several different genomes? One reason I am inquiring is that I feel it would be nice to have the GRanges show() method report, prominently, the genome in use (or NA if unspecified). This could be accomplished by reporting unique(genome(x)), and perhaps that would be satisfactory. after example(genome) : seqinfo(txdb) Seqinfo of length 15 seqnames seqlengths isCircular genome CH2L 23011544 FALSEdm3 CH2R 21146708 FALSEdm3 CH3L 24543557 FALSEdm3 CH3R 27905053 FALSEdm3 CH4 1351857 FALSEdm3 ... ......... CH3LHet 2555491 FALSEdm3 CH3RHet 2517507 FALSEdm3 CHXHet 204112 FALSEdm3 CHYHet 347038 FALSEdm3 CHUextra 29004656 FALSEdm3 genome(seqinfo(txdb)) CH2L CH2R CH3L CH3R CH4 CHX CHUM dm3dm3dm3dm3dm3dm3dm3dm3 CH2LHet CH2RHet CH3LHet CH3RHet CHXHet CHYHet CHUextra dm3dm3dm3dm3dm3dm3dm3 [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052,
Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method
On 09/08/2014 06:42 PM, Michael Lawrence wrote: Instead of printing out multiple lines of a table that is rarely of interest, could we develop Peter's idea toward something like: hg19:chr1 hg19:chr2 ... [lengths ...] Not sure what condensed notation would be useful for circularity. I don't know either. I'm worried that this would make the seqinfo stuff look like a named vector and that the user would expect hg19:chr1, hg19:chr2, etc... to be valid names. With the table-like layout, some screen real estate can always be saved by printing less lines: gr GRanges with 3 ranges and 0 metadata columns: seqnames ranges strand RleIRanges Rle [1]chr14 [19069583, 19069654] + [2]chr14 [19363738, 19363809] + [3]chr14 [19363755, 19363826] - [4]chr14 [19369799, 19369870] + --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) --- seqlevelsseqlengths isCircular genome chr1 249250621 NA hg19 chr10 135534747 NA hg19 ... ......... chrX 155270560 NA hg19 chrY 59373566 NA hg19 I agree that the exact content of the seqinfo table itself is rarely of interest so printing only 3 or 4 lines is OK. IMO it's important to make the user aware of the existence of this hidden table and to display it like what it really is (i.e. a table). Also displaying the column names is a well established tradition and serves the purpose of providing a quick summary of the accessors that are available to access those fields. H. On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey hic...@wehi.edu.au mailto:hic...@wehi.edu.au wrote: Perhaps it might be useful to have some way of highlighting if any of the chromosomes are circular or highlighting if there are multiple genomes present? Otherwise this information might be hidden in the … Cheers, Pete On 09/09/2014, at 9:44 AM, Hervé Pagès hpa...@fhcrc.org mailto:hpa...@fhcrc.org wrote: On 09/08/2014 02:28 PM, Peter Hickey wrote: Just a vote for still allowing for multiple genomes in a Seqinfo object (in a GRanges object). My use case is in bisulfite-sequencing experiments where there is often a spike-in of a lambda phage genome along with the genome of interest (human or mouse). It's often useful to keep all data from a single library together in the same objet but process according to genome(x) for each seqlevel. Note taken. Thanks Pete! It's always great to know about concrete use cases. FWIW, I like Vincent's proposal of selectSome(unique(genome(x))) in the show method. Or what about displaying the genome next to the seqlevel it's associated with? Like e.g.: gr GRanges with 3 ranges and 0 metadata columns: seqnames ranges strand RleIRanges Rle [1]chr14 [19069583, 19069654] + [2]chr14 [19363738, 19363809] + [3]chr14 [19363755, 19363826] - [4]chr14 [19369799, 19369870] + --- seqinfo: seqlevels seqlengths isCircular genome chr1 249250621 NA hg19 chr10 135534747 NA hg19 chr11 135006516 NA hg19 ... ......... chrUn_gl000249 38502 NA hg19 chrX 155270560 NA hg19 chrY59373566 NA hg19 That way, we also raise awareness about the isCircular field. The current choice to only display the seqlengths pre-dates the existence of the seqinfo slot but might be a little bit misleading those days since it only exposes some arbitrary seqinfo fields. H. Cheers, Pete I might have requested the genome annotation, but I'm pretty sure it wasn't me who decide on tracking it on a per-sequence basis. I could imagine use cases for that though, e.g., when diagnosing sequencing contamination (like human vs. mouse). But most other tools and file formats expect a single genome per track, so, for example, rtracklayer has an internal function singleGenome() to take care of this. On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s hpa...@fhcrc.org mailto:hpa...@fhcrc.org wrote: Hi Vince, Yes it would make sense to have the show method report the genome when genome(x) contains a unique
[Bioc-devel] bug when coercing from list to SimpleList
Hi Michael, I found the following bug when coercing a list to a SimpleList with IRanges devel (not with IRanges release): library(IRanges) x - list(a=matrix(rep(a, 6), nrow=3), b=array(rep(b, 24), dim=c(3,4,2))) Then: sapply(as(x, SimpleList), class) ab matrix matrix lapply(as(x, SimpleList), dim) $a [1] 3 2 $b [1] 24 1 The array was turned into a matrix! Note that the SimpleList() constructor behaves as expected: sapply(SimpleList(x), class) ab matrix array lapply(SimpleList(x), dim) $a [1] 3 2 $b [1] 3 4 2 Do you think you can have a look? Thanks, H. sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] IRanges_1.99.25 S4Vectors_0.1.5 BiocGenerics_0.11.4 loaded via a namespace (and not attached): [1] stats4_3.1.0 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Mistake in extractTranscriptSeqs() {from: GenomicFeatures} when transcripts are on minus strand
...AGGACGACAGCAGCAGCAGCGCGTCTCCTTCAG [2] 147 GGAGCTCCCAGGGAAGTGGTTGATCCGGTG...TGGAGAAGCAGCAGCAGAAGGAGCAGGAGCAAG [3]69 TGAGAGCCACGAGCCAAGGTGGGCACTTGATGTCGGATCTCTTCAACAAGCTGGTCATGAGGCGCAAGG [4] 468 GCATCTCTGGGAAAGGACCTCTGGTGAGGG...TGCAATAAAGGATCTCTAGCTGTGCAGGA Now with extractTranscriptSeqs(): tx_seq - extractTranscriptSeqs(genome, transcripts[uc009vis.3]) tx_seq A DNAStringSet instance of length 1 width seq names [1] 843 ACCTACAAGATTACTAACA...AAAGGATCTCTAGCTGTGCAGGA uc009vis.3 'tx_seq' contains the 4 sequences returned by getSeq() and concatenated: tx_seq == unlist(exon_seqs) [1] TRUE Using UCSC online query tools will give you the same sequence: http://genome.ucsc.edu/cgi-bin/hgGene?db=hg19hgg_gene=uc009vis.3 (Click on Genomic Sequence, then check 5' UTR Exons, CDS Exons, and 3' UTR Exons. No introns, no promoter/upstream, no downstream sequences.) Hope this helps clarifying things. H. sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] C attached base packages: [1] grDevices datasets grid parallel stats graphics utils methods base other attached packages: [1] GenomicFeatures_1.17.14AnnotationDbi_1.27.9 Biobase_2.25.0 BSgenome.Hsapiens.UCSC.hg19_1.3.99 [5] BSgenome_1.33.8Biostrings_2.33.13 XVector_0.5.7 spliceR_1.7.1 [9] plyr_1.8.1 RColorBrewer_1.0-5 VennDiagram_1.6.7 cummeRbund_2.7.2 [13] Gviz_1.9.11rtracklayer_1.25.13 GenomicRanges_1.17.28 GenomeInfoDb_1.1.18 [17] IRanges_1.99.24S4Vectors_0.1.2 fastcluster_1.1.13 reshape2_1.4 [21] ggplot2_1.0.0 RSQLite_0.11.4 DBI_0.2-7 BiocGenerics_0.11.4 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] subset fails on a DataFrame with 0 cols
On 07/28/2014 07:52 PM, Michael Lawrence wrote: Thanks, should be fixed in devel S4Vectors. Indeed. Thanks! H. On Mon, Jul 28, 2014 at 6:46 PM, Hervé Pagès hpa...@fhcrc.org mailto:hpa...@fhcrc.org wrote: Hi Michael, Works if the DataFrame has columns: library(IRanges) DF1 - DataFrame(aa=letters[1:4]) DF0 - DF1[0] Then: DF1 DataFrame with 4 rows and 1 column aa character 1 a 2 b 3 c 4 d DF0 DataFrame with 4 rows and 0 columns subset(DF1, c(FALSE, TRUE)) DataFrame with 2 rows and 1 column aa character 1 b 2 d OK. But: subset(DF0, c(FALSE, TRUE)) Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : subscript is a logical vector with out-of-bounds TRUE values This currently causes subset() to fail on a GRanges object with no metadata cols. Thanks, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org mailto:hpa...@fhcrc.org Phone: (206) 667-5791 tel:%28206%29%20667-5791 Fax: (206) 667-1319 tel:%28206%29%20667-1319 _ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Distinction between release and devel package websites
Hi Henrik, On 07/24/2014 06:03 AM, Henrik Bengtsson wrote: On Thu, Jul 24, 2014 at 3:16 AM, Hervé Pagès hpa...@fhcrc.org wrote: Hi Andrzej, On 07/22/2014 02:28 PM, Andrzej Oleś wrote: Hi Hervé, thank you for the demo! Yes, this is definitely much more clear than just a different color. Maybe we could first implement this idea on the build/check report websites and see how the uptake will be? I always keep getting confused by the colors which keep changing with every release cycle anyway... Done: http://bioconductor.org/checkResults/3.0/bioc-LATEST/index.html http://bioconductor.org/checkResults/3.0/data-experiment-LATEST/index.html Will revert back if there is too much complaining about this... Thanks for all updates. The latter hurts my eyes though - I didn't modify the data experiment report to use a less bright logo yet. It should auto-update the next time the report is generated though. may I instead suggest to use a left/right side bar with text Bioconductor developer version (3.0) written sideways. We need to deal with a space problem. The report is already too wide. Alternatively, add developer and release tag after each package's version, e.g. affycompData 1.2.0 (BioC release 2.14) and affycompData 1.3.0 (BioC developer 3.0) (possible on a separate line). Space problem again. BTW, the BioC version is nowhere to be seen on the current check results pages; it's only from the URL you can infer this. It's in the title! While at it, it would be great to have links on the check pages that quickly takes to the corresponding devel/release versions. The check page for devel has links to the devel landing pages and the check page for release has links to the release release landing pages. Just click on the name of the package and it will take you to the corresponding landing page. Also, if one could jump directly to a particular package, that would also be great, e.g. http://bioconductor.org/checkResults/3.0/data-experiment-LATEST/index.html#gageData http://bioconductor.org/checkResults/3.0/bioc-LATEST/aroma.light/ Cheers, H. Thxs, Henrik Cheers, H. Cheers, Andrzej On Tue, Jul 22, 2014 at 10:04 PM, Hervé Pagès hpa...@fhcrc.org wrote: Hi Andrzej, On 07/22/2014 10:14 AM, Andrzej Oleś wrote: Hi all, I think having links is useful, e.g. for someone who uses BioC release but wants to install by hand a particular package from the devel branch. Distinct colors between release and devel make sense only if one understands their meaning, which in the end might prove not to be very useful. I was thinking of something like this: http://www.bioconductor.org/checkResults/3.0/data-experiment-LATEST/ Just a demo. This will be reset to the usual background tomorrow. Cheers, H. I would rather recommend emphasizing the distinction between release and devel in clear text across the package landing page, possibly in multiple places, e.g. somewhere close to the actual package version number; for instance, add the word devel after the version number with a tooltip which will give some explanation/warning that this is not the stable release version. The concept of a notification box is far from ideal because it tends to be annoing to the user and once dismissed 'forever' the user won't be warned in the future. I think that the actual problem arises from the fact that the release landing pages are not clearly prioritized over the devel ones. Maybe this could be addressed by preventing the devel pages from being harvested by google? It could make also sense to emphasize (bold face, color, ...) the package release landing page on the result list returned by the search engine on the BioC website. Currently, the results for release and devel differ only in their relative path, which can be easily overlooked, and both say Package Home, see example below: Bioconductor - DESeq2 - /packages/release/bioc/html/DESeq2.html Bioconductor - DESeq2 Home Bioconductor - DESeq2 - /packages/devel/bioc/html/DESeq2.html Bioconductor - DESeq2 Home Cheers, Andrzej On Tue, Jul 22, 2014 at 6:26 PM, James W. MacDonald jmac...@uw.edu wrote: Given that we have an ongoing problem with people inadvisedly clicking and installing things, is there a good rationale for allowing them to do so? Perhaps the landing page for each package should be stripped of links and replaced with some indication of the availability for each package on the various operating systems. There could also be a note indicating that people can install using biocLite(). Best, Jim On 7/22/2014 11:48 AM, Dan Tenenbaum wrote: Seems like there are two problems, first that the release and devel pages look similar, but more importantly that people are downloading and installing from the package pages when they should be using biocLite(). I am open to the suggestions for making the release/devel pages look more different from each other, but I think something needs
Re: [Bioc-devel] A new way to manage side-by-side BioC release and Devel installs
Hi Gabe, Thanks for the heads up, sounds promising. Note that one major difficulty of switching between BioC release and devel is that between October and April it also requires to switch between R release and devel. How is that handled in switchr? Thanks, H. On 07/24/2014 08:42 AM, Gabe Becker wrote: Hey all, One of the things that has come up from the recent Release/Devel distinction thread on this list is that people don't consider there to be an easy way of handling both at the same time. I'd like to offer an alternative based on the switchr https://github.com/gmbecker/switchr package I'm developing. switchr is designed to allow seamless switching between distinct sets of installed packages. It also has build in support for the BioC release/devel distinction, like so: * library(switchr)* *switchTo(BiocDevel)* trying URL ' http://www.bioconductor.org/packages/3.0/bioc/bin/windows/contrib/3.1/BiocInstaller_1.15.5.zip ' Content type 'application/zip' length 109769 bytes (107 Kb) opened URL downloaded 107 Kb package 'BiocInstaller' successfully unpacked and MD5 sums checked Switched to the 'BioC_3.0' computing environment. 31 packages are currently available. Packages installed in your site library ARE suppressed. To switch back to your previous environment type switchBack() * BiocInstaller::biocVersion()* [1] '3.0' * switchBack()* Reverted to the 'original' computing environment. 40 packages are currently available. Packages installed in your site library ARE NOT suppressed. To switch back to your previous environment type switchBack() * BiocInstaller::biocVersion()* [1] '2.14' Note that this only works if you have an R version able to install the devel version of the BiocInstaller package, but that will be true of any solution using BioC Devel. I'm working on making the failure more graceful when this is not the case, and on writing/updating the documentation more generally. switchr is more widely useful than this, which I will be talking about publicly soon, but since it is so on-topic I figured I'd give this list a heads-up on this aspect of it. Please feel free to try switchr out, any feedback is appreciated. ~G -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Distinction between release and devel package websites
Hi Andrzej, On 07/22/2014 10:14 AM, Andrzej Oleś wrote: Hi all, I think having links is useful, e.g. for someone who uses BioC release but wants to install by hand a particular package from the devel branch. Distinct colors between release and devel make sense only if one understands their meaning, which in the end might prove not to be very useful. I was thinking of something like this: http://www.bioconductor.org/checkResults/3.0/data-experiment-LATEST/ Just a demo. This will be reset to the usual background tomorrow. Cheers, H. I would rather recommend emphasizing the distinction between release and devel in clear text across the package landing page, possibly in multiple places, e.g. somewhere close to the actual package version number; for instance, add the word devel after the version number with a tooltip which will give some explanation/warning that this is not the stable release version. The concept of a notification box is far from ideal because it tends to be annoing to the user and once dismissed 'forever' the user won't be warned in the future. I think that the actual problem arises from the fact that the release landing pages are not clearly prioritized over the devel ones. Maybe this could be addressed by preventing the devel pages from being harvested by google? It could make also sense to emphasize (bold face, color, ...) the package release landing page on the result list returned by the search engine on the BioC website. Currently, the results for release and devel differ only in their relative path, which can be easily overlooked, and both say Package Home, see example below: Bioconductor - DESeq2 - /packages/release/bioc/html/DESeq2.html Bioconductor - DESeq2 Home Bioconductor - DESeq2 - /packages/devel/bioc/html/DESeq2.html Bioconductor - DESeq2 Home Cheers, Andrzej On Tue, Jul 22, 2014 at 6:26 PM, James W. MacDonald jmac...@uw.edu wrote: Given that we have an ongoing problem with people inadvisedly clicking and installing things, is there a good rationale for allowing them to do so? Perhaps the landing page for each package should be stripped of links and replaced with some indication of the availability for each package on the various operating systems. There could also be a note indicating that people can install using biocLite(). Best, Jim On 7/22/2014 11:48 AM, Dan Tenenbaum wrote: Seems like there are two problems, first that the release and devel pages look similar, but more importantly that people are downloading and installing from the package pages when they should be using biocLite(). I am open to the suggestions for making the release/devel pages look more different from each other, but I think something needs to be done about the second problem as well. Perhaps a popup that comes up when you click on a package tarball saying The best way to install this is with biocLite(); are you sure you want to download it? Whatever we do probably won't happen until after BioC2014. Dan - Original Message - From: Julian Gehring julian.gehr...@embl.de To: Hervé Pagès hpa...@fhcrc.org, Michael Lawrence lawrence.mich...@gene.com, Vincent Carey st...@channing.harvard.edu Cc: bioc-devel@r-project.org Sent: Tuesday, July 22, 2014 8:07:29 AM Subject: Re: [Bioc-devel] Distinction between release and devel package websites Hi, Tooltips that appear while hovering over selected links are easy to miss. This alone will likely not be clear enough. We should convey the information that the entire website presents a different version of the package. The idea of a notification box that can be made visible by the individual user seems tempting. One can combine this with an optional cookie, to remember the state between browser sessions. Changing the layout of the devel page itself will also be helpful to make the distinction more pronounced. Hopefully we could approach this in a way that maintains the nice design of the bioc website. Best Julian On 21.07.2014 21:50, Hervé Pagès wrote: Hi, In addition to these suggestions, how about using a special background color for package landing pages in devel? Cheers, H. On 07/21/2014 07:32 PM, Michael Lawrence wrote: Or an unobtrusive notification box that drops down from the top of the page, saying something like this is devel; there would be a dismiss button and a checkbox for whether to show again. The user is free to simply ignore it and proceed as normal. On Mon, Jul 21, 2014 at 7:10 PM, Vincent Carey st...@channing.harvard.edu wrote: how about a tooltip that reads installation via biocLite() is the recommended approach to Bioconductor software acquisition, other approaches may lead to inconsistent package-sets that appears when a reader hovers over a tarball. i would imagine that this is how the wrong package gets installed, by manually using an inappropriate tarball. wrong documentation is not so easy but the doc on the devel branch might have a different tooltip
Re: [Bioc-devel] Distinction between release and devel package websites
Hi, In addition to these suggestions, how about using a special background color for package landing pages in devel? Cheers, H. On 07/21/2014 07:32 PM, Michael Lawrence wrote: Or an unobtrusive notification box that drops down from the top of the page, saying something like this is devel; there would be a dismiss button and a checkbox for whether to show again. The user is free to simply ignore it and proceed as normal. On Mon, Jul 21, 2014 at 7:10 PM, Vincent Carey st...@channing.harvard.edu wrote: how about a tooltip that reads installation via biocLite() is the recommended approach to Bioconductor software acquisition, other approaches may lead to inconsistent package-sets that appears when a reader hovers over a tarball. i would imagine that this is how the wrong package gets installed, by manually using an inappropriate tarball. wrong documentation is not so easy but the doc on the devel branch might have a different tooltip cautioning the readers to be sure they want to read the doc on the devel version. On Mon, Jul 21, 2014 at 9:39 PM, Julian Gehring julian.gehr...@embl.de wrote: Hi, Can we make the package websites for the devel and release version of a package more distinguishable? To elaborate on this: In the past, I have seen several users having problems with using bioconductor because they ended up on the wrong page (mostly the devel page when they would have needed the release). This resulted in getting the wrong documentation or installing the wrong package. The pages are well designed, and there is no reason to change this. However, the websites for the devel and release version of a package look almost identical, and that these two get confused seems to happen to many users (me included). If you search for a package within the bioc website, the release version always comes first in the search results. If you are coming from the outside (e.g. google), this may not be the case. In fact, googling a few packages names often returned only the devel page in the top 10 search results. What are the feelings regarding this? We could add a header section on the devel page that states that this is an unstable version not meant to be used in production settings, and provide a link to the respective release version? Best wishes Julian ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] ggbio 1.13.11 fails to load due to namespace change in IRanges
colorspace_1.2-4 DBI_0.2-7 [16] dichromat_2.0-0 digest_0.6.4 fail_1.2 foreach_1.4.2Formula_1.1-2 [21] GenomeInfoDb_1.1.12 GenomicAlignments_1.1.21 GenomicFeatures_1.17.12 GenomicRanges_1.17.24GGally_0.4.6 [26] grid_3.1.0 gridExtra_0.9.1 gtable_0.1.2 Hmisc_3.14-4 IRanges_1.99.22 [31] iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 MASS_7.3-33 munsell_0.4.2 [36] plyr_1.8.1 proto_0.3-10 RColorBrewer_1.0-5 Rcpp_0.11.2 RCurl_1.95-4.1 [41] reshape_0.8.5reshape2_1.4 Rsamtools_1.17.31RSQLite_0.11.4 rtracklayer_1.25.13 [46] S4Vectors_0.1.2 scales_0.2.4 sendmailR_1.1-2 splines_3.1.0stats4_3.1.0 [51] stringr_0.6.2survival_2.37-7 tools_3.1.0 XML_3.98-1.1 XVector_0.5.7 [56] zlibbioc_1.11.1 packageVersion(ggbio) [1] ‘1.13.11’ ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Confusing namespace issue with IRanges 1.99.17
parallel_3.1.0 S4Vectors_0.1.0 stats4_3.1.0tools_3.1.0 foo ## Load data foo load(~/Desktop/DF.Rdata) foo ## Run function foo result - foo(DF) R version 3.1.0 (2014-04-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] fooPkg_0.0.1 loaded via a namespace (and not attached): [1] BiocGenerics_0.11.3 IRanges_1.99.17 parallel_3.1.0 S4Vectors_0.1.0 stats4_3.1.0tools_3.1.0 Loading required package: parallel Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:parallel’: clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB The following object is masked from ‘package:stats’: xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, as.vector, cbind, colnames, do.call, duplicated, eval, evalq, Filter, Find, get, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int, rownames, sapply, setdiff, sort, table, tapply, union, unique, unlist R version 3.1.0 (2014-04-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] IRanges_1.99.17 S4Vectors_0.1.0 BiocGenerics_0.11.3 fooPkg_0.0.1 loaded via a namespace (and not attached): [1] stats4_3.1.0 tools_3.1.0 The same thing happens with the following setup: R version 3.1.1 RC (2014-07-07 r66083) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices datasets utils methods [8] base other attached packages: [1] IRanges_1.99.17 S4Vectors_0.1.0 BiocGenerics_0.11.3 [4] fooPkg_0.0.1colorout_1.0-2 loaded via a namespace (and not attached): [1] stats4_3.1.1 tools_3.1.1 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] [BioC] granges() method for GenomicRanges objects akin to ranges()...
Hi, On 05/13/2014 01:15 AM, Julian Gehring wrote: Hi, In summary, would it be feasible to add to 'GenomicRanges'? 1) A 'granges(x, use.mcols=FALSE, ...)' method with signature 'GRanges' that converts to a 'GRanges' object and optionally drops the mcols (if 'use.mcols' is TRUE) Will do. 2) A 'dropMcols' or 'dropmcols' method with signature 'GRanges' that is a wrapper for mcols(x) - NULL How about setMcols(), which is more general than dropmcols()? Thanks, H. If I can be of help in providing a patch for this, please let me know. Best wishes Julian On 05.05.2014 23:29, Hervé Pagès wrote: On 05/05/2014 02:12 PM, Cook, Malcolm wrote: On 05/05/2014 01:00 PM, Cook, Malcolm wrote: Wondering, Is it too off the beaten track to expect `mcols-`(x,NULL) args(`mcols-`) function (x, ..., value) Arguments after the ellipsis must be named: `mcols-`(x, value=NULL) Herve - Great - of course - so - does this not provide the means requested by the original poster? I think Tim also wanted 'x' to be downgraded to a GRanges instance, like Julian's grangesPlain() does. We could use granges() for that. Deciding of an idiom that can be used inline for just dropping the mcols would be good too. `mcols-`(x, value=NULL) is a little bit tricky, ugly, and error prone as you noticed. These are probably enough reasons for not choosing it as *the* idiom. Its only advantage is that it doesn't introduce a new symbol. H. Nothing we can do about this. Cheers, H. to work? hint: it does not -Original Message- From: bioc-devel-boun...@r-project.org [mailto:bioc-devel-boun...@r-project.org] On Behalf Of Hervé Pagès Sent: Monday, May 05, 2014 1:28 PM To: Kasper Daniel Hansen; Michael Lawrence Cc: Johnston, Jeffrey; ttri...@usc.edu; bioc-devel@r-project.org; bioconduc...@r-project.org Subject: Re: [Bioc-devel] [BioC] granges() method for GenomicRanges objects akin to ranges()... Hi, I have no problem using granges() for that. Just to clarify: (a) it would propagate the names() (b) it would drop the metadata() (c) the mcols() would propagate only if 'use.mcols=TRUE' was specified ('use.mcols' is FALSE by default) (d) it would return a GRanges *instance* i.e. input object 'x' would be downgraded to GRanges if it extends GRanges @Kasper: granges() on SummarizedExperiment ignores the 'use.mcols' arg and always propagates the mcols. Alternatively you can use rowData() which also propagates the mcols. granges() is actually just an alias for rowData() on SummarizedExperiment objects. H. On 05/05/2014 10:31 AM, Kasper Daniel Hansen wrote: I agree with Michael on this. I can see why, in some usage cases, granges() is convenient to have with use.mcols=FALSE (which seems to have been added in the latest release). But in my usage of granges(), where I call granges() on objects like SummarizedExperiments and friends, I have been expecting granges() to give me the GRange component of the object. Not a crippled version of the GRange component. This is - to me - very counter intuitive and I wish I had seen this earlier. It is particular frustrating that this default is part of the generic. Best, Kasper On Mon, May 5, 2014 at 12:11 PM, Michael Lawrence lawrence.mich...@gene.com wrote: In my opinion, granges() is not very clear as to the intent. The mcols are part of the GRanges, so why would calling granges() drop them? I think we want something similar to unclass(), unname(), etc. This why I suggested dropmcols(). On Mon, May 5, 2014 at 8:17 AM, Tim Triche, Jr. tim.tri...@gmail.com wrote: That's exactly what I was after -- the generic is already defined, so why not use it? --t On May 5, 2014, at 7:42 AM, Julian Gehring julian.gehr...@embl.de wrote: Hi, On 05.05.2014 16:22, Martin Morgan wrote: generalize as setMcols, like setNames? setMcols(x, NULL) How about Tim's original suggestion, to add a 'granges' method that works on a 'GRanges' input? The current definition granges(x, use.mcols=FALSE, ...) seem suited for this. Best wishes Julian [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] [BioC] granges() method for GenomicRanges objects akin to ranges()...
On 07/08/2014 11:29 AM, Michael Lawrence wrote: On Tue, Jul 8, 2014 at 10:36 AM, Julian Gehring julian.gehr...@embl.de wrote: Hi Herve, 2) A 'dropMcols' or 'dropmcols' method with signature 'GRanges' that is a wrapper for mcols(x) - NULL How about setMcols(), which is more general than dropmcols()? Do you mean a function like: setMcols - function(x, value = NULL) { mcols(x) = value return(x) } I'd be fine with this. However, some argued before that setting to NULL may be counterintuitive for non-advanced users. Probably best to have both setMcols and dropMcols. OK. Let's go for both. Thanks, H. Best wishes Julian ___ Bioconductor mailing list bioconduc...@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane. science.biology.informatics.conductor [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] makeTranscriptDbFrom... AnnotationHub
Hi Michael, On 07/08/2014 12:11 PM, Michael Love wrote: The recent TranscriptDb thread reminded me of a question: are there plans (or am I missing the function) to easily get a TranscriptDb out of the AnnotationHub objects? It would be great to have a preprocessed Ensembl txdb like we have for UCSC. I think the 1st thing we should do is have a makeTranscriptDbFromGRanges() function. It should not be too hard because we already have the code :) Marc wrote it. But it's currently part of the makeTranscriptDbFromGFF() function. Roughly speaking this function does 2 things: (1) import the GFF or GTF file as a GRanges object, then (2) turn that GRanges object into a TranscriptDb object. So we should move the code that does (2) into a separate function, the makeTranscriptDbFromGRanges() function, and have makeTranscriptDbFromGFF() call it internally. Then you could call makeTranscriptDbFromGRanges() on any of these GFF- or GTF-based GRanges objects you get from AnnotationHub. We'll work on this soon and announce here when it becomes available. Cheers, H. ah - AnnotationHub() gr - ah$ensembl.release.73.gtf.homo_sapiens.Homo_sapiens.GRCh37.73.gtf_0.0.1.RData gr GRanges with 2268089 ranges and 12 metadata columns: seqnames ranges strand | source Rle IRanges Rle | factor [1]1 [11869, 12227] + | processed_transcript [2]1 [12613, 12721] + | processed_transcript [3]1 [13221, 14409] + | processed_transcript [4]1 [11872, 12227] + | unprocessed_pseudogene [5]1 [12613, 12721] + | unprocessed_pseudogene ... ......... ...... [2268085] MT [14747, 15887] + | protein_coding [2268086] MT [14747, 15887] + | protein_coding [2268087] MT [14747, 14749] + | protein_coding [2268088] MT [15888, 15953] + |Mt_tRNA [2268089] MT [15956, 16023] - |Mt_tRNA type score phase gene_id transcript_id factor numeric integer character character [1]exon NA NA ENSG0223972 ENST0456328 [2]exon NA NA ENSG0223972 ENST0456328 [3]exon NA NA ENSG0223972 ENST0456328 [4]exon NA NA ENSG0223972 ENST0515242 [5]exon NA NA ENSG0223972 ENST0515242 ... ... ... ... ... ... [2268085]exon NA NA ENSG0198727 ENST0361789 [2268086] CDS NA 0 ENSG0198727 ENST0361789 [2268087] start_codon NA 0 ENSG0198727 ENST0361789 [2268088]exon NA NA ENSG0210195 ENST0387460 [2268089]exon NA NA ENSG0210196 ENST0387461 exon_number gene_name gene_biotype transcript_name numeric charactercharacter character [1] 1 DDX11L1 pseudogene DDX11L1-002 [2] 2 DDX11L1 pseudogene DDX11L1-002 [3] 3 DDX11L1 pseudogene DDX11L1-002 [4] 1 DDX11L1 pseudogene DDX11L1-201 [5] 2 DDX11L1 pseudogene DDX11L1-201 ... ... ...... ... [2268085] 1 MT-CYB protein_coding MT-CYB-201 [2268086] 1 MT-CYB protein_coding MT-CYB-201 [2268087] 1 MT-CYB protein_coding MT-CYB-201 [2268088] 1 MT-TTMt_tRNA MT-TT-201 [2268089] 1 MT-TPMt_tRNA MT-TP-201 exon_id protein_id character character [1] ENSE2234944NA [2] ENSE3582793NA [3] ENSE2312635NA [4] ENSE2234632NA [5] ENSE3608237NA ... ... ... [2268085] ENSE1436074NA [2268086]NA ENSP0354554 [2268087]NANA [2268088] ENSE1544475NA [2268089] ENSE1544473NA --- seqlengths: 1 2 ... MT NA NA ... NA ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O