Re: [Bioc-devel] IPI numbers in annotation packages
You need to scroll down that script a ways... Look for 'yeast'. On Mon, Oct 5, 2015 at 6:11 AM, James W. MacDonald <jmac...@uw.edu> wrote: > Hi Marc, > > That script has this in it: > > ## For now just get data for the ones that we have traditionally supported > ## I don't even know if the other species are available... > speciesList = c("chipsrc_human.sqlite", > "chipsrc_rat.sqlite", > "chipsrc_chicken.sqlite", > "chipsrc_zebrafish.sqlite", > # "chipsrc_worm.sqlite", > # "chipsrc_fly.sqlite", > "chipsrc_mouse.sqlite", > "chipsrc_bovine.sqlite" > # "chipsrc_arabidopsis.sqlite" ## this is available and could be > "activated" > ## But to activate arabidopsis, remember you have to pre-add the > tables... > # "chipsrc_canine.sqlite", > # "chipsrc_rhesus.sqlite", > # "chipsrc_chimp.sqlite", > # "chipsrc_anopheles.sqlite" > ) > > And there is no mention of yeast anywhere. If I search all the scripts for > say 'INSERT INTO pfam', I get > > custom_anno/script/bindb.sql > 328:INSERT INTO pfam > > pfam/script/srcdb_pfam.sql > 202:-- INSERT INTO pfamb > > organism_annotation/script/bindb_yeast.sql > 441:-- INSERT INTO pfam > > yeast/script/bindb.sql > 241:-- INSERT INTO pfam > > The first one is just doing all the metadata tables, and the other three > are in code blocks that are commented out. Is it possible that you used a > script that didn't make it into svn? > > Jim > > > > On Sun, Oct 4, 2015 at 2:36 PM, Marc Carlson <mrj...@gmail.com> wrote: > >> Hi Jim, >> >> You asked me on Friday where the PFAM Ids for yeast came from and I >> couldn't recall because at the moment I was at Seattle Childrens (and thus >> nowhere near my copy of my source code). But I also said I would look into >> it for you later (and I have). Here is what my code tells me: So ever >> since IPI shut down, we have been getting the PFAM and IPI data from >> UniProt. There is a script in the UniProt.ws package >> called processDataForBuild.R that is supposed to be called by the script >> "src_build.sh" (it's the last thing that script does). That code should >> get the pfam data from yeast for you. Please note that yeast required a >> lot of special code to get it processed. Nothing with yeast annotations is >> ever easy. It's like karmic accounting to compensate for all the bread and >> beer. ;) >> >> Let me know if you need any more explanations about what is in there. >> Because of the crazy timing, before I left I build I pushed into devel a >> fresh set of .DB0s and core packages (in late August) just in case it was >> too crazy to do a refresh right now. But it sounds like you won't need >> that. >> >> >> Marc >> >> >> >> On Sun, Oct 4, 2015 at 6:27 AM, James W. MacDonald <jmac...@uw.edu> >> wrote: >> >>> I am building the annotation db0 packages for the upcoming Bioconductor >>> release, which are used to generate all the orgDb and chip annotation >>> packages that we distribute. Up to the previous release we have always >>> included IPI identifiers (as part of the table containing the PROSITE and >>> PFAM IDs). Unfortunately, IPI <https://www.ebi.ac.uk/IPI> is no longer >>> maintained (since 2011), and UniProt, which is where we got data for the >>> last few releases, has now dropped support as well. >>> >>> Given that this annotation source is no longer maintained, I decided to >>> exclude these IDs from the current build of the following db0 packages: >>> >>>- rat.db0 >>>- chicken.db0 >>>- zebrafish.db0 >>>- mouse.db0 >>>- bovine.db0 >>>- human.db0 >>> >>> In addition, it is not clear to me (nor can Marc recall) where the data >>> for >>> PFAM in the yeast.db0 package comes from. Given that we are pretty far >>> behind schedule for these packages, I have excluded that table as well. >>> >>> If this will break anybody's package, or if there are people who rely on >>> these IDs, I can just parse out of the last release and deprecate, so you >>> will have the IDs for one more release. However, if nobody cares about >>> such >>> things, I will just go with what we have. Please speak up if this will >>> affect you. >>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> University of Washington >>> Environmental and Occupational Health Sciences >>> 4225 Roosevelt Way NE, # 100 >>> Seattle WA 98105-6099 >>> >>> [[alternative HTML version deleted]] >>> >>> ___ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >> >> > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Txdb Issues - all exon names are NA's ?
Works for me. Marc On Tue, Sep 22, 2015 at 6:03 PM, Hervé Pagès <hpa...@fredhutch.org> wrote: > Hi Marc, > > On 09/22/2015 05:39 PM, Marc Carlson wrote: > >> Herve is right. UCSC doesn't give us this information, And actually, I >> think it's pretty rare to see exon names from anybody. So it's weird >> to me that they are a default return value for this method. >> > > Ensembl does provide exon names/ids so any TxDb object that was created > with makeTxDbFromBiomart("ensembl", ...) should have them: > > library(GenomicFeatures) > txdb <- makeTxDbFromBiomart("ensembl", dataset="celegans_gene_ensembl") > exonsBy(txdb, use.names=TRUE)$Y74C9A.2a.2 > # GRanges object with 4 ranges and 3 metadata columns: > # seqnames ranges strand | exon_id exon_name > exon_rank > # | > > # [1]I [10413, 10585] + | 1 WBGene00022276.e1 > 1 > # [2]I [11618, 11689] + | 6 WBGene00022276.e6 > 2 > # [3]I [14951, 15160] + |11 WBGene00022276.e11 > 3 > # [4]I [16473, 16842] + |14 WBGene00022276.e14 > 4 > # --- > # seqinfo: 7 sequences (1 circular) from an unspecified genome > > Note that the *By() extractors don't let the user choose which column > to return at the moment so that's why it was decided (a long time ago) > to return exon internal ids *and* names (better more than less). > > H. > > >>Marc >> >> On Tue, Sep 22, 2015 at 5:29 PM, Hervé Pagès <hpa...@fredhutch.org >> <mailto:hpa...@fredhutch.org>> wrote: >> >> Hi Sonali, >> >> UCSC doesn't provide names for the exons of their gene models. >> See the table where this data is coming from: >> >> >> >> https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19_group=genes_track=knownGene_table=knownGene_doSchema=describe+table+schema >> >> The exon information is all coming from the exonStarts and exonEnds >> columns. No exon names! >> >> H. >> >> PS: Maybe this would better be asked on the support site. >> >> >> On 09/22/2015 04:44 PM, Arora, Sonali wrote: >> >>Hi everyone, >> >> I was trying to get the exons by gene from a txdb object but I >> get NA's >> for all exon_name's. Please advise. >> >> > library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> > ebg2 <- exonsBy(txdb, by="gene") >> > >> > ebg2 >> GRangesList object of length 23459: >> $1 >> GRanges object with 15 ranges and 2 metadata columns: >> seqnames ranges strand | exon_id >> | >> [1]chr19 [58858172, 58858395] - |250809 >> [2]chr19 [58858719, 58859006] - |250810 >> [3]chr19 [58859832, 58860494] - |250811 >> [4]chr19 [58860934, 58862017] - |250812 >> [5]chr19 [58861736, 58862017] - |250813 >> ... ... ...... ... ... >> [11]chr19 [58868951, 58869015] - |250821 >> [12]chr19 [58869318, 58869652] - |250822 >> [13]chr19 [58869855, 58869951] - |250823 >> [14]chr19 [58870563, 58870689] - |250824 >> [15]chr19 [58874043, 58874214] - |250825 >>exon_name >> >> [1] >> [2] >> [3] >> [4] >> [5] >> ... ... >> [11] >> [12] >> [13] >> [14] >> [15] >> >> $10 >> GRanges object with 2 ranges and 2 metadata columns: >> seqnames ranges strand | exon_id exon_name >> [1] chr8 [18248755, 18248855] + | 113603 >> [2] chr8 [18257508, 18258723] + | 113604 >> >> ... >> <23457 more elements> >> --- >> seqinfo: 93 sequences (1 circula
Re: [Bioc-devel] Txdb Issues - all exon names are NA's ?
Herve is right. UCSC doesn't give us this information, And actually, I think it's pretty rare to see exon names from anybody. So it's weird to me that they are a default return value for this method. Marc On Tue, Sep 22, 2015 at 5:29 PM, Hervé Pagèswrote: > Hi Sonali, > > UCSC doesn't provide names for the exons of their gene models. > See the table where this data is coming from: > > > > https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19_group=genes_track=knownGene_table=knownGene_doSchema=describe+table+schema > > The exon information is all coming from the exonStarts and exonEnds > columns. No exon names! > > H. > > PS: Maybe this would better be asked on the support site. > > > On 09/22/2015 04:44 PM, Arora, Sonali wrote: > >> Hi everyone, >> >> I was trying to get the exons by gene from a txdb object but I get NA's >> for all exon_name's. Please advise. >> >> > library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> > ebg2 <- exonsBy(txdb, by="gene") >> > >> > ebg2 >> GRangesList object of length 23459: >> $1 >> GRanges object with 15 ranges and 2 metadata columns: >> seqnames ranges strand | exon_id >> | >> [1]chr19 [58858172, 58858395] - |250809 >> [2]chr19 [58858719, 58859006] - |250810 >> [3]chr19 [58859832, 58860494] - |250811 >> [4]chr19 [58860934, 58862017] - |250812 >> [5]chr19 [58861736, 58862017] - |250813 >> ... ... ...... ... ... >>[11]chr19 [58868951, 58869015] - |250821 >>[12]chr19 [58869318, 58869652] - |250822 >>[13]chr19 [58869855, 58869951] - |250823 >>[14]chr19 [58870563, 58870689] - |250824 >>[15]chr19 [58874043, 58874214] - |250825 >> exon_name >> >> [1] >> [2] >> [3] >> [4] >> [5] >> ... ... >>[11] >>[12] >>[13] >>[14] >>[15] >> >> $10 >> GRanges object with 2 ranges and 2 metadata columns: >>seqnames ranges strand | exon_id exon_name >>[1] chr8 [18248755, 18248855] + | 113603 >>[2] chr8 [18257508, 18258723] + | 113604 >> >> ... >> <23457 more elements> >> --- >> seqinfo: 93 sequences (1 circular) from hg19 genome >> > testgr <- unlist(ebg2) >> > table(is.na(mcols(testgr)$exon_name)) >> >>TRUE >> 272776 >> > sessionInfo() >> R version 3.2.2 RC (2015-08-09 r68965) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> Running under: Windows 7 x64 (build 7601) Service Pack 1 >> >> locale: >> [1] LC_COLLATE=English_United States.1252 >> [2] LC_CTYPE=English_United States.1252 >> [3] LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] stats4parallel stats graphics grDevices utils >> [7] datasets methods base >> >> other attached packages: >> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.1 >> [2] GenomicFeatures_1.21.29 >> [3] AnnotationDbi_1.31.18 >> [4] Biobase_2.29.1 >> [5] GenomicRanges_1.21.28 >> [6] GenomeInfoDb_1.5.16 >> [7] IRanges_2.3.21 >> [8] S4Vectors_0.7.18 >> [9] BiocGenerics_0.15.6 >> >> loaded via a namespace (and not attached): >> [1] XVector_0.9.4 zlibbioc_1.15.0 >> [3] GenomicAlignments_1.5.17 BiocParallel_1.3.52 >> [5] tools_3.2.2SummarizedExperiment_0.3.9 >> [7] DBI_0.3.1 lambda.r_1.1.7 >> [9] futile.logger_1.4.1rtracklayer_1.29.27 >> [11] futile.options_1.0.0 bitops_1.0-6 >> [13] RCurl_1.95-4.7 biomaRt_2.25.3 >> [15] RSQLite_1.0.0 Biostrings_2.37.8 >> [17] Rsamtools_1.21.17 XML_3.98-1.3 >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax:(206) 667-1319 > > > ___ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] rtracklayer bug?
Hi Arne, So this time when I look at the bioc-devel email list, I don't see a record for this last name (or this email). In fact the only way I could be sure it was you was that your post was the same... ;) If you want to post from gmail, then you will need to subscribe the gmail address to the list here: https://stat.ethz.ch/mailman/listinfo/bioc-devel Marc On 06/30/2015 02:26 AM, Arne Müller wrote: Hello, I think there’s a problem in UCSCSession initializer in rtracklayer: setMethod(initialize, UCSCSession, function(.Object, url =http://genome.ucsc.edu/cgi-bin/;, user =ULL, session = NULL, force = FALSE, ...) { .Object@url - url .Object@views - new.env() gwURL - ucscURL(.Object, gateway) if (force) { gwURL - paste0(gwURL, '?redirect=anual') } gw - httpGet(gwURL, cookiefile =empfile(), header = TRUE, .parseúLSE) if (grepl(redirectTd, gw)) { url - sub(.*?a href=h([^[:space:]]+cgi-bin/).*, h\\1, gw) return(initialize(.Object, url, user=er, session=session, force=UE, ...)) } cookie - grep(Set-[Cc]ookie: hguid[^==, gw) if (!length(cookie)) stop(Failed to obtain 'hguid' cookie) hguid - sub(.*Set-Cookie: (hguid[^==[^;]*);.*, \\1, gw) .Object@hguid - hguid if (!is.null(user) !is.null(session)) { ## bring in other session ucscGet(.Object, tracks, list(hgS_doOtherUser =submit, hgS_otherUserName user, hgS_otherUserSessionName =ession)) } .Object }) Shouldn’t ‘…’ be passed to httpGet that in turn is passed to RCURL, I.e. gw - httpGet(gwURL, cookiefile =empfile(), header = TRUE, .parseúLSE, …) ? We run an internal instance of the UCSC genome browser and need to pass a cookie to all http-requests. The problem is that session =ew ('UCSCSession', url=myInternalURL, cookie=myAuthCookie) Does not pass the ‘cookie’ argument to httpGet. Regards, Arne [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Changes in AnnotationDbi
OK Jim, I will put very simple messages in (one liners) that will simply state whether the relationship between keys and the requested columns was 1:1, 1:many, many:1, or many:many. Hopefully this will represent an acceptable compromise. Marc On 06/05/2015 08:37 AM, James W. MacDonald wrote: I agree that a warning is probably not the way to go, as it does imply that there might have been something wrong with either the input or output. Plus, not everybody understands the distinction between error and warning. And having additional documentation can't possibly hurt. But that assumes that most/some/all of the end users both peruse and understand the documentation, which we all know is not the case. The main issue, for me at least, is that a significant proportion of people seem to think there is some sort of uniqueness imposed on things like Entrez Gene IDs and Hugo symbols, etc. While that is the ultimate goal, we do not have and maybe never will achieve unique IDs for each annotatable object. I used to work for a PI who was a very smart, well informed statistical geneticist who was absolutely shocked when I informed her that a) there are SNPs in dbSNP that have more than one RS ID, and that b.) there are RS IDs in dbSNP that have been assigned to multiple SNPs. She just assumed that there was a one-to-one RS ID - SNP mapping. So this is to me the crux of the problem. It is perfectly valid to return one-to-many mappings, and that is what should be expected /by those of us who already understand such things. /But for those of us who are ignorant of the details, and those who assume uniqueness of IDs, it would be really nice if they got a message telling them something like /Please note that there are one-to-many mappings between the input and output IDs, so the output is longer than your input vector. Please see ?select for more detail./ / / And if the message is objectionable to some, you could give the option for people to set a global flag to shut it off. Something like if(!pleaseMakeItStop) message(message goes here) and they could set pleaseMakeItStop = TRUE in their .Rprofile Is that a reasonable compromise? Jim On Thu, Jun 4, 2015 at 6:06 PM, Marc Carlson mcarl...@fredhutch.org mailto:mcarl...@fredhutch.org wrote: Hi Jim, I do agree that the warning was protective for that (this is why I put it there). But it was also annoying for many and a source of some confusion because when people see a warning() they think that something has gone wrong with the code that was just run. And in this case the select method was actually doing exactly what it was supposed to be doing. What it was actually warning you about was what you did separately in that assignment to fit2... Which is the step right after the select method already did it's work. And I can understand why that seems a little bit confusing since you are basically telling someone to be careful with the data you just gave them. Now I could replace it with a message() I guess, but in cases like this where the warning is about something that happens outside of the function you are calling, shouldn't that probably be handled by documentation? Or at least, that is the argument that finally persuaded me to remove it. That and that fact that almost every call to select() ended up accompanied by the warning you mentioned, because it turns out that perfect 1:1 relationships are pretty rare for annotation data. Very often, you are going to get back multiple results. But I didn't just remove the warning, I also supplied an alternative for people who have a real need for consistent 1:1 mapping. The mapIds() method takes most of the same arguments as select, except that unlike select(), it only looks up one column and it always returns a vector that is the same size as the vector that came in. So for your example, you could do something like this psuedocode here: mapIds(chippackage, featureNames(eset), column=ENTREZID, keytype=PROBEID) And mapIds will follow a rule specified by the default value for the multiVals argument so that you can get back your results in a 1:1 manner. And if you don't like any of the options available for the multiVals argument, you can make your own function and pass it in. Anyhow please continue to let us know what you think? Marc On 06/04/2015 10:50 AM, James W. MacDonald wrote: In the last release, the warning message from select() telling people that their results include one-to-many mappings was removed. While some may find this warning annoying, I think silently returning something unexpected to our users is dangerous. In other words, for me
Re: [Bioc-devel] Changes in AnnotationDbi
Hi Jim, I do agree that the warning was protective for that (this is why I put it there). But it was also annoying for many and a source of some confusion because when people see a warning() they think that something has gone wrong with the code that was just run. And in this case the select method was actually doing exactly what it was supposed to be doing. What it was actually warning you about was what you did separately in that assignment to fit2... Which is the step right after the select method already did it's work. And I can understand why that seems a little bit confusing since you are basically telling someone to be careful with the data you just gave them. Now I could replace it with a message() I guess, but in cases like this where the warning is about something that happens outside of the function you are calling, shouldn't that probably be handled by documentation? Or at least, that is the argument that finally persuaded me to remove it. That and that fact that almost every call to select() ended up accompanied by the warning you mentioned, because it turns out that perfect 1:1 relationships are pretty rare for annotation data. Very often, you are going to get back multiple results. But I didn't just remove the warning, I also supplied an alternative for people who have a real need for consistent 1:1 mapping. The mapIds() method takes most of the same arguments as select, except that unlike select(), it only looks up one column and it always returns a vector that is the same size as the vector that came in. So for your example, you could do something like this psuedocode here: mapIds(chippackage, featureNames(eset), column=ENTREZID, keytype=PROBEID) And mapIds will follow a rule specified by the default value for the multiVals argument so that you can get back your results in a 1:1 manner. And if you don't like any of the options available for the multiVals argument, you can make your own function and pass it in. Anyhow please continue to let us know what you think? Marc On 06/04/2015 10:50 AM, James W. MacDonald wrote: In the last release, the warning message from select() telling people that their results include one-to-many mappings was removed. While some may find this warning annoying, I think silently returning something unexpected to our users is dangerous. In other words, for me it is a common practice to do something like this: fit - lmFit(eset, design) fit2 - eBayes(fit) gns - select(chippackage, featureNames(eset), c(ENTREZID,SYMBOL)) gns - gns[!duplicated(gns[,1]),] fit2$genes - gns I add in the step where dups are removed because I already know they are there. But a naive user might instead do fit2$genes - select(chippackage, featureNames(eset), c(ENTREZID,SYMBOL)) Which will work just fine, but then all the annotation (except for the first few lines) will now be completely incorrect, and there wasn't a warning to let the end user know that they may have made a mistake. lmFit() will parse the featureData slot of an ExpressionSet and use those data for annotation, so that gives some hypothetical protections, for those who first put their annotation data into their ExpressionSet. However, ?eSet says: ‘featureData’: Contains variables describing features (i.e., rows in ‘assayData’) unique to this experiment. Use the ‘annotation’ slot to efficiently reference feature data common to the annotation package used in the experiment. Class: ‘AnnotatedDataFrame-class’ Which to me indicates that the featureData slot isn't really intended to contain annotation data, but instead some unique information that pertains to a given experiment. But maybe I misunderstand. Is the featureData slot actually intended for annotation data? If not, what is the intended pipeline for annotating data in an ExpressionSet? Am I alone in being concerned about this? Best, Jim ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] AnnotationHubData Error: Access denied: 530
Hi Johannes, We are already planning to upgrade those objects to have that information when they are downloaded... Sonali is actually working on that right now. She will probably have updated that information by the end of the week or so. It's a lot of files to update, but this is already in progress. So if you are willing to wait a few days you can probably save yourself some headaches... Marc On 04/11/2015 12:13 PM, Rainer Johannes wrote: Hi Marc, you're right. I'll start with option 1. For that it would however be really nice to have the seqinfo available in the GRanges as mentioned in my previous mail. In the meantime I'll try to fetch the chrom lengths myself but would be nice to have all that ready in the GRanges at some point. cheers, jo On 11 Apr 2015, at 00:54, Marc Carlson mcarl...@fredhutch.org mailto:mcarl...@fredhutch.org wrote: On 04/10/2015 12:18 PM, Rainer Johannes wrote: dear Sonali, Herve, On 10 Apr 2015, at 19:59, Herv� Pag�s hpa...@fredhutch.org mailto:hpa...@fredhutch.orgmailto:hpa...@fredhutch.org wrote: Hi Johannes, Sonali, On 04/10/2015 09:40 AM, Arora, Sonali wrote: Hi Rainer, Just to be clear - what do you want to be available from AnnotationHub() in the end? Currently the GTF files from Ensembl are already present inside the AnnotationHub library(AnnotationHub) ah = AnnotationHub() gtf - query(ah, GTF) gtf - query(gtf, Ensembl) gtf[1] gtf[[1]] # returned to you as GenomicRanges object. - why not get the GTF files directly from AnnotationHub instead of getting them from the ftp site? Then you can make your EnsDb classes from these GRanges. It will also make your recipe faster because you will not have to download the file and parse the object. A GRanges object is not the same as a GTF file and I guess Johannes wants access to the GTF file. Are these GTF files available on AnnotationHub? yes, you're right. I wanted access to the GTF file and most likely understood the AnnotationHub idea wrong... my idea was to build a recipe that takes as input the GTF file (as the makeEnsemblGtfToGRanges) and generates from that the EnsDb SQLite database file. I thought that these SQLite files would be generated on the fly on the user's computer, but I guess that stuff is processed once and stored on your servers, right? Hi Johannes, So you have several options actually. We sometimes store the files in S3 and then send them down/cache them as requested and other times the hub can just point to an existing ftp site (and files get transformed/cached on the fly when users ask for them). So you have three choices here: 1) You could just write a function that takes in one of the processed GRanges objects and transforms it into an EnsDb object. This should be straightforward and is probably your easiest option since you won't have to write a recipe OR have any code included into the AnnotationHub. You can basically just take advantage of the fact that these data are already there in the hub waiting to be used. 2) You could write R code that transforms a GTF file into a sqlite file and ALSO a recipe to call that (and create metadata) for all the GTF files. This will be more work than #1 since you will have to write both a recipe and port any code that you have for generating the DB files. But when you are done you would be able to have your resources come right out of the AnnotationHub. 3) You could write R code to process a GRanges object into an EnsDb object and then also write a recipe so that your data resources can be served up directly from the AnnotationHub, but still take advantage of what is already there (GRanges). No new data would need to be added to the hub since new metadata records could allow users to transform the data into EnsDb objects on the fly. This is an elegant solution, but it will still take more effort than option #1. If I were you, I would start with option #1. That way if (after I got that working) I still wanted things to be more elegant, then I could then add a recipe (thus evolving the strategy into option #3... Marc @Johannes - Here is one alternative: You could take a different approach and implement some equivalent of makeTxDbFromGRanges() for EnsDb objects. So people could just do: library(ensembldb) ensdb - makeEnsDbFromGRanges(gtf[[1]]) like they can do right now with makeTxDbFromGRanges(): library(GenomicFeatures) txdb - makeTxDbFromGRanges(gtf[[1]]) That way you don't need a recipe or try to add things to AnnotationHub at all. that's a good idea, I will implement that too. just want to make sure that I can get all data I'll need (also the genome build version, Ensembl version etc from the GRanges, most likely I have to guess that from the file name of the RData file). @Sonali - These GRanges objects I get from AnnotationHub have no genome information and their seqlevels are not sorted: seqinfo(gtf[[1
Re: [Bioc-devel] AnnotationHubData Error: Access denied: 530
On 04/10/2015 12:18 PM, Rainer Johannes wrote: dear Sonali, Herve, On 10 Apr 2015, at 19:59, Herv� Pag�s hpa...@fredhutch.orgmailto:hpa...@fredhutch.org wrote: Hi Johannes, Sonali, On 04/10/2015 09:40 AM, Arora, Sonali wrote: Hi Rainer, Just to be clear - what do you want to be available from AnnotationHub() in the end? Currently the GTF files from Ensembl are already present inside the AnnotationHub library(AnnotationHub) ah = AnnotationHub() gtf - query(ah, GTF) gtf - query(gtf, Ensembl) gtf[1] gtf[[1]] # returned to you as GenomicRanges object. - why not get the GTF files directly from AnnotationHub instead of getting them from the ftp site? Then you can make your EnsDb classes from these GRanges. It will also make your recipe faster because you will not have to download the file and parse the object. A GRanges object is not the same as a GTF file and I guess Johannes wants access to the GTF file. Are these GTF files available on AnnotationHub? yes, you're right. I wanted access to the GTF file and most likely understood the AnnotationHub idea wrong... my idea was to build a recipe that takes as input the GTF file (as the makeEnsemblGtfToGRanges) and generates from that the EnsDb SQLite database file. I thought that these SQLite files would be generated on the fly on the user's computer, but I guess that stuff is processed once and stored on your servers, right? Hi Johannes, So you have several options actually. We sometimes store the files in S3 and then send them down/cache them as requested and other times the hub can just point to an existing ftp site (and files get transformed/cached on the fly when users ask for them). So you have three choices here: 1) You could just write a function that takes in one of the processed GRanges objects and transforms it into an EnsDb object. This should be straightforward and is probably your easiest option since you won't have to write a recipe OR have any code included into the AnnotationHub. You can basically just take advantage of the fact that these data are already there in the hub waiting to be used. 2) You could write R code that transforms a GTF file into a sqlite file and ALSO a recipe to call that (and create metadata) for all the GTF files. This will be more work than #1 since you will have to write both a recipe and port any code that you have for generating the DB files. But when you are done you would be able to have your resources come right out of the AnnotationHub. 3) You could write R code to process a GRanges object into an EnsDb object and then also write a recipe so that your data resources can be served up directly from the AnnotationHub, but still take advantage of what is already there (GRanges). No new data would need to be added to the hub since new metadata records could allow users to transform the data into EnsDb objects on the fly. This is an elegant solution, but it will still take more effort than option #1. If I were you, I would start with option #1. That way if (after I got that working) I still wanted things to be more elegant, then I could then add a recipe (thus evolving the strategy into option #3... Marc @Johannes - Here is one alternative: You could take a different approach and implement some equivalent of makeTxDbFromGRanges() for EnsDb objects. So people could just do: library(ensembldb) ensdb - makeEnsDbFromGRanges(gtf[[1]]) like they can do right now with makeTxDbFromGRanges(): library(GenomicFeatures) txdb - makeTxDbFromGRanges(gtf[[1]]) That way you don't need a recipe or try to add things to AnnotationHub at all. that's a good idea, I will implement that too. just want to make sure that I can get all data I'll need (also the genome build version, Ensembl version etc from the GRanges, most likely I have to guess that from the file name of the RData file). @Sonali - These GRanges objects I get from AnnotationHub have no genome information and their seqlevels are not sorted: seqinfo(gtf[[1]]) Seqinfo object with 22 sequences from an unspecified genome; no seqlengths: seqnames seqlengths isCircular genome X NA NA NA 9 NA NA NA 8 NA NA NA 7 NA NA NA 6 NA NA NA ... ......... 12 NA NA NA 11 NA NA NA 10 NA NA NA 1 NA NA NA MT NA NA NA I know it's easy enough to sort the seqlevels with sortSeqlevels() but what about having these things done by the recipe instead? I also have a suggestion there: what if you used also the fetchChromLengthsFromEnsembl from the GenomicFeatures package? the GTF files are anyway from Ensembl, so getting the seqinfo from there would make sense... and I wouldn't have to fetch
Re: [Bioc-devel] Feature Request--add host and port to makeTxDbPackageFromBiomart
This is done BTW. Marc On 02/27/2015 02:43 PM, Marc Carlson wrote: Hi Sean, This seems like a solid suggestion. I have put it into my queue. Marc On 02/27/2015 04:41 AM, Sean Davis wrote: Hi, Marc. Since Ensembl has switched to GRCh38 for their most recent builds, to get access to GRCh37 data now requires a different host and port for biomaRt. These are exposed in the makeTxDbFromBiomart, but not the accompanying functionality to directly make a package. Would it make sense to add host and port as arguments for the latter? Thanks, Sean [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] keys function of org.Pf.plasmo.db gives an error.
Hi Paolo, The ORF type has never been available for that package. This is a bug in the columns method (which I will now fix). Thanks for reporting it. Marc On 03/18/2015 03:00 AM, Paolo Martini wrote: Dear Bioconductor, I am working with the annotation package org.Pf.plasmo.db. I tried to get the keys from the ORF column. library(org.Pf.plasmo.db) columns(org.Pf.plasmo.db) [1] ORF ENZYME PATHSYMBOL GENENAME [6] GO EVIDENCEONTOLOGYGOALL EVIDENCEALL [11] ONTOLOGYALL ALIAS2ORF keys(org.Pf.plasmo.db, ORF) Error in sqliteSendQuery(con, statement, bind.data) : error in statement: no such table: sgd I tried both stable and devel version but neither the stable nor the devel seemed to work. To my knowledge sgd is related to S. cerevisiae. Is the ORF keytype availble for Malaria? Thanks a lot. sessionInfo() R Under development (unstable) (2015-03-16 r67994) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.10 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] org.Pf.plasmo.db_3.1.0 RSQLite_1.0.0 DBI_0.3.1 [4] AnnotationDbi_1.29.17 GenomeInfoDb_1.3.13IRanges_2.1.43 [7] S4Vectors_0.5.22 Biobase_2.27.2 BiocGenerics_0.13.6 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] keys function of org.Pf.plasmo.db gives an error.
Ok I spoke a little too quickly earlier. Looking back, there should indeed be a field from columns() called 'ORF' and it is basically used here as the central gene ID for this single package. But the code for keys (and columns) was getting this 'ORF' conflated with the other 'ORF' used in resources from SGD (which this has nothing to do with since the data all comes from plasmoDB). Anyhow, I have fixed the software bugs and pushed a patch to release and devel. It should be up in about a day. Marc On 03/18/2015 10:23 AM, Marc Carlson wrote: Hi Paolo, The ORF type has never been available for that package. This is a bug in the columns method (which I will now fix). Thanks for reporting it. Marc On 03/18/2015 03:00 AM, Paolo Martini wrote: Dear Bioconductor, I am working with the annotation package org.Pf.plasmo.db. I tried to get the keys from the ORF column. library(org.Pf.plasmo.db) columns(org.Pf.plasmo.db) [1] ORF ENZYME PATHSYMBOL GENENAME [6] GO EVIDENCEONTOLOGYGOALL EVIDENCEALL [11] ONTOLOGYALL ALIAS2ORF keys(org.Pf.plasmo.db, ORF) Error in sqliteSendQuery(con, statement, bind.data) : error in statement: no such table: sgd I tried both stable and devel version but neither the stable nor the devel seemed to work. To my knowledge sgd is related to S. cerevisiae. Is the ORF keytype availble for Malaria? Thanks a lot. sessionInfo() R Under development (unstable) (2015-03-16 r67994) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.10 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] org.Pf.plasmo.db_3.1.0 RSQLite_1.0.0 DBI_0.3.1 [4] AnnotationDbi_1.29.17 GenomeInfoDb_1.3.13IRanges_2.1.43 [7] S4Vectors_0.5.22 Biobase_2.27.2 BiocGenerics_0.13.6 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Feature Request--add host and port to makeTxDbPackageFromBiomart
Hi Sean, This seems like a solid suggestion. I have put it into my queue. Marc On 02/27/2015 04:41 AM, Sean Davis wrote: Hi, Marc. Since Ensembl has switched to GRCh38 for their most recent builds, to get access to GRCh37 data now requires a different host and port for biomaRt. These are exposed in the makeTxDbFromBiomart, but not the accompanying functionality to directly make a package. Would it make sense to add host and port as arguments for the latter? Thanks, Sean [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] OrganismDb and associated TxDb
Hi Vince, First of all thank you for using OrganismDb objects. You raise some interesting points though about keeping these APIs better synchronized that I feel point to some deficiencies in the current design. I spoke with Herve about this and we are puzzling over possibly using inheritance to make this a bit easier for maintenance. Marc On 02/13/2015 10:01 AM, Vincent Carey wrote: Gviz has a nice way of working with TxDb instances to derive gene models. It can be cumbersome to refer to a TxDb instance, and the Homo.sapiens OrganismDb instance is very convenient to work with. I do not see any straightforward way to extract a reference to a TxDb from Homo.sapiens. I could traverse the graph slot but class?OrganismDb makes no reference to this. In summary, I think it would be good to document the OrganismDb API and to think about preferences for using OrganismDb as opposed to TxDb and OrgDb (org.Hs.eg.db) whenever possible. BTW I attempted to 'patch' Gviz by substituting OrganismDb for TxDb -- there are only two references to TxDb in Gviz ... and it would seem that the necessary operations apply to OrganismDb just as well as to TxDb. But the APIs are not in sync ... I ran into seqlevels0 ... and that is something of a mystery. [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Package submission with library requirement
Hi Avinash, So the argument for the importance of reproducible research *definitely* resonates with us here as it is a major goal of ours. However while the decision to use the same library as your paper helps to make the immediate work more reproducible, it simultaneously hampers others from benefiting from that because of the engineering problems that it creates for end users. Ultimately, this project has a longstanding commitment to try and provide not only a way for your previous work to be validated, but also a way for others to build upon and eventually extend that previous work. And both of these goals are crucial if your package is to be valuable to the greater scientific community over the long term. Anyhow I will try and work with you on our issue tracker to see if we can find a way to resolve this with you. Thanks for contributing! Marc On 01/26/2015 08:14 AM, avinash sahu wrote: Hi Dan, Now I have included source code of the rsampl.h in the GOAL package. Although, rlecuyer is good candidate for random number generator, I currently avoid using it because I wanted results of our submitted manuscript to be completely reproducible. I can reproduce the results using ransampl library by setting seed that I have stored. Changing to other random generator libraries will imply that I have recheck results of the manuscript are reproducible and possibly change some of them which is not possible at this stage. I will reserve that inclusion for the future. I have resubmitted the GOAL package. thanks avi On Fri, Jan 23, 2015 at 9:37 PM, Levi Waldron levi.wald...@hunter.cuny.edu wrote: On Fri, Jan 23, 2015 at 1:58 PM, Dan Tenenbaum dtene...@fredhutch.org wrote: However, you should consider using Rlecuyer as it has no external dependencies (see Levi's post to this thread). Then your package should build on windows. I think so too - it's also a standard solution in R, implemented natively in r-core's parallel library and suggested by the snow library. I used it in my pensim library before transitioning to parallel, and have tested its streams on hyperthreaded CPUs and clusters. -- Levi Waldron Assistant Professor of Biostatistics City University of New York School of Public Health, Hunter College 2180 3rd Ave Rm 538 New York NY 10035-4003 phone: 212-396-7747 www.waldronlab.org ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Replacing deprecated org.Hs.egCHR and friends
Hi Peter, I would add that you can see a listing all the currently pre-manufactured TxDb packages here: http://www.bioconductor.org/packages/devel/BiocViews.html#___TxDb And for convenience you can also use an OrgansimDb package to connect the contents of the TxDb package with the older org packages. You can learn more about those (and the other annotation resources) here: http://www.bioconductor.org/help/workflows/annotation/annotation/ Hope this helps you to be better acquainted! Marc On 01/13/2015 07:30 AM, James W. MacDonald wrote: Hi Peter, This isn't a devel question. Next time please ask this sort of thing on the support site. As for the message, it seems pretty clear to me. The org.Hs.eg.db package doesn't have the chromosomal location data any more, but the relevant TxDb package does have those data, in a much more useful format. The message can't be any more explicit than that, as there is more than one TxDb package for human. You could have hypothetically gone to the annotation data page ( http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData) and searched for, say 'TxDb', in which case you would see three packages with names like TxDb.Hsapiens.UCSC.hg19.knownGene. Which one you decide to use is dependent on the build/source you care about. And if you are completely unfamiliar with these packages, you need to read the GenomicFeatures vignette. Best, Jim On Tue, Jan 13, 2015 at 12:34 AM, Peter Langfelder peter.langfel...@gmail.com wrote: Hi all, can anyone please explain or point me to an explanation of how to replace org.Hs.egCHR and friends that appear to be deprecated in the devel version? The deprecation message isn't very helpful. Thanks! x = org.Hs.egCHR Warning message: In (function () : org.Hs.egCHR is deprecated. Please use an appropriate TxDb object or package for this kind of data. sessionInfo() R Under development (unstable) (2014-11-24 r67057) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] org.Hs.eg.db_3.0.0RSQLite_1.0.0 DBI_0.3.1 [4] AnnotationDbi_1.29.12 GenomeInfoDb_1.3.12 IRanges_2.1.35 [7] S4Vectors_0.5.16 Biobase_2.27.1BiocGenerics_0.13.4 [10] BiocInstaller_1.17.3 loaded via a namespace (and not attached): [1] tools_3.2.0 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] FW: GO offspring consistency
Hi Jelle, Thank you for your patience in waiting for my answer here. It took me a lot longer to properly test and validate this than I initially expected. So if you look at amigo you can see these graph views that show you what the current terms up and downstream of a given GO term should be: http://amigo.geneontology.org/amigo/term/GO:0006915 vs it's offspring term. http://amigo.geneontology.org/amigo/term/GO:0042981 And you can see (if you click on the inferred tree view for GO:0006915) that GO:0042981 is actually listed there as an offspring term. Which just that leaves us with the mystery of why: all(subsetapt %in% setapt) Would ever return false? Now to do some more digging, if we carry your example one step further we can do this to extract the specific terms that have this surprising result: subsetapt[!subsetapt %in% setapt] And lets look closer at the very 1st result (out of 3) that we see: GO:0035602. So now we would then expect that: GO:0006915 - GO:0042981 - GO:0035602 Especially since the very latest amigo diagrams show this set of relationships for this term. http://amigo.geneontology.org/amigo/term/GO:0035602 But if we look more closely at this term we can notice something unusual about it. Specifically if you look at the Graph views you will see that it has a 'part of' rather than an 'is a' relationship to the rest of the DAG. An examination of the other two non-compliant terms indicates that they too have this kind of relationship: http://amigo.geneontology.org/amigo/term/GO:0044336 http://amigo.geneontology.org/amigo/term/GO:0044337 Also of interest is the fact that the highest level term you tested (GO:0006915), has a broader kind of relationship to the rest of the DAG). Now please hold onto those thoughts while I tell you another important fact. http://amigo.geneontology.org/amigo/term/GO:0006915 The contents of the GOBPOFFSPRING mapping are ultimately derived from the graph_path table that you can find here: http://geneontology.org/page/lead-database-schema#go-optimisations.table.graph-path And they are indeed a faithful representation of what is in that table (from GO). That is, the source files both when I made the latest GO.db package for the October release and now have the same properties for their set of relationships as you pointed out. So for our 1st example, in both places you will find that GO:0035602 is listed as having an implied link when you ask for GO:0042981 but not when you ask for GO:0006915. So the very unsatisfying answer to your question is that the terms have this relationship because that is what the data at GO say. :P But the (hopefully) more satisfying answer is that the kind of relationships that these terms have to each other creates implications for whether or not they can be transitively associated in the GO graph_path table. That is, the child term GO:0035602 is not able to be implicitly linked to GO:0006915 because that term has a 'regulates' relationship to the offspring terms and *also* because GO:0035602 has a 'part of' relationship (instead of an 'is a' relationship) to its parent terms. And those issues don't crop up between the other terms in this part of the graph. I hope this explains things better for you, Marc On 12/02/2014 04:29 AM, jelle.goe...@radboudumc.nl wrote: Hi All, When working with the GO.db package we ran into a seeming inconsistency in the GOBPOFFSPRING object. It seems there that a term's offspring may have offspring that is not offspring of the term itself. This seems inconsistent with the DAG structure of gene ontology. library(GO.db) xx - as.list(GOBPOFFSPRING) setapt - xx$GO:0006915 #apoptosis subsetapt - xx$GO:0042981 #offspring of apoptosis GO:0042981%in%setapt [1] TRUE all(subsetapt %in% setapt) [1] FALSE Is there something wrong or are we misunderstanding the GOBPOFFSPRING object? Best wishes, Jelle Het Radboudumc staat geregistreerd bij de Kamer van Koophandel in het handelsregister onder nummer 41055629. The Radboud university medical center is listed in the Commercial Register of the Chamber of Commerce under file number 41055629. ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] About Hg38 BSgenome
Hi Raffaele, You are in luck today because while we normally do *not* have mechanisms to harmonize the non-standard chromosome names, for this specific case Herve wrote some code to handle it. So you want to look at this: library(GenomeInfoDb) ?fetchExtendedChromInfoFromUCSC Marc On 12/02/2014 07:15 AM, Julian Gehring wrote: Hi Raffaele, Ignore my last post completely, it was overly optimistic: The 'BSgenome.Hsapiens.NCBI.GRCh38' package contains the genomic sequence that is identical between GRCh38 and hg38. The naming of the chromosomes is different. For the toplevel chromosomes, the names can be easily converted: library(BSgenome.Hsapiens.NCBI.GRCh38) library(TxDb.Hsapiens.UCSC.hg38.knownGene) bs = BSgenome.Hsapiens.NCBI.GRCh38 seqlevelsStyle(bs) = UCSC ## convert to UCSC style seqlevels(BSgenome.Hsapiens.NCBI.GRCh38) seqlevels(bs) seqlevels(TxDb.Hsapiens.UCSC.hg38.knownGene) However, this does not work for the non-toplevel chrs, e.g.: 'HSCHR19KIR_RP5_B_HAP_CTG3_1' does not have a corresponding sequence in the 'TxDb.Hsapiens.UCSC.hg38.knownGene' (and also won't be converted). Best Julian Julian Gehring (12/02/14 15:44): Hi Raffaele, You can find it under the name BSgenome.Hsapiens.NCBI.GRCh38 http://bioconductor.org/packages/release/data/annotation/html/BSgenome.Hsapiens.NCBI.GRCh38.html (http://bioconductor.org/packages/release/data/annotation/html/BSgenome.Hsapiens.NCBI.GRCh38.html) The naming of the chromosomes has been harmonized between UCSC and GRCh with the new release, so there should be no need for two versions at the genome level. Best Julian On Tue, Dec 2, 2014 at 15:12, Raffaele Adolfo Calogero wrote: Dear Bioc Team, I am the maintainer of chimera package. Recently some of the users asked for the possibility to use chimera with fusions detected on hg38 human genome. I checked for the availability of hg38 as BSgenome but I did not find it in Bioc repository, as instead there is TxDb.Hsapiens.UCSC.hg38.knownGene. I would like to know if it is planned the release of hg38 as BSgenome, maybe in the next Bioc release. In case it is not planned could please suggest me what to read to build it? Cheers Raffaele ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Google hangout on Wed December 10th for new package authors
Hello new package authors, Based on the number of new software packages being submitted to the project it seems that Bioconductor is more popular than ever. Last release we added a hundred and ten new packages (a new record). A lot of the popularity of this project is because Bioconductor packages have to live up to certain minimal standards (Nature Genetics thinks so too, e.g., http://www.nature.com/ng/journal/v46/n1/full/ng.2869.html). For example every Bioconductor package is expected to: 1) provide complete documentation so that new users will know how to use them 2) contain working examples that are run when the package is checked by the build system so that failure can be detected early. 3) cooperate with related packages within the project so as to facilitate code reuse and support reproducible research. We hope you will agree that having such package guidelines is a big win for the whole community. To help *you* contribute to Bioconductor, we are going to have a Google hangout (on air) to allow you to tune in, listen to some tips from Bioconductor package reviewers and then open up the forum for questions. Webinar Invitation: Contributing your package to Bioconductor: guidelines and overview Date: December 10, 2014 Time: 8:00 AM PST /11:00 AM EST Please 'tune in' December 10th at 8AM PST for a Google Hangout to discuss new package contributions. And learn how to maximize the value of your package contribution to the Bioconductor community. ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] AnnotationDbi::loadDb() now requires dbType and dbPackage?
Actually they are required in one sense. Just not required of an end user who would be calling the function (so not for that manual page). But they are required in a separate internal sense (which is what the error is referring to) and that is that the database that is loaded it must contain metadata that specifies this information. But this is really just implementation details that only people who make such databases ever really need to know about. The error in this case just happens to be the 1st one that is hit when your package tried to load a database file that was recently renamed. By the time you read this post though, the problem caused by that test database resource being renamed has already been fixed. Marc On 10/10/2014 08:28 AM, Leonardo Collado Torres wrote: On Fri, Oct 10, 2014 at 11:24 AM, Leonardo Collado Torres lcoll...@jhu.edu wrote: Hi, I think that the docs for ?loadDb (AnnotationDbi) need to be updated as described below. According to ?loadDb in AnnotationDbi 1.27.19 dbType dbType - not required dbPackage dbPackage - not required However, R CMD check for derfinder 0.99.5 and R CMD build for GenomicFeatures 1.17.21 are failing due to dbType not being specified. See http://bioconductor.org/checkResults/devel/bioc-LATEST/derfinder/oaxaca-checksrc.html and http://bioconductor.org/checkResults/devel/bioc-LATEST/GenomicFeatures/zin1-buildsrc.html So I guess that the documentation needs to be updated or something went wrong after GenomicFeatures 1.17.20 (with AnnotationDbi 1.17.21) Err, I meant 1.27.19 here because I could get it to work then (see further below). Anyhow, I'll change the example in derfinder::makeGenomicState and specify dbType and dbPackage library('GenomicFeatures') samplefile - system.file('extdata', 'UCSC_knownGene_sample.sqlite', + package='GenomicFeatures') old - loadDb(samplefile) new - loadDb(samplefile, dbType = 'TxDb', dbPackage = 'GenomicFeatures') ## For some reason they are not identical. ## My guess is that each one has a different connection, while everything else is the ## same. identical(old, new) [1] FALSE ## However, by 'eye' they look the same old TxDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: UCSC | Genome: hg18 | Genus and Species: Homo sapiens | UCSC Table: knownGene | Resource URL: http://genome.ucsc.edu/ | Type of Gene ID: Entrez Gene ID | Full dataset: no | miRBase build ID: NA | transcript_nrow: 135 | exon_nrow: 544 | cds_nrow: 324 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2012-04-13 14:47:54 -0700 (Fri, 13 Apr 2012) | GenomicFeatures version at creation time: 1.9.4 | RSQLite version at creation time: 0.11.1 | DBSCHEMAVERSION: 1.0 new TxDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: UCSC | Genome: hg18 | Genus and Species: Homo sapiens | UCSC Table: knownGene | Resource URL: http://genome.ucsc.edu/ | Type of Gene ID: Entrez Gene ID | Full dataset: no | miRBase build ID: NA | transcript_nrow: 135 | exon_nrow: 544 | cds_nrow: 324 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2012-04-13 14:47:54 -0700 (Fri, 13 Apr 2012) | GenomicFeatures version at creation time: 1.9.4 | RSQLite version at creation time: 0.11.1 | DBSCHEMAVERSION: 1.0 devtools::session_info() Session info-- setting value version R version 3.1.1 (2014-07-10) system x86_64, darwin10.8.0 ui AQUA language (EN) collate en_US.UTF-8 tz America/New_York Packages-- package * version date source AnnotationDbi * 1.27.19 2014-10-05 Bioconductor base64enc 0.1.22014-06-26 CRAN (R 3.1.0) BatchJobs 1.4 2014-09-24 CRAN (R 3.1.1) BBmisc 1.7 2014-06-21 CRAN (R 3.1.0) Biobase * 2.25.1 2014-10-09 Bioconductor BiocGenerics * 0.11.5 2014-09-13 Bioconductor BiocParallel0.99.25 2014-10-02 Bioconductor biomaRt 2.21.5 2014-10-07 Bioconductor Biostrings 2.33.14 2014-09-09 Bioconductor bitops 1.0.62013-08-17 CRAN (R 3.1.0) brew1.0.62011-04-13 CRAN (R 3.1.0) checkmate 1.4 2014-09-03 CRAN (R 3.1.1) codetools 0.2.92014-08-21 CRAN (R 3.1.1) DBI 0.3.12014-09-24 CRAN (R 3.1.1) devtools1.6.12014-10-07 CRAN (R 3.1.1) digest 0.6.42013-12-03 CRAN (R 3.1.0) fail1.2 2013-09-19 CRAN (R 3.1.0) foreach 1.4.22014-04-11 CRAN (R 3.1.0) futile.logger 1.3.72014-01-23 CRAN (R 3.1.0) futile.options 1.0.02010-04-06 CRAN (R 3.1.0) GenomeInfoDb * 1.1.25 2014-10-02 Bioconductor GenomicAlignments 1.1.30
Re: [Bioc-devel] new error(?) related to annotation: illuminaHumanv1CHR is deprecated
Hi Vince, You raise an important point that a common use of the chipDb objects will become overly complicated with this change. Especially since chip platforms should really have an implicit genome that they were designed for from the get go. And since annotations packages are being build right now there isn't time to address all of these problems optimally so I am going to put these deprecations on ice till sometime after the release. Thanks for you feedback, Marc On 09/22/2014 08:03 PM, Vincent Carey wrote: Thanks for the clarification. Isn't there a way via active bindings to preserve the interfaces conferred by e.g., illuminaHumanv1CHRLOC, so that queries to the object (no longer a Bimap) succeed with the endorsed metadata? the chipDb packages would be revised to use a new protocol for these queries that go through TxDb. On Mon, Sep 22, 2014 at 8:38 PM, Marc Carlson mcarl...@fhcrc.org mailto:mcarl...@fhcrc.org wrote: Hi Vince, So if you wanted to do this manually, then the thing you would want to do is to get a gene ID from the probe and to take that to a TranscriptDb object (again: that is if you wanted to do it manually). Alternatively, if you had an OrganismDb object then this association would be handled for you (where it would be spelled out explicitly). The explicit nature is what we are after here since where a gene is expected to be (chromosome wise) can depend on the build of genome you are using. As people move between standard genomes and eventually to custom ones, we needed to decouple this kind of data from the organism packages (which are only ever intended to hold gene-centered data). Marc On 09/21/2014 08:21 AM, Vincent Carey wrote: On Sun, Sep 21, 2014 at 11:07 AM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: On 09/21/2014 07:44 AM, Vincent Carey wrote: this is coming out of the build system for GGtools ... not easy to find as the problem seems to cause emission of megabytes of warnings illuminaHumanv1CHR is deprecated as the data is better accessed from another location. Please use an appropriate TxDb object or package for this kind of data. i don't see the deprecation in the doc for illuminaHumanv1.db and i cannot get a get() to throw it. i also don't see this on the devel version package landing page Marc will likely reply on Monday. But the intention is that the CHR bimaps in *db packages Marc curates are being deprecated. The deprecation itself occurs in AnnotationDbi, I think. The reason is the lack of provenance for this information -- what genome build does it refer to? -- and its availability from other sources (i.e., the TxDb packages) with provenance. Nice to hear about the streamlining and improved provenance. I confess I don't see how to get a probe-chr mapping out of TxDb -- is there something new in there? A select operation that can resolve queries about manufacturer identifiers? illuminaHumanv1CHR illuminaHumanv1CHR is deprecated as the data is better accessed from another location. Please use an appropriate TxDb object or package for this kind of data. CHR map for chip illuminaHumanv1 (object of class ProbeAnnDbBimap) I think this is currently a message, but should be a warning. AnnotationDbi is not building successfully, so its biocLite() version and landing page are not in sync with an svn checkout (used by the build system); to replicate on your own system requires an svn install, at least until AnnotationDbi builds successfully. OK, so I can get the message now. But I think more details need to be supplied if we are to drop references to *CHR. I guess the megabytes of warnings come from code in GGtools or elsewhere; maybe there's a Indeed. Perhaps unrelated to this. convenient way of aggregating them (hopefully before throwing the warning, since that can be quite expensive). Martin [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailto:Bioc-devel@r
Re: [Bioc-devel] deprecated org.Hs.egCHRLOC UPDATE: the as.list() behaves differently in stable and devel configuration
Hi Raffaele, This problem should be resolved in devel at this time. Please update to the latest version of AnnotationDbi (1.27.14) and try again. Marc On 09/23/2014 02:36 PM, calogero UNITO wrote: affaeleaffaeleHi Vincent, I have further investigated the error I have in the vignette of chimera devel package . It is related to the libraries used to access to org.Hs.eg.db in the devel branch. In the presence of the following libraries (stable branch): [1] org.Hs.eg.db_2.14.0 RSQLite_0.11.4 DBI_0.3.0 [4] AnnotationDbi_1.26.0 GenomeInfoDb_1.0.2 Biobase_2.24.0 [7] BiocGenerics_0.10.0 BiocInstaller_1.14.2 If I extract start and end position for the chromosome location from org.Hs.eg: chr.tmps - as.list(org.Hs.egCHRLOC) chr.tmpe - as.list(org.Hs.egCHRLOCEND) as.numeric(chr.tmps[1:3]) [1] -58858172 18248755 -43248163 as.numeric(chr.tmpe[1:3]) [1] -58864865 18258723 -43280376 I get different numbers for star and end of a gene. In case I used the libraries derived from devel branch [1] org.Hs.eg.db_2.14.0 RSQLite_0.11.4 DBI_0.3.0 [4] AnnotationDbi_1.27.13 GenomeInfoDb_1.1.19 IRanges_1.99.28 [7] S4Vectors_0.2.4 Biobase_2.25.0 BiocGenerics_0.11.5 chr.tmps - as.list(org.Hs.egCHRLOC) org.Hs.egCHRLOC is deprecated as the data is better accessed from another location. Please use an appropriate TxDb object or package for this kind of data. chr.tmpe - as.list(org.Hs.egCHRLOCEND) org.Hs.egCHRLOC is deprecated as the data is better accessed from another location. Please use an appropriate TxDb object or package for this kind of data. as.numeric(chr.tmps[1:3]) [1] -58858172 18248755 -43248163 as.numeric(chr.tmpe[1:3]) [1] -58858172 18248755 -43248163 The values associated to start and end of the gene are the same. This is actually the reason why I get errors in the vignette of chimera package. R 3.1.1: sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base Then I installed the basic configuration needed to use org.Hs.eg.db in the actual stable release: source(http://bioconductor.org/biocLite.R;) biocLite(org.Hs.eg.db) library(org.Hs.eg.db) sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] org.Hs.eg.db_2.14.0 RSQLite_0.11.4 DBI_0.3.0 [4] AnnotationDbi_1.26.0 GenomeInfoDb_1.0.2 Biobase_2.24.0 [7] BiocGenerics_0.10.0 BiocInstaller_1.14.2 loaded via a namespace (and not attached): [1] IRanges_1.22.10 stats4_3.1.1tools_3.1.1 Then I used the devel release packages: |library(BiocInstaller) useDevel()| library(org.Hs.eg.db) sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] org.Hs.eg.db_2.14.0 RSQLite_0.11.4DBI_0.3.0 [4] AnnotationDbi_1.27.13 GenomeInfoDb_1.1.19 IRanges_1.99.28 [7] S4Vectors_0.2.4 Biobase_2.25.0BiocGenerics_0.11.5 On 23/09/14 12:41, Vincent Carey wrote: Of note, this is not an error and seems at this time not even to be a warning. A message is emitted indicating the deprecation, so we have a release to figure out how to deal with the fact that the *CHR/*CHRLOC entities will go away in the next release. There are various possible workarounds. Some more commentary will be forthcoming. On Tue, Sep 23, 2014 at 3:52 AM, calogero UNITO raffaele.calog...@unito.it wrote: Hi, I am the maintainer of chimera package and I am getting the following error in the develop version: org.Hs.egCHRLOC is deprecated as the data is better accessed from another location. Please use an appropriate TxDb object or package for this kind of data. Could please indicate me which package I should used instead of org.Hs.eg.db ? Cheers Raf -- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di Biotecnologie Molecolari Via Nizza 52, Torino 10126 Tel. ++39 0116706457 Fax++39 0112366457 Mobile ++39 827080 email: raffaele.calog...@unito.it raffaele.calog...@gmail.com www: http://www.bioinformatica.unito.it [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] new error(?) related to annotation: illuminaHumanv1CHR is deprecated
Hi Vince, So if you wanted to do this manually, then the thing you would want to do is to get a gene ID from the probe and to take that to a TranscriptDb object (again: that is if you wanted to do it manually). Alternatively, if you had an OrganismDb object then this association would be handled for you (where it would be spelled out explicitly). The explicit nature is what we are after here since where a gene is expected to be (chromosome wise) can depend on the build of genome you are using. As people move between standard genomes and eventually to custom ones, we needed to decouple this kind of data from the organism packages (which are only ever intended to hold gene-centered data). Marc On 09/21/2014 08:21 AM, Vincent Carey wrote: On Sun, Sep 21, 2014 at 11:07 AM, Martin Morgan mtmor...@fhcrc.org wrote: On 09/21/2014 07:44 AM, Vincent Carey wrote: this is coming out of the build system for GGtools ... not easy to find as the problem seems to cause emission of megabytes of warnings illuminaHumanv1CHR is deprecated as the data is better accessed from another location. Please use an appropriate TxDb object or package for this kind of data. i don't see the deprecation in the doc for illuminaHumanv1.db and i cannot get a get() to throw it. i also don't see this on the devel version package landing page Marc will likely reply on Monday. But the intention is that the CHR bimaps in *db packages Marc curates are being deprecated. The deprecation itself occurs in AnnotationDbi, I think. The reason is the lack of provenance for this information -- what genome build does it refer to? -- and its availability from other sources (i.e., the TxDb packages) with provenance. Nice to hear about the streamlining and improved provenance. I confess I don't see how to get a probe-chr mapping out of TxDb -- is there something new in there? A select operation that can resolve queries about manufacturer identifiers? illuminaHumanv1CHR illuminaHumanv1CHR is deprecated as the data is better accessed from another location. Please use an appropriate TxDb object or package for this kind of data. CHR map for chip illuminaHumanv1 (object of class ProbeAnnDbBimap) I think this is currently a message, but should be a warning. AnnotationDbi is not building successfully, so its biocLite() version and landing page are not in sync with an svn checkout (used by the build system); to replicate on your own system requires an svn install, at least until AnnotationDbi builds successfully. OK, so I can get the message now. But I think more details need to be supplied if we are to drop references to *CHR. I guess the megabytes of warnings come from code in GGtools or elsewhere; maybe there's a Indeed. Perhaps unrelated to this. convenient way of aggregating them (hopefully before throwing the warning, since that can be quite expensive). Martin [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] The release is fast approaching. Information about upcoming deadlines.
Hello package contributors, Please note that next Thursday is the deadline for submitting new packages if you want them to make it into the upcoming October release. You can see the release schedule here: http://www.bioconductor.org/developers/release-schedule/ Please also take note of the upcoming deadlines on October 2nd and 6th (for existing packages). Thanks again for participating! Marc ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] important announcement
Hello, This is a second warning that in less than a week we plan to roll out the new support site for Bioconductor. *Important* Once the support site is 'live', posts to the Bioconductor mailing list will receive an automatic reply indicating that it is no longer in service and directing you to the new site. This change affects the 'bioconductor' mailing list; the 'bioc-devel' mailing list will continue to function as before. Marc ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Announcement about the new support site
Hello, Thank you to those who have participated in the Beta for our new support site. The beta period is now over, and we are getting ready for a formal launch of the site during the week of September 15th. *Important* Once the support site is 'live', posts to the Bioconductor mailing list will receive an automatic reply indicating that it is no longer in service and directing you to the new site. This change affects the 'bioconductor' mailing list; the 'bioc-devel' mailing list will continue to function as before. Marc ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Please help us try out our new support site!
Hi Stephanie, There are two kinds of tag tracking on offer and they should be independent of each other. So on the site when you log in, there is a tab called 'My Tags'. That tab is supposed to list any posts that are tagged with your tags of interest and you can look at them whenever you log in. In contrast, the 'Watched Tags should actually email you whenever someone hits one of the tags you list in that field. So you might want these two lists to have slightly different contents depending on how often you want to get emailed etc. Also related to this, we have already retroactively tagged and imported the past 11+ years of older posts that mention bioconductor package names or biocViews terms. So you should be able to put tags for your packages of interest into 'My Tags' and then see related older posts listed under your My Tags tab right away. Marc On 08/19/2014 09:51 AM, Stephanie M. Gogarten wrote: Are tags in My Tags automatically Watched? Or should I enter those tags in both fields? I really like the option to get email when my packages are mentioned. I think it will mean that users get help faster, since those of us who are not constantly watching the mailing list will see relevant questions right away. Stephanie On 8/18/14 12:15 PM, Marc Carlson wrote: Hello! This is a message to announce the beta test for our new support site. We hope to replace the regular Bioconductor mailing list with this site soon and we have imported the past 11+ years of mailing list discussion into this new site. If you would like to help us test it out you can do that by logging in here: https://support.bioconductor.org For those of you who have posted to the bioconductor mailing list before, you will probably want to recover your well earned reputation from previous posts and answers. To do that you will need to scroll to the bottom of the log in page and click the link that says 'Forgot Password?'. This should get you started with your mailing list email address which will already be linked to your previous posts. And if you have never posted, then you can start a new account from that same page. As you explore the beta, you may come across things that you would like to see changed or that you feel are not working right. This site is based on a fork of Biostars, and we ask that you please post such questions to our github repository for this: https://github.com/Bioconductor/support.bioconductor.org/issues We aspire to switch over to this new site in early September, but we are leaving the schedule flexible depending on how well the beta site works. Also: please note that posts made to the new site during the beta will dissappear after the test period. We want you to help us test it, but this is not the live deployment phase quite yet. I expect there will probably be some other questions about this big transition. So please ask them as needed and we will try to answer them the best we can. Marc ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Please help us try out our new support site!
Hello! This is a message to announce the beta test for our new support site. We hope to replace the regular Bioconductor mailing list with this site soon and we have imported the past 11+ years of mailing list discussion into this new site. If you would like to help us test it out you can do that by logging in here: https://support.bioconductor.org For those of you who have posted to the bioconductor mailing list before, you will probably want to recover your well earned reputation from previous posts and answers. To do that you will need to scroll to the bottom of the log in page and click the link that says 'Forgot Password?'. This should get you started with your mailing list email address which will already be linked to your previous posts. And if you have never posted, then you can start a new account from that same page. As you explore the beta, you may come across things that you would like to see changed or that you feel are not working right. This site is based on a fork of Biostars, and we ask that you please post such questions to our github repository for this: https://github.com/Bioconductor/support.bioconductor.org/issues We aspire to switch over to this new site in early September, but we are leaving the schedule flexible depending on how well the beta site works. Also: please note that posts made to the new site during the beta will dissappear after the test period. We want you to help us test it, but this is not the live deployment phase quite yet. I expect there will probably be some other questions about this big transition. So please ask them as needed and we will try to answer them the best we can. Marc ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Please help us try out our new support site!
Hi, So CodersCrowd looks like a pretty neat tool. But here what we are trying to accomplish is something that is less ambitious. Basically there are a host of problems that crop up from using a mailing list at the scale that we currently use the main bioconductor mailing list. And here we are hoping that our new support site will help out with some of them. To give you an idea about what we were thinking here I will list just a few of these problems (in no particular order): 1 - Scale: With 3500 subscribers to the bioconductor mailing list, there is a lot of traffic. This means that a lot of people will either only guess post or they will post and then immediately unsubscribe. This basically means that a lot of the people who could most make use of this information are currently having trouble getting access to it. Having a web site means that we can make sure these questions (and their answers) are easily search able and can be read by anyone. The other side of this change is that if you are carefully answering questions, that your well earned reputation and your careful answers should now find a wider audience. 2 - Repetition: A lot of times the same questions get asked over and over again. This is bad for everyone, but is especially annoying for our package authors who sometimes have to spend a lot of their time answering the similar questions over and over again. Our hope is that by capturing your responses into a search able format the 1st time more users will be able to discover your hard work and thus benefit from it later on. 3 - Mistakes: Sometimes some spam or something embarrassing will get through. And with mailing lists everything that happens is written in permanent ink. We would rather that we were able to delete spam from the public record and that when appropriate you were able to amend statements so that they better reflected what you intended. It's also our hope that by amending your answers to questions, you can keep your answers current instead of crafting new responses from scratch each time. Anyhow, those are just some of the problems that we were hoping to address. I hope that the new site helps with these. Marc On 08/18/2014 03:56 PM, Aniba, Radhouane wrote: Hi Marc, I am a bit surprised to see that move to a biostars-like website ? Why another QA website ? Why don't you consider a sandbox like website like CodersCrowd that has already a docker image of R and Bioconductor where users can reproduce their bugs ? Just saying ... R Bioconductor is more a deep programming problem solving kind of interactions, and so should be the support for it, not just a copy and paste (fork) of biostars ( I have nothing against biostars btw) That's my personal opinion of course :) Rad On Aug 18, 2014, at 12:15 PM, Marc Carlson mcarl...@fhcrc.org wrote: Hello! This is a message to announce the beta test for our new support site. We hope to replace the regular Bioconductor mailing list with this site soon and we have imported the past 11+ years of mailing list discussion into this new site. If you would like to help us test it out you can do that by logging in here: https://support.bioconductor.org For those of you who have posted to the bioconductor mailing list before, you will probably want to recover your well earned reputation from previous posts and answers. To do that you will need to scroll to the bottom of the log in page and click the link that says 'Forgot Password?'. This should get you started with your mailing list email address which will already be linked to your previous posts. And if you have never posted, then you can start a new account from that same page. As you explore the beta, you may come across things that you would like to see changed or that you feel are not working right. This site is based on a fork of Biostars, and we ask that you please post such questions to our github repository for this: https://github.com/Bioconductor/support.bioconductor.org/issues We aspire to switch over to this new site in early September, but we are leaving the schedule flexible depending on how well the beta site works. Also: please note that posts made to the new site during the beta will dissappear after the test period. We want you to help us test it, but this is not the live deployment phase quite yet. I expect there will probably be some other questions about this big transition. So please ask them as needed and we will try to answer them the best we can. Marc ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Question about which new organism resources to create
Hi everyone, As many of you already know we have long provided organism annnotation packages that give gene based annotations for selected organisms. And we intend to keep doing that. But these days there is also a lot of other data at NCBI that could be used to make gene based databases for other organisms. And at the same time, there is also greater and greater demand for annotations from other organisms too. So I aim to make organism based gene databases for a wider range of organisms. However instead of just making more packages, I intend to put these DBs into the AnnotationHub. You can get an idea about what access will be like by looking at the inparanoid8 objects that were put in for the last release. library(AnnotationHub) ah = AnnotationHub() hs8 = ah$inparanoid8.Orthologs.hom.Homo_sapiens.inp8.sqlite hs8 columns(hs8) k = head(keys(hs8, 'TOXOPLASMA_GONDII')) select(hs8, k, 'HOMO_SAPIENS', 'TOXOPLASMA_GONDII') ## etc. Anyhow my reason for posting is that I am now looking at all the NCBI data that could be used for annotation packages and trying to decide what to include. About half of the 14 thousand potential critters in the NCBI dataset only have about one gene annotated. I am guessing that it is not worth anyone's time to pre-process those organisms that have only one gene. Or is it? If you think it might be, now would probably be a good time to speak up. How many annotations do you guys want/expect in an organism package before it becomes annoying that you even downloaded it? Thanks in advance for your opinions, Marc ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] RE : AnnotationDbi and select function
Also, There is nothing wrong with using GENEID the way that you initially did. It was just a small bug that prevented some internal subsetting from working properly and that is now fixed. It just happened that GENEID was equivalent to ENTREZID in this case. And that ends up making it a slower choice just because the software has to do more work (in case GENEID is something else). So since you know that these are in fact ENTREZIDs, you can take Jims suggestion as a short cut and thus get a little performance boost. But it's still a less specific thing to request than GENEID (which could potentially be another kind of ID). So the two things (GENEID and ENTREZID) are not always the same kind of thing. They just happened to both be ENTREZID in *this* case. In a different scenario GENEID from the associated TranscriptDb might be something like an ensembl gene ID. And then to use a shortcut would mean using ENSEMBL instead of ENTREZID to do the shortcut... In contrast: GENEID should normally always work (but it should also be a tiny bit slower). Sorry if you know all this stuff, but I think its better to be explicit than to say too little. Marc On 03/12/2014 02:53 PM, Marc Carlson wrote: I just checked a fix in for this bug to GenomicFeatures (which happens to be where the problem was). It should percolate out to the build system soon. Marc On 03/12/2014 02:19 PM, Servant Nicolas wrote: Hi guys, Thanks for your feedbacks. Indeed I put GENEID because it is used in the txdb database. library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb - TxDb.Hsapiens.UCSC.hg19.knownGene columns(txdb) [1] CDSID CDSNAMECDSCHROM CDSSTRAND CDSSTART [6] CDSEND EXONID EXONNAME EXONCHROM EXONSTRAND [11] EXONSTART EXONENDGENEID TXID EXONRANK [16] TXNAME TXCHROMTXSTRAND TXSTARTTXEND I will move to ENTREZID which is much faster ! I'm glad It could help Nicolas De : bioc-devel-boun...@r-project.org [bioc-devel-boun...@r-project.org] de la part de Marc Carlson [mcarl...@fhcrc.org] Date d'envoi : mercredi 12 mars 2014 20:18 À : bioc-devel@r-project.org Objet : Re: [Bioc-devel] AnnotationDbi and select function Thanks Nicolaus! That's a good bug. I will work on a fix. The reason why James work-around here functions is because the number of databases that it has to query is fewer by one. It is also faster for this reason. So when you say GENEID you mean the ids used in the associated txdb database which means that these have to be checked against that DB (and anything related to it extracted) and then merged with the results of the symbol information by joining on the foreign key for these two DBs. So thats actually much more complex than just extracting all the same data from just the org package even though the end result (in this case) is the same. The bug is probably happening in the associated merge step. Marc On 03/12/2014 10:06 AM, James W. MacDonald wrote: Hi Nicolas, On 3/12/2014 12:39 PM, Servant Nicolas wrote: Dear all, I have an error using the select function from the AnnotationDbi package. I try to convert some geneID into Symbol, but for some strange reasons it crashed. library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb - TxDb.Hsapiens.UCSC.hg19.knownGene isActiveSeq(txdb)[seqlevels(txdb)] - FALSE isActiveSeq(txdb)[c(chr16,chr1)] - TRUE geneGR - exonsBy(txdb, gene) library(Homo.sapiens) symbol - select(Homo.sapiens, keys = names(geneGR), keytype = GENEID, columns = SYMBOL) Erreur dans head(select(Homo.sapiens, keys = names(geneGR)[1:1001], keytype = GENEID, : erreur d'évaluation de l'argument 'x' lors de la sélection d'une méthode pour la fonction 'head' : Erreur dans res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : length(geneGR) [1] 3269 ## The first 1K work symbol - select(Homo.sapiens, keys = names(geneGR)[1:1000], keytype = GENEID, columns = SYMBOL) ## The 1K+1 does not ! symbol - select(Homo.sapiens, keys = names(geneGR)[1:1001], keytype = GENEID, columns = SYMBOL) Erreur dans res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : nombre de dimensions incorrect It looks like I cannot convert more than 1K elements ?? Any reason for that ? Thank you very much Nicolas Not sure what 'GENEID' is in this context - it appears to be Entrez Gene. But anyway, if you use ENTREZID instead, it works fine: symbol - select(Homo.sapiens, names(geneGR), SYMBOL, ENTREZID) symbol - select(Homo.sapiens, names(geneGR), GENEID, ENTREZID) Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : incorrect number of dimensions symbol - select(Homo.sapiens, names(geneGR)[1:1000], GENEID, ENTREZID) symbol - select(Homo.sapiens, names(geneGR)[1:1001], GENEID, ENTREZID) Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] : incorrect number of dimensions Best, Jim sessionInfo() R Under development (unstable) (2014-03-05
Re: [Bioc-devel] Update policy on experiment data and annotation packages
Hi Julian, This is a complicated issue for us and we have to choose our next move carefully since we don't have unlimited resources. Especially not with respect to time. But I wanted to let you know that we appreciate your comment and that we are still thinking about it. Marc On 10/10/2013 03:05 AM, Julian Gehring wrote: Hi, What is the consensus on updating data in experiment data and annotation packages? The bioc website [1] does not state any differences between the two package types in terms of updating their content. From the bioc core, I have the information that (a) experimental data packages should represent 'frozen' data and not get updated over release cycles, while (b) annotation packages should get updated with every release cycle. Should we add this information to the website? I'm curious what this means for experimental data that accumulates over time, i.e. data from big consortia, as represented by e.g. 'curatedOvarianData', 'SomaticCancerAlterations', and others. Should one create create a new package with each release cycle (indicating the data version in the package name, as the 'SNPlocs*' packages) to ensure reproducibility? Or update an annotation package with each release, and try to ensure backwards compatibility within the package itself? Best wishes Julian [1] http://bioconductor.org/developers/package-guidelines/#package-types ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Courtesy message about the upcoming release
Hello everyone, This is a courtesy message to remind all package developers that next Wednesday the 9th of October is the deadline for all packages to pass the build system without any errors or warnings. Please have a look at our build system for the development branch and make sure that the packages you maintain are not causing any errors or warnings: http://www.bioconductor.org/checkResults/2.13/bioc-LATEST/ Also you can see our release schedule here if you have questions about the dates: http://www.bioconductor.org/developers/release-schedule/ Marc ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] isActiveSeq deprecated
I actually considered this, but I opted to do it this way just for the sake of being consistent (which was my whole mission for implementing seqlevels in here in the 1st place). Now I could make it more convenient here and break consistency with how it is used elsewhere, but what do people prefer? Consistent or convenient? Marc On 09/18/2013 10:40 AM, Hervé Pagès wrote: Hi Marc, Wouldn't it make sense to just ignore the 'force' arg when dropping the seqlevels of a TranscriptDb? The 'force' argument is FALSE by default and this prevents seqlevels- to shrink GRanges or other vector-like objects when the user tries to drop seqlevels that are in use. Internally seqlevels- calls seqlevelsInUse() to get the seqlevels currently in use and see if they intersect with the seqlevels to drop. In the TranscriptDb situation, people always have to use 'force=TRUE' to drop seqlevels, regardless of whether the levels to drop are in use or not (the seqlevelsInUse() getter not being defined for TranscriptDb objects, I suspect seqlevels- doesn't look at this). So maybe 'force' could just be ignored for TranscriptDb objects? That would make seqlevels- a little bit more user-friendly on those objects. Thanks, H. On 09/13/2013 10:38 AM, Marc Carlson wrote: Hi Florian, Yes we are trying to make things more uniform. seqlevels() lets you rename as well as deactivate chromosomes you want to ignore, so it was really redundant with isActiveSeq(). So we are moving away from isActiveSeq() just so that users have less to learn about. The reason why isActiveSeq was different from seqlevels was just because it was born for a TranscriptDb (which is based on an annotation database) instead of being born on a GRanges object. So seqlevels was the more general tool. Marc On 09/13/2013 07:24 AM, Hahne, Florian wrote: Hi Marc, I saw these warnings in Gviz, but they stem from GenomicFeatures Warning messages: 1: 'isActiveSeq' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 2: 'isActiveSeq' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 3: 'isActiveSeq-' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 4: 'isActiveSeq-' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 5: 'isActiveSeq' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 6: 'isActiveSeq-' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). So has the whole idea of active chromosomes in the data base been dropped? I could not find anything in the change notes. Do I get it right that you can now do seqlevels(txdb, force=TRUE) - chr1 if you just want the first chromosome to be active? Florian [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] isActiveSeq deprecated
Thanks Florian, I just checked in a fix for this. Please let me know if you find any other quirks. Marc On 09/16/2013 05:33 AM, Hahne, Florian wrote: Hey Marc, I think your move towards seqlevels is not quite working yet: samplefile - system.file(extdata, UCSC_knownGene_sample.sqlite, package=GenomicFeatures) txdb - loadDb(samplefile) ## This works fine fiveUTRsByTranscript(txdb) ## This breaks seqlevels(txdb, force=TRUE) - chr6 fiveUTRsByTranscript(txdb) Error in relist(x, f) : shape of 'skeleton' is not compatible with 'NROW(flesh)' Deep in the guts of this you are trying to build a GRanges object with NAs as seqlevels, and it doesn't really like that. Florian sessionInfo() R version 3.0.1 RC (2013-05-12 r62736) Platform: i386-apple-darwin12.3.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel grid stats graphics grDevices utils datasets [8] methods base other attached packages: [1] GenomicFeatures_1.13.37 AnnotationDbi_1.23.23 Biobase_2.21.7 [4] GenomicRanges_1.13.43 XVector_0.1.4 IRanges_1.19.36 [7] BiocGenerics_0.7.5 Gviz_1.5.11 BiocInstaller_1.11.4 loaded via a namespace (and not attached): [1] biomaRt_2.17.2 Biostrings_2.29.18 biovizBase_1.9.2 [4] bitops_1.0-6BSgenome_1.29.1 cluster_1.14.4 [7] colorspace_1.2-2DBI_0.2-7 dichromat_2.0-0 [10] Hmisc_3.12-2labeling_0.2 lattice_0.20-23 [13] munsell_0.4.2 plyr_1.8 RColorBrewer_1.0-5 [16] RCurl_1.95-4.1 rpart_4.1-3 Rsamtools_1.13.39 [19] RSQLite_0.11.4 rtracklayer_1.21.11 scales_0.2.3 [22] stats4_3.0.1stringr_0.6.2 tools_3.0.1 [25] XML_3.98-1.1zlibbioc_1.7.0 From: Marc Carlson mcarl...@fhcrc.org mailto:mcarl...@fhcrc.org Date: Friday, September 13, 2013 7:38 PM To: Florian Hahne florian.ha...@novartis.com mailto:florian.ha...@novartis.com Cc: bioc-devel@r-project.org mailto:bioc-devel@r-project.org bioc-devel@r-project.org mailto:bioc-devel@r-project.org Subject: Re: isActiveSeq deprecated Hi Florian, Yes we are trying to make things more uniform. seqlevels() lets you rename as well as deactivate chromosomes you want to ignore, so it was really redundant with isActiveSeq(). So we are moving away from isActiveSeq() just so that users have less to learn about. The reason why isActiveSeq was different from seqlevels was just because it was born for a TranscriptDb (which is based on an annotation database) instead of being born on a GRanges object. So seqlevels was the more general tool. Marc On 09/13/2013 07:24 AM, Hahne, Florian wrote: Hi Marc, I saw these warnings in Gviz, but they stem from GenomicFeatures Warning messages: 1: 'isActiveSeq' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 2: 'isActiveSeq' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 3: 'isActiveSeq-' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 4: 'isActiveSeq-' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 5: 'isActiveSeq' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 6: 'isActiveSeq-' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). So has the whole idea of active chromosomes in the data base been dropped? I could not find anything in the change notes. Do I get it right that you can now do seqlevels(txdb, force=TRUE) - chr1 if you just want the first chromosome to be active? Florian [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] isActiveSeq deprecated
Hi Florian, Yes we are trying to make things more uniform. seqlevels() lets you rename as well as deactivate chromosomes you want to ignore, so it was really redundant with isActiveSeq(). So we are moving away from isActiveSeq() just so that users have less to learn about. The reason why isActiveSeq was different from seqlevels was just because it was born for a TranscriptDb (which is based on an annotation database) instead of being born on a GRanges object. So seqlevels was the more general tool. Marc On 09/13/2013 07:24 AM, Hahne, Florian wrote: Hi Marc, I saw these warnings in Gviz, but they stem from GenomicFeatures Warning messages: 1: 'isActiveSeq' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 2: 'isActiveSeq' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 3: 'isActiveSeq-' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 4: 'isActiveSeq-' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 5: 'isActiveSeq' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). 6: 'isActiveSeq-' is deprecated. Use 'seqlevels' instead. See help(Deprecated) and help(GenomicFeatures-deprecated). So has the whole idea of active chromosomes in the data base been dropped? I could not find anything in the change notes. Do I get it right that you can now do seqlevels(txdb, force=TRUE) - chr1 if you just want the first chromosome to be active? Florian [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] svn for annotation packages
Yes I try to get updates on everything every time. Marc On 09/12/2013 10:38 AM, Kasper Daniel Hansen wrote: Ok, sounds good. This is especially nice to know for the annotation packages which are hand created as opposed to being created by some script. Kasper On Thu, Sep 12, 2013 at 1:22 PM, Marc Carlson mcarl...@fhcrc.org mailto:mcarl...@fhcrc.org wrote: Hi Kasper, You should get an email from me in the coming weeks with instructions regarding the upcoming release. If you need something changed before then, please send me an email. Marc On 09/11/2013 06:17 PM, Dan Tenenbaum wrote: Annotation packages are not in svn. Send your changes to Marc. Dan Kasper Daniel Hansen kasperdanielhan...@gmail.com mailto:kasperdanielhan...@gmail.com wrote: What is the url? Or should I not work from subversion, if I want to update a package? The HOWTO is a bit unclear. (Want to work on IlluminaHumanMethylation450kannotation.ilmn_v1.2) Best, Kasper [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] svn for annotation packages
Hi Kasper, You should get an email from me in the coming weeks with instructions regarding the upcoming release. If you need something changed before then, please send me an email. Marc On 09/11/2013 06:17 PM, Dan Tenenbaum wrote: Annotation packages are not in svn. Send your changes to Marc. Dan Kasper Daniel Hansen kasperdanielhan...@gmail.com wrote: What is the url? Or should I not work from subversion, if I want to update a package? The HOWTO is a bit unclear. (Want to work on IlluminaHumanMethylation450kannotation.ilmn_v1.2) Best, Kasper [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] heavy vignette
Hi Carles, In general there are three steps to consider in turn: 1) Look at the repository of experiment data packages. If there are existing packages there with data that you can use then you probably want to use those. 2) Look for ways to make the data smaller and still get your testing/examples done. Maybe you don't need 30 cel files? Maybe you only needed the expressionSet object that they eventually result in? 3) If you still really need all 30 raw files, then it sounds like you might need to make an new package to hold them. If that is the case, then please document them carefully. This is so that others who come along can find them at step 1 above... Marc On 07/23/2013 03:27 AM, Hernandez Ferrer, Carles wrote: Hello to everyone, Related to vignettes creation. I'm developing an R-package to preprocess raw Affymetrix data for two other packages (MAD and inveRsion). The idea is to publish it in Bioconductor so an executable vignette must be done but to test the package functionality I need ~30 CEL files (this goes from 45.5Mb to 70.0Mb aprox. per file) fore each allowed technologies (4 technologies). How do you recommend me to develop the vignette or to store the needed data? Carles Hernandez-Ferrer Centre for Research in Environmental Epidemiology - CREAL Parc de Recerca Biomèdica de Barcelona - PRBB Doctor Aiguader, 88 | 0800a3 Barcelona, Spain chernan...@creal.cat | 93 214 75 78 www.creal.cat [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Fwd: [BioC] Problem reading VCF file using readVcf from package VariantAnnotation
Hi Michael, Yes. library(Homo.sapiens) cols(Homo.sapiens) txs - transcripts(Homo.sapiens, columns=c(SYMBOL)) exs - exons(Homo.sapiens, columns=c(SYMBOL)) Marc On 04/30/2013 03:07 PM, Michael Lawrence wrote: Hi Marc, Do you know if it is easy yet to get the gene symbols returned as a result of e.g. a transcripts() or exons() call? Michael On Tue, Apr 30, 2013 at 2:16 PM, Marc Carlson mcarl...@fhcrc.org mailto:mcarl...@fhcrc.org wrote: Related to this: I have added getters for seqinfo (and friends) for the OrganismDb objects. I have not added the setters yet though since that requires some refactoring of what an OrganismDb object actually is internally. But I intend to do this also. Marc On 04/25/2013 09:32 AM, Valerie Obenchain wrote: Hi Vince, Kasper, cc'ing Herve and Marc. I think we have a couple of things going on so I wanted to clarify. The 'genome' argument to readVcf() is assigned to the GRanges in rowData with the genome- setter. This is where .normargGenome() gets called. setReplaceMethod(genome, Seqinfo, function(x, value) { x@genome - .normargGenome(value, seqnames(x)) x } ) If the 'genome' replacement value is named, the name(s) must match the seqnames, not the build. So we aren't talking about matching compatible builds, fl - system.file(extdata, ex2.vcf, package=VariantAnnotation) vcf - readVcf(fl, c(b37=hg19)) ## this is wrong vcf - readVcf(fl, c(hg19=hg19)) ## also wrong Instead the name must be the seqname, the value is the build, vcf - readVcf(fl, c(20=hg19)) ## correct vcf - readVcf(fl, hg19) ## also correct This requirement for 'genome' is not well documented on ?readVcf or ?Seqinfo. We can fix that. The second thing is the issue of a flexible mapping between seqinfo metadata for different institutions. Herve and Marc have worked on this in AnnotationDbi. They can explain more about the 'SeqnameStyle' and how it might be used more widely. Val On 04/25/2013 06:54 AM, Kasper Daniel Hansen wrote: An official comment on this http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19 with some more info in this discussion https://groups.google.com/a/soe.ucsc.edu/forum/?fromgroups=#!topic/genome/hFp-dGG9gBs https://groups.google.com/a/soe.ucsc.edu/forum/?fromgroups=#%21topic/genome/hFp-dGG9gBs Essentially it seems the b37 has been patched and this patched release is not reflected in hg19 but may be (I don't know) reflected in the b37 download from NCBI Kasper On Thu, Apr 25, 2013 at 9:49 AM, Kasper Daniel Hansen kasperdanielhan...@gmail.com mailto:kasperdanielhan...@gmail.com wrote: I agree with Vincent. I have seen code from Herve in a package with some standardization of chromosome names, and this code could perhaps be used more widely so we don't have all the problems with chr1 vs chr01 vs 1. However, in this particular case, if Ulrich is actually interested in the mitochondrial genome, he has a problem. hg19, which is the genome version from UCSC is consider equal to NCBIs b37. However, as far as I understand, UCSC screwed up with the mitochondrial genome and used an old version for their hg19. So the error message is in many ways right here: the two genomes are slightly different because they have different mitochondrial genomes. Kasper [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Bioconductor and the Google Summer of Code
Hello everyone, This year Bioconductor is participating in the Google Summer of Code. Interested parties can see our GSOC page here: https://google-melange.appspot.com/gsoc/org/google/gsoc2013/bioconductor Also, we know that we only have a few ideas on our ideas list this year. It's because we want to start small this year and only take on a few projects under careful mentorship. But, that doesn't mean that we don't want to hear from the community about what you would like to see in the future. And of course we would also like to hear from any students out there who might want to participate! So if you are a student who wants to participate, please email our special list set up for this purpose ( gsoc-b...@lists.fhcrc.org ) and tell us about yourself and your interests. Marc ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Next Wednesday is the deadline for all new package submissions.
Hello everyone, If you are planning to submit a package to the project in time for the upcoming release, please be sure to do so by next Wednesday (the 13th). You can see the deadline in our release schedule here: http://www.bioconductor.org/developers/release-schedule/ Thanks again, Marc ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] I would like to publish a bioconductor package.
Hi Davide, Lots of good advice here. The main goal with two packages is to minimize dependencies for the experiment data package as this is presumed to be the less specialized package. Whenever you have a package ready please be sure to follow the follow the instructions on the link that Herve provided. Thanks in advance for your interest in contributing to the project, Marc On 03/04/2013 08:19 AM, Kasper Daniel Hansen wrote: This is a kind of a chicken and egg problem. If the data in the experimental data package is in base R containers (like just a matrix etc), it is pretty clear: the data package does not depend on anything and the methods package either suggests or depends on the data package. However, in most cases, the data will be in some container (S4) defined in the methods package. In that case I usually let the data package depends on the methods package and I let the methods package suggests the data package. Then you need to start each example by something like if(require(DATAPACKAGE)) { CODE } I have done this for minfi/minfiData and bsseq/bsseqData Kasper On Mon, Mar 4, 2013 at 10:15 AM, Davide Rambaldidavide.ramba...@ieo.eu wrote: You can solve the package size issue by putting your example data in a separate experiment data package (http://www.bioconductor.org/packages/release/data/experiment/). Stephanie I fixed the package size issue with a secondary experiment data package (flowFitExampleData) It is not clear to me how to fix the dependencies between the 2 packages: My setup (I am trying to duplicate the affy/affydata setup…): flowFit/DESCRIPTION Suggests: flowFitExampleData flowFitExampleData/DESCRIPTION Depends: flowFit And a lot of (may be are not necessary?) if (require(flowFItExampleData)) in the examples It is correct? Davide P.S: tested the package on OSX and Linux with R 3.0 unstable for BUILD and CHECK and it's OK… (me vs inconsolata.sty: 1 -0) for windows, well I will try to do it … may be I will ask more help ... On Feb 27, 2013, at 5:25 PM, Stephanie M. Gogarten wrote: You can solve the package size issue by putting your example data in a separate experiment data package (http://www.bioconductor.org/packages/release/data/experiment/). Stephanie On 2/27/13 3:03 AM, Davide Rambaldi wrote: Hi all, I am working on a library called flowFit, the purpose of this library is to analyze the FACS data coming from proliferation tracking dyes study. The library depends on the flowCore and flowViz bioconductor libraries and use minpack.lm (levenberg-marquadt algorithm) to fit a set of peaks over the FACS data. A typical experimental pipeline: 1) Acquire with FACS a sample of unlabelled cells 2) Acquire with FACS a sample of labeled and unstimulated cells (the Parent Population) 3) Acquire with FACS a sample of labeled and stimulated cells (the Proliferative Population) In R we can use the flowCore functions to transform the raw data and to gate the population of interest. Once we have gated the correct population, with 2 commands of flowFit you can perform the fitting: library(flowFit) parent- parentFitting(QuahAndParish[[1]], FITC-A) fitting- proliferationFitting(QuahAndParish[[2]], FITC-A, parent.fitting.cfse@parentPeakPosition, parent.fitting.cfse@parentPeakSize) The function can generate also some graphical output with: plot(fitting.cfse) To demonstrate the correctness of the fitting I have made some in silico simulations and a retrospective analysis of the data from the paper: New and improved methods for measuring lymphocyte proliferation in vitro and in vivo using CFSE-like fluorescent dyes, Benjamin J.C. Quah ⁎, Christopher R. Parish, Journal of Immunological Methods (2012) In this paper, the same population of lymphocytes (proliferation with the same growth conditions) was stained with 3 different proliferation tracking dyes: if the fitting algorithm is working as expected, we expect to estimate the same % of cells for generation in the 3 sample. Comparing the 3 samples we didn't see any significant difference in the estimation of the % of cell for generations, suggesting us that the algorithm is correctly estimating the % of cells / generation. I have posted a graphical output example with the Quah and Parish data (pdf) here: http://dl.dropbox.com/u/40644496/QuahAndPArishOut.pdf The dataset will be included in the library (in the data subdir). Actually I am writing the vignette (I am following the guidelines in http://www.bioconductor.org/developers/package-guidelines/) and fixing some graphical bugs (like the legend oversized …). The package Pass R CMD build and R CMD CHECK (time: 86 seconds) with no errors on OSX and Linux (I have to find a windows machine somewhere ...), I still have to test with the R-devel version of R. The library is bigger than expected (4.2 Mb) because the example datasets (FCS files converted in .Rdata) are big (3.7M) and I don't know how
Re: [Bioc-devel] makeTranscriptDBFromGFF v. Flybase GFF
Hi Malcolm, Not too much that hasn't been mentioned before. So I bet that many people can probably walk past this one. Both GFF and GTF files have many of the same things that come up when you use them. They both are being used for things today (like transcriptomes) which represent a pretty specific use case. And both these file formats were designed a while ago now, and some kinds of information (like exon rank) that are completely crucial for doing something like a transcriptome are therefore still optional when making a GFF or GTF file. Also, because these file formats are very flexible and general in their specification, it is possible for them to be either overly sparse, OR overly loaded with unnecessary stuff (depending on what you were planning to use them for). So it is completely possible that the ensembl file may be smaller and yet still contain what you need. Or it might not be smaller. You will simply have to check it and see how it compares. If you are using my function makeTranscriptDBFromGFF() from the GenomicFeatures package, it will try to check and see if the file has all the required information for you as it processes it into a transcriptDb object. If you are calling this, the only thing you really have to be extra careful about is the exon rank attribute. This function can guess at that information for you, but I am betting you don't want that if you can avoid it (which is why you will get a warning if this happens). So for these data, you really want to point to an attribute that has that information (if that is possible). In addition to seeing problems where a file will have too much or too little information, you will also sometimes see a file that is formatted in some peculiar way that requires you to translate it into a more typical looking GFF or GTF file. This can happen to you because as I mentioned above the file formats are fairly general and open to some interpretation by those who write them out. In general I think the most important piece of advice is that you should always look at GFF or GTF files in person before you try to use them, because you can't really be too sure about what kind of information will be in there unless you do. The bottom line is that both ensembl and flybase are reputable places to get data from. But because they are different places, they may produce dramatically different looking GFF or GTF files. Also related to this, please be sure to use the very latest version of makeTranscriptDBFromGFF from the devel branch, as I have made some improvements for performance since the release. I hope this helps, Marc On 02/11/2013 03:13 PM, Cook, Malcolm wrote: Marc et. al., A colleague of mine (cc:ed) is experiencing memory bloat using makeTranscriptDBFromGFF on dmel GFF from Flybase.org I told him of my success in using Ensembl's GTF-ization but that I would check in with you (et al). So Do you have any advice/warnings/gothcas/toldyasos/caveats re: applying makeTranscriptDBFromGFF to Flybase Thanks! Cheers, Malcolm ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] [GenomicFeatures] no pkgName found. makeTxDbPackage() called after txdb created from GFF3 file
Hi Malcolm, In general I have found ensembl to be really great and I expect that their gtf files are probably fine. Usually the exon rank is the 1st thing you will see left out when a gtf file is cutting corners, and you are correct that they seem to be including that. I ran the one for Homo sapiens though makeTranscriptDbFromGFF() and everything appears to be in working order. I wanted to warn Tengfei about this because I worry that most people will be surprised to learn that the gtf file format comes with fewer guarantees about the data included than they might have expected. I also mentioned it because I noticed that his function call to makeTranscriptDbFromGFF() did not specify an exonRankAttributeName, which strongly implies to me that maybe that his file might not have had that information present. The assumption was that if he had that information, he would have supplied that argument so that he could make use of it. But another possibility is that Tengfei just didn't need that information at all, in which case this will all just be another (possibly unwarranted) public service message. If that is the case, I apologise for the noise. Marc On 02/08/2013 06:19 PM, Cook, Malcolm wrote: .Hi Tengfei, . .Yes that looks like an oversight. Thanks for reporting that! I will .extend makeTxDbPackage so that it's more accommodating of these newer .transcriptDbs. If you want to help me out, you could call saveDb() on .your gmax189 object and send me the .sqlite file that you save it to. . .Also, if you have any alternate options for importing your data (other .than using GFF or GTF): I think you probably should consider it. The .file specifications for these filetypes are missing key details and so .you can very easily get a legal GFF or GTF file that is actually .missing important details from it's contents. For example, they can .commonly lack information about the order of the exons for a given .transcript, which can render them difficult (or impossible) to use for .transcript work. But for these specifications, that information is .optional. Marco, do you have any comment on ensembl GTF (which has exon order) in this regard? Thanks, Malcolm . . . Marc . . . .On 02/06/2013 09:46 PM, Tengfei Yin wrote: . Dear all, . . I am trying to build a txdb object from gff3 for soybean data and try to . make it a package. Code used like this . . gmax189- makeTranscriptDbFromGFF(~/Gmax_189_gene_exons.gff3, . format = gff3, species = Glycine max, . dataSource = http://www.phytozome.org/;) . makeTxDbPackage(txdb = gmax189, . version = 0.9.1, . maintainer = Tengfei Yin, . author = Tengfei Yin, . destDir=., . license=Artistic-2.0) . . Error message: . Error in gsub(_, , pkgName) : .error in evaluating the argument 'x' in selecting a method for function . 'gsub': Error: object 'pkgName' not found . . . Looks like my dataSource should be either BioMart or UCSC, otherwise no . pkgname will be produced in function .makePackageName? . . Or should I build annotation package in some other ways? . . Thanks a lot . . Tengfei . . my sessionInfo . . sessionInfo() . R Under development (unstable) (2013-01-21 r61728) . Platform: x86_64-unknown-linux-gnu (64-bit) . . locale: . [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C . [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 . [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 . [7] LC_PAPER=C LC_NAME=C . [9] LC_ADDRESS=C LC_TELEPHONE=C . [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C . . attached base packages: . [1] parallel stats graphics grDevices utils datasets methods . [8] base . . other attached packages: . [1] GenomicFeatures_1.11.8 AnnotationDbi_1.21.10 Biobase_2.19.2 . [4] GenomicRanges_1.11.28 IRanges_1.17.31BiocGenerics_0.5.6 . . loaded via a namespace (and not attached): . [1] biomaRt_2.15.0 Biostrings_2.27.10 bitops_1.0-5 . BSgenome_1.27.1 . [5] DBI_0.2-5 RCurl_1.95-3 Rsamtools_1.11.15 . RSQLite_0.11.2 . [9] rtracklayer_1.19.9 stats4_3.0.0 tools_3.0.0XML_3.95-0.1 . . [13] zlibbioc_1.5.0 . . . .___ .Bioc-devel@r-project.org mailing list .https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] [GenomicFeatures] no pkgName found. makeTxDbPackage() called after txdb created from GFF3 file
Hi Tengfei, Yes that looks like an oversight. Thanks for reporting that! I will extend makeTxDbPackage so that it's more accommodating of these newer transcriptDbs. If you want to help me out, you could call saveDb() on your gmax189 object and send me the .sqlite file that you save it to. Also, if you have any alternate options for importing your data (other than using GFF or GTF): I think you probably should consider it. The file specifications for these filetypes are missing key details and so you can very easily get a legal GFF or GTF file that is actually missing important details from it's contents. For example, they can commonly lack information about the order of the exons for a given transcript, which can render them difficult (or impossible) to use for transcript work. But for these specifications, that information is optional. Marc On 02/06/2013 09:46 PM, Tengfei Yin wrote: Dear all, I am trying to build a txdb object from gff3 for soybean data and try to make it a package. Code used like this gmax189- makeTranscriptDbFromGFF(~/Gmax_189_gene_exons.gff3, format = gff3, species = Glycine max, dataSource = http://www.phytozome.org/;) makeTxDbPackage(txdb = gmax189, version = 0.9.1, maintainer = Tengfei Yin, author = Tengfei Yin, destDir=., license=Artistic-2.0) Error message: Error in gsub(_, , pkgName) : error in evaluating the argument 'x' in selecting a method for function 'gsub': Error: object 'pkgName' not found Looks like my dataSource should be either BioMart or UCSC, otherwise no pkgname will be produced in function .makePackageName? Or should I build annotation package in some other ways? Thanks a lot Tengfei my sessionInfo sessionInfo() R Under development (unstable) (2013-01-21 r61728) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicFeatures_1.11.8 AnnotationDbi_1.21.10 Biobase_2.19.2 [4] GenomicRanges_1.11.28 IRanges_1.17.31BiocGenerics_0.5.6 loaded via a namespace (and not attached): [1] biomaRt_2.15.0 Biostrings_2.27.10 bitops_1.0-5 BSgenome_1.27.1 [5] DBI_0.2-5 RCurl_1.95-3 Rsamtools_1.11.15 RSQLite_0.11.2 [9] rtracklayer_1.19.9 stats4_3.0.0 tools_3.0.0XML_3.95-0.1 [13] zlibbioc_1.5.0 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Rd] question about assignment warnings for replacement methods
Hi, I have seen several packages that with the most recent version of R are giving a warning like this: Assignments in \usage in documentation object 'marginalData-methods': marginalData(object) = value I assume that this is to prevent people from making assignments in their usage statements (which seems completely understandable). But what about the case above? This is a person who just wants to show the proper usage for a replacement method. IOW they just want to write something that looks like what you actually do when you use a replacement method. They just want to show users how to do something like this: replacementMethod(object) - newValue So is that really something that should not be allowed in a usage statement? Marc __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] question about assignment warnings for replacement methods
Thank you for the clarifications Duncan. Marc On 04/05/2011 11:15 AM, Duncan Murdoch wrote: On 05/04/2011 1:51 PM, Marc Carlson wrote: Hi, I have seen several packages that with the most recent version of R are giving a warning like this: Assignments in \usage in documentation object 'marginalData-methods': marginalData(object) = value I assume that this is to prevent people from making assignments in their usage statements (which seems completely understandable). But what about the case above? This is a person who just wants to show the proper usage for a replacement method. IOW they just want to write something that looks like what you actually do when you use a replacement method. They just want to show users how to do something like this: replacementMethod(object)- newValue So is that really something that should not be allowed in a usage statement? If replacementMethod was a replacement function, then replacementMethod(object)- newValue is supposed to be fine. But if it is an S3 method, it should be \method{replacementMethod}{class}(object)- newValue and if it is an S4 method I think it should be \S4method{replacementMethod}{signature_list}(object)- newValue (though the manual suggests using the S3 style, I'm not sure how literally to take it). Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel